All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
@ 2023-07-13 16:31 Tony Luck
  2023-07-13 16:32 ` [PATCH v3 1/8] x86/resctrl: Refactor in preparation for node-scoped resources Tony Luck
                   ` (9 more replies)
  0 siblings, 10 replies; 331+ messages in thread
From: Tony Luck @ 2023-07-13 16:31 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

There isn't a simple hardware enumeration to indicate to software that
a system is running with Sub-NUMA Clustering enabled.

Compare the number of NUMA nodes with the number of L3 caches to calculate
the number of Sub-NUMA nodes per L3 cache.

When Sub-NUMA clustering mode is enabled in BIOS setup, the RMID counters
are distributed equally between the SNC nodes within each socket.

E.g. if there are 400 RMID counters, and the system is configured with
two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
0 on the socket, and RMID counter 200..399 on SNC node 1.

A model specific MSR (0xca0) can change the configuration of the RMIDs
when SNC mode is enabled.

The MSR controls the interpretation of the RMID field in the
IA32_PQR_ASSOC MSR so that the appropriate hardware counters within the
SNC node are updated. If reconfigured from default, RMIDs are divided
evenly across clusters.

Also initialize a per-cpu RMID offset value. Use this to calculate the
value to write to the IA32_QM_EVTSEL MSR when reading RMID event values.

N.B. this works well for well-behaved NUMA applications that access
memory predominantly from the local memory node. For applications that
access memory across multiple nodes it may be necessary for the user
to read counters for all SNC nodes on a socket and add the values to
get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
all that different from applications that span across multiple sockets
in a legacy system.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v2:

* Rebased to v6.5-rc1

Peter Newman:	Found that I'd reversed the actions writing to the new
		MSR to enable/disable RMID remapping for SNC mode.
* Fixed.

Peter Newman:	Provided Reviewed-by: and Tested-by: tags

* Included in this series.

Randy Dunlap:	Reported a run-on sentence in the documentation.

* Broke the sentence into two as suggested.

Shaopeng Tan:	Reported that the CMT resctrl self-test failed

* Added extra patch to the series to make the resctrl test detect when SNC
  mode is enabled and adjust effective cache size.

* I also patched the rdtgroup_cbm_to_size() function to adjust the cache
  size reported in the "size" files in resctrl groups when SNC is active.

Shaopeng Tan:	Noted the for_each_capable_rdt_resource() macro is no longer used.

* Deleted defintion of this macro.

Tony Luck (8):
  x86/resctrl: Refactor in preparation for node-scoped resources
  x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c
  x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  x86/resctrl: Add code to setup monitoring at L3 or NODE scope.
  x86/resctrl: Add package scoped resource
  x86/resctrl: Update documentation with Sub-NUMA cluster changes
  x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  selftests/resctrl: Adjust effective L3 cache size when SNC enabled

 Documentation/arch/x86/resctrl.rst          |  10 +-
 include/linux/resctrl.h                     |   5 +-
 arch/x86/include/asm/resctrl.h              |   2 +
 arch/x86/kernel/cpu/resctrl/internal.h      |  20 ++-
 tools/testing/selftests/resctrl/resctrl.h   |   1 +
 arch/x86/kernel/cpu/resctrl/core.c          | 154 ++++++++++++++++++--
 arch/x86/kernel/cpu/resctrl/monitor.c       |  24 +--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c   |   2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c      |   6 +-
 tools/testing/selftests/resctrl/resctrlfs.c |  57 ++++++++
 10 files changed, 248 insertions(+), 33 deletions(-)


base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
-- 
2.40.1


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v3 1/8] x86/resctrl: Refactor in preparation for node-scoped resources
  2023-07-13 16:31 [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
@ 2023-07-13 16:32 ` Tony Luck
  2023-07-18 20:36   ` Reinette Chatre
  2023-07-13 16:32 ` [PATCH v3 2/8] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c Tony Luck
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-13 16:32 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Sub-NUMA cluster systems provide monitoring resources at the NUMA
node scope instead of the L3 cache scope.

Rename the cache_level field in struct rdt_resource to the more
generic "scope" and add symbolic names and a helper function.

No functional change.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
---
 include/linux/resctrl.h                   |  4 ++--
 arch/x86/kernel/cpu/resctrl/internal.h    |  5 +++++
 arch/x86/kernel/cpu/resctrl/core.c        | 17 +++++++++++------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  2 +-
 5 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..25051daa6655 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -150,7 +150,7 @@ struct resctrl_schema;
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource (cache level or NUMA node)
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +168,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	int			scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 85ceaf9a31ac..8275b8a74f7e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -440,6 +440,11 @@ enum resctrl_res_level {
 	RDT_NUM_RESOURCES,
 };
 
+enum resctrl_scope {
+	SCOPE_L2_CACHE = 2,
+	SCOPE_L3_CACHE = 3
+};
+
 static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(res);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..6571514752f3 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= SCOPE_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= SCOPE_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= SCOPE_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= 3,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -487,6 +487,11 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id(int cpu, enum resctrl_scope scope)
+{
+	return get_cpu_cacheinfo_id(cpu, scope);
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -502,7 +507,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
@@ -552,7 +557,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 458cb7419502..42f124ffb968 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -297,7 +297,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == plr->s->res->scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 725344048f85..418658f0a9ad 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1348,7 +1348,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v3 2/8] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c
  2023-07-13 16:31 [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2023-07-13 16:32 ` [PATCH v3 1/8] x86/resctrl: Refactor in preparation for node-scoped resources Tony Luck
@ 2023-07-13 16:32 ` Tony Luck
  2023-07-18 20:37   ` Reinette Chatre
  2023-07-13 16:32 ` [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[] Tony Luck
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-13 16:32 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Scope of monitoring may be scoped at L3 cache granularity (legacy) or
at the node level (systems with Sub NUMA Cluster enabled).

Save the struct rdt_resource pointer that was used to initialize
the monitor sections of code and use that value instead of the
hard-coded RDT_RESOURCE_L3.

No functional change.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..9be6ffdd01ae 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -30,6 +30,8 @@ struct rmid_entry {
 	struct list_head		list;
 };
 
+static struct rdt_resource *mon_resource;
+
 /**
  * @rmid_free_lru    A least recently used list of free RMIDs
  *     These RMIDs are guaranteed to have an occupancy less than the
@@ -268,7 +270,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  */
 void __check_limbo(struct rdt_domain *d, bool force_free)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_resource *r = mon_resource;
 	struct rmid_entry *entry;
 	u32 crmid = 1, nrmid;
 	bool rmid_dirty;
@@ -333,7 +335,7 @@ int alloc_rmid(void)
 
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_resource *r = mon_resource;
 	struct rdt_domain *d;
 	int cpu, err;
 	u64 val = 0;
@@ -645,7 +647,7 @@ void cqm_handle_limbo(struct work_struct *work)
 
 	mutex_lock(&rdtgroup_mutex);
 
-	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	r = mon_resource;
 	d = container_of(work, struct rdt_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
@@ -681,7 +683,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	if (!static_branch_likely(&rdt_mon_enable_key))
 		goto out_unlock;
 
-	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	r = mon_resource;
 	d = container_of(work, struct rdt_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
@@ -759,9 +761,9 @@ static struct mon_evt mbm_local_event = {
 /*
  * Initialize the event list for the resource.
  *
- * Note that MBM events are also part of RDT_RESOURCE_L3 resource
- * because as per the SDM the total and local memory bandwidth
- * are enumerated as part of L3 monitoring.
+ * Monitor events can either be part of RDT_RESOURCE_L3 resource,
+ * or they may be per NUMA node on systems with sub-NUMA cluster
+ * enabled and are then in the RDT_RESOURCE_NODE resource.
  */
 static void l3_mon_evt_init(struct rdt_resource *r)
 {
@@ -773,6 +775,8 @@ static void l3_mon_evt_init(struct rdt_resource *r)
 		list_add_tail(&mbm_total_event.list, &r->evt_list);
 	if (is_mbm_local_enabled())
 		list_add_tail(&mbm_local_event.list, &r->evt_list);
+
+	mon_resource = r;
 }
 
 int __init rdt_get_mon_l3_config(struct rdt_resource *r)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-07-13 16:31 [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2023-07-13 16:32 ` [PATCH v3 1/8] x86/resctrl: Refactor in preparation for node-scoped resources Tony Luck
  2023-07-13 16:32 ` [PATCH v3 2/8] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c Tony Luck
@ 2023-07-13 16:32 ` Tony Luck
  2023-07-18 20:40   ` Reinette Chatre
  2023-07-13 16:32 ` [PATCH v3 4/8] x86/resctrl: Add code to setup monitoring at L3 or NODE scope Tony Luck
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-13 16:32 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Add a placeholder in the array of struct rdt_hw_resource to be used
for event monitoring of systems with Sub-NUMA Cluster enabled.

Update get_domain_id() to handle SCOPE_NODE.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  4 +++-
 arch/x86/kernel/cpu/resctrl/core.c     | 12 ++++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 8275b8a74f7e..243017096ddf 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -435,6 +435,7 @@ enum resctrl_res_level {
 	RDT_RESOURCE_L2,
 	RDT_RESOURCE_MBA,
 	RDT_RESOURCE_SMBA,
+	RDT_RESOURCE_NODE,
 
 	/* Must be the last */
 	RDT_NUM_RESOURCES,
@@ -442,7 +443,8 @@ enum resctrl_res_level {
 
 enum resctrl_scope {
 	SCOPE_L2_CACHE = 2,
-	SCOPE_L3_CACHE = 3
+	SCOPE_L3_CACHE = 3,
+	SCOPE_NODE,
 };
 
 static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6571514752f3..e4bd3072927c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -112,6 +112,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.fflags			= RFTYPE_RES_MB,
 		},
 	},
+	[RDT_RESOURCE_NODE] =
+	{
+		.r_resctrl = {
+			.rid			= RDT_RESOURCE_NODE,
+			.name			= "L3",
+			.scope			= SCOPE_NODE,
+			.domains		= domain_init(RDT_RESOURCE_NODE),
+			.fflags			= 0,
+		},
+	},
 };
 
 /*
@@ -489,6 +499,8 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 
 static int get_domain_id(int cpu, enum resctrl_scope scope)
 {
+	if (scope == SCOPE_NODE)
+		return cpu_to_node(cpu);
 	return get_cpu_cacheinfo_id(cpu, scope);
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v3 4/8] x86/resctrl: Add code to setup monitoring at L3 or NODE scope.
  2023-07-13 16:31 [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (2 preceding siblings ...)
  2023-07-13 16:32 ` [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[] Tony Luck
@ 2023-07-13 16:32 ` Tony Luck
  2023-07-18 20:42   ` Reinette Chatre
  2023-07-13 16:32 ` [PATCH v3 5/8] x86/resctrl: Add package scoped resource Tony Luck
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-13 16:32 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

When Sub-NUMA cluster is enabled (snc_ways > 1) use the RDT_RESOURCE_NODE
instead of RDT_RESOURCE_L3 for all monitoring operations.

The mon_scale and num_rmid values from CPUID(0xf,0x1),(EBX,ECX) must be
scaled down by the number of Sub-NUMA Clusters.

A subsequent change will detect sub-NUMA cluster mode and set
"snc_ways". For now set to one (meaning each L3 cache spans one
node).

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h | 7 +++++++
 arch/x86/kernel/cpu/resctrl/core.c     | 7 ++++++-
 arch/x86/kernel/cpu/resctrl/monitor.c  | 4 ++--
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
 4 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 243017096ddf..38bac0062c82 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -430,6 +430,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern int snc_ways;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
@@ -447,6 +449,11 @@ enum resctrl_scope {
 	SCOPE_NODE,
 };
 
+static inline int get_mbm_res_level(void)
+{
+	return snc_ways > 1 ? RDT_RESOURCE_NODE : RDT_RESOURCE_L3;
+}
+
 static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(res);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index e4bd3072927c..6fe9f87d4403 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,11 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * How many Sub-Numa Cluster nodes share a single L3 cache
+ */
+int snc_ways = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
@@ -831,7 +836,7 @@ static __init bool get_rdt_alloc_resources(void)
 
 static __init bool get_rdt_mon_resources(void)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_resource *r = &rdt_resources_all[get_mbm_res_level()].r_resctrl;
 
 	if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
 		rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 9be6ffdd01ae..da3f36212898 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -787,8 +787,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_ways;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_ways;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 418658f0a9ad..d037f3da9e55 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2524,7 +2524,7 @@ static int rdt_get_tree(struct fs_context *fc)
 		static_branch_enable_cpuslocked(&rdt_enable_key);
 
 	if (is_mbm_enabled()) {
-		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+		r = &rdt_resources_all[get_mbm_res_level()].r_resctrl;
 		list_for_each_entry(dom, &r->domains, list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v3 5/8] x86/resctrl: Add package scoped resource
  2023-07-13 16:31 [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (3 preceding siblings ...)
  2023-07-13 16:32 ` [PATCH v3 4/8] x86/resctrl: Add code to setup monitoring at L3 or NODE scope Tony Luck
@ 2023-07-13 16:32 ` Tony Luck
  2023-07-18 20:43   ` Reinette Chatre
  2023-07-13 16:32 ` [PATCH v3 6/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-13 16:32 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Some Intel features require setting a package scoped model specific
register.

Add a new resource that builds domains for each package.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
---
 include/linux/resctrl.h                |  1 +
 arch/x86/kernel/cpu/resctrl/internal.h |  6 ++++--
 arch/x86/kernel/cpu/resctrl/core.c     | 23 +++++++++++++++++++----
 3 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 25051daa6655..f504f6263fec 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -167,6 +167,7 @@ struct rdt_resource {
 	int			rid;
 	bool			alloc_capable;
 	bool			mon_capable;
+	bool			pkg_actions;
 	int			num_rmid;
 	int			scope;
 	struct resctrl_cache	cache;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 38bac0062c82..67340c83392f 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -438,6 +438,7 @@ enum resctrl_res_level {
 	RDT_RESOURCE_MBA,
 	RDT_RESOURCE_SMBA,
 	RDT_RESOURCE_NODE,
+	RDT_RESOURCE_PKG,
 
 	/* Must be the last */
 	RDT_NUM_RESOURCES,
@@ -447,6 +448,7 @@ enum resctrl_scope {
 	SCOPE_L2_CACHE = 2,
 	SCOPE_L3_CACHE = 3,
 	SCOPE_NODE,
+	SCOPE_PKG,
 };
 
 static inline int get_mbm_res_level(void)
@@ -478,9 +480,9 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
 	     r <= &rdt_resources_all[RDT_NUM_RESOURCES - 1].r_resctrl;	      \
 	     r = resctrl_inc(r))
 
-#define for_each_capable_rdt_resource(r)				      \
+#define for_each_domain_needed_rdt_resource(r)				      \
 	for_each_rdt_resource(r)					      \
-		if (r->alloc_capable || r->mon_capable)
+		if (r->alloc_capable || r->mon_capable || r->pkg_actions)
 
 #define for_each_alloc_capable_rdt_resource(r)				      \
 	for_each_rdt_resource(r)					      \
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6fe9f87d4403..af3be3c2db96 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -127,6 +127,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.fflags			= 0,
 		},
 	},
+	[RDT_RESOURCE_PKG] =
+	{
+		.r_resctrl = {
+			.rid			= RDT_RESOURCE_PKG,
+			.name			= "PKG",
+			.scope			= SCOPE_PKG,
+			.domains		= domain_init(RDT_RESOURCE_PKG),
+			.fflags			= 0,
+		},
+	},
 };
 
 /*
@@ -504,9 +514,14 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 
 static int get_domain_id(int cpu, enum resctrl_scope scope)
 {
-	if (scope == SCOPE_NODE)
+	switch (scope) {
+	case SCOPE_NODE:
 		return cpu_to_node(cpu);
-	return get_cpu_cacheinfo_id(cpu, scope);
+	case SCOPE_PKG:
+		return topology_physical_package_id(cpu);
+	default:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	}
 }
 
 /*
@@ -630,7 +645,7 @@ static int resctrl_online_cpu(unsigned int cpu)
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
-	for_each_capable_rdt_resource(r)
+	for_each_domain_needed_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
 	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
@@ -657,7 +672,7 @@ static int resctrl_offline_cpu(unsigned int cpu)
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
-	for_each_capable_rdt_resource(r)
+	for_each_domain_needed_rdt_resource(r)
 		domain_remove_cpu(cpu, r);
 	list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
 		if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v3 6/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-07-13 16:31 [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (4 preceding siblings ...)
  2023-07-13 16:32 ` [PATCH v3 5/8] x86/resctrl: Add package scoped resource Tony Luck
@ 2023-07-13 16:32 ` Tony Luck
  2023-07-13 16:32 ` [PATCH v3 7/8] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-07-13 16:32 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
---
 Documentation/arch/x86/resctrl.rst | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index cb05d90111b4..4d9ddb91751d 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -345,9 +345,13 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event. E.g. on a system with
+	SNC mode disabled with two L3 domains there will be subdirectories
+	"mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
+	L3 cache id.  With SNC enabled the directory names are the same,
+	but the numerical suffix refers to the node id.  Each of these
 	directories have one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v3 7/8] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-07-13 16:31 [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (5 preceding siblings ...)
  2023-07-13 16:32 ` [PATCH v3 6/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-07-13 16:32 ` Tony Luck
  2023-07-18 20:45   ` Reinette Chatre
  2023-07-13 16:32 ` [PATCH v3 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled Tony Luck
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-13 16:32 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

There isn't a simple hardware enumeration to indicate to software that
a system is running with Sub-NUMA Cluster enabled.

Compare the number of NUMA nodes with the number of L3 caches to calculate
the number of Sub-NUMA nodes per L3 cache.

When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
are distributed equally between the SNC nodes within each socket.

E.g. if there are 400 RMID counters, and the system is configured with
two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
0 on the socket, and RMID counter 200..399 on SNC node 1.

A model specific MSR (0xca0) can change the configuration of the RMIDs
when SNC mode is enabled.

The MSR controls the interpretation of the RMID field in the
IA32_PQR_ASSOC MSR so that the appropriate hardware counters
within the SNC node are updated.

Also initialize a per-cpu RMID offset value. Use this
to calculate the value to write to the IA32_QM_EVTSEL MSR when
reading RMID event values.

N.B. this works well for well-behaved NUMA applications that access
memory predominantly from the local memory node. For applications that
access memory across multiple nodes it may be necessary for the user
to read counters for all SNC nodes on a socket and add the values to
get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
all that different from applications that span across multiple sockets
in a legacy system.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Tested-by: Peter Newman <peternewman@google.com>
---
 arch/x86/include/asm/resctrl.h         |  2 +
 arch/x86/kernel/cpu/resctrl/core.c     | 99 +++++++++++++++++++++++++-
 arch/x86/kernel/cpu/resctrl/monitor.c  |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  2 +-
 4 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 255a78d9d906..f95e69bacc65 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
 DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
 
+DECLARE_PER_CPU(int, rmid_offset);
+
 /*
  * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
  *
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index af3be3c2db96..a03ff1a95624 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -524,6 +527,39 @@ static int get_domain_id(int cpu, enum resctrl_scope scope)
 	}
 }
 
+DEFINE_PER_CPU(int, rmid_offset);
+
+static void set_per_cpu_rmid_offset(int cpu, struct rdt_resource *r)
+{
+	this_cpu_write(rmid_offset, (cpu_to_node(cpu) % snc_ways) * r->num_rmid);
+}
+
+/*
+ * This MSR provides for configuration of RMIDs on Sub-NUMA Cluster
+ * systems.
+ * Bit0 = 1 (default) For legacy configuration
+ * Bit0 = 0 RMIDs are divided evenly between SNC nodes.
+ */
+#define MSR_RMID_SNC_CONFIG   0xCA0
+
+static void snc_add_pkg(void)
+{
+	u64	msrval;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
+	msrval &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
+}
+
+static void snc_remove_pkg(void)
+{
+	u64	msrval;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
+	msrval |= BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -555,6 +591,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
+		if (r->mon_capable)
+			set_per_cpu_rmid_offset(cpu, r);
 		return;
 	}
 
@@ -573,11 +611,17 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
-		return;
+	if (r->mon_capable) {
+		if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+			domain_free(hw_dom);
+			return;
+		}
+		set_per_cpu_rmid_offset(cpu, r);
 	}
 
+	if (r->pkg_actions)
+		snc_add_pkg();
+
 	list_add_tail(&d->list, add_pos);
 
 	err = resctrl_online_domain(r, d);
@@ -613,6 +657,9 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 			d->plr->d = NULL;
 		domain_free(hw_dom);
 
+		if (r->pkg_actions)
+			snc_remove_pkg();
+
 		return;
 	}
 
@@ -899,11 +946,57 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple enumeration bit to show whether SNC mode
+ * is enabled. Look at the ratio of number of NUMA nodes to the
+ * number of distinct L3 caches. Take care to skip memory-only nodes.
+ */
+static __init int find_snc_ways(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids)
+			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+		else
+			mem_only_nodes++;
+	}
+	cpus_read_unlock();
+
+	ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
+	kfree(node_caches);
+
+	if (ret > 1)
+		rdt_resources_all[RDT_RESOURCE_PKG].r_resctrl.pkg_actions = true;
+
+	return ret;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_ways = find_snc_ways();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index da3f36212898..74db99d299e1 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -160,7 +160,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index d037f3da9e55..1a9c38b018ba 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1354,7 +1354,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_ways;
 }
 
 /**
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v3 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-07-13 16:31 [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (6 preceding siblings ...)
  2023-07-13 16:32 ` [PATCH v3 7/8] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
@ 2023-07-13 16:32 ` Tony Luck
  2023-07-19  2:43 ` [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Shaopeng Tan (Fujitsu)
  2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-07-13 16:32 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
nodes. Systems may support splitting into either two or four nodes.

When SNC mode is enabled the effective amount of L3 cache available
for allocation is divided by the number of nodes per L3.

Detect which SNC mode is active by comparing the number of CPUs
that share a cache with CPU0, with the number of CPUs on node0.

Reported-by: "Shaopeng Tan (Fujitsu)" <tan.shaopeng@fujitsu.com>
Closes: https://lore.kernel.org/r/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 tools/testing/selftests/resctrl/resctrl.h   |  1 +
 tools/testing/selftests/resctrl/resctrlfs.c | 57 +++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index 87e39456dee0..a8b43210b573 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -13,6 +13,7 @@
 #include <signal.h>
 #include <dirent.h>
 #include <stdbool.h>
+#include <ctype.h>
 #include <sys/stat.h>
 #include <sys/ioctl.h>
 #include <sys/mount.h>
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index fb00245dee92..79eecbf9f863 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -130,6 +130,61 @@ int get_resource_id(int cpu_no, int *resource_id)
 	return 0;
 }
 
+/*
+ * Count number of CPUs in a /sys bit map
+ */
+static int count_sys_bitmap_bits(char *name)
+{
+	FILE *fp = fopen(name, "r");
+	int count = 0, c;
+
+	if (!fp)
+		return 0;
+
+	while ((c = fgetc(fp)) != EOF) {
+		if (!isxdigit(c))
+			continue;
+		switch (c) {
+		case 'f':
+			count++;
+		case '7': case 'b': case 'd': case 'e':
+			count++;
+		case '3': case '5': case '6': case '9': case 'a': case 'c':
+			count++;
+		case '1': case '2': case '4': case '8':
+			count++;
+		}
+	}
+	fclose(fp);
+
+	return count;
+}
+
+/*
+ * Detect SNC by compating #CPUs in node0 with #CPUs sharing LLC with CPU0
+ * Try to get this right, even if a few CPUs are offline so that the number
+ * of CPUs in node0 is not exactly half or a quarter of the CPUs sharing the
+ * LLC of CPU0.
+ */
+static int snc_ways(void)
+{
+	int node_cpus, cache_cpus;
+
+	node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
+	cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
+
+	if (!node_cpus || !cache_cpus) {
+		fprintf(stderr, "Warning could not determine Sub-NUMA Cluster mode\n");
+		return 1;
+	}
+
+	if (4 * node_cpus >= cache_cpus)
+		return 4;
+	else if (2 * node_cpus >= cache_cpus)
+		return 2;
+	return 1;
+}
+
 /*
  * get_cache_size - Get cache size for a specified CPU
  * @cpu_no:	CPU number
@@ -190,6 +245,8 @@ int get_cache_size(int cpu_no, char *cache_type, unsigned long *cache_size)
 			break;
 	}
 
+	if (cache_num == 3)
+		*cache_size /= snc_ways();
 	return 0;
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 1/8] x86/resctrl: Refactor in preparation for node-scoped resources
  2023-07-13 16:32 ` [PATCH v3 1/8] x86/resctrl: Refactor in preparation for node-scoped resources Tony Luck
@ 2023-07-18 20:36   ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-07-18 20:36 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/13/2023 9:32 AM, Tony Luck wrote:
> Sub-NUMA cluster systems provide monitoring resources at the NUMA
> node scope instead of the L3 cache scope.
> 
> Rename the cache_level field in struct rdt_resource to the more
> generic "scope" and add symbolic names and a helper function.

Can the changelog elaborate how the helper function is intended
to be used? When changelog just states "add a helper function" it
is unnecessary since that is clear from the code.

> 
> No functional change.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Peter Newman <peternewman@google.com>
> ---

...

> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 030d3b409768..6571514752f3 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L3,
>  			.name			= "L3",
> -			.cache_level		= 3,
> +			.scope			= SCOPE_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_L3),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
> @@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L2,
>  			.name			= "L2",
> -			.cache_level		= 2,
> +			.scope			= SCOPE_L2_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_L2),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
> @@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_MBA,
>  			.name			= "MB",
> -			.cache_level		= 3,
> +			.scope			= SCOPE_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_MBA),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
> @@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_SMBA,
>  			.name			= "SMBA",
> -			.cache_level		= 3,
> +			.scope			= 3,

Should this be SCOPE_L3_CACHE?

>  			.domains		= domain_init(RDT_RESOURCE_SMBA),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",

...

> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 458cb7419502..42f124ffb968 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -297,7 +297,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
>  
>  	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == plr->s->res->cache_level) {
> +		if (ci->info_list[i].level == plr->s->res->scope) {
>  			plr->line_size = ci->info_list[i].coherency_line_size;
>  			return 0;
>  		}
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 725344048f85..418658f0a9ad 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1348,7 +1348,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>  	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
>  	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
>  	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == r->cache_level) {
> +		if (ci->info_list[i].level == r->scope) {
>  			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
>  			break;
>  		}

The last two hunks are red flags to me. Clearly the "cache_level"->"scope"
change is done in preparation for "scope" to be assigned more values than
2 or 3. Yet the code continue to use these values as cache levels, comparing
it to cacheinfo->level for which I only expect cache levels 2 or 3 to be valid.
The above two hunks thus now have potential for errors when rdt_resource->scope
has a value that is not 2 or 3. 

Even if these functions may not be called if rdt_resource->scope is not 2 or 3,
this change makes the code harder to understand and maintain because now it
requires users to know in which flows particular functions can be called and/or
when code paths with invalid values are "ok".

Reinette


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 2/8] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c
  2023-07-13 16:32 ` [PATCH v3 2/8] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c Tony Luck
@ 2023-07-18 20:37   ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-07-18 20:37 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

Regarding subject and change: Why is the focus on just monitor.c?
Hardcoding of RDT_RESOURCE_L3 as monitoring resource is done
elsewhere also (rdtgroup.c:rdt_get_tree()) - why not treat all
hardcoding?

On 7/13/2023 9:32 AM, Tony Luck wrote:

...

> @@ -759,9 +761,9 @@ static struct mon_evt mbm_local_event = {
>  /*
>   * Initialize the event list for the resource.
>   *
> - * Note that MBM events are also part of RDT_RESOURCE_L3 resource
> - * because as per the SDM the total and local memory bandwidth
> - * are enumerated as part of L3 monitoring.
> + * Monitor events can either be part of RDT_RESOURCE_L3 resource,
> + * or they may be per NUMA node on systems with sub-NUMA cluster
> + * enabled and are then in the RDT_RESOURCE_NODE resource.
>   */
>  static void l3_mon_evt_init(struct rdt_resource *r)
>  {
> @@ -773,6 +775,8 @@ static void l3_mon_evt_init(struct rdt_resource *r)
>  		list_add_tail(&mbm_total_event.list, &r->evt_list);
>  	if (is_mbm_local_enabled())
>  		list_add_tail(&mbm_local_event.list, &r->evt_list);
> +
> +	mon_resource = r;
>  }
>  

This does not seem like the right place for this initialization.
mon_evt_init() has a single job that the function comment clearly
states: "Initialize the event list for the resource". What does
the global mon_resource have to do with the event list?

Would get_rdt_mon_resources() not be more appropriate?

Although, looking ahead it is not clear to me why this is needed.
I'll try to focus my responses to the individual patches in this
regard.

>  int __init rdt_get_mon_l3_config(struct rdt_resource *r)

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-07-13 16:32 ` [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[] Tony Luck
@ 2023-07-18 20:40   ` Reinette Chatre
  2023-07-18 22:57     ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-07-18 20:40 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/13/2023 9:32 AM, Tony Luck wrote:
> Add a placeholder in the array of struct rdt_hw_resource to be used
> for event monitoring of systems with Sub-NUMA Cluster enabled.

Could you please elaborate why a new resource is required?


> 
> Update get_domain_id() to handle SCOPE_NODE.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Peter Newman <peternewman@google.com>
> ---
>  arch/x86/kernel/cpu/resctrl/internal.h |  4 +++-
>  arch/x86/kernel/cpu/resctrl/core.c     | 12 ++++++++++++
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 8275b8a74f7e..243017096ddf 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -435,6 +435,7 @@ enum resctrl_res_level {
>  	RDT_RESOURCE_L2,
>  	RDT_RESOURCE_MBA,
>  	RDT_RESOURCE_SMBA,
> +	RDT_RESOURCE_NODE,
>  
>  	/* Must be the last */
>  	RDT_NUM_RESOURCES,
> @@ -442,7 +443,8 @@ enum resctrl_res_level {
>  
>  enum resctrl_scope {
>  	SCOPE_L2_CACHE = 2,
> -	SCOPE_L3_CACHE = 3
> +	SCOPE_L3_CACHE = 3,
> +	SCOPE_NODE,
>  };

A new resource _and_ a new scope is added. Could changelog please
explain why this is required?

>  
>  static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 6571514752f3..e4bd3072927c 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -112,6 +112,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  			.fflags			= RFTYPE_RES_MB,
>  		},
>  	},
> +	[RDT_RESOURCE_NODE] =
> +	{
> +		.r_resctrl = {
> +			.rid			= RDT_RESOURCE_NODE,
> +			.name			= "L3",
> +			.scope			= SCOPE_NODE,
> +			.domains		= domain_init(RDT_RESOURCE_NODE),
> +			.fflags			= 0,
> +		},
> +	},
>  };

So the new resource has the same name, from user perspective,
as RDT_RESOURCE_L3. From this perspective it thus seems to be a
shadow of RDT_RESOURCE_L3 that is used as alternative for some properties
of the actual RDT_RESOURCE_L3? This is starting to look as though this
solution is wrenching itself into current architecture.

From what I can tell the monitoring in SNC environment needs a different
domain list because of the change in scope. What else is needed in the
resource that is different from the existing L3 resource? Could the
monitoring scope of a resource not instead be made distinct from its
allocation scope? By default monitoring and allocation scope will be
the same and thus use the same domain list but when SNC is enabled
then monitoring uses a different domain list.
  
>  /*
> @@ -489,6 +499,8 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>  
>  static int get_domain_id(int cpu, enum resctrl_scope scope)
>  {
> +	if (scope == SCOPE_NODE)
> +		return cpu_to_node(cpu);
>  	return get_cpu_cacheinfo_id(cpu, scope);
>  }
>  

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 4/8] x86/resctrl: Add code to setup monitoring at L3 or NODE scope.
  2023-07-13 16:32 ` [PATCH v3 4/8] x86/resctrl: Add code to setup monitoring at L3 or NODE scope Tony Luck
@ 2023-07-18 20:42   ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-07-18 20:42 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

Regarding subject: "Add code" is not necessary.

On 7/13/2023 9:32 AM, Tony Luck wrote:
> When Sub-NUMA cluster is enabled (snc_ways > 1) use the RDT_RESOURCE_NODE
> instead of RDT_RESOURCE_L3 for all monitoring operations.

This duplication of resource does not look right to me.
RDT_RESOURCE_NODE now contains the monitoring data for RDT_RESOURCE_L3
with related structures within RDT_RESOURCE_L3 going unused.

> 
> The mon_scale and num_rmid values from CPUID(0xf,0x1),(EBX,ECX) must be
> scaled down by the number of Sub-NUMA Clusters.
> 
> A subsequent change will detect sub-NUMA cluster mode and set
> "snc_ways". For now set to one (meaning each L3 cache spans one
> node).
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Peter Newman <peternewman@google.com>
> ---
>  arch/x86/kernel/cpu/resctrl/internal.h | 7 +++++++
>  arch/x86/kernel/cpu/resctrl/core.c     | 7 ++++++-
>  arch/x86/kernel/cpu/resctrl/monitor.c  | 4 ++--
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
>  4 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 243017096ddf..38bac0062c82 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -430,6 +430,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>  
>  extern struct dentry *debugfs_resctrl;
>  
> +extern int snc_ways;
> +
>  enum resctrl_res_level {
>  	RDT_RESOURCE_L3,
>  	RDT_RESOURCE_L2,
> @@ -447,6 +449,11 @@ enum resctrl_scope {
>  	SCOPE_NODE,
>  };
>  
> +static inline int get_mbm_res_level(void)
> +{
> +	return snc_ways > 1 ? RDT_RESOURCE_NODE : RDT_RESOURCE_L3;
> +}

Need to return the enum here? It may be simpler for this helper to
just return a pointer to the resource. (Although the need for a
separate resource is still not clear to me.)

> +
>  static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
>  {
>  	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(res);
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index e4bd3072927c..6fe9f87d4403 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -48,6 +48,11 @@ int max_name_width, max_data_width;
>   */
>  bool rdt_alloc_capable;
>  
> +/*
> + * How many Sub-Numa Cluster nodes share a single L3 cache
> + */
> +int snc_ways = 1;
> +

Since snc_ways is always used I think the comment should provide
more detail on the possible values it may have. For example, to
a reader it may not be obvious what the value of snc_ways should be
if SNC is disabled. Also, what does "ways" refer to? (Also mentioned
later).

>  static void
>  mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
>  		struct rdt_resource *r);
> @@ -831,7 +836,7 @@ static __init bool get_rdt_alloc_resources(void)
>  
>  static __init bool get_rdt_mon_resources(void)
>  {
> -	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	struct rdt_resource *r = &rdt_resources_all[get_mbm_res_level()].r_resctrl;
>  
>  	if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
>  		rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 9be6ffdd01ae..da3f36212898 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -787,8 +787,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>  	int ret;
>  
>  	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> -	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> -	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> +	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_ways;
> +	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_ways;
>  	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>  
>  	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 418658f0a9ad..d037f3da9e55 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2524,7 +2524,7 @@ static int rdt_get_tree(struct fs_context *fc)
>  		static_branch_enable_cpuslocked(&rdt_enable_key);
>  
>  	if (is_mbm_enabled()) {
> -		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +		r = &rdt_resources_all[get_mbm_res_level()].r_resctrl;
>  		list_for_each_entry(dom, &r->domains, list)
>  			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
>  	}

This final hunk makes me wonder why the monitor.c:
mon_resource is necessary at all. A single helper used everywhere
may be simpler.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 5/8] x86/resctrl: Add package scoped resource
  2023-07-13 16:32 ` [PATCH v3 5/8] x86/resctrl: Add package scoped resource Tony Luck
@ 2023-07-18 20:43   ` Reinette Chatre
  2023-07-18 23:05     ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-07-18 20:43 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/13/2023 9:32 AM, Tony Luck wrote:
> Some Intel features require setting a package scoped model specific
> register.
> 
> Add a new resource that builds domains for each package.

If I understand correctly the only purpose of this new resource
is to know when the first CPU associated with a package
comes online. Am I not reading this right? Using a resctrl resource
for this purpose seems inappropriate and unnecessary while also
making the code very hard to follow.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 7/8] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-07-13 16:32 ` [PATCH v3 7/8] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
@ 2023-07-18 20:45   ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-07-18 20:45 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/13/2023 9:32 AM, Tony Luck wrote:
> There isn't a simple hardware enumeration to indicate to software that
> a system is running with Sub-NUMA Cluster enabled.

This changelog appears to _almost_ be identical to the cover letter. There
is no problem with changelog and cover letter being identical but it makes
things confusing when there are slight differences between the text.
For example, "Sub-NUMA Cluster" vs "Sub-NUMA Clustering". With this difference
between the two the reader is left wondering what behind the difference is.

> 
> Compare the number of NUMA nodes with the number of L3 caches to calculate
> the number of Sub-NUMA nodes per L3 cache.
> 
> When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
> are distributed equally between the SNC nodes within each socket.
> 
> E.g. if there are 400 RMID counters, and the system is configured with
> two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
> 0 on the socket, and RMID counter 200..399 on SNC node 1.
> 
> A model specific MSR (0xca0) can change the configuration of the RMIDs
> when SNC mode is enabled.
> 
> The MSR controls the interpretation of the RMID field in the
> IA32_PQR_ASSOC MSR so that the appropriate hardware counters
> within the SNC node are updated.
> 
> Also initialize a per-cpu RMID offset value. Use this
> to calculate the value to write to the IA32_QM_EVTSEL MSR when
> reading RMID event values.
> 
> N.B. this works well for well-behaved NUMA applications that access
> memory predominantly from the local memory node. For applications that
> access memory across multiple nodes it may be necessary for the user
> to read counters for all SNC nodes on a socket and add the values to
> get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
> all that different from applications that span across multiple sockets
> in a legacy system.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Peter Newman <peternewman@google.com>
> Tested-by: Peter Newman <peternewman@google.com>
> ---
>  arch/x86/include/asm/resctrl.h         |  2 +
>  arch/x86/kernel/cpu/resctrl/core.c     | 99 +++++++++++++++++++++++++-
>  arch/x86/kernel/cpu/resctrl/monitor.c  |  2 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  2 +-
>  4 files changed, 100 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 255a78d9d906..f95e69bacc65 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
>  DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>  DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
>  
> +DECLARE_PER_CPU(int, rmid_offset);
> +
>  /*
>   * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
>   *
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index af3be3c2db96..a03ff1a95624 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>  
>  #define pr_fmt(fmt)	"resctrl: " fmt
>  
> +#include <linux/cpu.h>
>  #include <linux/slab.h>
>  #include <linux/err.h>
>  #include <linux/cacheinfo.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>  
> +#include <asm/cpu_device_id.h>
>  #include <asm/intel-family.h>
>  #include <asm/resctrl.h>
>  #include "internal.h"
> @@ -524,6 +527,39 @@ static int get_domain_id(int cpu, enum resctrl_scope scope)
>  	}
>  }
>  
> +DEFINE_PER_CPU(int, rmid_offset);
> +
> +static void set_per_cpu_rmid_offset(int cpu, struct rdt_resource *r)
> +{
> +	this_cpu_write(rmid_offset, (cpu_to_node(cpu) % snc_ways) * r->num_rmid);
> +}

Does this mean that per-cpu data is used as a way to keep "per SNC node" data?
Why is it required for this to be per-cpu data instead of, for example, the
offset computed when it is needed?

> +
> +/*
> + * This MSR provides for configuration of RMIDs on Sub-NUMA Cluster
> + * systems.
> + * Bit0 = 1 (default) For legacy configuration
> + * Bit0 = 0 RMIDs are divided evenly between SNC nodes.
> + */
> +#define MSR_RMID_SNC_CONFIG   0xCA0

Please move to msr-index.h. For reference:
97fa21f65c3e ("x86/resctrl: Move MSR defines into msr-index.h")

> +
> +static void snc_add_pkg(void)
> +{
> +	u64	msrval;
> +
> +	rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
> +	msrval &= ~BIT_ULL(0);
> +	wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
> +}
> +
> +static void snc_remove_pkg(void)
> +{
> +	u64	msrval;
> +
> +	rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
> +	msrval |= BIT_ULL(0);
> +	wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
> +}
> +
>  /*
>   * domain_add_cpu - Add a cpu to a resource's domain list.
>   *
> @@ -555,6 +591,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  		cpumask_set_cpu(cpu, &d->cpu_mask);
>  		if (r->cache.arch_has_per_cpu_cfg)
>  			rdt_domain_reconfigure_cdp(r);
> +		if (r->mon_capable)
> +			set_per_cpu_rmid_offset(cpu, r);
>  		return;
>  	}
>  
> @@ -573,11 +611,17 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  		return;
>  	}
>  
> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> -		domain_free(hw_dom);
> -		return;
> +	if (r->mon_capable) {
> +		if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> +			domain_free(hw_dom);
> +			return;
> +		}
> +		set_per_cpu_rmid_offset(cpu, r);
>  	}
>  
> +	if (r->pkg_actions)
> +		snc_add_pkg();
> +

This seems unnecessary use of a resctrl resource.


>  	list_add_tail(&d->list, add_pos);
>  
>  	err = resctrl_online_domain(r, d);
> @@ -613,6 +657,9 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  			d->plr->d = NULL;
>  		domain_free(hw_dom);
>  
> +		if (r->pkg_actions)
> +			snc_remove_pkg();
> +
>  		return;
>  	}
>  
> @@ -899,11 +946,57 @@ static __init bool get_rdt_resources(void)
>  	return (rdt_mon_capable || rdt_alloc_capable);
>  }
>  
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> +	{}
> +};
> +
> +/*
> + * There isn't a simple enumeration bit to show whether SNC mode
> + * is enabled. Look at the ratio of number of NUMA nodes to the
> + * number of distinct L3 caches. Take care to skip memory-only nodes.
> + */
> +static __init int find_snc_ways(void)

Based on the function comment and function name it is not clear what
this function is intended to do. In cache "ways" have a particular meaning,
what does "ways" mean in this context? 

> +{
> +	unsigned long *node_caches;
> +	int mem_only_nodes = 0;
> +	int cpu, node, ret;
> +
> +	if (!x86_match_cpu(snc_cpu_ids))
> +		return 1;
> +
> +	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> +	if (!node_caches)
> +		return 1;
> +
> +	cpus_read_lock();
> +	for_each_node(node) {
> +		cpu = cpumask_first(cpumask_of_node(node));
> +		if (cpu < nr_cpu_ids)
> +			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> +		else
> +			mem_only_nodes++;
> +	}
> +	cpus_read_unlock();
> +
> +	ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
> +	kfree(node_caches);
> +
> +	if (ret > 1)
> +		rdt_resources_all[RDT_RESOURCE_PKG].r_resctrl.pkg_actions = true;
> +
> +	return ret;
> +}
> +
>  static __init void rdt_init_res_defs_intel(void)
>  {
>  	struct rdt_hw_resource *hw_res;
>  	struct rdt_resource *r;
>  
> +	snc_ways = find_snc_ways();
> +
>  	for_each_rdt_resource(r) {
>  		hw_res = resctrl_to_arch_res(r);
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index da3f36212898..74db99d299e1 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -160,7 +160,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>  	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>  	 * are error bits.
>  	 */
> -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));
>  	rdmsrl(MSR_IA32_QM_CTR, msr_val);
>  
>  	if (msr_val & RMID_VAL_ERROR)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index d037f3da9e55..1a9c38b018ba 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1354,7 +1354,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>  		}
>  	}
>  
> -	return size;
> +	return size / snc_ways;
>  }
>  
>  /**

The last hunk does not seem to be covered in the changelog.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-07-18 20:40   ` Reinette Chatre
@ 2023-07-18 22:57     ` Tony Luck
  2023-07-18 23:45       ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-18 22:57 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Tue, Jul 18, 2023 at 01:40:32PM -0700, Reinette Chatre wrote:
> > +	[RDT_RESOURCE_NODE] =
> > +	{
> > +		.r_resctrl = {
> > +			.rid			= RDT_RESOURCE_NODE,
> > +			.name			= "L3",
> > +			.scope			= SCOPE_NODE,
> > +			.domains		= domain_init(RDT_RESOURCE_NODE),
> > +			.fflags			= 0,
> > +		},
> > +	},
> >  };
> 
> So the new resource has the same name, from user perspective,
> as RDT_RESOURCE_L3. From this perspective it thus seems to be a
> shadow of RDT_RESOURCE_L3 that is used as alternative for some properties
> of the actual RDT_RESOURCE_L3? This is starting to look as though this
> solution is wrenching itself into current architecture.
> 
> >From what I can tell the monitoring in SNC environment needs a different
> domain list because of the change in scope. What else is needed in the
> resource that is different from the existing L3 resource? Could the
> monitoring scope of a resource not instead be made distinct from its
> allocation scope? By default monitoring and allocation scope will be
> the same and thus use the same domain list but when SNC is enabled
> then monitoring uses a different domain list.

Answering this part first, because my choice here affects a bunch
of the code that also raised comments from you.

The crux of the issue is that when SNC mode is enabled the scope
for L3 monitoring functions changes to "node" scope, while the
scope of L3 control functions (CAT, CDP) remains at L3 cache scope.

My solution was to just create a new resource. But you have an
interesing alternate solution. Add an extra domain list to the
resource structure to allow creation of distinct domain lists
for this case where the scope for control and monitor functions
differs.

So change the resource structure like this:

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..01590aa59a67 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -168,10 +168,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	int			ctrl_scope;
+	int			mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;

and build/use separate domain lists for when this resource is
being referenced for allocation/monitoring. E.g. domain_add_cpu()
would check "r->alloc_capable" and add a cpu to the ctrl_domains
list based on the ctrl_scope value. It would do the same with
mon_capable / mon_domains / mon_scope.

If ctrl_scope == mon_scope, just build one list as you suggest above.

Maybe there are more places that walk the list of control domains than
walk the list of monitor domains. Need to audit this set:

$ git grep list_for_each.*domains -- arch/x86/kernel/cpu/resctrl
arch/x86/kernel/cpu/resctrl/core.c:     list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/core.c:     list_for_each(l, &r->domains) {
arch/x86/kernel/cpu/resctrl/ctrlmondata.c:      list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/ctrlmondata.c:      list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/ctrlmondata.c:      list_for_each_entry(dom, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/monitor.c:  list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/pseudo_lock.c:              list_for_each_entry(d_i, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(dom, &r->domains, list)
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r_l->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(dom, &r->domains, list)
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &s->res->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {

Maybe "domains" can keep its name and make a "list_for_each_monitor_domain()" macro
to pick the right list to walk?


I don't think this will reduce the amount of code change in a
significant way. But it may be conceptually easier to follow
what is going on.

-Tony

^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 5/8] x86/resctrl: Add package scoped resource
  2023-07-18 20:43   ` Reinette Chatre
@ 2023-07-18 23:05     ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-07-18 23:05 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Tue, Jul 18, 2023 at 01:43:45PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 7/13/2023 9:32 AM, Tony Luck wrote:
> > Some Intel features require setting a package scoped model specific
> > register.
> > 
> > Add a new resource that builds domains for each package.
> 
> If I understand correctly the only purpose of this new resource
> is to know when the first CPU associated with a package
> comes online. Am I not reading this right? Using a resctrl resource
> for this purpose seems inappropriate and unnecessary while also
> making the code very hard to follow.

Reinette,

Yes. You understand.

I agree that this is blatant abuse of the resource structures.  I can find
another way to perform an action on the first CPU of a package online,
and the last CPU of a package offline.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-07-18 22:57     ` Tony Luck
@ 2023-07-18 23:45       ` Reinette Chatre
  2023-07-19  0:11         ` Luck, Tony
  2023-07-20  0:20         ` Tony Luck
  0 siblings, 2 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-07-18 23:45 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/18/2023 3:57 PM, Tony Luck wrote:
> On Tue, Jul 18, 2023 at 01:40:32PM -0700, Reinette Chatre wrote:
>>> +	[RDT_RESOURCE_NODE] =
>>> +	{
>>> +		.r_resctrl = {
>>> +			.rid			= RDT_RESOURCE_NODE,
>>> +			.name			= "L3",
>>> +			.scope			= SCOPE_NODE,
>>> +			.domains		= domain_init(RDT_RESOURCE_NODE),
>>> +			.fflags			= 0,
>>> +		},
>>> +	},
>>>  };
>>
>> So the new resource has the same name, from user perspective,
>> as RDT_RESOURCE_L3. From this perspective it thus seems to be a
>> shadow of RDT_RESOURCE_L3 that is used as alternative for some properties
>> of the actual RDT_RESOURCE_L3? This is starting to look as though this
>> solution is wrenching itself into current architecture.
>>
>> >From what I can tell the monitoring in SNC environment needs a different
>> domain list because of the change in scope. What else is needed in the
>> resource that is different from the existing L3 resource? Could the
>> monitoring scope of a resource not instead be made distinct from its
>> allocation scope? By default monitoring and allocation scope will be
>> the same and thus use the same domain list but when SNC is enabled
>> then monitoring uses a different domain list.
> 
> Answering this part first, because my choice here affects a bunch
> of the code that also raised comments from you.

Indeed.

> 
> The crux of the issue is that when SNC mode is enabled the scope
> for L3 monitoring functions changes to "node" scope, while the
> scope of L3 control functions (CAT, CDP) remains at L3 cache scope.
> 
> My solution was to just create a new resource. But you have an
> interesing alternate solution. Add an extra domain list to the
> resource structure to allow creation of distinct domain lists
> for this case where the scope for control and monitor functions
> differs.
> 
> So change the resource structure like this:
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 8334eeacfec5..01590aa59a67 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -168,10 +168,12 @@ struct rdt_resource {
>  	bool			alloc_capable;
>  	bool			mon_capable;
>  	int			num_rmid;
> -	int			cache_level;
> +	int			ctrl_scope;
> +	int			mon_scope;

I am not sure about getting rid of cache_level so fast.
I see regarding the current problem being solved that
ctrl_scope would have the same values as cache_level but
I find that adding this level of indirection while keeping
the comparison with cacheinfo->level to create a trap
for future mistakes.


>  	struct resctrl_cache	cache;
>  	struct resctrl_membw	membw;
> -	struct list_head	domains;
> +	struct list_head	ctrl_domains;
> +	struct list_head	mon_domains;
>  	char			*name;
>  	int			data_width;
>  	u32			default_ctrl;
> 
> and build/use separate domain lists for when this resource is
> being referenced for allocation/monitoring. E.g. domain_add_cpu()
> would check "r->alloc_capable" and add a cpu to the ctrl_domains
> list based on the ctrl_scope value. It would do the same with
> mon_capable / mon_domains / mon_scope.
> 
> If ctrl_scope == mon_scope, just build one list as you suggest above.

Yes, this is the idea. Thank you for considering it. Something else
to consider that may make this even cleaner/simpler would be to review
struct rdt_domain and struct rdt_hw_domain members for "monitor" vs "control"
usage. These structs could potentially be split further into separate
"control" and "monitor" variants. For example, "struct rdt_domain" split into
"struct rdt_ctrl_domain" and "struct rdt_mon_domain". If there is a clean
split then resctrl can always create two lists with the unnecessary duplication
eliminated when two domain lists are created. This would also
eliminate the need to scatter ctrl_scope == mon_scope checks throughout. 

> 
> Maybe there are more places that walk the list of control domains than
> walk the list of monitor domains. Need to audit this set:
> 
> $ git grep list_for_each.*domains -- arch/x86/kernel/cpu/resctrl
> arch/x86/kernel/cpu/resctrl/core.c:     list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/core.c:     list_for_each(l, &r->domains) {
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c:      list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c:      list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c:      list_for_each_entry(dom, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/monitor.c:  list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c:              list_for_each_entry(d_i, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(dom, &r->domains, list)
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r_l->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(dom, &r->domains, list)
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &s->res->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> 
> Maybe "domains" can keep its name and make a "list_for_each_monitor_domain()" macro
> to pick the right list to walk?

It is not clear to me how "domains" can keep its name. If I understand
the macro would be useful if scope always needs to be considered. I wonder
if the list walkers may not mostly just walk the appropriate list directly
if resctrl always creates separate "control domain" and "monitor domain"
lists.

> I don't think this will reduce the amount of code change in a
> significant way. But it may be conceptually easier to follow
> what is going on.

Reducing the amount of code changed is not a goal to me. If I understand
correctly I think that adapting resctrl to support different monitor and
control scope could create a foundation into which SNC can slot in smoothly. 

Reinette



^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-07-18 23:45       ` Reinette Chatre
@ 2023-07-19  0:11         ` Luck, Tony
  2023-07-20 17:27           ` Reinette Chatre
  2023-07-20  0:20         ` Tony Luck
  1 sibling, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-07-19  0:11 UTC (permalink / raw)
  To: Chatre, Reinette
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

> Yes, this is the idea. Thank you for considering it. Something else
> to consider that may make this even cleaner/simpler would be to review
> struct rdt_domain and struct rdt_hw_domain members for "monitor" vs "control"
> usage. These structs could potentially be split further into separate
> "control" and "monitor" variants. For example, "struct rdt_domain" split into
> "struct rdt_ctrl_domain" and "struct rdt_mon_domain". If there is a clean
> split then resctrl can always create two lists with the unnecessary duplication
> eliminated when two domain lists are created. This would also
> eliminate the need to scatter ctrl_scope == mon_scope checks throughout.

You might like what I'm doing in the "resctrl2" re-write[1]. Arch independent code
that maintains the domain lists for a resource via a cpuhp notifier just has this
for the domain structure:

struct resctrl_domain {
        struct list_head        list;
        struct cpumask          cpu_mask;
        int                     id;
        int                     cache_size;
};

Each module managing a resource decides what extra information it wants to
carry in the domain. So the above structure is common to all, but it is followed
by whatever the resource module wants. E.g. the CBM masks for each CLOSid
for the CAT module. The module tells core code the size to allocate.

"cache_size" is only there because the cache topology bits needed to discover
sizes of caches aren't exported. Both the "size" file and pseudo-locking need
to know the size.

It's also possible that you may hate it. There is zero sharing of resource structures
even if they have the same scope. This is because all modules are independently
loadable.

-Tony

[1] WIP snapshot at git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git
branch resctrl2_v65rc1. That doesn't have pseudo-locking, but most of the rest
of existing resctrl functionality is there.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
  2023-07-13 16:31 [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (7 preceding siblings ...)
  2023-07-13 16:32 ` [PATCH v3 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled Tony Luck
@ 2023-07-19  2:43 ` Shaopeng Tan (Fujitsu)
  2023-07-22 20:21   ` Tony Luck
  2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
  9 siblings, 1 reply; 331+ messages in thread
From: Shaopeng Tan (Fujitsu) @ 2023-07-19  2:43 UTC (permalink / raw)
  To: 'Tony Luck',
	Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap, linux-kernel,
	linux-doc, patches

Hi tony,

I ran selftest/resctrl in my environment,
the test result is "not ok".

Processer in my environment:
Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz

kernel:
$ uname -r
6.5.0-rc1+

Result :
Sub-NUMA enable:
xxx@xxx:~/linux_v6.5_rc1l$ sudo make -C tools/testing/selftests/resctrl run_tests
make: Entering directory '/.../tools/testing/selftests/resctrl'
TAP version 13
1..1
# timeout set to 120
# selftests: resctrl: resctrl_tests
# TAP version 13
# # Pass: Check kernel supports resctrl filesystem
# # Pass: Check resctrl mountpoint "/sys/fs/resctrl" exists
# # resctrl filesystem not mounted
# # dmesg: [    3.060018] resctrl: L3 allocation detected
# # dmesg: [    3.098180] resctrl: MB allocation detected
# # dmesg: [    3.118507] resctrl: L3 monitoring detected
# 1..4
# # Starting MBM BW change ...
# # Mounting resctrl to "/sys/fs/resctrl"
# # Mounting resctrl to "/sys/fs/resctrl"
# # Benchmark PID: 14784
# # Writing benchmark parameters to resctrl FS
# # Write schema "MB:0=100" to resctrl FS
# # Checking for pass/fail
# # Fail: Check MBM diff within 5%
# # avg_diff_per: 100%
# # Span (MB): 250
# # avg_bw_imc: 14185
# # avg_bw_resc: 28389
# not ok 1 MBM: bw change
# # Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.
# # Starting MBA Schemata change ...
# # Mounting resctrl to "/sys/fs/resctrl"
# # Mounting resctrl to "/sys/fs/resctrl"
# # Benchmark PID: 14787
# # Writing benchmark parameters to resctrl FS
# # Write schema "MB:0=100" to resctrl FS
# # Write schema "MB:0=90" to resctrl FS
# # Write schema "MB:0=80" to resctrl FS
# # Write schema "MB:0=70" to resctrl FS
# # Write schema "MB:0=60" to resctrl FS
# # Write schema "MB:0=50" to resctrl FS
# # Write schema "MB:0=40" to resctrl FS
# # Write schema "MB:0=30" to resctrl FS
# # Write schema "MB:0=20" to resctrl FS
# # Write schema "MB:0=10" to resctrl FS
# # Results are displayed in (MB)
# # Fail: Check MBA diff within 5% for schemata 100
# # avg_diff_per: 99%
# # avg_bw_imc: 14179
# # avg_bw_resc: 28340
# # Fail: Check MBA diff within 5% for schemata 90
# # avg_diff_per: 100%
# # avg_bw_imc: 9244
# # avg_bw_resc: 18497
# # Fail: Check MBA diff within 5% for schemata 80
# # avg_diff_per: 100%
# # avg_bw_imc: 9249
# # avg_bw_resc: 18504
# # Fail: Check MBA diff within 5% for schemata 70
# # avg_diff_per: 100%
# # avg_bw_imc: 9250
# # avg_bw_resc: 18506
# # Fail: Check MBA diff within 5% for schemata 60
# # avg_diff_per: 100%
# # avg_bw_imc: 7521
# # avg_bw_resc: 15055
# # Fail: Check MBA diff within 5% for schemata 50
# # avg_diff_per: 100%
# # avg_bw_imc: 7455
# # avg_bw_resc: 14917
# # Fail: Check MBA diff within 5% for schemata 40
# # avg_diff_per: 100%
# # avg_bw_imc: 5962
# # avg_bw_resc: 11934
# # Fail: Check MBA diff within 5% for schemata 30
# # avg_diff_per: 100%
# # avg_bw_imc: 4208
# # avg_bw_resc: 8436
# # Fail: Check MBA diff within 5% for schemata 20
# # avg_diff_per: 98%
# # avg_bw_imc: 2972
# # avg_bw_resc: 5909
# # Fail: Check MBA diff within 5% for schemata 10
# # avg_diff_per: 99%
# # avg_bw_imc: 1715
# # avg_bw_resc: 3426
# # Fail: Check schemata change using MBA
# # At least one test failed
# not ok 2 MBA: schemata change
# # Starting CMT test ...
# # Mounting resctrl to "/sys/fs/resctrl"
# # Mounting resctrl to "/sys/fs/resctrl"
# # Cache size :6488064
# # Benchmark PID: 14793
# # Writing benchmark parameters to resctrl FS
# # Checking for pass/fail
# # Fail: Check cache miss rate within 15%
# # Percent diff=91
# # Number of bits: 5
# # Average LLC val: 5640192
# # Cache span (bytes): 2949120
# not ok 3 CMT: test
# # Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.
# # Starting CAT test ...
# # Mounting resctrl to "/sys/fs/resctrl"
# # Mounting resctrl to "/sys/fs/resctrl"
# # Cache size :6488064
# # Writing benchmark parameters to resctrl FS
# # Write schema "L3:0=3f" to resctrl FS
# # Checking for pass/fail
# # Fail: Check cache miss rate within 4%
# # Percent diff=6
# # Number of bits: 6
# # Average LLC val: 51475
# # Cache span (lines): 55296
# not ok 4 CAT: test
# # Totals: pass:0 fail:4 xfail:0 xpass:0 skip:0 error:0
not ok 1 selftests: resctrl: resctrl_tests # exit=1
make: Leaving directory '/...l/tools/testing/selftests/resctrl'

Sub-NUMA disable:
xxx@xxx:~/linux_v6.5_rc1l$ sudo make -C tools/testing/selftests/resctrl run_tests
...
# # Starting CAT test ...
# # Mounting resctrl to "/sys/fs/resctrl"
# # Mounting resctrl to "/sys/fs/resctrl"
# # Cache size :6488064
# # Writing benchmark parameters to resctrl FS
# # Write schema "L3:0=3f" to resctrl FS
# # Checking for pass/fail
# # Fail: Check cache miss rate within 4%
# # Percent diff=6
# # Number of bits: 6
# # Average LLC val: 51899
# # Cache span (lines): 55296
# not ok 4 CAT: test
# # Totals: pass:3 fail:1 xfail:0 xpass:0 skip:0 error:0
not ok 1 selftests: resctrl: resctrl_tests # exit=1
make: Leaving directory '/.../tools/testing/selftests/resctrl'

Best regards,
Shaopeng TAN

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-07-18 23:45       ` Reinette Chatre
  2023-07-19  0:11         ` Luck, Tony
@ 2023-07-20  0:20         ` Tony Luck
  2023-07-20 17:27           ` Reinette Chatre
  1 sibling, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-20  0:20 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Here's a quick hack to see how things might look with
separate domain lists in the "L3" resource.

For testing purposes on a non-SNC system I set ->mon_scope =
MON_SCOPE_NODE, but made domain_add_cpu() allocate the mondomains
list based on L3 scope ... just so I could check that I found all
the places where monitoring needs to use the mondomains list.
The kernel doesn't crash when running tools/testing/selftests/resctrl,
and the tests all pass. But that doesn't mean I didn't miss something.

Some restructuring of control vs. monitoing initialization might
avoid some of the code I duplicated in domain_add_cpu(). But this
is intended just as a "Is this what you meant?" before I dig deeper.

Overall, I think it is a cleaner approach that making a new
"L3" resource with different scope just for the SNC monitoring

-Tony

---

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..e4b653088a22 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -151,9 +151,11 @@ struct resctrl_schema;
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
  * @cache_level:	Which cache level defines scope of this resource
+ * @mon_scope:		Scope of this resource if different from cache_level
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
+ * @mondomains:		Monitor domains for this resource (if mon_scope != 0)
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -169,9 +171,11 @@ struct rdt_resource {
 	bool			mon_capable;
 	int			num_rmid;
 	int			cache_level;
+	int			mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
+	struct list_head	mondomains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -184,6 +188,8 @@ struct rdt_resource {
 	bool			cdp_capable;
 };
 
+#define MON_SCOPE_NODE	1
+
 /**
  * struct resctrl_schema - configuration abilities of a resource presented to
  *			   user-space
@@ -217,8 +223,8 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d, bool mon_setup);
+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d, bool mon_teardown);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 85ceaf9a31ac..c5e2ac2a60cf 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -511,7 +511,7 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
+struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
 				   struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..545d563ba956 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,7 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define domain_init(id, field) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.field)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -66,7 +66,9 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
 			.cache_level		= 3,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.mon_scope		= MON_SCOPE_NODE, //FAKE
+			.domains		= domain_init(RDT_RESOURCE_L3, domains),
+			.mondomains		= domain_init(RDT_RESOURCE_L3, mondomains),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -80,7 +82,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
 			.cache_level		= 2,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.domains		= domain_init(RDT_RESOURCE_L2, domains),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -94,7 +96,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
 			.cache_level		= 3,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.domains		= domain_init(RDT_RESOURCE_MBA, domains),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -106,7 +108,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
 			.cache_level		= 3,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.domains		= domain_init(RDT_RESOURCE_SMBA, domains),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -384,14 +386,15 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Find a domain in one of the lists for a resource that
+ * matches input resource id
  *
  * Search resource r's domain list to find the resource id. If the resource
  * id is found in a domain, return the domain. Otherwise, if requested by
  * caller, return the first domain whose id is bigger than the input id.
  * The domain list is sorted by id in ascending order.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
+struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
 				   struct list_head **pos)
 {
 	struct rdt_domain *d;
@@ -400,7 +403,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	if (id < 0)
 		return ERR_PTR(-ENODEV);
 
-	list_for_each(l, &r->domains) {
+	list_for_each(l, h) {
 		d = list_entry(l, struct rdt_domain, list);
 		/* When id is found, return its domain. */
 		if (id == d->id)
@@ -508,7 +511,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 	struct rdt_domain *d;
 	int err;
 
-	d = rdt_find_domain(r, id, &add_pos);
+	d = rdt_find_domain(&r->domains, id, &add_pos);
 	if (IS_ERR(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
 		return;
@@ -536,6 +539,44 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
+	if (!r->mon_scope && r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+		domain_free(hw_dom);
+		return;
+	}
+
+	list_add_tail(&d->list, add_pos);
+
+	err = resctrl_online_domain(r, d, r->mon_scope == 0);
+	if (err) {
+		list_del(&d->list);
+		domain_free(hw_dom);
+	}
+
+	if (r->mon_scope != MON_SCOPE_NODE)
+		return;
+
+	//id = cpu_to_node(cpu);
+	id = get_cpu_cacheinfo_id(cpu, r->cache_level); // FAKE
+	add_pos = NULL;
+	d = rdt_find_domain(&r->mondomains, id, &add_pos);
+	if (IS_ERR(d)) {
+		pr_warn("Couldn't find node id for CPU %d\n", cpu);
+		return;
+	}
+
+	if (d) {
+		cpumask_set_cpu(cpu, &d->cpu_mask);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->id = id;
+	cpumask_set_cpu(cpu, &d->cpu_mask);
+
 	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
@@ -543,7 +584,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	list_add_tail(&d->list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_domain(r, d, true);
 	if (err) {
 		list_del(&d->list);
 		domain_free(hw_dom);
@@ -556,7 +597,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
-	d = rdt_find_domain(r, id, NULL);
+	d = rdt_find_domain(&r->domains, id, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
 		return;
@@ -565,7 +606,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_domain(r, d, r->mon_scope == 0);
 		list_del(&d->list);
 
 		/*
@@ -579,7 +620,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
+	if (r->mon_scope == 0 && r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
 			cancel_delayed_work(&d->mbm_over);
 			mbm_setup_overflow_handler(d, 0);
@@ -590,6 +631,23 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 			cqm_setup_limbo_handler(d, 0);
 		}
 	}
+
+	if (r->mon_scope != MON_SCOPE_NODE)
+		return;
+
+	id = cpu_to_node(cpu);
+	d = rdt_find_domain(&r->mondomains, id, NULL);
+	if (IS_ERR_OR_NULL(d)) {
+		pr_warn("Couldn't find node id for CPU %d\n", cpu);
+		return;
+	}
+
+	cpumask_clear_cpu(cpu, &d->cpu_mask);
+	if (cpumask_empty(&d->cpu_mask)) {
+		resctrl_offline_domain(r, d, true);
+		list_del(&d->list);
+		domain_free(hw_dom);
+	}
 }
 
 static void clear_closid_rmid(int cpu)
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b44c487727d4..80033cb698d0 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -545,6 +545,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	struct rdt_resource *r;
 	union mon_data_bits md;
 	struct rdt_domain *d;
+	struct list_head *h;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -560,7 +561,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
+	h = r->mon_scope ? &r->mondomains : &r->domains;
+	d = rdt_find_domain(h, domid, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		ret = -ENOENT;
 		goto out;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..08085202582a 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -335,12 +335,14 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rdt_domain *d;
+	struct list_head *h;
 	int cpu, err;
 	u64 val = 0;
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
+	h = r->mon_scope ? &r->mondomains : &r->domains;
+	list_for_each_entry(d, h, list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 725344048f85..fb5b23fcb6d4 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1492,11 +1492,13 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 {
 	struct mon_config_info mon_info = {0};
 	struct rdt_domain *dom;
+	struct list_head *h;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	h = r->mon_scope ? &r->mondomains : &r->domains;
+	list_for_each_entry(dom, h, list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1599,6 +1601,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
 	struct rdt_domain *d;
+	struct list_head *h;
 	int ret = 0;
 
 next:
@@ -1619,7 +1622,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
+	h = r->mon_scope ? &r->mondomains : &r->domains;
+	list_for_each_entry(d, h, list) {
 		if (d->id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
@@ -2465,6 +2469,7 @@ static int rdt_get_tree(struct fs_context *fc)
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	struct rdt_domain *dom;
 	struct rdt_resource *r;
+	struct list_head *h;
 	int ret;
 
 	cpus_read_lock();
@@ -2525,7 +2530,8 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		h = r->mon_scope ? &r->mondomains : &r->domains;
+		list_for_each_entry(dom, h, list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2917,9 +2923,11 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdtgroup *prgrp)
 {
 	struct rdt_domain *dom;
+	struct list_head *h;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	h = r->mon_scope ? &r->mondomains : &r->domains;
+	list_for_each_entry(dom, h, list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3708,14 +3716,14 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d, bool mon_teardown)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
 
-	if (!r->mon_capable)
+	if (!mon_teardown || !r->mon_capable)
 		return;
 
 	/*
@@ -3773,7 +3781,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d, bool mon_setup)
 {
 	int err;
 
@@ -3783,7 +3791,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
+	if (!mon_setup || !r->mon_capable)
 		return 0;
 
 	err = domain_setup_mon_state(r, d);

^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-07-19  0:11         ` Luck, Tony
@ 2023-07-20 17:27           ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-07-20 17:27 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/18/2023 5:11 PM, Luck, Tony wrote:
>> Yes, this is the idea. Thank you for considering it. Something else
>> to consider that may make this even cleaner/simpler would be to review
>> struct rdt_domain and struct rdt_hw_domain members for "monitor" vs "control"
>> usage. These structs could potentially be split further into separate
>> "control" and "monitor" variants. For example, "struct rdt_domain" split into
>> "struct rdt_ctrl_domain" and "struct rdt_mon_domain". If there is a clean
>> split then resctrl can always create two lists with the unnecessary duplication
>> eliminated when two domain lists are created. This would also
>> eliminate the need to scatter ctrl_scope == mon_scope checks throughout.
> 
> You might like what I'm doing in the "resctrl2" re-write[1]. Arch independent code
> that maintains the domain lists for a resource via a cpuhp notifier just has this
> for the domain structure:
> 
> struct resctrl_domain {
>         struct list_head        list;
>         struct cpumask          cpu_mask;
>         int                     id;
>         int                     cache_size;
> };
> 
> Each module managing a resource decides what extra information it wants to
> carry in the domain. So the above structure is common to all, but it is followed
> by whatever the resource module wants. E.g. the CBM masks for each CLOSid
> for the CAT module. The module tells core code the size to allocate.

hmmm ... what I am *hearing* you say is that the goodness from the
rewrite can be added to resctrl? :)

> "cache_size" is only there because the cache topology bits needed to discover
> sizes of caches aren't exported. Both the "size" file and pseudo-locking need
> to know the size.
> 
> It's also possible that you may hate it. There is zero sharing of resource structures
> even if they have the same scope. This is because all modules are independently
> loadable.

Apologies but I am still unable to understand the problem statement that
motivates the rewrite.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-07-20  0:20         ` Tony Luck
@ 2023-07-20 17:27           ` Reinette Chatre
  2023-07-20 21:56             ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-07-20 17:27 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/19/2023 5:20 PM, Tony Luck wrote:
> Here's a quick hack to see how things might look with
> separate domain lists in the "L3" resource.
> 
> For testing purposes on a non-SNC system I set ->mon_scope =
> MON_SCOPE_NODE, but made domain_add_cpu() allocate the mondomains
> list based on L3 scope ... just so I could check that I found all
> the places where monitoring needs to use the mondomains list.
> The kernel doesn't crash when running tools/testing/selftests/resctrl,
> and the tests all pass. But that doesn't mean I didn't miss something.
> 
> Some restructuring of control vs. monitoing initialization might
> avoid some of the code I duplicated in domain_add_cpu(). But this
> is intended just as a "Is this what you meant?" before I dig deeper.

Thank you for considering the approach. I find that this sample move
towards the idea while also highlighting what else can be considered.
I do not know if you already considered these ideas and found it flawed
so I will try to make it explicit so that you can point out to me where
things will fall apart.

The sample code introduces a new list "mondomains" that is intended to
be used when the monitoring scope is different from the allocation scope.
This introduces duplication when the monitoring and allocation scope is
different. Each list, "domains" and "mondomains" will host structures
that can accommodate both monitoring and allocation data, with the data
not relevant to the list going unused as it is unnecessarily duplicated.
Additionally this forces significant portions of resctrl to now always
consider whether the monitoring and allocation scope is different ...
note how this sample now has code like below scattered throughout.
   h = r->mon_scope ? &r->mondomains : &r->domains;

I also find the domain_add_cpu() becoming intricate as it needs to
navigate all the different scenarios.

This unnecessary duplication, new environment checks needed throughout,
and additional code complexities are red flags to me that this solution
is not well integrated.

To deal with these complexities I would like to consider if it may
make things simpler to always (irrespective of allocation and
monitoring scope) maintain allocation and monitoring domain lists.
Each list need only carry data appropriate to its use ... the allocation
list only has data relevant to allocation, the monitoring list only
has data relevant to monitoring. This is the struct rdt_domain related
split I mentioned previously.

Code could become something like:

resctrl_online_cpu()
{
	...
	for_each_alloc_capable_rdt_resource(r)
		alloc_domain_add_cpu(...)
	for_each_mon_capable_rdt_resource(r)
		mon_domain_add_cpu(...)
	...
}

This would reduce complication in domain_add_cpu() since each domain list
only need to concern itself with monitoring or allocation.

Even resctrl_online_domain() can be siplified significantly by
making it specific to allocation or monitoring. For example,
resctrl_online_mon_domain() would only and always just run
the monitoring related code.

With the separate allocation and monitoring domain lists there
may no longer be a need for scattering code with checks like:
   h = r->mon_scope ? &r->mondomains : &r->domains;
This would be because the code can directly pick the domain
list it is operating on.

What do you think? The above is just refactoring of existing
code and from what I can tell this would make supporting
SNC straight forward.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-07-20 17:27           ` Reinette Chatre
@ 2023-07-20 21:56             ` Luck, Tony
  2023-07-22 19:18               ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-07-20 21:56 UTC (permalink / raw)
  To: Chatre, Reinette
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

> To deal with these complexities I would like to consider if it may
> make things simpler to always (irrespective of allocation and
> monitoring scope) maintain allocation and monitoring domain lists.
> Each list need only carry data appropriate to its use ... the allocation
> list only has data relevant to allocation, the monitoring list only
> has data relevant to monitoring. This is the struct rdt_domain related
> split I mentioned previously.
>
> Code could become something like:

resctrl_online_cpu()
{
	...
	for_each_alloc_capable_rdt_resource(r)
		alloc_domain_add_cpu(...)
	for_each_mon_capable_rdt_resource(r)
		mon_domain_add_cpu(...)
	...
}

> This would reduce complication in domain_add_cpu() since each domain list
> only need to concern itself with monitoring or allocation.

This does seem a worthy target.

I started on a patch to so this ... but I'm not sure I have the stamina or the time
to see it through. 

I split struct rdt_domain into rdt_ctrl_domain and rdt_mon_domain. But that
led to also splitting the rdt_hw_domain structure into two, and then splitting
the resctrl_to_arch_dom() function, and then another and another.

That process will eventually converge (there are a finite number of lines
of code) .... but it will be a big patch. I don't see how to stage it a piece
at a time.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v4 0/7] Add support for Sub-NUMA cluster (SNC) systems
  2023-07-13 16:31 [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (8 preceding siblings ...)
  2023-07-19  2:43 ` [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Shaopeng Tan (Fujitsu)
@ 2023-07-22 19:07 ` Tony Luck
  2023-07-22 19:07   ` [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring Tony Luck
                     ` (9 more replies)
  9 siblings, 10 replies; 331+ messages in thread
From: Tony Luck @ 2023-07-22 19:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The Sub-NUMA cluster feature on some Intel processors partitions
the CPUs that share an L3 cache into two or more sets. This plays
havoc with the Resource Director Technology (RDT) monitoring features.
Prior to this patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID
counters in the same way. This allows for monitoring features
to be used (with the caveat that memory accesses between different
SNC NUMA nodes may still not be counted accuratlely.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v3:

Reinette provided the most excellent suggestion that this series
could better achieve its objective if it enabled separate domain
lists for control & monitoring within a resource, rather than
creating a whole new resource to support separte node scope needed
for SNC monitoring. Thus all the pre-amble patches from the previous
version have gone, replaced by patches 1-4 of this new series.

Note to anyone backporting this to some older Linux kernel version.
You may be able to skip parts 2-4. These provide separate domain
structures for control and monitor with just the fields needed for
each. But this is largely cosmetic.

Of the code from v3 that survived to v4 the following changes have
been made (also from Reinette's review of v3).

1) Rename "snc_ways" to "snc_nodes_per_l3_cache" to avoid the confusing
use of "ways" which means something entirely different when talking
about caches.
2) Move the #define for MSR_RMID_SNC_CONFIG to <asm/msr-index.h> along
with all the other RDT MSRs.
3) Don't use a per-CPU variable "rmid_offset". Just calculate value
needed at the one place where it is used.
4) Don't create an entire resource structure with package scoped domains
just to set the SNC MSR.
5) Add comment in the commit message about adjusting the value shown in
the "size" files in each resctrl ctrl_mon directory.

This one not from Reinette:
6) Prevent mounting in "mba_MBps" mode when SNC mode is enabled. This
would just be confusing since monitoring is done at the node scope while
control is still at package scope.

Tony Luck (7):
  x86/resctrl: Create separate domains for control and monitoring
  x86/resctrl: Split the rdt_domain structures
  x86/resctrl: Change monitor code to use rdt_mondomain
  x86/resctrl: Delete unused fields from struct rdt_domain
  x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  x86/resctrl: Update documentation with Sub-NUMA cluster changes
  selftests/resctrl: Adjust effective L3 cache size when SNC enabled

 Documentation/arch/x86/resctrl.rst          |  10 +-
 include/linux/resctrl.h                     |  50 +++-
 arch/x86/include/asm/msr-index.h            |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h      |  40 ++-
 tools/testing/selftests/resctrl/resctrl.h   |   1 +
 arch/x86/kernel/cpu/resctrl/core.c          | 289 ++++++++++++++++----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c   |   6 +-
 arch/x86/kernel/cpu/resctrl/monitor.c       |  58 ++--
 arch/x86/kernel/cpu/resctrl/rdtgroup.c      |  54 ++--
 tools/testing/selftests/resctrl/resctrlfs.c |  57 ++++
 10 files changed, 427 insertions(+), 139 deletions(-)


base-commit: fdf0eaf11452d72945af31804e2a1048ee1b574c
-- 
2.40.1


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring
  2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
@ 2023-07-22 19:07   ` Tony Luck
  2023-08-11 17:29     ` Reinette Chatre
  2023-07-22 19:07   ` [PATCH v4 2/7] x86/resctrl: Split the rdt_domain structures Tony Luck
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-22 19:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

First step towards supporting resource control where the scope of
control operations is not the same as monitor operations.

Add an extra list in the rdt_resource structure. For this will
just duplicate the existing list of domains based on the L3 cache
scope.

Refactor the domain_add_cpu() and domain_remove() functions to
build separate lists for r->alloc_capable and r->mon_capable
resources. Note that only the "L3" domain currently supports
both types.

Change all places where monitoring functions walk the list of
domains to use the new "mondomains" list instead of the old
"domains" list.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |  10 +-
 arch/x86/kernel/cpu/resctrl/internal.h    |   2 +-
 arch/x86/kernel/cpu/resctrl/core.c        | 195 +++++++++++++++-------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  30 ++--
 6 files changed, 167 insertions(+), 74 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..1267d56f9e76 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -151,9 +151,11 @@ struct resctrl_schema;
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
  * @cache_level:	Which cache level defines scope of this resource
+ * @mon_scope:		Scope of this resource if different from cache_level
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
+ * @mondomains:		Monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -169,9 +171,11 @@ struct rdt_resource {
 	bool			mon_capable;
 	int			num_rmid;
 	int			cache_level;
+	int			mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
+	struct list_head	mondomains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -217,8 +221,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 85ceaf9a31ac..c5e2ac2a60cf 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -511,7 +511,7 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
+struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
 				   struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..274605aaa026 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,7 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define domain_init(id, field) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.field)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -66,7 +66,9 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
 			.cache_level		= 3,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.mon_scope		= 3,
+			.domains		= domain_init(RDT_RESOURCE_L3, domains),
+			.mondomains		= domain_init(RDT_RESOURCE_L3, mondomains),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -80,7 +82,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
 			.cache_level		= 2,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.domains		= domain_init(RDT_RESOURCE_L2, domains),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -94,7 +96,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
 			.cache_level		= 3,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.domains		= domain_init(RDT_RESOURCE_MBA, domains),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -106,7 +108,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
 			.cache_level		= 3,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.domains		= domain_init(RDT_RESOURCE_SMBA, domains),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -384,14 +386,15 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Find a domain in one of the lists for a resource that
+ * matches input resource id
  *
  * Search resource r's domain list to find the resource id. If the resource
  * id is found in a domain, return the domain. Otherwise, if requested by
  * caller, return the first domain whose id is bigger than the input id.
  * The domain list is sorted by id in ascending order.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
+struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
 				   struct list_head **pos)
 {
 	struct rdt_domain *d;
@@ -400,7 +403,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	if (id < 0)
 		return ERR_PTR(-ENODEV);
 
-	list_for_each(l, &r->domains) {
+	list_for_each(l, h) {
 		d = list_entry(l, struct rdt_domain, list);
 		/* When id is found, return its domain. */
 		if (id == d->id)
@@ -487,6 +490,94 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
+{
+	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain *d;
+	int err;
+
+	d = rdt_find_domain(&r->domains, id, &add_pos);
+	if (IS_ERR(d)) {
+		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+		return;
+	}
+
+	if (d) {
+		cpumask_set_cpu(cpu, &d->cpu_mask);
+		if (r->cache.arch_has_per_cpu_cfg)
+			rdt_domain_reconfigure_cdp(r);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->id = id;
+	cpumask_set_cpu(cpu, &d->cpu_mask);
+
+	rdt_domain_reconfigure_cdp(r);
+
+	if (domain_setup_ctrlval(r, d)) {
+		domain_free(hw_dom);
+		return;
+	}
+
+	list_add_tail(&d->list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del(&d->list);
+		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain *d;
+	int err;
+
+	d = rdt_find_domain(&r->mondomains, id, &add_pos);
+	if (IS_ERR(d)) {
+		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+		return;
+	}
+
+	if (d) {
+		cpumask_set_cpu(cpu, &d->cpu_mask);
+		if (r->cache.arch_has_per_cpu_cfg)
+			rdt_domain_reconfigure_cdp(r);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->id = id;
+	cpumask_set_cpu(cpu, &d->cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+		domain_free(hw_dom);
+		return;
+	}
+
+	list_add_tail(&d->list, add_pos);
+
+	err = resctrl_online_mon_domain(r, d);
+	if (err) {
+		list_del(&d->list);
+		domain_free(hw_dom);
+	}
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
-	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
-	struct rdt_domain *d;
-	int err;
-
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
-		return;
-	}
-
-	if (d) {
-		cpumask_set_cpu(cpu, &d->cpu_mask);
-		if (r->cache.arch_has_per_cpu_cfg)
-			rdt_domain_reconfigure_cdp(r);
-		return;
-	}
-
-	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
-	if (!hw_dom)
-		return;
-
-	d = &hw_dom->d_resctrl;
-	d->id = id;
-	cpumask_set_cpu(cpu, &d->cpu_mask);
-
-	rdt_domain_reconfigure_cdp(r);
-
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
-		return;
-	}
-
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
-		return;
-	}
-
-	list_add_tail(&d->list, add_pos);
-
-	err = resctrl_online_domain(r, d);
-	if (err) {
-		list_del(&d->list);
-		domain_free(hw_dom);
-	}
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
-	d = rdt_find_domain(r, id, NULL);
+	d = rdt_find_domain(&r->domains, id, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
 		return;
@@ -565,7 +614,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del(&d->list);
 
 		/*
@@ -578,6 +627,30 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 		return;
 	}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain *d;
+
+	d = rdt_find_domain(&r->mondomains, id, NULL);
+	if (IS_ERR_OR_NULL(d)) {
+		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+		return;
+	}
+	hw_dom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->cpu_mask);
+	if (cpumask_empty(&d->cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del(&d->list);
+
+		domain_free(hw_dom);
+
+		return;
+	}
 
 	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -592,6 +665,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b44c487727d4..839df83d1a0a 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -560,7 +560,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
+	d = rdt_find_domain(&r->mondomains, domid, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		ret = -ENOENT;
 		goto out;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..66beca785535 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->mondomains, list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 725344048f85..27753eb5d513 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1496,7 +1496,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->mondomains, list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1619,7 +1619,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->mondomains, list) {
 		if (d->id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
@@ -2525,7 +2525,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->mondomains, list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2919,7 +2919,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->mondomains, list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3708,15 +3708,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
+}
 
-	if (!r->mon_capable)
-		return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3773,18 +3775,22 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
-	int err;
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
-		return 0;
+	return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v4 2/7] x86/resctrl: Split the rdt_domain structures
  2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
  2023-07-22 19:07   ` [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring Tony Luck
@ 2023-07-22 19:07   ` Tony Luck
  2023-07-22 19:07   ` [PATCH v4 3/7] x86/resctrl: Change monitor code to use rdt_mondomain Tony Luck
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-07-22 19:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The rdt_domain and rdt_hw_domain structures contain an amalgam of
fields used by control and monitoring features. Now that there
are separate domain lists for control/monitoring these can be
divided between two structures.

First step: Add new domain structures for monitoring with the
fields that are needed. Leave these fields in the legacy structure
so compilation won't fail. They will be deleted once all the
monitoring code has been converted to use the new structure.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                | 28 +++++++++++++++++++++++++-
 arch/x86/kernel/cpu/resctrl/internal.h | 17 +++++++++++++++-
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 1267d56f9e76..475912662e47 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,7 +53,7 @@ struct resctrl_staged_config {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain - group of CPUs sharing a resctrl control resource
  * @list:		all instances of this resource
  * @id:			unique id for this instance
  * @cpu_mask:		which CPUs share this resource
@@ -86,6 +86,32 @@ struct rdt_domain {
 	u32				*mbps_val;
 };
 
+/**
+ * struct rdt_mondomain - group of CPUs sharing a resctrl monitor resource
+ * @list:		all instances of this resource
+ * @id:			unique id for this instance
+ * @cpu_mask:		which CPUs share this resource
+ * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
+ * @mbm_total:		saved state for MBM total bandwidth
+ * @mbm_local:		saved state for MBM local bandwidth
+ * @mbm_over:		worker to periodically read MBM h/w counters
+ * @cqm_limbo:		worker to periodically read CQM h/w counters
+ * @mbm_work_cpu:	worker CPU for MBM h/w counters
+ * @cqm_work_cpu:	worker CPU for CQM h/w counters
+ */
+struct rdt_mondomain {
+	struct list_head		list;
+	int				id;
+	struct cpumask			cpu_mask;
+	unsigned long			*rmid_busy_llc;
+	struct mbm_state		*mbm_total;
+	struct mbm_state		*mbm_local;
+	struct delayed_work		mbm_over;
+	struct delayed_work		cqm_limbo;
+	int				mbm_work_cpu;
+	int				cqm_work_cpu;
+};
+
 /**
  * struct resctrl_cache - Cache allocation related data
  * @cbm_len:		Length of the cache bit mask
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c5e2ac2a60cf..e956090a874e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -320,7 +320,7 @@ struct arch_mbm_state {
 
 /**
  * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ *			  a control resource
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
  * @arch_mbm_total:	arch private state for MBM total bandwidth
@@ -335,6 +335,21 @@ struct rdt_hw_domain {
 	struct arch_mbm_state		*arch_mbm_local;
 };
 
+/**
+ * struct rdt_hw_mondomain - Arch private attributes of a set of CPUs that share
+ *			  a monitor resource
+ * @d_resctrl:	Properties exposed to the resctrl file system
+ * @arch_mbm_total:	arch private state for MBM total bandwidth
+ * @arch_mbm_local:	arch private state for MBM local bandwidth
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_mondomain {
+	struct rdt_mondomain		d_resctrl;
+	struct arch_mbm_state		*arch_mbm_total;
+	struct arch_mbm_state		*arch_mbm_local;
+};
+
 static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
 {
 	return container_of(r, struct rdt_hw_domain, d_resctrl);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v4 3/7] x86/resctrl: Change monitor code to use rdt_mondomain
  2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
  2023-07-22 19:07   ` [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring Tony Luck
  2023-07-22 19:07   ` [PATCH v4 2/7] x86/resctrl: Split the rdt_domain structures Tony Luck
@ 2023-07-22 19:07   ` Tony Luck
  2023-08-11 17:30     ` Reinette Chatre
  2023-07-22 19:07   ` [PATCH v4 4/7] x86/resctrl: Delete unused fields from struct rdt_domain Tony Luck
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-22 19:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

A few functions need to be duplicated to provide versions to
operate on control and monitor domains respectively. But most
of the changes are just fixing argument and return value types.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   | 10 +++---
 arch/x86/kernel/cpu/resctrl/internal.h    | 21 +++++++-----
 arch/x86/kernel/cpu/resctrl/core.c        | 40 ++++++++++++++---------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  4 +--
 arch/x86/kernel/cpu/resctrl/monitor.c     | 38 ++++++++++-----------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 24 +++++++-------
 6 files changed, 75 insertions(+), 62 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 475912662e47..663bbc427c4b 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -248,9 +248,9 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
 int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mondomain *d);
 void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mondomain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -266,7 +266,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mondomain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
 
 /**
@@ -279,7 +279,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mondomain *d,
 			     u32 rmid, enum resctrl_event_id eventid);
 
 /**
@@ -291,7 +291,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mondomain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e956090a874e..401af6ccf272 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -106,7 +106,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mondomain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -355,6 +355,11 @@ static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
 	return container_of(r, struct rdt_hw_domain, d_resctrl);
 }
 
+static inline struct rdt_hw_mondomain *resctrl_to_arch_mondom(struct rdt_mondomain *r)
+{
+	return container_of(r, struct rdt_hw_mondomain, d_resctrl);
+}
+
 /**
  * struct msr_param - set a range of MSRs from a domain
  * @res:       The resource to use
@@ -526,8 +531,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
-				   struct list_head **pos);
+void *rdt_find_domain(struct list_head *h, int id,
+		      struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -556,17 +561,17 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mondomain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mondomain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mondomain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mondomain *d);
+void __check_limbo(struct rdt_mondomain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 274605aaa026..0161362b0c3e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -393,9 +393,12 @@ void rdt_ctrl_update(void *arg)
  * id is found in a domain, return the domain. Otherwise, if requested by
  * caller, return the first domain whose id is bigger than the input id.
  * The domain list is sorted by id in ascending order.
+ *
+ * N.B. Returned value may be either a pointer to "struct rdt_domain" or
+ * to "struct rdt_mondomain" depending on which domain list is scanned.
  */
-struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
-				   struct list_head **pos)
+void *rdt_find_domain(struct list_head *h, int id,
+		      struct list_head **pos)
 {
 	struct rdt_domain *d;
 	struct list_head *l;
@@ -434,10 +437,15 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 }
 
 static void domain_free(struct rdt_hw_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mondomain_free(struct rdt_hw_mondomain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
@@ -467,7 +475,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mondomain *hw_dom)
 {
 	size_t tsize;
 
@@ -539,8 +547,8 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
-	struct rdt_domain *d;
+	struct rdt_hw_mondomain *hw_mondom;
+	struct rdt_mondomain *d;
 	int err;
 
 	d = rdt_find_domain(&r->mondomains, id, &add_pos);
@@ -556,16 +564,16 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
-	if (!hw_dom)
+	hw_mondom = kzalloc_node(sizeof(*hw_mondom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_mondom)
 		return;
 
-	d = &hw_dom->d_resctrl;
+	d = &hw_mondom->d_resctrl;
 	d->id = id;
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
-	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_mondom)) {
+		mondomain_free(hw_mondom);
 		return;
 	}
 
@@ -574,7 +582,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->list);
-		domain_free(hw_dom);
+		mondomain_free(hw_mondom);
 	}
 }
 
@@ -632,22 +640,22 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
-	struct rdt_hw_domain *hw_dom;
-	struct rdt_domain *d;
+	struct rdt_hw_mondomain *hw_mondom;
+	struct rdt_mondomain *d;
 
 	d = rdt_find_domain(&r->mondomains, id, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
 		return;
 	}
-	hw_dom = resctrl_to_arch_dom(d);
+	hw_mondom = resctrl_to_arch_mondom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del(&d->list);
 
-		domain_free(hw_dom);
+		mondomain_free(hw_mondom);
 
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 839df83d1a0a..86fc5b0e3d39 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -521,7 +521,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mondomain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	/*
@@ -544,7 +544,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
+	struct rdt_mondomain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 66beca785535..0d9605fccb34 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mondomain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mondomain *d,
 			     u32 rmid, enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mondomain *hw_dom = resctrl_to_arch_mondom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mondomain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mondomain *hw_dom = resctrl_to_arch_mondom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mondomain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mondomain *hw_dom = resctrl_to_arch_mondom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mondomain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	}
 }
 
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mondomain *d)
 {
 	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
 }
@@ -334,7 +334,7 @@ int alloc_rmid(void)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mondomain *d;
 	int cpu, err;
 	u64 val = 0;
 
@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mondomain *d, u32 rmid,
 				       enum resctrl_event_id evtid)
 {
 	switch (evtid) {
@@ -516,7 +516,7 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mondomain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	}
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mondomain *d, int rmid)
 {
 	struct rmid_read rr;
 
@@ -641,12 +641,12 @@ void cqm_handle_limbo(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	int cpu = smp_processor_id();
 	struct rdt_resource *r;
-	struct rdt_domain *d;
+	struct rdt_mondomain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mondomain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mondomain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -674,7 +674,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	int cpu = smp_processor_id();
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
+	struct rdt_mondomain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mondomain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mondomain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 27753eb5d513..4a268df9b456 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1483,7 +1483,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mondomain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1491,7 +1491,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mondomain *dom;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
@@ -1548,7 +1548,7 @@ static void mon_event_config_write(void *info)
 }
 
 static int mbm_config_write_domain(struct rdt_resource *r,
-				   struct rdt_domain *d, u32 evtid, u32 val)
+				   struct rdt_mondomain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 	int ret = 0;
@@ -1598,7 +1598,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 {
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mondomain *d;
 	int ret = 0;
 
 next:
@@ -2463,7 +2463,7 @@ static void schemata_list_destroy(void)
 static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
-	struct rdt_domain *dom;
+	struct rdt_mondomain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2845,7 +2845,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mondomain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -2894,7 +2894,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mondomain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -2916,7 +2916,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mondomain *dom;
 	int ret;
 
 	list_for_each_entry(dom, &r->mondomains, list) {
@@ -3701,7 +3701,7 @@ static int __init rdtgroup_setup_root(void)
 	return ret;
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mondomain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
@@ -3716,7 +3716,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 		mba_sc_domain_destroy(r, d);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mondomain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3745,7 +3745,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mondomain *d)
 {
 	size_t tsize;
 
@@ -3786,7 +3786,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mondomain *d)
 {
 	int err;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v4 4/7] x86/resctrl: Delete unused fields from struct rdt_domain
  2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
                     ` (2 preceding siblings ...)
  2023-07-22 19:07   ` [PATCH v4 3/7] x86/resctrl: Change monitor code to use rdt_mondomain Tony Luck
@ 2023-07-22 19:07   ` Tony Luck
  2023-08-11 17:30     ` Reinette Chatre
  2023-07-22 19:07   ` [PATCH v4 5/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-22 19:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Now that all the monitoring functions use struct rdt_mondomain the
monitor fields can be dropped from the structure used for control
operations.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                | 14 --------------
 arch/x86/kernel/cpu/resctrl/internal.h |  4 ----
 2 files changed, 18 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 663bbc427c4b..80a89d171eba 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -57,13 +57,6 @@ struct resctrl_staged_config {
  * @list:		all instances of this resource
  * @id:			unique id for this instance
  * @cpu_mask:		which CPUs share this resource
- * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
- * @mbm_total:		saved state for MBM total bandwidth
- * @mbm_local:		saved state for MBM local bandwidth
- * @mbm_over:		worker to periodically read MBM h/w counters
- * @cqm_limbo:		worker to periodically read CQM h/w counters
- * @mbm_work_cpu:	worker CPU for MBM h/w counters
- * @cqm_work_cpu:	worker CPU for CQM h/w counters
  * @plr:		pseudo-locked region (if any) associated with domain
  * @staged_config:	parsed configuration to be applied
  * @mbps_val:		When mba_sc is enabled, this holds the array of user
@@ -74,13 +67,6 @@ struct rdt_domain {
 	struct list_head		list;
 	int				id;
 	struct cpumask			cpu_mask;
-	unsigned long			*rmid_busy_llc;
-	struct mbm_state		*mbm_total;
-	struct mbm_state		*mbm_local;
-	struct delayed_work		mbm_over;
-	struct delayed_work		cqm_limbo;
-	int				mbm_work_cpu;
-	int				cqm_work_cpu;
 	struct pseudo_lock_region	*plr;
 	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
 	u32				*mbps_val;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 401af6ccf272..016ef0373c5a 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -323,16 +323,12 @@ struct arch_mbm_state {
  *			  a control resource
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
- * @arch_mbm_total:	arch private state for MBM total bandwidth
- * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
 struct rdt_hw_domain {
 	struct rdt_domain		d_resctrl;
 	u32				*ctrl_val;
-	struct arch_mbm_state		*arch_mbm_total;
-	struct arch_mbm_state		*arch_mbm_local;
 };
 
 /**
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v4 5/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
                     ` (3 preceding siblings ...)
  2023-07-22 19:07   ` [PATCH v4 4/7] x86/resctrl: Delete unused fields from struct rdt_domain Tony Luck
@ 2023-07-22 19:07   ` Tony Luck
  2023-08-11 17:32     ` Reinette Chatre
  2023-07-22 19:07   ` [PATCH v4 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-22 19:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

There isn't a simple hardware enumeration to indicate to software that
a system is running with Sub-NUMA Cluster enabled.

Compare the number of NUMA nodes with the number of L3 caches to calculate
the number of Sub-NUMA nodes per L3 cache.

When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
are distributed equally between the SNC nodes within each socket.

E.g. if there are 400 RMID counters, and the system is configured with
two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
0 on the socket, and RMID counter 200..399 on SNC node 1.

A model specific MSR (0xca0) can change the configuration of the RMIDs
when SNC mode is enabled.

The MSR controls the interpretation of the RMID field in the
IA32_PQR_ASSOC MSR so that the appropriate hardware counters
within the SNC node are updated.

To read the RMID counters an offset must be used to get data
from the physical counter associated with the SNC node. As in
the example above with 400 RMID counters Linux sees only 200
counters. No special action is needed to read a counter from
the first SNC node on a socket. But to read a Linux visible
counter 50 on the second SNC node the kernel must load 250
into the QM_EVTSEL MSR.

N.B. this works well for well-behaved NUMA applications that access
memory predominantly from the local memory node. For applications that
access memory across multiple nodes it may be necessary for the user
to read counters for all SNC nodes on a socket and add the values to
get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
all that different from applications that span across multiple sockets
in a legacy system.

The cache allocation feature still provides the same number of
bits in a mask to control allocation into the L3 cache. But each
of those ways has its capacity reduced because the cache is divided
between the SNC nodes. Adjust the value reported in the resctrl
"size" file accordingly.

Mounting the file system with the "mba_MBps" option is disabled
when SNC mode is enabled. This is because the measurement of bandwidth
is per SNC node, while the MBA throttling controls are still at
the L3 cache scope.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                |  2 +
 arch/x86/include/asm/msr-index.h       |  1 +
 arch/x86/kernel/cpu/resctrl/internal.h |  2 +
 arch/x86/kernel/cpu/resctrl/core.c     | 82 +++++++++++++++++++++++++-
 arch/x86/kernel/cpu/resctrl/monitor.c  | 18 +++++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +-
 6 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 80a89d171eba..576dc21bd990 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -200,6 +200,8 @@ struct rdt_resource {
 	bool			cdp_capable;
 };
 
+#define MON_SCOPE_NODE 100
+
 /**
  * struct resctrl_schema - configuration abilities of a resource presented to
  *			   user-space
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3aedae61af4f..4b624a37d64a 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1087,6 +1087,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 016ef0373c5a..00a330bc5ced 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 0161362b0c3e..1331add347fc 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -48,6 +51,13 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.
+ * Default is 1 for systems that do not support
+ * SNC, or have SNC disabled.
+ */
+int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
@@ -543,9 +553,16 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	}
 }
 
+static int get_mon_scope_id(int cpu, int scope)
+{
+	if (scope == MON_SCOPE_NODE)
+		return cpu_to_node(cpu);
+	return get_cpu_cacheinfo_id(cpu, scope);
+}
+
 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
+	int id = get_mon_scope_id(cpu, r->mon_scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_mondomain *hw_mondom;
 	struct rdt_mondomain *d;
@@ -692,11 +709,28 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
+static void snc_remap_rmids(int cpu)
+{
+	u64	val;
+
+	/* Only need to enable once per package */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
@@ -951,11 +985,57 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple enumeration bit to show whether SNC mode
+ * is enabled. Look at the ratio of number of NUMA nodes to the
+ * number of distinct L3 caches. Take care to skip memory-only nodes.
+ */
+static __init int get_snc_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids)
+			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+		else
+			mem_only_nodes++;
+	}
+	cpus_read_unlock();
+
+	ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
+	kfree(node_caches);
+
+	if (ret > 1)
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = MON_SCOPE_NODE;
+
+	return ret;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = get_snc_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 0d9605fccb34..4ca064e62911 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = get_cpu();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,9 +168,11 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
+	put_cpu();
+
 	if (msr_val & RMID_VAL_ERROR)
 		return -EIO;
 	if (msr_val & RMID_VAL_UNAVAIL)
@@ -783,8 +795,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 4a268df9b456..d831b21f7389 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1354,7 +1354,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /**
@@ -2587,7 +2587,7 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		ctx->enable_cdpl2 = true;
 		return 0;
 	case Opt_mba_mbps:
-		if (!supports_mba_mbps())
+		if (!supports_mba_mbps() || snc_nodes_per_l3_cache > 1)
 			return -EINVAL;
 		ctx->enable_mba_mbps = true;
 		return 0;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v4 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
                     ` (4 preceding siblings ...)
  2023-07-22 19:07   ` [PATCH v4 5/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
@ 2023-07-22 19:07   ` Tony Luck
  2023-08-11 17:33     ` Reinette Chatre
  2023-07-22 19:07   ` [PATCH v4 7/7] selftests/resctrl: Adjust effective L3 cache size when SNC enabled Tony Luck
                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-22 19:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
---
 Documentation/arch/x86/resctrl.rst | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index cb05d90111b4..4d9ddb91751d 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -345,9 +345,13 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event. E.g. on a system with
+	SNC mode disabled with two L3 domains there will be subdirectories
+	"mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
+	L3 cache id.  With SNC enabled the directory names are the same,
+	but the numerical suffix refers to the node id.  Each of these
 	directories have one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v4 7/7] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
                     ` (5 preceding siblings ...)
  2023-07-22 19:07   ` [PATCH v4 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-07-22 19:07   ` Tony Luck
  2023-08-11 17:33     ` Reinette Chatre
  2023-07-26  3:10   ` [PATCH v4 0/7] Add support for Sub-NUMA cluster (SNC) systems Drew Fustini
                     ` (2 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-07-22 19:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
nodes. Systems may support splitting into either two or four nodes.

When SNC mode is enabled the effective amount of L3 cache available
for allocation is divided by the number of nodes per L3.

Detect which SNC mode is active by comparing the number of CPUs
that share a cache with CPU0, with the number of CPUs on node0.

Reported-by: "Shaopeng Tan (Fujitsu)" <tan.shaopeng@fujitsu.com>
Closes: https://lore.kernel.org/r/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 tools/testing/selftests/resctrl/resctrl.h   |  1 +
 tools/testing/selftests/resctrl/resctrlfs.c | 57 +++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index 87e39456dee0..a8b43210b573 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -13,6 +13,7 @@
 #include <signal.h>
 #include <dirent.h>
 #include <stdbool.h>
+#include <ctype.h>
 #include <sys/stat.h>
 #include <sys/ioctl.h>
 #include <sys/mount.h>
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index fb00245dee92..79eecbf9f863 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -130,6 +130,61 @@ int get_resource_id(int cpu_no, int *resource_id)
 	return 0;
 }
 
+/*
+ * Count number of CPUs in a /sys bit map
+ */
+static int count_sys_bitmap_bits(char *name)
+{
+	FILE *fp = fopen(name, "r");
+	int count = 0, c;
+
+	if (!fp)
+		return 0;
+
+	while ((c = fgetc(fp)) != EOF) {
+		if (!isxdigit(c))
+			continue;
+		switch (c) {
+		case 'f':
+			count++;
+		case '7': case 'b': case 'd': case 'e':
+			count++;
+		case '3': case '5': case '6': case '9': case 'a': case 'c':
+			count++;
+		case '1': case '2': case '4': case '8':
+			count++;
+		}
+	}
+	fclose(fp);
+
+	return count;
+}
+
+/*
+ * Detect SNC by compating #CPUs in node0 with #CPUs sharing LLC with CPU0
+ * Try to get this right, even if a few CPUs are offline so that the number
+ * of CPUs in node0 is not exactly half or a quarter of the CPUs sharing the
+ * LLC of CPU0.
+ */
+static int snc_ways(void)
+{
+	int node_cpus, cache_cpus;
+
+	node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
+	cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
+
+	if (!node_cpus || !cache_cpus) {
+		fprintf(stderr, "Warning could not determine Sub-NUMA Cluster mode\n");
+		return 1;
+	}
+
+	if (4 * node_cpus >= cache_cpus)
+		return 4;
+	else if (2 * node_cpus >= cache_cpus)
+		return 2;
+	return 1;
+}
+
 /*
  * get_cache_size - Get cache size for a specified CPU
  * @cpu_no:	CPU number
@@ -190,6 +245,8 @@ int get_cache_size(int cpu_no, char *cache_type, unsigned long *cache_size)
 			break;
 	}
 
+	if (cache_num == 3)
+		*cache_size /= snc_ways();
 	return 0;
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-07-20 21:56             ` Luck, Tony
@ 2023-07-22 19:18               ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-07-22 19:18 UTC (permalink / raw)
  To: Chatre, Reinette
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Thu, Jul 20, 2023 at 09:56:50PM +0000, Luck, Tony wrote:
> This does seem a worthy target.
> 
> I started on a patch to so this ... but I'm not sure I have the stamina or the time
> to see it through. 

I was being a wuss. I came back at this on Friday from a slightly
different perspective, and it all came togeteher fairly easily.

New series posted here:
	https://lore.kernel.org/all/20230722190740.326190-1-tony.luck@intel.com/

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
  2023-07-19  2:43 ` [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Shaopeng Tan (Fujitsu)
@ 2023-07-22 20:21   ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-07-22 20:21 UTC (permalink / raw)
  To: Shaopeng Tan (Fujitsu)
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, linux-kernel, linux-doc, patches

On Wed, Jul 19, 2023 at 02:43:20AM +0000, Shaopeng Tan (Fujitsu) wrote:
> Hi tony,
> 
> I ran selftest/resctrl in my environment,
> the test result is "not ok".
> 
> Processer in my environment:
> Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
> 
> kernel:
> $ uname -r
> 6.5.0-rc1+
> 
> Result :
> Sub-NUMA enable:
> xxx@xxx:~/linux_v6.5_rc1l$ sudo make -C tools/testing/selftests/resctrl run_tests
> make: Entering directory '/.../tools/testing/selftests/resctrl'

I see most tests pass. Just one fail on my most recent run with the
v4 patch series:

# # Fail: Check MBA diff within 5% for schemata 10
# # avg_diff_per: 7%
# # avg_bw_imc: 883
# # avg_bw_resc: 815
# # Fail: Check schemata change using MBA

But just missed the 5% target by a small amount,
not the near total failures that you see.

I wonder if there is a cross-SNC node memory
allocation issue.  Can you try running the test
bound to a CPU in one node:

$ taskset -c 1 sudo make -C tools/testing/selftests/resctrl run_tests

Try with different "-c" arguments to bind to different nodes. Do you
see different results on differnt nodes?

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 0/7] Add support for Sub-NUMA cluster (SNC) systems
  2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
                     ` (6 preceding siblings ...)
  2023-07-22 19:07   ` [PATCH v4 7/7] selftests/resctrl: Adjust effective L3 cache size when SNC enabled Tony Luck
@ 2023-07-26  3:10   ` Drew Fustini
  2023-07-26 14:10     ` Tony Luck
  2023-08-11 17:27   ` Reinette Chatre
  2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
  9 siblings, 1 reply; 331+ messages in thread
From: Drew Fustini @ 2023-07-26  3:10 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, linux-kernel, linux-doc, patches

On Sat, Jul 22, 2023 at 12:07:33PM -0700, Tony Luck wrote:
> The Sub-NUMA cluster feature on some Intel processors partitions
> the CPUs that share an L3 cache into two or more sets. This plays
> havoc with the Resource Director Technology (RDT) monitoring features.
> Prior to this patch Intel has advised that SNC and RDT are incompatible.
> 
> Some of these CPU support an MSR that can partition the RMID
> counters in the same way. This allows for monitoring features
> to be used (with the caveat that memory accesses between different
> SNC NUMA nodes may still not be counted accuratlely.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> 
> ---
> 
> Changes since v3:
> 
> Reinette provided the most excellent suggestion that this series
> could better achieve its objective if it enabled separate domain
> lists for control & monitoring within a resource, rather than
> creating a whole new resource to support separte node scope needed
> for SNC monitoring. Thus all the pre-amble patches from the previous
> version have gone, replaced by patches 1-4 of this new series.

[This comment is unrelated to Sub-NUMA support so please disregard if
this is the wrong place to make these comments]

I think that the resctrl interface for RISC-V CBQRI could also benefit
from separate domain lists for control and monitoring.

For example, the bandwidth controller QoS register [1] interface allows
a device to implement both bandwidth usage monitoring and bandwidth
allocation. The resctrl proof-of-concept [2] had to awkwardly create two
domains for each memory controller in our example SoC, one that would
contain the MBA resource and one that would contain the L3 resource to
represent MBM files like local_bytes.

This resulted in a very odd looking schemata that would be hard to the
user to understand:

  # cat /sys/fs/resctrl/schemata
  MB:4=  80;6=  80;8=  80
  L2:0=0fff;1=0fff
  L3:2=ffff;3=0000;5=0000;7=0000

Where:

  Domain 0 is L2 cache controller 0 capacity allocation
  Domain 1 is L2 cache controller 1 capacity allocation
  Domain 2 is L3 cache controller capacity allocation

  Domain 4 is Memory controller 0 bandwidth allocation
  Domain 6 is Memory controller 1 bandwidth allocation
  Domain 8 is Memory controller 2 bandwidth allocation

  Domain 3 is Memory controller 0 bandwidth monitoring
  Domain 5 is Memory controller 1 bandwidth monitoring
  Domain 7 is Memory controller 2 bandwidth monitoring

But there is no value of having the domains created for the purposes of
bandwidth monitoring in schemata.

I've not yet fully understood how the new approach in this patch series
could help the situation for CBQRI, but I thought I would mention that
separate lists for control and monitoring might be useful.

Thanks,
Drew

[1] https://github.com/riscv-non-isa/riscv-cbqri/blob/main/qos_bandwidth.adoc
[2] https://lore.kernel.org/linux-riscv/20230419111111.477118-1-dfustini@baylibre.com/

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 0/7] Add support for Sub-NUMA cluster (SNC) systems
  2023-07-26  3:10   ` [PATCH v4 0/7] Add support for Sub-NUMA cluster (SNC) systems Drew Fustini
@ 2023-07-26 14:10     ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-07-26 14:10 UTC (permalink / raw)
  To: Drew Fustini
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, linux-kernel, linux-doc, patches

On Tue, Jul 25, 2023 at 08:10:52PM -0700, Drew Fustini wrote:
> I think that the resctrl interface for RISC-V CBQRI could also benefit
> from separate domain lists for control and monitoring.
> 
> For example, the bandwidth controller QoS register [1] interface allows
> a device to implement both bandwidth usage monitoring and bandwidth
> allocation. The resctrl proof-of-concept [2] had to awkwardly create two
> domains for each memory controller in our example SoC, one that would
> contain the MBA resource and one that would contain the L3 resource to
> represent MBM files like local_bytes.
> 
> This resulted in a very odd looking schemata that would be hard to the
> user to understand:
> 
>   # cat /sys/fs/resctrl/schemata
>   MB:4=  80;6=  80;8=  80
>   L2:0=0fff;1=0fff
>   L3:2=ffff;3=0000;5=0000;7=0000
> 
> Where:
> 
>   Domain 0 is L2 cache controller 0 capacity allocation
>   Domain 1 is L2 cache controller 1 capacity allocation
>   Domain 2 is L3 cache controller capacity allocation
> 
>   Domain 4 is Memory controller 0 bandwidth allocation
>   Domain 6 is Memory controller 1 bandwidth allocation
>   Domain 8 is Memory controller 2 bandwidth allocation
> 
>   Domain 3 is Memory controller 0 bandwidth monitoring
>   Domain 5 is Memory controller 1 bandwidth monitoring
>   Domain 7 is Memory controller 2 bandwidth monitoring
> 
> But there is no value of having the domains created for the purposes of
> bandwidth monitoring in schemata.

There's certainly no value in exposing those domain numbers
in the schemata file. There should also be some way for users
to decode the ids. On x86 the "id" is exposed in sysfs. Though
the user does need to work to get all the details:

$ cat /sys/devices/system/cpu/cpu36/cache/index3/level
3
$ cat /sys/devices/system/cpu/cpu36/cache/index3/id
1
$ cat /sys/devices/system/cpu/cpu36/cache/index3/shared_cpu_list
36-71,108-143

This shows the L3 cachce with id "1" is shared by CPUs 36-71,108-143

X86 also has independent domain numbers for each resource. So the
L2 ones count 0, 1, 2, ... and so do the L3 ones: 0, 1, 2 and the
MBA ones: 0, 1, 2

That fits well with the /sys decoding ... but maybe your approach of
not repeating domain numbers across different resources is less
confusing?

Note that in my resctrl re-write where each resource is handled by
a separate loadable module it may be hard for you to keep the unique
domain scheme as resource modules are unaware of each other. Though
perhaps its just an arch specific hook to provide domain numbers.

> I've not yet fully understood how the new approach in this patch series
> could help the situation for CBQRI, but I thought I would mention that
> separate lists for control and monitoring might be useful.

Good. It's nice to know there's potentially another use case for
this split besides SNC.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 0/7] Add support for Sub-NUMA cluster (SNC) systems
  2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
                     ` (7 preceding siblings ...)
  2023-07-26  3:10   ` [PATCH v4 0/7] Add support for Sub-NUMA cluster (SNC) systems Drew Fustini
@ 2023-08-11 17:27   ` Reinette Chatre
  2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
  9 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-08-11 17:27 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> The Sub-NUMA cluster feature on some Intel processors partitions
> the CPUs that share an L3 cache into two or more sets. This plays
> havoc with the Resource Director Technology (RDT) monitoring features.
> Prior to this patch Intel has advised that SNC and RDT are incompatible.
> 
> Some of these CPU support an MSR that can partition the RMID
> counters in the same way. This allows for monitoring features
> to be used (with the caveat that memory accesses between different
> SNC NUMA nodes may still not be counted accuratlely.

accuratlely. -> accurately).

Is there any guidance on the scenarios under which memory accesses
may not be counted accurately, how users can detect when this
is the case, or any techniques users can use to avoid this?

Since this question has come up during this series I do think it
will help to document the impact of SNC on CAT.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring
  2023-07-22 19:07   ` [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring Tony Luck
@ 2023-08-11 17:29     ` Reinette Chatre
  2023-08-25 17:05       ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-08-11 17:29 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> First step towards supporting resource control where the scope of
> control operations is not the same as monitor operations.

Each changelog should stand on its own merit. This changelog
appears to be written as a continuation of the cover letter.

Please do ensure that each patch first establishes the context
before it describes the problem and solution. For example,
as a context this changelog can start by describing what the
resctrl domains list represents.

> 
> Add an extra list in the rdt_resource structure. For this will
> just duplicate the existing list of domains based on the L3 cache
> scope.

The above paragraph does not make this change appealing at all.

> Refactor the domain_add_cpu() and domain_remove() functions to

domain_remove() -> domain_remove_cpu()

> build separate lists for r->alloc_capable and r->mon_capable
> resources. Note that only the "L3" domain currently supports
> both types.

"L3" domain -> "L3" resource?

> 
> Change all places where monitoring functions walk the list of
> domains to use the new "mondomains" list instead of the old
> "domains" list.

I would not refer to it as "the old domains list" as it creates
impression that this is being replaced. The changelog makes
no mention that domains list will remain and be dedicated to
control domains. I think this is important to include in description
of this change.

> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h                   |  10 +-
>  arch/x86/kernel/cpu/resctrl/internal.h    |   2 +-
>  arch/x86/kernel/cpu/resctrl/core.c        | 195 +++++++++++++++-------
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
>  arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  30 ++--
>  6 files changed, 167 insertions(+), 74 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 8334eeacfec5..1267d56f9e76 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -151,9 +151,11 @@ struct resctrl_schema;
>   * @mon_capable:	Is monitor feature available on this machine
>   * @num_rmid:		Number of RMIDs available
>   * @cache_level:	Which cache level defines scope of this resource
> + * @mon_scope:		Scope of this resource if different from cache_level

I think this addition should be deferred. As it is here it the "if different
from cache_level" also creates many questions (when will it be different?
how will it be determined that the scope is different in order to know that
mon_scope should be used?)

Looking ahead on how mon_scope is used there does not seem to be an "if"
involved at all ... mon_scope is always the monitoring scope. 

>   * @cache:		Cache allocation related data
>   * @membw:		If the component has bandwidth controls, their properties.
>   * @domains:		All domains for this resource

A change to the domains comment would also help - to highlight that it is
now dedicated to control domains.

> + * @mondomains:		Monitor domains for this resource
>   * @name:		Name to use in "schemata" file.
>   * @data_width:		Character width of data when displaying
>   * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
> @@ -169,9 +171,11 @@ struct rdt_resource {
>  	bool			mon_capable;
>  	int			num_rmid;
>  	int			cache_level;
> +	int			mon_scope;
>  	struct resctrl_cache	cache;
>  	struct resctrl_membw	membw;
>  	struct list_head	domains;
> +	struct list_head	mondomains;
>  	char			*name;
>  	int			data_width;
>  	u32			default_ctrl;

...

> @@ -384,14 +386,15 @@ void rdt_ctrl_update(void *arg)
>  }
>  
>  /*
> - * rdt_find_domain - Find a domain in a resource that matches input resource id
> + * rdt_find_domain - Find a domain in one of the lists for a resource that
> + * matches input resource id
>   *

This change makes the function more vague. I think original summary is
still accurate, how the list is used can be describe in the details below.
I see more changes to this function is upcoming and I will comment more 
at those sites.

>   * Search resource r's domain list to find the resource id. If the resource
>   * id is found in a domain, return the domain. Otherwise, if requested by
>   * caller, return the first domain whose id is bigger than the input id.
>   * The domain list is sorted by id in ascending order.
>   */
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> +struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
>  				   struct list_head **pos)
>  {
>  	struct rdt_domain *d;
> @@ -400,7 +403,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
>  	if (id < 0)
>  		return ERR_PTR(-ENODEV);
>  
> -	list_for_each(l, &r->domains) {
> +	list_for_each(l, h) {
>  		d = list_entry(l, struct rdt_domain, list);
>  		/* When id is found, return its domain. */
>  		if (id == d->id)
> @@ -487,6 +490,94 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>  	return 0;
>  }
>  
> +static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	struct list_head *add_pos = NULL;
> +	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain *d;
> +	int err;
> +
> +	d = rdt_find_domain(&r->domains, id, &add_pos);
> +	if (IS_ERR(d)) {
> +		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> +		return;
> +	}
> +
> +	if (d) {
> +		cpumask_set_cpu(cpu, &d->cpu_mask);
> +		if (r->cache.arch_has_per_cpu_cfg)
> +			rdt_domain_reconfigure_cdp(r);
> +		return;
> +	}
> +
> +	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!hw_dom)
> +		return;
> +
> +	d = &hw_dom->d_resctrl;
> +	d->id = id;
> +	cpumask_set_cpu(cpu, &d->cpu_mask);
> +
> +	rdt_domain_reconfigure_cdp(r);
> +
> +	if (domain_setup_ctrlval(r, d)) {
> +		domain_free(hw_dom);
> +		return;
> +	}
> +
> +	list_add_tail(&d->list, add_pos);
> +
> +	err = resctrl_online_ctrl_domain(r, d);
> +	if (err) {
> +		list_del(&d->list);
> +		domain_free(hw_dom);
> +	}
> +}
> +
> +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);

Using a different scope variable but continuing to treat it
as a cache level creates unnecessary confusion at this point.

> +	struct list_head *add_pos = NULL;
> +	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain *d;
> +	int err;
> +
> +	d = rdt_find_domain(&r->mondomains, id, &add_pos);
> +	if (IS_ERR(d)) {
> +		pr_warn("Couldn't find cache id for CPU %d\n", cpu);

Note for future change ... this continues to refer to monitor scope as
a cache id. I did not see this changed in the later patch that actually
changes how scope is used.

> +		return;
> +	}
> +
> +	if (d) {
> +		cpumask_set_cpu(cpu, &d->cpu_mask);
> +		if (r->cache.arch_has_per_cpu_cfg)
> +			rdt_domain_reconfigure_cdp(r);

Copy & paste error?

> +		return;
> +	}
> +
> +	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!hw_dom)
> +		return;
> +
> +	d = &hw_dom->d_resctrl;
> +	d->id = id;
> +	cpumask_set_cpu(cpu, &d->cpu_mask);
> +
> +	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> +		domain_free(hw_dom);
> +		return;
> +	}
> +
> +	list_add_tail(&d->list, add_pos);
> +
> +	err = resctrl_online_mon_domain(r, d);
> +	if (err) {
> +		list_del(&d->list);
> +		domain_free(hw_dom);
> +	}
> +}
> +
>  /*
>   * domain_add_cpu - Add a cpu to a resource's domain list.
>   *
> @@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>   */
>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> -	struct list_head *add_pos = NULL;
> -	struct rdt_hw_domain *hw_dom;
> -	struct rdt_domain *d;
> -	int err;
> -
> -	d = rdt_find_domain(r, id, &add_pos);
> -	if (IS_ERR(d)) {
> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> -		return;
> -	}
> -
> -	if (d) {
> -		cpumask_set_cpu(cpu, &d->cpu_mask);
> -		if (r->cache.arch_has_per_cpu_cfg)
> -			rdt_domain_reconfigure_cdp(r);
> -		return;
> -	}
> -
> -	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> -	if (!hw_dom)
> -		return;
> -
> -	d = &hw_dom->d_resctrl;
> -	d->id = id;
> -	cpumask_set_cpu(cpu, &d->cpu_mask);
> -
> -	rdt_domain_reconfigure_cdp(r);
> -
> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> -		domain_free(hw_dom);
> -		return;
> -	}
> -
> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> -		domain_free(hw_dom);
> -		return;
> -	}
> -
> -	list_add_tail(&d->list, add_pos);
> -
> -	err = resctrl_online_domain(r, d);
> -	if (err) {
> -		list_del(&d->list);
> -		domain_free(hw_dom);
> -	}
> +	if (r->alloc_capable)
> +		domain_add_cpu_ctrl(cpu, r);
> +	if (r->mon_capable)
> +		domain_add_cpu_mon(cpu, r);
>  }

A resource could be both alloc and mon capable ... both
domain_add_cpu_ctrl() and domain_add_cpu_mon() can fail.
Should domain_add_cpu_mon() still be run for a CPU if
domain_add_cpu_ctrl() failed? 

Looking ahead the CPU should probably also not be added
to the default groups mask if a failure occurred.

> -static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> +static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
>  {
>  	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  
> -	d = rdt_find_domain(r, id, NULL);
> +	d = rdt_find_domain(&r->domains, id, NULL);
>  	if (IS_ERR_OR_NULL(d)) {
>  		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
>  		return;
> @@ -565,7 +614,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  
>  	cpumask_clear_cpu(cpu, &d->cpu_mask);
>  	if (cpumask_empty(&d->cpu_mask)) {
> -		resctrl_offline_domain(r, d);
> +		resctrl_offline_ctrl_domain(r, d);
>  		list_del(&d->list);
>  
>  		/*
> @@ -578,6 +627,30 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  
>  		return;
>  	}
> +}
> +
> +static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);

Introducing mon_scope can really be deferred ... here the monitoring code
is not using mon_scope anyway.

> +	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain *d;
> +
> +	d = rdt_find_domain(&r->mondomains, id, NULL);
> +	if (IS_ERR_OR_NULL(d)) {
> +		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> +		return;
> +	}
> +	hw_dom = resctrl_to_arch_dom(d);
> +
> +	cpumask_clear_cpu(cpu, &d->cpu_mask);
> +	if (cpumask_empty(&d->cpu_mask)) {
> +		resctrl_offline_mon_domain(r, d);
> +		list_del(&d->list);
> +
> +		domain_free(hw_dom);
> +
> +		return;
> +	}
>  
>  	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
>  		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 3/7] x86/resctrl: Change monitor code to use rdt_mondomain
  2023-07-22 19:07   ` [PATCH v4 3/7] x86/resctrl: Change monitor code to use rdt_mondomain Tony Luck
@ 2023-08-11 17:30     ` Reinette Chatre
  2023-08-25 17:13       ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-08-11 17:30 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> A few functions need to be duplicated to provide versions to
> operate on control and monitor domains respectively. But most
> of the changes are just fixing argument and return value types.

Could you please add some context in support of this change?

I do not think "duplicated" is appropriate though. Functions
are not duplicated but instead made to be dedicated to
either control or monitoring domains, no?

...

> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 274605aaa026..0161362b0c3e 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -393,9 +393,12 @@ void rdt_ctrl_update(void *arg)
>   * id is found in a domain, return the domain. Otherwise, if requested by
>   * caller, return the first domain whose id is bigger than the input id.
>   * The domain list is sorted by id in ascending order.
> + *
> + * N.B. Returned value may be either a pointer to "struct rdt_domain" or
> + * to "struct rdt_mondomain" depending on which domain list is scanned.
>   */
> -struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
> -				   struct list_head **pos)
> +void *rdt_find_domain(struct list_head *h, int id,
> +		      struct list_head **pos)
>  {
>  	struct rdt_domain *d;
>  	struct list_head *l;

I do not think that void pointers should be passed around. How about two
new functions dedicated to the different domain types with the void pointer
handling contained in a static function? For example,

static void *__rdt_find_domain(struct list_head *h, int id, struct list_head **pos)

struct rdt_mondomain *rdt_find_mondomain(struct rdt_resource *r, int id, struct list_head **pos)
struct rdt_domain *rdt_find_ctrldomain(struct rdt_resource *r, int id, struct list_head **pos)

rdt_find_mondomain() and rdt_find_ctrldomain() would be what callers use
while they can be wrappers of __rdt_find_domain().


> @@ -434,10 +437,15 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
>  }
>  
>  static void domain_free(struct rdt_hw_domain *hw_dom)
> +{
> +	kfree(hw_dom->ctrl_val);
> +	kfree(hw_dom);
> +}
> +
> +static void mondomain_free(struct rdt_hw_mondomain *hw_dom)
>  {
>  	kfree(hw_dom->arch_mbm_total);
>  	kfree(hw_dom->arch_mbm_local);
> -	kfree(hw_dom->ctrl_val);
>  	kfree(hw_dom);
>  }
>  
> @@ -467,7 +475,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
>   * @num_rmid:	The size of the MBM counter array
>   * @hw_dom:	The domain that owns the allocated arrays
>   */
> -static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> +static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mondomain *hw_dom)
>  {
>  	size_t tsize;
>  
> @@ -539,8 +547,8 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>  {
>  	int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
>  	struct list_head *add_pos = NULL;
> -	struct rdt_hw_domain *hw_dom;
> -	struct rdt_domain *d;
> +	struct rdt_hw_mondomain *hw_mondom;
> +	struct rdt_mondomain *d;
>  	int err;
>  

Please ensure that reverse fir tree order is maintained in all these changes.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 4/7] x86/resctrl: Delete unused fields from struct rdt_domain
  2023-07-22 19:07   ` [PATCH v4 4/7] x86/resctrl: Delete unused fields from struct rdt_domain Tony Luck
@ 2023-08-11 17:30     ` Reinette Chatre
  2023-08-25 17:13       ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-08-11 17:30 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> Now that all the monitoring functions use struct rdt_mondomain the
> monitor fields can be dropped from the structure used for control
> operations.

Please provide some context for this patch so that it
can stand by itself when merged.

Note that two structures are changed in the patch.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 5/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-07-22 19:07   ` [PATCH v4 5/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
@ 2023-08-11 17:32     ` Reinette Chatre
  2023-08-25 17:49       ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-08-11 17:32 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> There isn't a simple hardware enumeration to indicate to software that
> a system is running with Sub-NUMA Cluster enabled.
> 
> Compare the number of NUMA nodes with the number of L3 caches to calculate
> the number of Sub-NUMA nodes per L3 cache.
> 
> When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
> are distributed equally between the SNC nodes within each socket.
> 
> E.g. if there are 400 RMID counters, and the system is configured with
> two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
> 0 on the socket, and RMID counter 200..399 on SNC node 1.
> 
> A model specific MSR (0xca0) can change the configuration of the RMIDs
> when SNC mode is enabled.
> 
> The MSR controls the interpretation of the RMID field in the
> IA32_PQR_ASSOC MSR so that the appropriate hardware counters
> within the SNC node are updated.
> 
> To read the RMID counters an offset must be used to get data
> from the physical counter associated with the SNC node. As in
> the example above with 400 RMID counters Linux sees only 200
> counters. No special action is needed to read a counter from
> the first SNC node on a socket. But to read a Linux visible
> counter 50 on the second SNC node the kernel must load 250
> into the QM_EVTSEL MSR.
> 
> N.B. this works well for well-behaved NUMA applications that access
> memory predominantly from the local memory node. For applications that
> access memory across multiple nodes it may be necessary for the user
> to read counters for all SNC nodes on a socket and add the values to
> get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
> all that different from applications that span across multiple sockets
> in a legacy system.
> 
> The cache allocation feature still provides the same number of
> bits in a mask to control allocation into the L3 cache. But each
> of those ways has its capacity reduced because the cache is divided
> between the SNC nodes. Adjust the value reported in the resctrl
> "size" file accordingly.
> 
> Mounting the file system with the "mba_MBps" option is disabled
> when SNC mode is enabled. This is because the measurement of bandwidth
> is per SNC node, while the MBA throttling controls are still at
> the L3 cache scope.
> 

I'm counting four logical changes in this changelog. Can they be separate
patches?

> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h                |  2 +
>  arch/x86/include/asm/msr-index.h       |  1 +
>  arch/x86/kernel/cpu/resctrl/internal.h |  2 +
>  arch/x86/kernel/cpu/resctrl/core.c     | 82 +++++++++++++++++++++++++-
>  arch/x86/kernel/cpu/resctrl/monitor.c  | 18 +++++-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +-
>  6 files changed, 103 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 80a89d171eba..576dc21bd990 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -200,6 +200,8 @@ struct rdt_resource {
>  	bool			cdp_capable;
>  };
>  
> +#define MON_SCOPE_NODE 100
> +

Could you please add a comment to explain what this constant represents
and how it is used?

>  /**
>   * struct resctrl_schema - configuration abilities of a resource presented to
>   *			   user-space
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 3aedae61af4f..4b624a37d64a 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1087,6 +1087,7 @@
>  #define MSR_IA32_QM_CTR			0xc8e
>  #define MSR_IA32_PQR_ASSOC		0xc8f
>  #define MSR_IA32_L3_CBM_BASE		0xc90
> +#define MSR_RMID_SNC_CONFIG		0xca0
>  #define MSR_IA32_L2_CBM_BASE		0xd10
>  #define MSR_IA32_MBA_THRTL_BASE		0xd50
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 016ef0373c5a..00a330bc5ced 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>  
>  extern struct dentry *debugfs_resctrl;
>  
> +extern int snc_nodes_per_l3_cache;
> +
>  enum resctrl_res_level {
>  	RDT_RESOURCE_L3,
>  	RDT_RESOURCE_L2,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 0161362b0c3e..1331add347fc 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>  
>  #define pr_fmt(fmt)	"resctrl: " fmt
>  
> +#include <linux/cpu.h>
>  #include <linux/slab.h>
>  #include <linux/err.h>
>  #include <linux/cacheinfo.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>  

What does this include provide?

> +#include <asm/cpu_device_id.h>
>  #include <asm/intel-family.h>
>  #include <asm/resctrl.h>
>  #include "internal.h"
> @@ -48,6 +51,13 @@ int max_name_width, max_data_width;
>   */
>  bool rdt_alloc_capable;
>  
> +/*
> + * Number of SNC nodes that share each L3 cache.
> + * Default is 1 for systems that do not support
> + * SNC, or have SNC disabled.
> + */
> +int snc_nodes_per_l3_cache = 1;
> +
>  static void
>  mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
>  		struct rdt_resource *r);
> @@ -543,9 +553,16 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
>  	}
>  }
>  
> +static int get_mon_scope_id(int cpu, int scope)
> +{
> +	if (scope == MON_SCOPE_NODE)
> +		return cpu_to_node(cpu);
> +	return get_cpu_cacheinfo_id(cpu, scope);
> +}
> +
>  static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
> +	int id = get_mon_scope_id(cpu, r->mon_scope);
>  	struct list_head *add_pos = NULL;
>  	struct rdt_hw_mondomain *hw_mondom;
>  	struct rdt_mondomain *d;
> @@ -692,11 +709,28 @@ static void clear_closid_rmid(int cpu)
>  	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
>  }
>  
> +static void snc_remap_rmids(int cpu)
> +{
> +	u64	val;

No need for tab here.

> +
> +	/* Only need to enable once per package */
> +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> +		return;
> +
> +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
> +	val &= ~BIT_ULL(0);
> +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
> +}

Could you please document snc_remap_rmids()
with information on what the bit in the register means
and what the above function does?

> +
>  static int resctrl_online_cpu(unsigned int cpu)
>  {
>  	struct rdt_resource *r;
>  
>  	mutex_lock(&rdtgroup_mutex);
> +
> +	if (snc_nodes_per_l3_cache > 1)
> +		snc_remap_rmids(cpu);
> +
>  	for_each_capable_rdt_resource(r)
>  		domain_add_cpu(cpu, r);
>  	/* The cpu is set in default rdtgroup after online. */
> @@ -951,11 +985,57 @@ static __init bool get_rdt_resources(void)
>  	return (rdt_mon_capable || rdt_alloc_capable);
>  }
>  

I think it will help to add a comment like:
"CPUs that support the model specific MSR_RMID_SNC_CONFIG register."

> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> +	{}
> +};
> +
> +/*
> + * There isn't a simple enumeration bit to show whether SNC mode
> + * is enabled. Look at the ratio of number of NUMA nodes to the
> + * number of distinct L3 caches. Take care to skip memory-only nodes.
> + */
> +static __init int get_snc_config(void)
> +{
> +	unsigned long *node_caches;
> +	int mem_only_nodes = 0;
> +	int cpu, node, ret;
> +
> +	if (!x86_match_cpu(snc_cpu_ids))
> +		return 1;
> +
> +	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> +	if (!node_caches)
> +		return 1;
> +
> +	cpus_read_lock();
> +	for_each_node(node) {
> +		cpu = cpumask_first(cpumask_of_node(node));
> +		if (cpu < nr_cpu_ids)
> +			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> +		else
> +			mem_only_nodes++;
> +	}
> +	cpus_read_unlock();

I am not familiar with the numa code at all so please correct me
where I am wrong. I do see that nr_node_ids is initialized with __init code
so it should be accurate at this point. It looks to me like this initialization
assumes that at least one CPU per node will be online at the time it is run.
It is not clear to me that this assumption would always be true. 

> +
> +	ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
> +	kfree(node_caches);
> +
> +	if (ret > 1)
> +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = MON_SCOPE_NODE;
> +
> +	return ret;
> +}
> +
>  static __init void rdt_init_res_defs_intel(void)
>  {
>  	struct rdt_hw_resource *hw_res;
>  	struct rdt_resource *r;
>  
> +	snc_nodes_per_l3_cache = get_snc_config();
> +
>  	for_each_rdt_resource(r) {
>  		hw_res = resctrl_to_arch_res(r);
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 0d9605fccb34..4ca064e62911 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
>  
>  static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>  {
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	int cpu = get_cpu();

I do not think it is necessary to disable preemption here. Also please note that
James is working on changes to not have this code block. Could just smp_processor_id()
do?

> +	int rmid_offset = 0;
>  	u64 msr_val;
>  
> +	/*
> +	 * When SNC mode is on, need to compute the offset to read the
> +	 * physical RMID counter for the node to which this CPU belongs
> +	 */
> +	if (snc_nodes_per_l3_cache > 1)
> +		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> +
>  	/*
>  	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
>  	 * with a valid event code for supported resource type and the bits
> @@ -158,9 +168,11 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>  	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>  	 * are error bits.
>  	 */
> -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
>  	rdmsrl(MSR_IA32_QM_CTR, msr_val);
>  
> +	put_cpu();
> +
>  	if (msr_val & RMID_VAL_ERROR)
>  		return -EIO;
>  	if (msr_val & RMID_VAL_UNAVAIL)

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-07-22 19:07   ` [PATCH v4 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-08-11 17:33     ` Reinette Chatre
  2023-08-25 17:50       ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-08-11 17:33 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Peter Newman <peternewman@google.com>
> ---
>  Documentation/arch/x86/resctrl.rst | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index cb05d90111b4..4d9ddb91751d 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -345,9 +345,13 @@ When control is enabled all CTRL_MON groups will also contain:
>  When monitoring is enabled all MON groups will also contain:
>  
>  "mon_data":
> -	This contains a set of files organized by L3 domain and by
> -	RDT event. E.g. on a system with two L3 domains there will
> -	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
> +	This contains a set of files organized by L3 domain or by NUMA
> +	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> +	or enabled respectively) and by RDT event. E.g. on a system with
> +	SNC mode disabled with two L3 domains there will be subdirectories
> +	"mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
> +	L3 cache id.  With SNC enabled the directory names are the same,
> +	but the numerical suffix refers to the node id.  Each of these
>  	directories have one file per event (e.g. "llc_occupancy",
>  	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
>  	files provide a read out of the current value of the event for

I think it would be helpful to add a modified version of the snippet
(from previous patch changelog) regarding well-behaved NUMA apps.
With the above it may be confusing that a single cache allocation has
multiple cache occupancy counters. 

This also changes the meaning of the numbers in the directory names.
The documentation already provides guidance on how to find the cache
ID of a logical CPU (see section "Cache IDs"). I think it will be 
helpful to add a snippet that makes it clear to users how to map
a CPU to its node ID.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 7/7] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-07-22 19:07   ` [PATCH v4 7/7] selftests/resctrl: Adjust effective L3 cache size when SNC enabled Tony Luck
@ 2023-08-11 17:33     ` Reinette Chatre
  2023-08-25 17:56       ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-08-11 17:33 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
> nodes. Systems may support splitting into either two or four nodes.
> 
> When SNC mode is enabled the effective amount of L3 cache available
> for allocation is divided by the number of nodes per L3.
> 
> Detect which SNC mode is active by comparing the number of CPUs
> that share a cache with CPU0, with the number of CPUs on node0.
> 
> Reported-by: "Shaopeng Tan (Fujitsu)" <tan.shaopeng@fujitsu.com>
> Closes: https://lore.kernel.org/r/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com

This does not seem to be the case when looking at
https://lore.kernel.org/all/TYAPR01MB6330A4EB3633B791939EA45E8B39A@TYAPR01MB6330.jpnprd01.prod.outlook.com/

> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  tools/testing/selftests/resctrl/resctrl.h   |  1 +
>  tools/testing/selftests/resctrl/resctrlfs.c | 57 +++++++++++++++++++++
>  2 files changed, 58 insertions(+)
> 
> diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
> index 87e39456dee0..a8b43210b573 100644
> --- a/tools/testing/selftests/resctrl/resctrl.h
> +++ b/tools/testing/selftests/resctrl/resctrl.h
> @@ -13,6 +13,7 @@
>  #include <signal.h>
>  #include <dirent.h>
>  #include <stdbool.h>
> +#include <ctype.h>
>  #include <sys/stat.h>
>  #include <sys/ioctl.h>
>  #include <sys/mount.h>
> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
> index fb00245dee92..79eecbf9f863 100644
> --- a/tools/testing/selftests/resctrl/resctrlfs.c
> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> @@ -130,6 +130,61 @@ int get_resource_id(int cpu_no, int *resource_id)
>  	return 0;
>  }
>  
> +/*
> + * Count number of CPUs in a /sys bit map
> + */
> +static int count_sys_bitmap_bits(char *name)
> +{
> +	FILE *fp = fopen(name, "r");
> +	int count = 0, c;
> +
> +	if (!fp)
> +		return 0;
> +
> +	while ((c = fgetc(fp)) != EOF) {
> +		if (!isxdigit(c))
> +			continue;
> +		switch (c) {
> +		case 'f':
> +			count++;
> +		case '7': case 'b': case 'd': case 'e':
> +			count++;
> +		case '3': case '5': case '6': case '9': case 'a': case 'c':
> +			count++;
> +		case '1': case '2': case '4': case '8':
> +			count++;
> +		}
> +	}
> +	fclose(fp);
> +
> +	return count;
> +}
> +
> +/*
> + * Detect SNC by compating #CPUs in node0 with #CPUs sharing LLC with CPU0
> + * Try to get this right, even if a few CPUs are offline so that the number
> + * of CPUs in node0 is not exactly half or a quarter of the CPUs sharing the
> + * LLC of CPU0.
> + */
> +static int snc_ways(void)
> +{
> +	int node_cpus, cache_cpus;
> +
> +	node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
> +	cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
> +
> +	if (!node_cpus || !cache_cpus) {
> +		fprintf(stderr, "Warning could not determine Sub-NUMA Cluster mode\n");
> +		return 1;
> +	}
> +
> +	if (4 * node_cpus >= cache_cpus)
> +		return 4;
> +	else if (2 * node_cpus >= cache_cpus)
> +		return 2;
> +	return 1;
> +}
> +
>  /*
>   * get_cache_size - Get cache size for a specified CPU
>   * @cpu_no:	CPU number
> @@ -190,6 +245,8 @@ int get_cache_size(int cpu_no, char *cache_type, unsigned long *cache_size)
>  			break;
>  	}
>  
> +	if (cache_num == 3)
> +		*cache_size /= snc_ways();
>  	return 0;
>  }
>  

I am surprised that this small change is sufficient. The resctrl
selftests are definitely not NUMA aware and the CAT and CMT tests
are not taking that into account when picking CPUs to run on. From
what I understand LLC occupancy counters need to be added in this
scenario but I do not see that done either.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring
  2023-08-11 17:29     ` Reinette Chatre
@ 2023-08-25 17:05       ` Tony Luck
  2023-08-28 17:05         ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-08-25 17:05 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Fri, Aug 11, 2023 at 10:29:25AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 7/22/2023 12:07 PM, Tony Luck wrote:
> > First step towards supporting resource control where the scope of
> > control operations is not the same as monitor operations.
> 
> Each changelog should stand on its own merit. This changelog
> appears to be written as a continuation of the cover letter.
> 
> Please do ensure that each patch first establishes the context
> before it describes the problem and solution. For example,
> as a context this changelog can start by describing what the
> resctrl domains list represents.
> 
> > 
> > Add an extra list in the rdt_resource structure. For this will
> > just duplicate the existing list of domains based on the L3 cache
> > scope.
> 
> The above paragraph does not make this change appealing at all.
> 
> > Refactor the domain_add_cpu() and domain_remove() functions to
> 
> domain_remove() -> domain_remove_cpu()
> 
> > build separate lists for r->alloc_capable and r->mon_capable
> > resources. Note that only the "L3" domain currently supports
> > both types.
> 
> "L3" domain -> "L3" resource?
> 
> > 
> > Change all places where monitoring functions walk the list of
> > domains to use the new "mondomains" list instead of the old
> > "domains" list.
> 
> I would not refer to it as "the old domains list" as it creates
> impression that this is being replaced. The changelog makes
> no mention that domains list will remain and be dedicated to
> control domains. I think this is important to include in description
> of this change.

I've rewritten the entire commit message incorporating your suggestions.
V6 will be posted soon (after I get some time on an SNC SPR to check
that it all works!)

> 
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  include/linux/resctrl.h                   |  10 +-
> >  arch/x86/kernel/cpu/resctrl/internal.h    |   2 +-
> >  arch/x86/kernel/cpu/resctrl/core.c        | 195 +++++++++++++++-------
> >  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
> >  arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  30 ++--
> >  6 files changed, 167 insertions(+), 74 deletions(-)
> > 
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 8334eeacfec5..1267d56f9e76 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -151,9 +151,11 @@ struct resctrl_schema;
> >   * @mon_capable:	Is monitor feature available on this machine
> >   * @num_rmid:		Number of RMIDs available
> >   * @cache_level:	Which cache level defines scope of this resource
> > + * @mon_scope:		Scope of this resource if different from cache_level
> 
> I think this addition should be deferred. As it is here it the "if different
> from cache_level" also creates many questions (when will it be different?
> how will it be determined that the scope is different in order to know that
> mon_scope should be used?)

I've gone in a different direction. V6 renames "cache_level" to
"ctrl_scope". I think this makes intent clear from step #1.

> 
> Looking ahead on how mon_scope is used there does not seem to be an "if"
> involved at all ... mon_scope is always the monitoring scope. 

Dropped the "if different" comment. You are correct that this is
unconditionally the monitor scope.

> 
> >   * @cache:		Cache allocation related data
> >   * @membw:		If the component has bandwidth controls, their properties.
> >   * @domains:		All domains for this resource
> 
> A change to the domains comment would also help - to highlight that it is
> now dedicated to control domains.

Done.

> 
> > + * @mondomains:		Monitor domains for this resource
> >   * @name:		Name to use in "schemata" file.
> >   * @data_width:		Character width of data when displaying
> >   * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
> > @@ -169,9 +171,11 @@ struct rdt_resource {
> >  	bool			mon_capable;
> >  	int			num_rmid;
> >  	int			cache_level;
> > +	int			mon_scope;
> >  	struct resctrl_cache	cache;
> >  	struct resctrl_membw	membw;
> >  	struct list_head	domains;
> > +	struct list_head	mondomains;
> >  	char			*name;
> >  	int			data_width;
> >  	u32			default_ctrl;
> 
> ...
> 
> > @@ -384,14 +386,15 @@ void rdt_ctrl_update(void *arg)
> >  }
> >  
> >  /*
> > - * rdt_find_domain - Find a domain in a resource that matches input resource id
> > + * rdt_find_domain - Find a domain in one of the lists for a resource that
> > + * matches input resource id
> >   *
> 
> This change makes the function more vague. I think original summary is
> still accurate, how the list is used can be describe in the details below.
> I see more changes to this function is upcoming and I will comment more 
> at those sites.

Re-worked this based on your other suggestions to have separate find*
functions for ctrl and mon cases calling to common __rdt_find_domain().

> 
> >   * Search resource r's domain list to find the resource id. If the resource
> >   * id is found in a domain, return the domain. Otherwise, if requested by
> >   * caller, return the first domain whose id is bigger than the input id.
> >   * The domain list is sorted by id in ascending order.
> >   */
> > -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> > +struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
> >  				   struct list_head **pos)
> >  {
> >  	struct rdt_domain *d;
> > @@ -400,7 +403,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> >  	if (id < 0)
> >  		return ERR_PTR(-ENODEV);
> >  
> > -	list_for_each(l, &r->domains) {
> > +	list_for_each(l, h) {
> >  		d = list_entry(l, struct rdt_domain, list);
> >  		/* When id is found, return its domain. */
> >  		if (id == d->id)
> > @@ -487,6 +490,94 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> >  	return 0;
> >  }
> >  
> > +static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> > +{
> > +	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> > +	struct list_head *add_pos = NULL;
> > +	struct rdt_hw_domain *hw_dom;
> > +	struct rdt_domain *d;
> > +	int err;
> > +
> > +	d = rdt_find_domain(&r->domains, id, &add_pos);
> > +	if (IS_ERR(d)) {
> > +		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> > +		return;
> > +	}
> > +
> > +	if (d) {
> > +		cpumask_set_cpu(cpu, &d->cpu_mask);
> > +		if (r->cache.arch_has_per_cpu_cfg)
> > +			rdt_domain_reconfigure_cdp(r);
> > +		return;
> > +	}
> > +
> > +	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> > +	if (!hw_dom)
> > +		return;
> > +
> > +	d = &hw_dom->d_resctrl;
> > +	d->id = id;
> > +	cpumask_set_cpu(cpu, &d->cpu_mask);
> > +
> > +	rdt_domain_reconfigure_cdp(r);
> > +
> > +	if (domain_setup_ctrlval(r, d)) {
> > +		domain_free(hw_dom);
> > +		return;
> > +	}
> > +
> > +	list_add_tail(&d->list, add_pos);
> > +
> > +	err = resctrl_online_ctrl_domain(r, d);
> > +	if (err) {
> > +		list_del(&d->list);
> > +		domain_free(hw_dom);
> > +	}
> > +}
> > +
> > +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> > +{
> > +	int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
> 
> Using a different scope variable but continuing to treat it
> as a cache level creates unnecessary confusion at this point.
> 
> > +	struct list_head *add_pos = NULL;
> > +	struct rdt_hw_domain *hw_dom;
> > +	struct rdt_domain *d;
> > +	int err;
> > +
> > +	d = rdt_find_domain(&r->mondomains, id, &add_pos);
> > +	if (IS_ERR(d)) {
> > +		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> 
> Note for future change ... this continues to refer to monitor scope as
> a cache id. I did not see this changed in the later patch that actually
> changes how scope is used.

Changed all these messages to refer to scope instead of cache id.

> 
> > +		return;
> > +	}
> > +
> > +	if (d) {
> > +		cpumask_set_cpu(cpu, &d->cpu_mask);
> > +		if (r->cache.arch_has_per_cpu_cfg)
> > +			rdt_domain_reconfigure_cdp(r);
> 
> Copy & paste error?

Indeed yes. This code isn't appropriate for the monitor case.
Deleted.

> 
> > +		return;
> > +	}
> > +
> > +	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> > +	if (!hw_dom)
> > +		return;
> > +
> > +	d = &hw_dom->d_resctrl;
> > +	d->id = id;
> > +	cpumask_set_cpu(cpu, &d->cpu_mask);
> > +
> > +	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> > +		domain_free(hw_dom);
> > +		return;
> > +	}
> > +
> > +	list_add_tail(&d->list, add_pos);
> > +
> > +	err = resctrl_online_mon_domain(r, d);
> > +	if (err) {
> > +		list_del(&d->list);
> > +		domain_free(hw_dom);
> > +	}
> > +}
> > +
> >  /*
> >   * domain_add_cpu - Add a cpu to a resource's domain list.
> >   *
> > @@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> >   */
> >  static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >  {
> > -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> > -	struct list_head *add_pos = NULL;
> > -	struct rdt_hw_domain *hw_dom;
> > -	struct rdt_domain *d;
> > -	int err;
> > -
> > -	d = rdt_find_domain(r, id, &add_pos);
> > -	if (IS_ERR(d)) {
> > -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> > -		return;
> > -	}
> > -
> > -	if (d) {
> > -		cpumask_set_cpu(cpu, &d->cpu_mask);
> > -		if (r->cache.arch_has_per_cpu_cfg)
> > -			rdt_domain_reconfigure_cdp(r);
> > -		return;
> > -	}
> > -
> > -	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> > -	if (!hw_dom)
> > -		return;
> > -
> > -	d = &hw_dom->d_resctrl;
> > -	d->id = id;
> > -	cpumask_set_cpu(cpu, &d->cpu_mask);
> > -
> > -	rdt_domain_reconfigure_cdp(r);
> > -
> > -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> > -		domain_free(hw_dom);
> > -		return;
> > -	}
> > -
> > -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> > -		domain_free(hw_dom);
> > -		return;
> > -	}
> > -
> > -	list_add_tail(&d->list, add_pos);
> > -
> > -	err = resctrl_online_domain(r, d);
> > -	if (err) {
> > -		list_del(&d->list);
> > -		domain_free(hw_dom);
> > -	}
> > +	if (r->alloc_capable)
> > +		domain_add_cpu_ctrl(cpu, r);
> > +	if (r->mon_capable)
> > +		domain_add_cpu_mon(cpu, r);
> >  }
> 
> A resource could be both alloc and mon capable ... both
> domain_add_cpu_ctrl() and domain_add_cpu_mon() can fail.
> Should domain_add_cpu_mon() still be run for a CPU if
> domain_add_cpu_ctrl() failed? 
> 
> Looking ahead the CPU should probably also not be added
> to the default groups mask if a failure occurred.

Existing code doesn't do anything for the case where a CPU
can't be added to a domain (probably the only real error case
is failure to allocate memory for the domain structure).

May be something to tackle in a future series if anyone
thinks this is a serious problem and has suggestions on
what to do. It seems like a catastrophic problem to not
have some CPUs in some/all domains of some resources.
Maybe this should disable mounting resctrl filesystem
completely?

> 
> > -static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> > +static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> >  {
> >  	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> >  	struct rdt_hw_domain *hw_dom;
> >  	struct rdt_domain *d;
> >  
> > -	d = rdt_find_domain(r, id, NULL);
> > +	d = rdt_find_domain(&r->domains, id, NULL);
> >  	if (IS_ERR_OR_NULL(d)) {
> >  		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> >  		return;
> > @@ -565,7 +614,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> >  
> >  	cpumask_clear_cpu(cpu, &d->cpu_mask);
> >  	if (cpumask_empty(&d->cpu_mask)) {
> > -		resctrl_offline_domain(r, d);
> > +		resctrl_offline_ctrl_domain(r, d);
> >  		list_del(&d->list);
> >  
> >  		/*
> > @@ -578,6 +627,30 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> >  
> >  		return;
> >  	}
> > +}
> > +
> > +static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> > +{
> > +	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> 
> Introducing mon_scope can really be deferred ... here the monitoring code
> is not using mon_scope anyway.

Not deferring. But I did fix this to use r->mon_scope. Good catch.

> 
> > +	struct rdt_hw_domain *hw_dom;
> > +	struct rdt_domain *d;
> > +
> > +	d = rdt_find_domain(&r->mondomains, id, NULL);
> > +	if (IS_ERR_OR_NULL(d)) {
> > +		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> > +		return;
> > +	}
> > +	hw_dom = resctrl_to_arch_dom(d);
> > +
> > +	cpumask_clear_cpu(cpu, &d->cpu_mask);
> > +	if (cpumask_empty(&d->cpu_mask)) {
> > +		resctrl_offline_mon_domain(r, d);
> > +		list_del(&d->list);
> > +
> > +		domain_free(hw_dom);
> > +
> > +		return;
> > +	}
> >  
> >  	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> >  		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 3/7] x86/resctrl: Change monitor code to use rdt_mondomain
  2023-08-11 17:30     ` Reinette Chatre
@ 2023-08-25 17:13       ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-08-25 17:13 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Fri, Aug 11, 2023 at 10:30:20AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 7/22/2023 12:07 PM, Tony Luck wrote:
> > A few functions need to be duplicated to provide versions to
> > operate on control and monitor domains respectively. But most
> > of the changes are just fixing argument and return value types.
> 
> Could you please add some context in support of this change?
> 
> I do not think "duplicated" is appropriate though. Functions
> are not duplicated but instead made to be dedicated to
> either control or monitoring domains, no?

Commit comment rewritten based on these suggestions

> 
> ...
> 
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 274605aaa026..0161362b0c3e 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -393,9 +393,12 @@ void rdt_ctrl_update(void *arg)
> >   * id is found in a domain, return the domain. Otherwise, if requested by
> >   * caller, return the first domain whose id is bigger than the input id.
> >   * The domain list is sorted by id in ascending order.
> > + *
> > + * N.B. Returned value may be either a pointer to "struct rdt_domain" or
> > + * to "struct rdt_mondomain" depending on which domain list is scanned.
> >   */
> > -struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
> > -				   struct list_head **pos)
> > +void *rdt_find_domain(struct list_head *h, int id,
> > +		      struct list_head **pos)
> >  {
> >  	struct rdt_domain *d;
> >  	struct list_head *l;
> 
> I do not think that void pointers should be passed around. How about two
> new functions dedicated to the different domain types with the void pointer
> handling contained in a static function? For example,
> 
> static void *__rdt_find_domain(struct list_head *h, int id, struct list_head **pos)
> 
> struct rdt_mondomain *rdt_find_mondomain(struct rdt_resource *r, int id, struct list_head **pos)
> struct rdt_domain *rdt_find_ctrldomain(struct rdt_resource *r, int id, struct list_head **pos)
> 
> rdt_find_mondomain() and rdt_find_ctrldomain() would be what callers use
> while they can be wrappers of __rdt_find_domain().

Good suggestion. Initial bits are in patch 1. Types changed later
after struct rdt_mondomain is added.

> 
> 
> > @@ -434,10 +437,15 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
> >  }
> >  
> >  static void domain_free(struct rdt_hw_domain *hw_dom)
> > +{
> > +	kfree(hw_dom->ctrl_val);
> > +	kfree(hw_dom);
> > +}
> > +
> > +static void mondomain_free(struct rdt_hw_mondomain *hw_dom)
> >  {
> >  	kfree(hw_dom->arch_mbm_total);
> >  	kfree(hw_dom->arch_mbm_local);
> > -	kfree(hw_dom->ctrl_val);
> >  	kfree(hw_dom);
> >  }
> >  
> > @@ -467,7 +475,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
> >   * @num_rmid:	The size of the MBM counter array
> >   * @hw_dom:	The domain that owns the allocated arrays
> >   */
> > -static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> > +static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mondomain *hw_dom)
> >  {
> >  	size_t tsize;
> >  
> > @@ -539,8 +547,8 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> >  {
> >  	int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
> >  	struct list_head *add_pos = NULL;
> > -	struct rdt_hw_domain *hw_dom;
> > -	struct rdt_domain *d;
> > +	struct rdt_hw_mondomain *hw_mondom;
> > +	struct rdt_mondomain *d;
> >  	int err;
> >  
> 
> Please ensure that reverse fir tree order is maintained in all these changes.

Oops. Fixed this one and looked around for any others introduced by me.
> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 4/7] x86/resctrl: Delete unused fields from struct rdt_domain
  2023-08-11 17:30     ` Reinette Chatre
@ 2023-08-25 17:13       ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-08-25 17:13 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Fri, Aug 11, 2023 at 10:30:47AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 7/22/2023 12:07 PM, Tony Luck wrote:
> > Now that all the monitoring functions use struct rdt_mondomain the
> > monitor fields can be dropped from the structure used for control
> > operations.
> 
> Please provide some context for this patch so that it
> can stand by itself when merged.
> 
> Note that two structures are changed in the patch.
> 
> Reinette

Commit comment rewritten.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 5/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-08-11 17:32     ` Reinette Chatre
@ 2023-08-25 17:49       ` Tony Luck
  2023-08-28 17:05         ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-08-25 17:49 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Fri, Aug 11, 2023 at 10:32:29AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 7/22/2023 12:07 PM, Tony Luck wrote:
> > There isn't a simple hardware enumeration to indicate to software that
> > a system is running with Sub-NUMA Cluster enabled.
> > 
> > Compare the number of NUMA nodes with the number of L3 caches to calculate
> > the number of Sub-NUMA nodes per L3 cache.
> > 
> > When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
> > are distributed equally between the SNC nodes within each socket.
> > 
> > E.g. if there are 400 RMID counters, and the system is configured with
> > two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
> > 0 on the socket, and RMID counter 200..399 on SNC node 1.
> > 
> > A model specific MSR (0xca0) can change the configuration of the RMIDs
> > when SNC mode is enabled.
> > 
> > The MSR controls the interpretation of the RMID field in the
> > IA32_PQR_ASSOC MSR so that the appropriate hardware counters
> > within the SNC node are updated.
> > 
> > To read the RMID counters an offset must be used to get data
> > from the physical counter associated with the SNC node. As in
> > the example above with 400 RMID counters Linux sees only 200
> > counters. No special action is needed to read a counter from
> > the first SNC node on a socket. But to read a Linux visible
> > counter 50 on the second SNC node the kernel must load 250
> > into the QM_EVTSEL MSR.
> > 
> > N.B. this works well for well-behaved NUMA applications that access
> > memory predominantly from the local memory node. For applications that
> > access memory across multiple nodes it may be necessary for the user
> > to read counters for all SNC nodes on a socket and add the values to
> > get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
> > all that different from applications that span across multiple sockets
> > in a legacy system.
> > 
> > The cache allocation feature still provides the same number of
> > bits in a mask to control allocation into the L3 cache. But each
> > of those ways has its capacity reduced because the cache is divided
> > between the SNC nodes. Adjust the value reported in the resctrl
> > "size" file accordingly.
> > 
> > Mounting the file system with the "mba_MBps" option is disabled
> > when SNC mode is enabled. This is because the measurement of bandwidth
> > is per SNC node, while the MBA throttling controls are still at
> > the L3 cache scope.
> > 
> 
> I'm counting four logical changes in this changelog. Can they be separate
> patches?

Split into three parts. I couln't quite see how to get to four.

> 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  include/linux/resctrl.h                |  2 +
> >  arch/x86/include/asm/msr-index.h       |  1 +
> >  arch/x86/kernel/cpu/resctrl/internal.h |  2 +
> >  arch/x86/kernel/cpu/resctrl/core.c     | 82 +++++++++++++++++++++++++-
> >  arch/x86/kernel/cpu/resctrl/monitor.c  | 18 +++++-
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +-
> >  6 files changed, 103 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 80a89d171eba..576dc21bd990 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -200,6 +200,8 @@ struct rdt_resource {
> >  	bool			cdp_capable;
> >  };
> >  
> > +#define MON_SCOPE_NODE 100
> > +
> 
> Could you please add a comment to explain what this constant represents
> and how it is used?

This is gone in V6. It's become part of an "enum" for each of the scope
options.

> 
> >  /**
> >   * struct resctrl_schema - configuration abilities of a resource presented to
> >   *			   user-space
> > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> > index 3aedae61af4f..4b624a37d64a 100644
> > --- a/arch/x86/include/asm/msr-index.h
> > +++ b/arch/x86/include/asm/msr-index.h
> > @@ -1087,6 +1087,7 @@
> >  #define MSR_IA32_QM_CTR			0xc8e
> >  #define MSR_IA32_PQR_ASSOC		0xc8f
> >  #define MSR_IA32_L3_CBM_BASE		0xc90
> > +#define MSR_RMID_SNC_CONFIG		0xca0
> >  #define MSR_IA32_L2_CBM_BASE		0xd10
> >  #define MSR_IA32_MBA_THRTL_BASE		0xd50
> >  
> > diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> > index 016ef0373c5a..00a330bc5ced 100644
> > --- a/arch/x86/kernel/cpu/resctrl/internal.h
> > +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> > @@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
> >  
> >  extern struct dentry *debugfs_resctrl;
> >  
> > +extern int snc_nodes_per_l3_cache;
> > +
> >  enum resctrl_res_level {
> >  	RDT_RESOURCE_L3,
> >  	RDT_RESOURCE_L2,
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 0161362b0c3e..1331add347fc 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -16,11 +16,14 @@
> >  
> >  #define pr_fmt(fmt)	"resctrl: " fmt
> >  
> > +#include <linux/cpu.h>
> >  #include <linux/slab.h>
> >  #include <linux/err.h>
> >  #include <linux/cacheinfo.h>
> >  #include <linux/cpuhotplug.h>
> > +#include <linux/mod_devicetable.h>
> >  
> 
> What does this include provide?

This header provides the defintion of struct x86_cpu_id.

> 
> > +#include <asm/cpu_device_id.h>
> >  #include <asm/intel-family.h>
> >  #include <asm/resctrl.h>
> >  #include "internal.h"
> > @@ -48,6 +51,13 @@ int max_name_width, max_data_width;
> >   */
> >  bool rdt_alloc_capable;
> >  
> > +/*
> > + * Number of SNC nodes that share each L3 cache.
> > + * Default is 1 for systems that do not support
> > + * SNC, or have SNC disabled.
> > + */
> > +int snc_nodes_per_l3_cache = 1;
> > +
> >  static void
> >  mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> >  		struct rdt_resource *r);
> > @@ -543,9 +553,16 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> >  	}
> >  }
> >  
> > +static int get_mon_scope_id(int cpu, int scope)
> > +{
> > +	if (scope == MON_SCOPE_NODE)
> > +		return cpu_to_node(cpu);
> > +	return get_cpu_cacheinfo_id(cpu, scope);
> > +}
> > +
> >  static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> >  {
> > -	int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
> > +	int id = get_mon_scope_id(cpu, r->mon_scope);
> >  	struct list_head *add_pos = NULL;
> >  	struct rdt_hw_mondomain *hw_mondom;
> >  	struct rdt_mondomain *d;
> > @@ -692,11 +709,28 @@ static void clear_closid_rmid(int cpu)
> >  	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
> >  }
> >  
> > +static void snc_remap_rmids(int cpu)
> > +{
> > +	u64	val;
> 
> No need for tab here.

Turned into a <SPACE>

> 
> > +
> > +	/* Only need to enable once per package */
> > +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> > +		return;
> > +
> > +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
> > +	val &= ~BIT_ULL(0);
> > +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
> > +}
> 
> Could you please document snc_remap_rmids()
> with information on what the bit in the register means
> and what the above function does?

Added header comment for this function describing the MSR
and the function.

> 
> > +
> >  static int resctrl_online_cpu(unsigned int cpu)
> >  {
> >  	struct rdt_resource *r;
> >  
> >  	mutex_lock(&rdtgroup_mutex);
> > +
> > +	if (snc_nodes_per_l3_cache > 1)
> > +		snc_remap_rmids(cpu);
> > +
> >  	for_each_capable_rdt_resource(r)
> >  		domain_add_cpu(cpu, r);
> >  	/* The cpu is set in default rdtgroup after online. */
> > @@ -951,11 +985,57 @@ static __init bool get_rdt_resources(void)
> >  	return (rdt_mon_capable || rdt_alloc_capable);
> >  }
> >  
> 
> I think it will help to add a comment like:
> "CPUs that support the model specific MSR_RMID_SNC_CONFIG register."

Done.

> 
> > +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> > +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> > +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> > +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> > +	{}
> > +};
> > +
> > +/*
> > + * There isn't a simple enumeration bit to show whether SNC mode
> > + * is enabled. Look at the ratio of number of NUMA nodes to the
> > + * number of distinct L3 caches. Take care to skip memory-only nodes.
> > + */
> > +static __init int get_snc_config(void)
> > +{
> > +	unsigned long *node_caches;
> > +	int mem_only_nodes = 0;
> > +	int cpu, node, ret;
> > +
> > +	if (!x86_match_cpu(snc_cpu_ids))
> > +		return 1;
> > +
> > +	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> > +	if (!node_caches)
> > +		return 1;
> > +
> > +	cpus_read_lock();
> > +	for_each_node(node) {
> > +		cpu = cpumask_first(cpumask_of_node(node));
> > +		if (cpu < nr_cpu_ids)
> > +			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> > +		else
> > +			mem_only_nodes++;
> > +	}
> > +	cpus_read_unlock();
> 
> I am not familiar with the numa code at all so please correct me
> where I am wrong. I do see that nr_node_ids is initialized with __init code
> so it should be accurate at this point. It looks to me like this initialization
> assumes that at least one CPU per node will be online at the time it is run.
> It is not clear to me that this assumption would always be true. 

Resctrl initialization is kicked off as a late_initcall(). So all CPUs
and devices are fully initialized before this code runs.

Resctrl can't be moved to an "init" state before CPUs are brought online
because it makes a call to cpuhp_setup_state() to get callbacks for
online/offline CPU events ... that call can't be done early.

> 
> > +
> > +	ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
> > +	kfree(node_caches);
> > +
> > +	if (ret > 1)
> > +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = MON_SCOPE_NODE;
> > +
> > +	return ret;
> > +}
> > +
> >  static __init void rdt_init_res_defs_intel(void)
> >  {
> >  	struct rdt_hw_resource *hw_res;
> >  	struct rdt_resource *r;
> >  
> > +	snc_nodes_per_l3_cache = get_snc_config();
> > +
> >  	for_each_rdt_resource(r) {
> >  		hw_res = resctrl_to_arch_res(r);
> >  
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index 0d9605fccb34..4ca064e62911 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
> >  
> >  static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> >  {
> > +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> > +	int cpu = get_cpu();
> 
> I do not think it is necessary to disable preemption here. Also please note that
> James is working on changes to not have this code block. Could just smp_processor_id()
> do?

Not needed to disable pre-emption - because it is already disabled ...
call sequence to get to here comes through smp_call_function_single()
which does it's own get_cpu()). Will change this code to use
smp_processor_id()
> 
> > +	int rmid_offset = 0;
> >  	u64 msr_val;
> >  
> > +	/*
> > +	 * When SNC mode is on, need to compute the offset to read the
> > +	 * physical RMID counter for the node to which this CPU belongs
> > +	 */
> > +	if (snc_nodes_per_l3_cache > 1)
> > +		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> > +
> >  	/*
> >  	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> >  	 * with a valid event code for supported resource type and the bits
> > @@ -158,9 +168,11 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> >  	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> >  	 * are error bits.
> >  	 */
> > -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> > +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
> >  	rdmsrl(MSR_IA32_QM_CTR, msr_val);
> >  
> > +	put_cpu();
> > +
> >  	if (msr_val & RMID_VAL_ERROR)
> >  		return -EIO;
> >  	if (msr_val & RMID_VAL_UNAVAIL)
> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-08-11 17:33     ` Reinette Chatre
@ 2023-08-25 17:50       ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-08-25 17:50 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Fri, Aug 11, 2023 at 10:33:18AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 7/22/2023 12:07 PM, Tony Luck wrote:
> > With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> > per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> > their name refer to Sub-NUMA nodes instead of L3 cache ids.
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > Reviewed-by: Peter Newman <peternewman@google.com>
> > ---
> >  Documentation/arch/x86/resctrl.rst | 10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> > index cb05d90111b4..4d9ddb91751d 100644
> > --- a/Documentation/arch/x86/resctrl.rst
> > +++ b/Documentation/arch/x86/resctrl.rst
> > @@ -345,9 +345,13 @@ When control is enabled all CTRL_MON groups will also contain:
> >  When monitoring is enabled all MON groups will also contain:
> >  
> >  "mon_data":
> > -	This contains a set of files organized by L3 domain and by
> > -	RDT event. E.g. on a system with two L3 domains there will
> > -	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
> > +	This contains a set of files organized by L3 domain or by NUMA
> > +	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> > +	or enabled respectively) and by RDT event. E.g. on a system with
> > +	SNC mode disabled with two L3 domains there will be subdirectories
> > +	"mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
> > +	L3 cache id.  With SNC enabled the directory names are the same,
> > +	but the numerical suffix refers to the node id.  Each of these
> >  	directories have one file per event (e.g. "llc_occupancy",
> >  	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
> >  	files provide a read out of the current value of the event for
> 
> I think it would be helpful to add a modified version of the snippet
> (from previous patch changelog) regarding well-behaved NUMA apps.
> With the above it may be confusing that a single cache allocation has
> multiple cache occupancy counters. 
> 
> This also changes the meaning of the numbers in the directory names.
> The documentation already provides guidance on how to find the cache
> ID of a logical CPU (see section "Cache IDs"). I think it will be 
> helpful to add a snippet that makes it clear to users how to map
> a CPU to its node ID.

Added extra details as suggested.

> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 7/7] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-08-11 17:33     ` Reinette Chatre
@ 2023-08-25 17:56       ` Tony Luck
  2023-08-28 17:06         ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-08-25 17:56 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Fri, Aug 11, 2023 at 10:33:43AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 7/22/2023 12:07 PM, Tony Luck wrote:
> > Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
> > nodes. Systems may support splitting into either two or four nodes.
> > 
> > When SNC mode is enabled the effective amount of L3 cache available
> > for allocation is divided by the number of nodes per L3.
> > 
> > Detect which SNC mode is active by comparing the number of CPUs
> > that share a cache with CPU0, with the number of CPUs on node0.
> > 
> > Reported-by: "Shaopeng Tan (Fujitsu)" <tan.shaopeng@fujitsu.com>
> > Closes: https://lore.kernel.org/r/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com
> 
> This does not seem to be the case when looking at
> https://lore.kernel.org/all/TYAPR01MB6330A4EB3633B791939EA45E8B39A@TYAPR01MB6330.jpnprd01.prod.outlook.com/

Correct. I'll drop the "Closes:" tag.  I'm not sure what
the status is. Shaopeng didn't respond to my suggestion
to try "taskset(1)" when running the tests to check if
NUMA effects are causing the test to fail.

> 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  tools/testing/selftests/resctrl/resctrl.h   |  1 +
> >  tools/testing/selftests/resctrl/resctrlfs.c | 57 +++++++++++++++++++++
> >  2 files changed, 58 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
> > index 87e39456dee0..a8b43210b573 100644
> > --- a/tools/testing/selftests/resctrl/resctrl.h
> > +++ b/tools/testing/selftests/resctrl/resctrl.h
> > @@ -13,6 +13,7 @@
> >  #include <signal.h>
> >  #include <dirent.h>
> >  #include <stdbool.h>
> > +#include <ctype.h>
> >  #include <sys/stat.h>
> >  #include <sys/ioctl.h>
> >  #include <sys/mount.h>
> > diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
> > index fb00245dee92..79eecbf9f863 100644
> > --- a/tools/testing/selftests/resctrl/resctrlfs.c
> > +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> > @@ -130,6 +130,61 @@ int get_resource_id(int cpu_no, int *resource_id)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Count number of CPUs in a /sys bit map
> > + */
> > +static int count_sys_bitmap_bits(char *name)
> > +{
> > +	FILE *fp = fopen(name, "r");
> > +	int count = 0, c;
> > +
> > +	if (!fp)
> > +		return 0;
> > +
> > +	while ((c = fgetc(fp)) != EOF) {
> > +		if (!isxdigit(c))
> > +			continue;
> > +		switch (c) {
> > +		case 'f':
> > +			count++;
> > +		case '7': case 'b': case 'd': case 'e':
> > +			count++;
> > +		case '3': case '5': case '6': case '9': case 'a': case 'c':
> > +			count++;
> > +		case '1': case '2': case '4': case '8':
> > +			count++;
> > +		}
> > +	}
> > +	fclose(fp);
> > +
> > +	return count;
> > +}
> > +
> > +/*
> > + * Detect SNC by compating #CPUs in node0 with #CPUs sharing LLC with CPU0
> > + * Try to get this right, even if a few CPUs are offline so that the number
> > + * of CPUs in node0 is not exactly half or a quarter of the CPUs sharing the
> > + * LLC of CPU0.
> > + */
> > +static int snc_ways(void)
> > +{
> > +	int node_cpus, cache_cpus;
> > +
> > +	node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
> > +	cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
> > +
> > +	if (!node_cpus || !cache_cpus) {
> > +		fprintf(stderr, "Warning could not determine Sub-NUMA Cluster mode\n");
> > +		return 1;
> > +	}
> > +
> > +	if (4 * node_cpus >= cache_cpus)
> > +		return 4;
> > +	else if (2 * node_cpus >= cache_cpus)
> > +		return 2;
> > +	return 1;
> > +}
> > +
> >  /*
> >   * get_cache_size - Get cache size for a specified CPU
> >   * @cpu_no:	CPU number
> > @@ -190,6 +245,8 @@ int get_cache_size(int cpu_no, char *cache_type, unsigned long *cache_size)
> >  			break;
> >  	}
> >  
> > +	if (cache_num == 3)
> > +		*cache_size /= snc_ways();
> >  	return 0;
> >  }
> >  
> 
> I am surprised that this small change is sufficient. The resctrl
> selftests are definitely not NUMA aware and the CAT and CMT tests
> are not taking that into account when picking CPUs to run on. From
> what I understand LLC occupancy counters need to be added in this
> scenario but I do not see that done either.

This is a first step (the tests are definitely going to fail if
they have incorrect information about the cache size).

For a fully reliable set of tests some major surgery will be required
to bind to CPUs and memory to control allocation and access.

> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring
  2023-08-25 17:05       ` Tony Luck
@ 2023-08-28 17:05         ` Reinette Chatre
  2023-08-28 18:46           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-08-28 17:05 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/25/2023 10:05 AM, Tony Luck wrote:
> On Fri, Aug 11, 2023 at 10:29:25AM -0700, Reinette Chatre wrote:
>> On 7/22/2023 12:07 PM, Tony Luck wrote:

>>> Change all places where monitoring functions walk the list of
>>> domains to use the new "mondomains" list instead of the old
>>> "domains" list.
>>
>> I would not refer to it as "the old domains list" as it creates
>> impression that this is being replaced. The changelog makes
>> no mention that domains list will remain and be dedicated to
>> control domains. I think this is important to include in description
>> of this change.
> 
> I've rewritten the entire commit message incorporating your suggestions.
> V6 will be posted soon (after I get some time on an SNC SPR to check
> that it all works!)

I seem to have missed v5.

> 
>>
>>>
>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>> ---
>>>  include/linux/resctrl.h                   |  10 +-
>>>  arch/x86/kernel/cpu/resctrl/internal.h    |   2 +-
>>>  arch/x86/kernel/cpu/resctrl/core.c        | 195 +++++++++++++++-------
>>>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
>>>  arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
>>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  30 ++--
>>>  6 files changed, 167 insertions(+), 74 deletions(-)
>>>
>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index 8334eeacfec5..1267d56f9e76 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -151,9 +151,11 @@ struct resctrl_schema;
>>>   * @mon_capable:	Is monitor feature available on this machine
>>>   * @num_rmid:		Number of RMIDs available
>>>   * @cache_level:	Which cache level defines scope of this resource
>>> + * @mon_scope:		Scope of this resource if different from cache_level
>>
>> I think this addition should be deferred. As it is here it the "if different
>> from cache_level" also creates many questions (when will it be different?
>> how will it be determined that the scope is different in order to know that
>> mon_scope should be used?)
> 
> I've gone in a different direction. V6 renames "cache_level" to
> "ctrl_scope". I think this makes intent clear from step #1.
>

This change is not clear to me. Previously you changed this name
but kept using it in code specific to cache levels. It is not clear
to me how this time's rename would be different.
 
...

>>> @@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>>>   */
>>>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>>>  {
>>> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
>>> -	struct list_head *add_pos = NULL;
>>> -	struct rdt_hw_domain *hw_dom;
>>> -	struct rdt_domain *d;
>>> -	int err;
>>> -
>>> -	d = rdt_find_domain(r, id, &add_pos);
>>> -	if (IS_ERR(d)) {
>>> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
>>> -		return;
>>> -	}
>>> -
>>> -	if (d) {
>>> -		cpumask_set_cpu(cpu, &d->cpu_mask);
>>> -		if (r->cache.arch_has_per_cpu_cfg)
>>> -			rdt_domain_reconfigure_cdp(r);
>>> -		return;
>>> -	}
>>> -
>>> -	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
>>> -	if (!hw_dom)
>>> -		return;
>>> -
>>> -	d = &hw_dom->d_resctrl;
>>> -	d->id = id;
>>> -	cpumask_set_cpu(cpu, &d->cpu_mask);
>>> -
>>> -	rdt_domain_reconfigure_cdp(r);
>>> -
>>> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
>>> -		domain_free(hw_dom);
>>> -		return;
>>> -	}
>>> -
>>> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
>>> -		domain_free(hw_dom);
>>> -		return;
>>> -	}
>>> -
>>> -	list_add_tail(&d->list, add_pos);
>>> -
>>> -	err = resctrl_online_domain(r, d);
>>> -	if (err) {
>>> -		list_del(&d->list);
>>> -		domain_free(hw_dom);
>>> -	}
>>> +	if (r->alloc_capable)
>>> +		domain_add_cpu_ctrl(cpu, r);
>>> +	if (r->mon_capable)
>>> +		domain_add_cpu_mon(cpu, r);
>>>  }
>>
>> A resource could be both alloc and mon capable ... both
>> domain_add_cpu_ctrl() and domain_add_cpu_mon() can fail.
>> Should domain_add_cpu_mon() still be run for a CPU if
>> domain_add_cpu_ctrl() failed? 
>>
>> Looking ahead the CPU should probably also not be added
>> to the default groups mask if a failure occurred.
> 
> Existing code doesn't do anything for the case where a CPU
> can't be added to a domain (probably the only real error case
> is failure to allocate memory for the domain structure).

Is my statement about CPU being added to default group mask
incorrect? Seems like a potential issue related to domain's
CPU mask also.

Please see my earlier question. Existing code does not proceed
with monitor initialization if control initialization fails and
undoes control initialization if monitor initialization fails. 

> May be something to tackle in a future series if anyone
> thinks this is a serious problem and has suggestions on
> what to do. It seems like a catastrophic problem to not
> have some CPUs in some/all domains of some resources.
> Maybe this should disable mounting resctrl filesystem
> completely?

It is not clear to me how this is catastrophic but I
do not think resctrl should claim support for a resource
on a CPU if it was not able to complete initialization

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 5/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-08-25 17:49       ` Tony Luck
@ 2023-08-28 17:05         ` Reinette Chatre
  2023-08-28 19:01           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-08-28 17:05 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/25/2023 10:49 AM, Tony Luck wrote:
> On Fri, Aug 11, 2023 at 10:32:29AM -0700, Reinette Chatre wrote:
>> On 7/22/2023 12:07 PM, Tony Luck wrote:

...

>>> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
>>> +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
>>> +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
>>> +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
>>> +	{}
>>> +};
>>> +
>>> +/*
>>> + * There isn't a simple enumeration bit to show whether SNC mode
>>> + * is enabled. Look at the ratio of number of NUMA nodes to the
>>> + * number of distinct L3 caches. Take care to skip memory-only nodes.
>>> + */
>>> +static __init int get_snc_config(void)
>>> +{
>>> +	unsigned long *node_caches;
>>> +	int mem_only_nodes = 0;
>>> +	int cpu, node, ret;
>>> +
>>> +	if (!x86_match_cpu(snc_cpu_ids))
>>> +		return 1;
>>> +
>>> +	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
>>> +	if (!node_caches)
>>> +		return 1;
>>> +
>>> +	cpus_read_lock();
>>> +	for_each_node(node) {
>>> +		cpu = cpumask_first(cpumask_of_node(node));
>>> +		if (cpu < nr_cpu_ids)
>>> +			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
>>> +		else
>>> +			mem_only_nodes++;
>>> +	}
>>> +	cpus_read_unlock();
>>
>> I am not familiar with the numa code at all so please correct me
>> where I am wrong. I do see that nr_node_ids is initialized with __init code
>> so it should be accurate at this point. It looks to me like this initialization
>> assumes that at least one CPU per node will be online at the time it is run.
>> It is not clear to me that this assumption would always be true. 
> 
> Resctrl initialization is kicked off as a late_initcall(). So all CPUs
> and devices are fully initialized before this code runs.
> 
> Resctrl can't be moved to an "init" state before CPUs are brought online
> because it makes a call to cpuhp_setup_state() to get callbacks for
> online/offline CPU events ... that call can't be done early.

Apologies but this is not so obvious to me. From what I understand a
system need not be booted with all CPUs online. CPUs can be brought
online at any time.

>>> +
>>> +	ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
>>> +	kfree(node_caches);
>>> +
>>> +	if (ret > 1)
>>> +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = MON_SCOPE_NODE;
>>> +
>>> +	return ret;
>>> +}
>>> +


Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 7/7] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-08-25 17:56       ` Tony Luck
@ 2023-08-28 17:06         ` Reinette Chatre
  2023-08-28 19:06           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-08-28 17:06 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/25/2023 10:56 AM, Tony Luck wrote:
> On Fri, Aug 11, 2023 at 10:33:43AM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 7/22/2023 12:07 PM, Tony Luck wrote:

...

>>> @@ -190,6 +245,8 @@ int get_cache_size(int cpu_no, char *cache_type, unsigned long *cache_size)
>>>  			break;
>>>  	}
>>>  
>>> +	if (cache_num == 3)
>>> +		*cache_size /= snc_ways();
>>>  	return 0;
>>>  }
>>>  
>>
>> I am surprised that this small change is sufficient. The resctrl
>> selftests are definitely not NUMA aware and the CAT and CMT tests
>> are not taking that into account when picking CPUs to run on. From
>> what I understand LLC occupancy counters need to be added in this
>> scenario but I do not see that done either.
> 
> This is a first step (the tests are definitely going to fail if
> they have incorrect information about the cache size).
> 
> For a fully reliable set of tests some major surgery will be required
> to bind to CPUs and memory to control allocation and access.
> 

What is the plan for making the tests more reliable? What is the
use of this patch if it is just the first step?

Reinette



^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring
  2023-08-28 17:05         ` Reinette Chatre
@ 2023-08-28 18:46           ` Tony Luck
  2023-08-28 19:56             ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-08-28 18:46 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Aug 28, 2023 at 10:05:31AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/25/2023 10:05 AM, Tony Luck wrote:
> > On Fri, Aug 11, 2023 at 10:29:25AM -0700, Reinette Chatre wrote:
> >> On 7/22/2023 12:07 PM, Tony Luck wrote:
> 
> >>> Change all places where monitoring functions walk the list of
> >>> domains to use the new "mondomains" list instead of the old
> >>> "domains" list.
> >>
> >> I would not refer to it as "the old domains list" as it creates
> >> impression that this is being replaced. The changelog makes
> >> no mention that domains list will remain and be dedicated to
> >> control domains. I think this is important to include in description
> >> of this change.
> > 
> > I've rewritten the entire commit message incorporating your suggestions.
> > V6 will be posted soon (after I get some time on an SNC SPR to check
> > that it all works!)
> 
> I seem to have missed v5.

I simply can't count. You are correct that next version to be posted
will be v5.

> > 
> >>
> >>>
> >>> Signed-off-by: Tony Luck <tony.luck@intel.com>
> >>> ---
> >>>  include/linux/resctrl.h                   |  10 +-
> >>>  arch/x86/kernel/cpu/resctrl/internal.h    |   2 +-
> >>>  arch/x86/kernel/cpu/resctrl/core.c        | 195 +++++++++++++++-------
> >>>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
> >>>  arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
> >>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  30 ++--
> >>>  6 files changed, 167 insertions(+), 74 deletions(-)
> >>>
> >>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> >>> index 8334eeacfec5..1267d56f9e76 100644
> >>> --- a/include/linux/resctrl.h
> >>> +++ b/include/linux/resctrl.h
> >>> @@ -151,9 +151,11 @@ struct resctrl_schema;
> >>>   * @mon_capable:	Is monitor feature available on this machine
> >>>   * @num_rmid:		Number of RMIDs available
> >>>   * @cache_level:	Which cache level defines scope of this resource
> >>> + * @mon_scope:		Scope of this resource if different from cache_level
> >>
> >> I think this addition should be deferred. As it is here it the "if different
> >> from cache_level" also creates many questions (when will it be different?
> >> how will it be determined that the scope is different in order to know that
> >> mon_scope should be used?)
> > 
> > I've gone in a different direction. V6 renames "cache_level" to
> > "ctrl_scope". I think this makes intent clear from step #1.
> >
> 
> This change is not clear to me. Previously you changed this name
> but kept using it in code specific to cache levels. It is not clear
> to me how this time's rename would be different.

The current "cache_level" field in the structure is describing
the scope of each instance using the cache level (2 or 3) as the
method to describe which CPUs are considered part of a group. Currently
the scope is the same for both control and monitor resources.

Would you like to see patches in this progrssion:

1) Rename "cache_level" to "scope". With commit comment that future
patches are going to base the scope on NUMA nodes in addtion to sharing
caches at particular levels, and will split into separate control and
monitor scope.

2) Split the "scope" field from first patch into "ctrl_scope" and
"mon_scope" (also with the addition of the new list for the mon_scope).

3) Add "node" as a new option for scope in addtion to L3 and L2 cache.

> ...
> 
> >>> @@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> >>>   */
> >>>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >>>  {
> >>> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> >>> -	struct list_head *add_pos = NULL;
> >>> -	struct rdt_hw_domain *hw_dom;
> >>> -	struct rdt_domain *d;
> >>> -	int err;
> >>> -
> >>> -	d = rdt_find_domain(r, id, &add_pos);
> >>> -	if (IS_ERR(d)) {
> >>> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> >>> -		return;
> >>> -	}
> >>> -
> >>> -	if (d) {
> >>> -		cpumask_set_cpu(cpu, &d->cpu_mask);
> >>> -		if (r->cache.arch_has_per_cpu_cfg)
> >>> -			rdt_domain_reconfigure_cdp(r);
> >>> -		return;
> >>> -	}
> >>> -
> >>> -	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> >>> -	if (!hw_dom)
> >>> -		return;
> >>> -
> >>> -	d = &hw_dom->d_resctrl;
> >>> -	d->id = id;
> >>> -	cpumask_set_cpu(cpu, &d->cpu_mask);
> >>> -
> >>> -	rdt_domain_reconfigure_cdp(r);
> >>> -
> >>> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> >>> -		domain_free(hw_dom);
> >>> -		return;
> >>> -	}
> >>> -
> >>> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> >>> -		domain_free(hw_dom);
> >>> -		return;
> >>> -	}
> >>> -
> >>> -	list_add_tail(&d->list, add_pos);
> >>> -
> >>> -	err = resctrl_online_domain(r, d);
> >>> -	if (err) {
> >>> -		list_del(&d->list);
> >>> -		domain_free(hw_dom);
> >>> -	}
> >>> +	if (r->alloc_capable)
> >>> +		domain_add_cpu_ctrl(cpu, r);
> >>> +	if (r->mon_capable)
> >>> +		domain_add_cpu_mon(cpu, r);
> >>>  }
> >>
> >> A resource could be both alloc and mon capable ... both
> >> domain_add_cpu_ctrl() and domain_add_cpu_mon() can fail.
> >> Should domain_add_cpu_mon() still be run for a CPU if
> >> domain_add_cpu_ctrl() failed? 
> >>
> >> Looking ahead the CPU should probably also not be added
> >> to the default groups mask if a failure occurred.
> > 
> > Existing code doesn't do anything for the case where a CPU
> > can't be added to a domain (probably the only real error case
> > is failure to allocate memory for the domain structure).
> 
> Is my statement about CPU being added to default group mask
> incorrect? Seems like a potential issue related to domain's
> CPU mask also.
> 
> Please see my earlier question. Existing code does not proceed
> with monitor initialization if control initialization fails and
> undoes control initialization if monitor initialization fails. 

Existing code silently continues if a domain structure cannot
be allocated to add a CPU to a domain:

503 static void domain_add_cpu(int cpu, struct rdt_resource *r)
504 {
505         int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
506         struct list_head *add_pos = NULL;
507         struct rdt_hw_domain *hw_dom;
508         struct rdt_domain *d;
509         int err;

...

523
524         hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
525         if (!hw_dom)
526                 return;
527

> 
> > May be something to tackle in a future series if anyone
> > thinks this is a serious problem and has suggestions on
> > what to do. It seems like a catastrophic problem to not
> > have some CPUs in some/all domains of some resources.
> > Maybe this should disable mounting resctrl filesystem
> > completely?
> 
> It is not clear to me how this is catastrophic but I
> do not think resctrl should claim support for a resource
> on a CPU if it was not able to complete initialization

That's the status quo. See above code snippet.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 5/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-08-28 17:05         ` Reinette Chatre
@ 2023-08-28 19:01           ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-08-28 19:01 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Aug 28, 2023 at 10:05:50AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/25/2023 10:49 AM, Tony Luck wrote:
> > On Fri, Aug 11, 2023 at 10:32:29AM -0700, Reinette Chatre wrote:
> >> On 7/22/2023 12:07 PM, Tony Luck wrote:
> 
> ...
> 
> >>> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> >>> +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> >>> +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> >>> +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> >>> +	{}
> >>> +};
> >>> +
> >>> +/*
> >>> + * There isn't a simple enumeration bit to show whether SNC mode
> >>> + * is enabled. Look at the ratio of number of NUMA nodes to the
> >>> + * number of distinct L3 caches. Take care to skip memory-only nodes.
> >>> + */
> >>> +static __init int get_snc_config(void)
> >>> +{
> >>> +	unsigned long *node_caches;
> >>> +	int mem_only_nodes = 0;
> >>> +	int cpu, node, ret;
> >>> +
> >>> +	if (!x86_match_cpu(snc_cpu_ids))
> >>> +		return 1;
> >>> +
> >>> +	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> >>> +	if (!node_caches)
> >>> +		return 1;
> >>> +
> >>> +	cpus_read_lock();
> >>> +	for_each_node(node) {
> >>> +		cpu = cpumask_first(cpumask_of_node(node));
> >>> +		if (cpu < nr_cpu_ids)
> >>> +			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> >>> +		else
> >>> +			mem_only_nodes++;
> >>> +	}
> >>> +	cpus_read_unlock();
> >>
> >> I am not familiar with the numa code at all so please correct me
> >> where I am wrong. I do see that nr_node_ids is initialized with __init code
> >> so it should be accurate at this point. It looks to me like this initialization
> >> assumes that at least one CPU per node will be online at the time it is run.
> >> It is not clear to me that this assumption would always be true. 
> > 
> > Resctrl initialization is kicked off as a late_initcall(). So all CPUs
> > and devices are fully initialized before this code runs.
> > 
> > Resctrl can't be moved to an "init" state before CPUs are brought online
> > because it makes a call to cpuhp_setup_state() to get callbacks for
> > online/offline CPU events ... that call can't be done early.
> 
> Apologies but this is not so obvious to me. From what I understand a
> system need not be booted with all CPUs online. CPUs can be brought
> online at any time.

So you are concerned about the case where the system was booted with a
maxcpus=N boot argument, and additional CPUs are brought online later?

My code will fail to detect SNC mode if someone does that. I don't see
any possible way to recover from this in a safe way later when additional
CPUs are brought online. Those CPUs could be brought online in any order
by the user, and there is no way to know when they are all done. Without an
explicit enumeration of SNC mode there is no fool-proof way to detect
SNC mode in the face of CPU hot plug post-boot.

I could add a pr_warn() into this code:

 915 static int __init nrcpus(char *str)
 916 {
 917         int nr_cpus;
 918
 919         if (get_option(&str, &nr_cpus) && nr_cpus > 0 && nr_cpus < nr_cpu_ids)
 920                 set_nr_cpu_ids(nr_cpus);
 921
 922         return 0;
 923 }
 924
 925 early_param("nr_cpus", nrcpus);

But this feels like a "user shot themself in the foot" case. I don't
think it necessary to add checks and warnings for every possible strange
way a user could configure a system.

> >>> +
> >>> +	ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
> >>> +	kfree(node_caches);
> >>> +
> >>> +	if (ret > 1)
> >>> +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = MON_SCOPE_NODE;
> >>> +
> >>> +	return ret;
> >>> +}
> >>> +

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 7/7] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-08-28 17:06         ` Reinette Chatre
@ 2023-08-28 19:06           ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-08-28 19:06 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Aug 28, 2023 at 10:06:32AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/25/2023 10:56 AM, Tony Luck wrote:
> > On Fri, Aug 11, 2023 at 10:33:43AM -0700, Reinette Chatre wrote:
> >> Hi Tony,
> >>
> >> On 7/22/2023 12:07 PM, Tony Luck wrote:
> 
> ...
> 
> >>> @@ -190,6 +245,8 @@ int get_cache_size(int cpu_no, char *cache_type, unsigned long *cache_size)
> >>>  			break;
> >>>  	}
> >>>  
> >>> +	if (cache_num == 3)
> >>> +		*cache_size /= snc_ways();
> >>>  	return 0;
> >>>  }
> >>>  
> >>
> >> I am surprised that this small change is sufficient. The resctrl
> >> selftests are definitely not NUMA aware and the CAT and CMT tests
> >> are not taking that into account when picking CPUs to run on. From
> >> what I understand LLC occupancy counters need to be added in this
> >> scenario but I do not see that done either.
> > 
> > This is a first step (the tests are definitely going to fail if
> > they have incorrect information about the cache size).
> > 
> > For a fully reliable set of tests some major surgery will be required
> > to bind to CPUs and memory to control allocation and access.
> > 
> 
> What is the plan for making the tests more reliable? What is the
> use of this patch if it is just the first step?

Reinette,

I have no immediate plan to re-architect the the resctrl self-tests.
If you feel this step towards a solution is useless unless it is part
of a complete solution, then I can drop it from this series.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring
  2023-08-28 18:46           ` Tony Luck
@ 2023-08-28 19:56             ` Reinette Chatre
  2023-08-28 20:59               ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-08-28 19:56 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/28/2023 11:46 AM, Tony Luck wrote:
> On Mon, Aug 28, 2023 at 10:05:31AM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 8/25/2023 10:05 AM, Tony Luck wrote:
>>> On Fri, Aug 11, 2023 at 10:29:25AM -0700, Reinette Chatre wrote:
>>>> On 7/22/2023 12:07 PM, Tony Luck wrote:
>>
>>>>> Change all places where monitoring functions walk the list of
>>>>> domains to use the new "mondomains" list instead of the old
>>>>> "domains" list.
>>>>
>>>> I would not refer to it as "the old domains list" as it creates
>>>> impression that this is being replaced. The changelog makes
>>>> no mention that domains list will remain and be dedicated to
>>>> control domains. I think this is important to include in description
>>>> of this change.
>>>
>>> I've rewritten the entire commit message incorporating your suggestions.
>>> V6 will be posted soon (after I get some time on an SNC SPR to check
>>> that it all works!)
>>
>> I seem to have missed v5.
> 
> I simply can't count. You are correct that next version to be posted
> will be v5.
> 
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>>>> ---
>>>>>  include/linux/resctrl.h                   |  10 +-
>>>>>  arch/x86/kernel/cpu/resctrl/internal.h    |   2 +-
>>>>>  arch/x86/kernel/cpu/resctrl/core.c        | 195 +++++++++++++++-------
>>>>>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
>>>>>  arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
>>>>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  30 ++--
>>>>>  6 files changed, 167 insertions(+), 74 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>>>> index 8334eeacfec5..1267d56f9e76 100644
>>>>> --- a/include/linux/resctrl.h
>>>>> +++ b/include/linux/resctrl.h
>>>>> @@ -151,9 +151,11 @@ struct resctrl_schema;
>>>>>   * @mon_capable:	Is monitor feature available on this machine
>>>>>   * @num_rmid:		Number of RMIDs available
>>>>>   * @cache_level:	Which cache level defines scope of this resource
>>>>> + * @mon_scope:		Scope of this resource if different from cache_level
>>>>
>>>> I think this addition should be deferred. As it is here it the "if different
>>>> from cache_level" also creates many questions (when will it be different?
>>>> how will it be determined that the scope is different in order to know that
>>>> mon_scope should be used?)
>>>
>>> I've gone in a different direction. V6 renames "cache_level" to
>>> "ctrl_scope". I think this makes intent clear from step #1.
>>>
>>
>> This change is not clear to me. Previously you changed this name
>> but kept using it in code specific to cache levels. It is not clear
>> to me how this time's rename would be different.
> 
> The current "cache_level" field in the structure is describing
> the scope of each instance using the cache level (2 or 3) as the
> method to describe which CPUs are considered part of a group. Currently
> the scope is the same for both control and monitor resources.

Right.

> 
> Would you like to see patches in this progrssion:
> 
> 1) Rename "cache_level" to "scope". With commit comment that future
> patches are going to base the scope on NUMA nodes in addtion to sharing
> caches at particular levels, and will split into separate control and
> monitor scope.
> 
> 2) Split the "scope" field from first patch into "ctrl_scope" and
> "mon_scope" (also with the addition of the new list for the mon_scope).
> 
> 3) Add "node" as a new option for scope in addtion to L3 and L2 cache.
> 

hmmm - my comment cannot be addressed through patch re-ordering.
If I understand correctly you plan to change the name of "cache_level"
to "ctrl_scope". My comment is that this obfuscates the code as long as
you use this variable to compare against data that can only represent cache
levels. This just repeats what I wrote in
https://lore.kernel.org/lkml/09847c37-66d7-c286-a313-308eaa338c64@intel.com/


>> ...
>>
>>>>> @@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>>>>>   */
>>>>>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>>>>>  {
>>>>> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
>>>>> -	struct list_head *add_pos = NULL;
>>>>> -	struct rdt_hw_domain *hw_dom;
>>>>> -	struct rdt_domain *d;
>>>>> -	int err;
>>>>> -
>>>>> -	d = rdt_find_domain(r, id, &add_pos);
>>>>> -	if (IS_ERR(d)) {
>>>>> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
>>>>> -		return;
>>>>> -	}
>>>>> -
>>>>> -	if (d) {
>>>>> -		cpumask_set_cpu(cpu, &d->cpu_mask);
>>>>> -		if (r->cache.arch_has_per_cpu_cfg)
>>>>> -			rdt_domain_reconfigure_cdp(r);
>>>>> -		return;
>>>>> -	}
>>>>> -
>>>>> -	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
>>>>> -	if (!hw_dom)
>>>>> -		return;
>>>>> -
>>>>> -	d = &hw_dom->d_resctrl;
>>>>> -	d->id = id;
>>>>> -	cpumask_set_cpu(cpu, &d->cpu_mask);
>>>>> -
>>>>> -	rdt_domain_reconfigure_cdp(r);
>>>>> -
>>>>> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
>>>>> -		domain_free(hw_dom);
>>>>> -		return;
>>>>> -	}
>>>>> -
>>>>> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
>>>>> -		domain_free(hw_dom);
>>>>> -		return;
>>>>> -	}
>>>>> -
>>>>> -	list_add_tail(&d->list, add_pos);
>>>>> -
>>>>> -	err = resctrl_online_domain(r, d);
>>>>> -	if (err) {
>>>>> -		list_del(&d->list);
>>>>> -		domain_free(hw_dom);
>>>>> -	}
>>>>> +	if (r->alloc_capable)
>>>>> +		domain_add_cpu_ctrl(cpu, r);
>>>>> +	if (r->mon_capable)
>>>>> +		domain_add_cpu_mon(cpu, r);
>>>>>  }
>>>>
>>>> A resource could be both alloc and mon capable ... both
>>>> domain_add_cpu_ctrl() and domain_add_cpu_mon() can fail.
>>>> Should domain_add_cpu_mon() still be run for a CPU if
>>>> domain_add_cpu_ctrl() failed? 
>>>>
>>>> Looking ahead the CPU should probably also not be added
>>>> to the default groups mask if a failure occurred.
>>>
>>> Existing code doesn't do anything for the case where a CPU
>>> can't be added to a domain (probably the only real error case
>>> is failure to allocate memory for the domain structure).
>>
>> Is my statement about CPU being added to default group mask
>> incorrect? Seems like a potential issue related to domain's
>> CPU mask also.
>>
>> Please see my earlier question. Existing code does not proceed
>> with monitor initialization if control initialization fails and
>> undoes control initialization if monitor initialization fails. 
> 
> Existing code silently continues if a domain structure cannot
> be allocated to add a CPU to a domain:
> 
> 503 static void domain_add_cpu(int cpu, struct rdt_resource *r)
> 504 {
> 505         int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> 506         struct list_head *add_pos = NULL;
> 507         struct rdt_hw_domain *hw_dom;
> 508         struct rdt_domain *d;
> 509         int err;
> 
> ...
> 
> 523
> 524         hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> 525         if (!hw_dom)
> 526                 return;
> 527
> 


Right ... and if it returns silently as above it runs:

static int resctrl_online_cpu(unsigned int cpu)
{


	for_each_capable_rdt_resource(r)
		domain_add_cpu(cpu, r);
	>>>>> cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); <<<<<<<<

}

Also, note within domain_add_cpu():

static void domain_add_cpu(int cpu, struct rdt_resource *r)
{


	...
	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
		domain_free(hw_dom);
		return;
	}

	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
		domain_free(hw_dom);
		return;
	}

	...
}

The above is the other item that I've been trying to discuss
with you. Note that existing resctrl will not initialize monitoring if
control could not be initialized.
Compare with this submission:

	if (r->alloc_capable)
		domain_add_cpu_ctrl(cpu, r);
	if (r->mon_capable)
		domain_add_cpu_mon(cpu, r);



>>
>>> May be something to tackle in a future series if anyone
>>> thinks this is a serious problem and has suggestions on
>>> what to do. It seems like a catastrophic problem to not
>>> have some CPUs in some/all domains of some resources.
>>> Maybe this should disable mounting resctrl filesystem
>>> completely?
>>
>> It is not clear to me how this is catastrophic but I
>> do not think resctrl should claim support for a resource
>> on a CPU if it was not able to complete initialization
> 
> That's the status quo. See above code snippet.

Could you please elaborate what you mean with status quo?

Reinette


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring
  2023-08-28 19:56             ` Reinette Chatre
@ 2023-08-28 20:59               ` Tony Luck
  2023-08-28 21:20                 ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-08-28 20:59 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Aug 28, 2023 at 12:56:27PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/28/2023 11:46 AM, Tony Luck wrote:
> > On Mon, Aug 28, 2023 at 10:05:31AM -0700, Reinette Chatre wrote:
> >> Hi Tony,
> >>
> >> On 8/25/2023 10:05 AM, Tony Luck wrote:
> >>> On Fri, Aug 11, 2023 at 10:29:25AM -0700, Reinette Chatre wrote:
> >>>> On 7/22/2023 12:07 PM, Tony Luck wrote:
> >>
> >>>>> Change all places where monitoring functions walk the list of
> >>>>> domains to use the new "mondomains" list instead of the old
> >>>>> "domains" list.
> >>>>
> >>>> I would not refer to it as "the old domains list" as it creates
> >>>> impression that this is being replaced. The changelog makes
> >>>> no mention that domains list will remain and be dedicated to
> >>>> control domains. I think this is important to include in description
> >>>> of this change.
> >>>
> >>> I've rewritten the entire commit message incorporating your suggestions.
> >>> V6 will be posted soon (after I get some time on an SNC SPR to check
> >>> that it all works!)
> >>
> >> I seem to have missed v5.
> > 
> > I simply can't count. You are correct that next version to be posted
> > will be v5.
> > 
> >>>
> >>>>
> >>>>>
> >>>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
> >>>>> ---
> >>>>>  include/linux/resctrl.h                   |  10 +-
> >>>>>  arch/x86/kernel/cpu/resctrl/internal.h    |   2 +-
> >>>>>  arch/x86/kernel/cpu/resctrl/core.c        | 195 +++++++++++++++-------
> >>>>>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
> >>>>>  arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
> >>>>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  30 ++--
> >>>>>  6 files changed, 167 insertions(+), 74 deletions(-)
> >>>>>
> >>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> >>>>> index 8334eeacfec5..1267d56f9e76 100644
> >>>>> --- a/include/linux/resctrl.h
> >>>>> +++ b/include/linux/resctrl.h
> >>>>> @@ -151,9 +151,11 @@ struct resctrl_schema;
> >>>>>   * @mon_capable:	Is monitor feature available on this machine
> >>>>>   * @num_rmid:		Number of RMIDs available
> >>>>>   * @cache_level:	Which cache level defines scope of this resource
> >>>>> + * @mon_scope:		Scope of this resource if different from cache_level
> >>>>
> >>>> I think this addition should be deferred. As it is here it the "if different
> >>>> from cache_level" also creates many questions (when will it be different?
> >>>> how will it be determined that the scope is different in order to know that
> >>>> mon_scope should be used?)
> >>>
> >>> I've gone in a different direction. V6 renames "cache_level" to
> >>> "ctrl_scope". I think this makes intent clear from step #1.
> >>>
> >>
> >> This change is not clear to me. Previously you changed this name
> >> but kept using it in code specific to cache levels. It is not clear
> >> to me how this time's rename would be different.
> > 
> > The current "cache_level" field in the structure is describing
> > the scope of each instance using the cache level (2 or 3) as the
> > method to describe which CPUs are considered part of a group. Currently
> > the scope is the same for both control and monitor resources.
> 
> Right.
> 
> > 
> > Would you like to see patches in this progrssion:
> > 
> > 1) Rename "cache_level" to "scope". With commit comment that future
> > patches are going to base the scope on NUMA nodes in addtion to sharing
> > caches at particular levels, and will split into separate control and
> > monitor scope.
> > 
> > 2) Split the "scope" field from first patch into "ctrl_scope" and
> > "mon_scope" (also with the addition of the new list for the mon_scope).
> > 
> > 3) Add "node" as a new option for scope in addtion to L3 and L2 cache.
> > 
> 
> hmmm - my comment cannot be addressed through patch re-ordering.
> If I understand correctly you plan to change the name of "cache_level"
> to "ctrl_scope". My comment is that this obfuscates the code as long as
> you use this variable to compare against data that can only represent cache
> levels. This just repeats what I wrote in
> https://lore.kernel.org/lkml/09847c37-66d7-c286-a313-308eaa338c64@intel.com/

I'm proposing more than just re-ordering. The above sequence is a
couple of extra patches in the series.

Existing state of code:

	There is a single field named "cache_level" that describes how
	CPUs are assigned to domains based on their sharing of a cache
	at a particular level. Hard-coded values of "2" and "3" are used
	to describe the level. This is just a scope description of which
	CPUs are grouped together. But it is limited to just doing so
	based on which caches are shared by those CPUs.

Step 1:

	Change the name of the field s/cache_level/scope/. Provide an
	enum with values RESCTRL_L3_CACHE, RESCTRL_L2_CACHE aand use
	those througout code instead of 3, 2, and implictly passing
	the resctrl scope to functions like get_cpu_cacheinfo_id()

	Add get_domain_id_from_scope() function that takes the enum
	values for scope and converts to "3", "2" to pass to
	get_cpu_cacheinfo_id().

	No functional change. Just broadening the meaning of the field
	so that it can in future patches to describe scopes that
	aren't a function of sharing a cache of a particular level.

Step 2:

	Break the single list of domains into two separate lists. One
	for control domains, one for monitor domains. Change the name
	of the "scope" field to "ctrl"scope", add a new field
	"mon_scope".

Step 3:

	Add RESCTRL_NODE as a new option in the enum for how CPUs are
	grouped into domains.
> 
> 
> >> ...
> >>
> >>>>> @@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> >>>>>   */
> >>>>>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >>>>>  {
> >>>>> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> >>>>> -	struct list_head *add_pos = NULL;
> >>>>> -	struct rdt_hw_domain *hw_dom;
> >>>>> -	struct rdt_domain *d;
> >>>>> -	int err;
> >>>>> -
> >>>>> -	d = rdt_find_domain(r, id, &add_pos);
> >>>>> -	if (IS_ERR(d)) {
> >>>>> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> >>>>> -		return;
> >>>>> -	}
> >>>>> -
> >>>>> -	if (d) {
> >>>>> -		cpumask_set_cpu(cpu, &d->cpu_mask);
> >>>>> -		if (r->cache.arch_has_per_cpu_cfg)
> >>>>> -			rdt_domain_reconfigure_cdp(r);
> >>>>> -		return;
> >>>>> -	}
> >>>>> -
> >>>>> -	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> >>>>> -	if (!hw_dom)
> >>>>> -		return;
> >>>>> -
> >>>>> -	d = &hw_dom->d_resctrl;
> >>>>> -	d->id = id;
> >>>>> -	cpumask_set_cpu(cpu, &d->cpu_mask);
> >>>>> -
> >>>>> -	rdt_domain_reconfigure_cdp(r);
> >>>>> -
> >>>>> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> >>>>> -		domain_free(hw_dom);
> >>>>> -		return;
> >>>>> -	}
> >>>>> -
> >>>>> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> >>>>> -		domain_free(hw_dom);
> >>>>> -		return;
> >>>>> -	}
> >>>>> -
> >>>>> -	list_add_tail(&d->list, add_pos);
> >>>>> -
> >>>>> -	err = resctrl_online_domain(r, d);
> >>>>> -	if (err) {
> >>>>> -		list_del(&d->list);
> >>>>> -		domain_free(hw_dom);
> >>>>> -	}
> >>>>> +	if (r->alloc_capable)
> >>>>> +		domain_add_cpu_ctrl(cpu, r);
> >>>>> +	if (r->mon_capable)
> >>>>> +		domain_add_cpu_mon(cpu, r);
> >>>>>  }
> >>>>
> >>>> A resource could be both alloc and mon capable ... both
> >>>> domain_add_cpu_ctrl() and domain_add_cpu_mon() can fail.
> >>>> Should domain_add_cpu_mon() still be run for a CPU if
> >>>> domain_add_cpu_ctrl() failed? 
> >>>>
> >>>> Looking ahead the CPU should probably also not be added
> >>>> to the default groups mask if a failure occurred.
> >>>
> >>> Existing code doesn't do anything for the case where a CPU
> >>> can't be added to a domain (probably the only real error case
> >>> is failure to allocate memory for the domain structure).
> >>
> >> Is my statement about CPU being added to default group mask
> >> incorrect? Seems like a potential issue related to domain's
> >> CPU mask also.
> >>
> >> Please see my earlier question. Existing code does not proceed
> >> with monitor initialization if control initialization fails and
> >> undoes control initialization if monitor initialization fails. 
> > 
> > Existing code silently continues if a domain structure cannot
> > be allocated to add a CPU to a domain:
> > 
> > 503 static void domain_add_cpu(int cpu, struct rdt_resource *r)
> > 504 {
> > 505         int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> > 506         struct list_head *add_pos = NULL;
> > 507         struct rdt_hw_domain *hw_dom;
> > 508         struct rdt_domain *d;
> > 509         int err;
> > 
> > ...
> > 
> > 523
> > 524         hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> > 525         if (!hw_dom)
> > 526                 return;
> > 527
> > 
> 
> 
> Right ... and if it returns silently as above it runs:
> 
> static int resctrl_online_cpu(unsigned int cpu)
> {
> 
> 
> 	for_each_capable_rdt_resource(r)
> 		domain_add_cpu(cpu, r);
> 	>>>>> cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); <<<<<<<<
> 
> }
> 
> Also, note within domain_add_cpu():
> 
> static void domain_add_cpu(int cpu, struct rdt_resource *r)
> {
> 
> 
> 	...
> 	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> 		domain_free(hw_dom);
> 		return;
> 	}
> 
> 	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> 		domain_free(hw_dom);
> 		return;
> 	}
> 
> 	...
> }
> 
> The above is the other item that I've been trying to discuss
> with you. Note that existing resctrl will not initialize monitoring if
> control could not be initialized.
> Compare with this submission:
> 
> 	if (r->alloc_capable)
> 		domain_add_cpu_ctrl(cpu, r);
> 	if (r->mon_capable)
> 		domain_add_cpu_mon(cpu, r);
> 
> 
> 
> >>
> >>> May be something to tackle in a future series if anyone
> >>> thinks this is a serious problem and has suggestions on
> >>> what to do. It seems like a catastrophic problem to not
> >>> have some CPUs in some/all domains of some resources.
> >>> Maybe this should disable mounting resctrl filesystem
> >>> completely?
> >>
> >> It is not clear to me how this is catastrophic but I
> >> do not think resctrl should claim support for a resource
> >> on a CPU if it was not able to complete initialization
> > 
> > That's the status quo. See above code snippet.
> 
> Could you please elaborate what you mean with status quo?

Status quo. System may be booting and add a bunch of CPUs to the
first domain for a resource. When it finds a CPU from a different
domain it calls domain_add_cpu(), but for some reason the allocation
of a rdt_hw_domain fails here:

524         hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
525         if (!hw_dom)
526                 return;

domain_add_cpu() returns with no message to the console, no error code.
System boots, but this CPU is missing from the domain (likely the whole
domain is missing as the same allocation will most likely fail for the
remainder of the CPUs in this domain). Only indication to user is when
they read the schemata file and see something like this:

$ cat /sys/fs/resctrl/schemata
    MB:0= 100
    L3:0=7fff;1=7fff

(assuming that all domains were successfully allocated for L3, and the
failure described above happened on the allocation for the second domain
of the MB resource.)

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring
  2023-08-28 20:59               ` Tony Luck
@ 2023-08-28 21:20                 ` Reinette Chatre
  2023-08-28 22:12                   ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-08-28 21:20 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches



On 8/28/2023 1:59 PM, Tony Luck wrote:
> On Mon, Aug 28, 2023 at 12:56:27PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 8/28/2023 11:46 AM, Tony Luck wrote:
>>> On Mon, Aug 28, 2023 at 10:05:31AM -0700, Reinette Chatre wrote:
>>>> Hi Tony,
>>>>
>>>> On 8/25/2023 10:05 AM, Tony Luck wrote:
>>>>> On Fri, Aug 11, 2023 at 10:29:25AM -0700, Reinette Chatre wrote:
>>>>>> On 7/22/2023 12:07 PM, Tony Luck wrote:
>>>>
>>>>>>> Change all places where monitoring functions walk the list of
>>>>>>> domains to use the new "mondomains" list instead of the old
>>>>>>> "domains" list.
>>>>>>
>>>>>> I would not refer to it as "the old domains list" as it creates
>>>>>> impression that this is being replaced. The changelog makes
>>>>>> no mention that domains list will remain and be dedicated to
>>>>>> control domains. I think this is important to include in description
>>>>>> of this change.
>>>>>
>>>>> I've rewritten the entire commit message incorporating your suggestions.
>>>>> V6 will be posted soon (after I get some time on an SNC SPR to check
>>>>> that it all works!)
>>>>
>>>> I seem to have missed v5.
>>>
>>> I simply can't count. You are correct that next version to be posted
>>> will be v5.
>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>>>>>> ---
>>>>>>>  include/linux/resctrl.h                   |  10 +-
>>>>>>>  arch/x86/kernel/cpu/resctrl/internal.h    |   2 +-
>>>>>>>  arch/x86/kernel/cpu/resctrl/core.c        | 195 +++++++++++++++-------
>>>>>>>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
>>>>>>>  arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
>>>>>>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  30 ++--
>>>>>>>  6 files changed, 167 insertions(+), 74 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>>>>>> index 8334eeacfec5..1267d56f9e76 100644
>>>>>>> --- a/include/linux/resctrl.h
>>>>>>> +++ b/include/linux/resctrl.h
>>>>>>> @@ -151,9 +151,11 @@ struct resctrl_schema;
>>>>>>>   * @mon_capable:	Is monitor feature available on this machine
>>>>>>>   * @num_rmid:		Number of RMIDs available
>>>>>>>   * @cache_level:	Which cache level defines scope of this resource
>>>>>>> + * @mon_scope:		Scope of this resource if different from cache_level
>>>>>>
>>>>>> I think this addition should be deferred. As it is here it the "if different
>>>>>> from cache_level" also creates many questions (when will it be different?
>>>>>> how will it be determined that the scope is different in order to know that
>>>>>> mon_scope should be used?)
>>>>>
>>>>> I've gone in a different direction. V6 renames "cache_level" to
>>>>> "ctrl_scope". I think this makes intent clear from step #1.
>>>>>
>>>>
>>>> This change is not clear to me. Previously you changed this name
>>>> but kept using it in code specific to cache levels. It is not clear
>>>> to me how this time's rename would be different.
>>>
>>> The current "cache_level" field in the structure is describing
>>> the scope of each instance using the cache level (2 or 3) as the
>>> method to describe which CPUs are considered part of a group. Currently
>>> the scope is the same for both control and monitor resources.
>>
>> Right.
>>
>>>
>>> Would you like to see patches in this progrssion:
>>>
>>> 1) Rename "cache_level" to "scope". With commit comment that future
>>> patches are going to base the scope on NUMA nodes in addtion to sharing
>>> caches at particular levels, and will split into separate control and
>>> monitor scope.
>>>
>>> 2) Split the "scope" field from first patch into "ctrl_scope" and
>>> "mon_scope" (also with the addition of the new list for the mon_scope).
>>>
>>> 3) Add "node" as a new option for scope in addtion to L3 and L2 cache.
>>>
>>
>> hmmm - my comment cannot be addressed through patch re-ordering.
>> If I understand correctly you plan to change the name of "cache_level"
>> to "ctrl_scope". My comment is that this obfuscates the code as long as
>> you use this variable to compare against data that can only represent cache
>> levels. This just repeats what I wrote in
>> https://lore.kernel.org/lkml/09847c37-66d7-c286-a313-308eaa338c64@intel.com/
> 
> I'm proposing more than just re-ordering. The above sequence is a
> couple of extra patches in the series.
> 
> Existing state of code:
> 
> 	There is a single field named "cache_level" that describes how
> 	CPUs are assigned to domains based on their sharing of a cache
> 	at a particular level. Hard-coded values of "2" and "3" are used
> 	to describe the level. This is just a scope description of which
> 	CPUs are grouped together. But it is limited to just doing so
> 	based on which caches are shared by those CPUs.
> 
> Step 1:
> 
> 	Change the name of the field s/cache_level/scope/. Provide an
> 	enum with values RESCTRL_L3_CACHE, RESCTRL_L2_CACHE aand use
> 	those througout code instead of 3, 2, and implictly passing
> 	the resctrl scope to functions like get_cpu_cacheinfo_id()
> 
> 	Add get_domain_id_from_scope() function that takes the enum
> 	values for scope and converts to "3", "2" to pass to
> 	get_cpu_cacheinfo_id().
> 
> 	No functional change. Just broadening the meaning of the field
> 	so that it can in future patches to describe scopes that
> 	aren't a function of sharing a cache of a particular level.

Right. So if I understand you are planning the same change as you did
in V3, like below, and my original comment still stands.

 @@ -1348,7 +1348,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
  	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
  	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
  	for (i = 0; i < ci->num_leaves; i++) {
 -		if (ci->info_list[i].level == r->cache_level) {
 +		if (ci->info_list[i].level == r->scope) {
  			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
  			break;
  		}


>>>>>>> @@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>>>>>>>   */
>>>>>>>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>>>>>>>  {
>>>>>>> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
>>>>>>> -	struct list_head *add_pos = NULL;
>>>>>>> -	struct rdt_hw_domain *hw_dom;
>>>>>>> -	struct rdt_domain *d;
>>>>>>> -	int err;
>>>>>>> -
>>>>>>> -	d = rdt_find_domain(r, id, &add_pos);
>>>>>>> -	if (IS_ERR(d)) {
>>>>>>> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
>>>>>>> -		return;
>>>>>>> -	}
>>>>>>> -
>>>>>>> -	if (d) {
>>>>>>> -		cpumask_set_cpu(cpu, &d->cpu_mask);
>>>>>>> -		if (r->cache.arch_has_per_cpu_cfg)
>>>>>>> -			rdt_domain_reconfigure_cdp(r);
>>>>>>> -		return;
>>>>>>> -	}
>>>>>>> -
>>>>>>> -	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
>>>>>>> -	if (!hw_dom)
>>>>>>> -		return;
>>>>>>> -
>>>>>>> -	d = &hw_dom->d_resctrl;
>>>>>>> -	d->id = id;
>>>>>>> -	cpumask_set_cpu(cpu, &d->cpu_mask);
>>>>>>> -
>>>>>>> -	rdt_domain_reconfigure_cdp(r);
>>>>>>> -
>>>>>>> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
>>>>>>> -		domain_free(hw_dom);
>>>>>>> -		return;
>>>>>>> -	}
>>>>>>> -
>>>>>>> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
>>>>>>> -		domain_free(hw_dom);
>>>>>>> -		return;
>>>>>>> -	}
>>>>>>> -
>>>>>>> -	list_add_tail(&d->list, add_pos);
>>>>>>> -
>>>>>>> -	err = resctrl_online_domain(r, d);
>>>>>>> -	if (err) {
>>>>>>> -		list_del(&d->list);
>>>>>>> -		domain_free(hw_dom);
>>>>>>> -	}
>>>>>>> +	if (r->alloc_capable)
>>>>>>> +		domain_add_cpu_ctrl(cpu, r);
>>>>>>> +	if (r->mon_capable)
>>>>>>> +		domain_add_cpu_mon(cpu, r);
>>>>>>>  }
>>>>>>
>>>>>> A resource could be both alloc and mon capable ... both
>>>>>> domain_add_cpu_ctrl() and domain_add_cpu_mon() can fail.
>>>>>> Should domain_add_cpu_mon() still be run for a CPU if
>>>>>> domain_add_cpu_ctrl() failed? 
>>>>>>
>>>>>> Looking ahead the CPU should probably also not be added
>>>>>> to the default groups mask if a failure occurred.
>>>>>
>>>>> Existing code doesn't do anything for the case where a CPU
>>>>> can't be added to a domain (probably the only real error case
>>>>> is failure to allocate memory for the domain structure).
>>>>
>>>> Is my statement about CPU being added to default group mask
>>>> incorrect? Seems like a potential issue related to domain's
>>>> CPU mask also.
>>>>
>>>> Please see my earlier question. Existing code does not proceed
>>>> with monitor initialization if control initialization fails and
>>>> undoes control initialization if monitor initialization fails. 
>>>
>>> Existing code silently continues if a domain structure cannot
>>> be allocated to add a CPU to a domain:
>>>
>>> 503 static void domain_add_cpu(int cpu, struct rdt_resource *r)
>>> 504 {
>>> 505         int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
>>> 506         struct list_head *add_pos = NULL;
>>> 507         struct rdt_hw_domain *hw_dom;
>>> 508         struct rdt_domain *d;
>>> 509         int err;
>>>
>>> ...
>>>
>>> 523
>>> 524         hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
>>> 525         if (!hw_dom)
>>> 526                 return;
>>> 527
>>>
>>
>>
>> Right ... and if it returns silently as above it runs:
>>
>> static int resctrl_online_cpu(unsigned int cpu)
>> {
>>
>>
>> 	for_each_capable_rdt_resource(r)
>> 		domain_add_cpu(cpu, r);
>> 	>>>>> cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); <<<<<<<<
>>
>> }
>>
>> Also, note within domain_add_cpu():
>>
>> static void domain_add_cpu(int cpu, struct rdt_resource *r)
>> {
>>
>>
>> 	...
>> 	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
>> 		domain_free(hw_dom);
>> 		return;
>> 	}
>>
>> 	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
>> 		domain_free(hw_dom);
>> 		return;
>> 	}
>>
>> 	...
>> }
>>
>> The above is the other item that I've been trying to discuss
>> with you. Note that existing resctrl will not initialize monitoring if
>> control could not be initialized.
>> Compare with this submission:
>>
>> 	if (r->alloc_capable)
>> 		domain_add_cpu_ctrl(cpu, r);
>> 	if (r->mon_capable)
>> 		domain_add_cpu_mon(cpu, r);
>>
>>

I'll stop trying

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring
  2023-08-28 21:20                 ` Reinette Chatre
@ 2023-08-28 22:12                   ` Tony Luck
  2023-08-29  0:31                     ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-08-28 22:12 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Aug 28, 2023 at 02:20:04PM -0700, Reinette Chatre wrote:
> 
> 
> On 8/28/2023 1:59 PM, Tony Luck wrote:
> > On Mon, Aug 28, 2023 at 12:56:27PM -0700, Reinette Chatre wrote:
> >> Hi Tony,
> >>
> >> On 8/28/2023 11:46 AM, Tony Luck wrote:
> >>> On Mon, Aug 28, 2023 at 10:05:31AM -0700, Reinette Chatre wrote:
> >>>> Hi Tony,
> >>>>
> >>>> On 8/25/2023 10:05 AM, Tony Luck wrote:
> >>>>> On Fri, Aug 11, 2023 at 10:29:25AM -0700, Reinette Chatre wrote:
> >>>>>> On 7/22/2023 12:07 PM, Tony Luck wrote:
> >>>>
> >>>>>>> Change all places where monitoring functions walk the list of
> >>>>>>> domains to use the new "mondomains" list instead of the old
> >>>>>>> "domains" list.
> >>>>>>
> >>>>>> I would not refer to it as "the old domains list" as it creates
> >>>>>> impression that this is being replaced. The changelog makes
> >>>>>> no mention that domains list will remain and be dedicated to
> >>>>>> control domains. I think this is important to include in description
> >>>>>> of this change.
> >>>>>
> >>>>> I've rewritten the entire commit message incorporating your suggestions.
> >>>>> V6 will be posted soon (after I get some time on an SNC SPR to check
> >>>>> that it all works!)
> >>>>
> >>>> I seem to have missed v5.
> >>>
> >>> I simply can't count. You are correct that next version to be posted
> >>> will be v5.
> >>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
> >>>>>>> ---
> >>>>>>>  include/linux/resctrl.h                   |  10 +-
> >>>>>>>  arch/x86/kernel/cpu/resctrl/internal.h    |   2 +-
> >>>>>>>  arch/x86/kernel/cpu/resctrl/core.c        | 195 +++++++++++++++-------
> >>>>>>>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
> >>>>>>>  arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
> >>>>>>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  30 ++--
> >>>>>>>  6 files changed, 167 insertions(+), 74 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> >>>>>>> index 8334eeacfec5..1267d56f9e76 100644
> >>>>>>> --- a/include/linux/resctrl.h
> >>>>>>> +++ b/include/linux/resctrl.h
> >>>>>>> @@ -151,9 +151,11 @@ struct resctrl_schema;
> >>>>>>>   * @mon_capable:	Is monitor feature available on this machine
> >>>>>>>   * @num_rmid:		Number of RMIDs available
> >>>>>>>   * @cache_level:	Which cache level defines scope of this resource
> >>>>>>> + * @mon_scope:		Scope of this resource if different from cache_level
> >>>>>>
> >>>>>> I think this addition should be deferred. As it is here it the "if different
> >>>>>> from cache_level" also creates many questions (when will it be different?
> >>>>>> how will it be determined that the scope is different in order to know that
> >>>>>> mon_scope should be used?)
> >>>>>
> >>>>> I've gone in a different direction. V6 renames "cache_level" to
> >>>>> "ctrl_scope". I think this makes intent clear from step #1.
> >>>>>
> >>>>
> >>>> This change is not clear to me. Previously you changed this name
> >>>> but kept using it in code specific to cache levels. It is not clear
> >>>> to me how this time's rename would be different.
> >>>
> >>> The current "cache_level" field in the structure is describing
> >>> the scope of each instance using the cache level (2 or 3) as the
> >>> method to describe which CPUs are considered part of a group. Currently
> >>> the scope is the same for both control and monitor resources.
> >>
> >> Right.
> >>
> >>>
> >>> Would you like to see patches in this progrssion:
> >>>
> >>> 1) Rename "cache_level" to "scope". With commit comment that future
> >>> patches are going to base the scope on NUMA nodes in addtion to sharing
> >>> caches at particular levels, and will split into separate control and
> >>> monitor scope.
> >>>
> >>> 2) Split the "scope" field from first patch into "ctrl_scope" and
> >>> "mon_scope" (also with the addition of the new list for the mon_scope).
> >>>
> >>> 3) Add "node" as a new option for scope in addtion to L3 and L2 cache.
> >>>
> >>
> >> hmmm - my comment cannot be addressed through patch re-ordering.
> >> If I understand correctly you plan to change the name of "cache_level"
> >> to "ctrl_scope". My comment is that this obfuscates the code as long as
> >> you use this variable to compare against data that can only represent cache
> >> levels. This just repeats what I wrote in
> >> https://lore.kernel.org/lkml/09847c37-66d7-c286-a313-308eaa338c64@intel.com/
> > 
> > I'm proposing more than just re-ordering. The above sequence is a
> > couple of extra patches in the series.
> > 
> > Existing state of code:
> > 
> > 	There is a single field named "cache_level" that describes how
> > 	CPUs are assigned to domains based on their sharing of a cache
> > 	at a particular level. Hard-coded values of "2" and "3" are used
> > 	to describe the level. This is just a scope description of which
> > 	CPUs are grouped together. But it is limited to just doing so
> > 	based on which caches are shared by those CPUs.
> > 
> > Step 1:
> > 
> > 	Change the name of the field s/cache_level/scope/. Provide an
> > 	enum with values RESCTRL_L3_CACHE, RESCTRL_L2_CACHE aand use
> > 	those througout code instead of 3, 2, and implictly passing
> > 	the resctrl scope to functions like get_cpu_cacheinfo_id()
> > 
> > 	Add get_domain_id_from_scope() function that takes the enum
> > 	values for scope and converts to "3", "2" to pass to
> > 	get_cpu_cacheinfo_id().
> > 
> > 	No functional change. Just broadening the meaning of the field
> > 	so that it can in future patches to describe scopes that
> > 	aren't a function of sharing a cache of a particular level.
> 
> Right. So if I understand you are planning the same change as you did
> in V3, like below, and my original comment still stands.
> 
>  @@ -1348,7 +1348,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>   	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
>   	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
>   	for (i = 0; i < ci->num_leaves; i++) {
>  -		if (ci->info_list[i].level == r->cache_level) {
>  +		if (ci->info_list[i].level == r->scope) {
>   			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
>   			break;

No. I heard your complaint about mixing "scope" and "cache_level".
Latest version of that code looks like this:

1341 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
1342                                   struct rdt_domain *d, unsigned long cbm)
1343 {
1344         struct cpu_cacheinfo *ci;
1345         unsigned int size = 0;
1346         int cache_level;
1347         int num_b, i;
1348
1349         switch (r->ctrl_scope) {
1350         case RESCTRL_L3_CACHE:
1351                 cache_level = 3;
1352                 break;
1353         case RESCTRL_L2_CACHE:
1354                 cache_level = 2;
1355                 break;
1356         default:
1357                 WARN_ON_ONCE(1);
1358                 return size;
1359         }
1360
1361         num_b = bitmap_weight(&cbm, r->cache.cbm_len);
1362         ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
1363         for (i = 0; i < ci->num_leaves; i++) {
1364                 if (ci->info_list[i].level == cache_level) {
1365                         size = ci->info_list[i].size / r->cache.cbm_len * num_b;
1366                         break;
1367                 }
1368         }
1369
1370         return size / snc_nodes_per_l3_cache;
1371 }

>   		}
> 
> 
> >>>>>>> @@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> >>>>>>>   */
> >>>>>>>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >>>>>>>  {
> >>>>>>> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> >>>>>>> -	struct list_head *add_pos = NULL;
> >>>>>>> -	struct rdt_hw_domain *hw_dom;
> >>>>>>> -	struct rdt_domain *d;
> >>>>>>> -	int err;
> >>>>>>> -
> >>>>>>> -	d = rdt_find_domain(r, id, &add_pos);
> >>>>>>> -	if (IS_ERR(d)) {
> >>>>>>> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> >>>>>>> -		return;
> >>>>>>> -	}
> >>>>>>> -
> >>>>>>> -	if (d) {
> >>>>>>> -		cpumask_set_cpu(cpu, &d->cpu_mask);
> >>>>>>> -		if (r->cache.arch_has_per_cpu_cfg)
> >>>>>>> -			rdt_domain_reconfigure_cdp(r);
> >>>>>>> -		return;
> >>>>>>> -	}
> >>>>>>> -
> >>>>>>> -	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> >>>>>>> -	if (!hw_dom)
> >>>>>>> -		return;
> >>>>>>> -
> >>>>>>> -	d = &hw_dom->d_resctrl;
> >>>>>>> -	d->id = id;
> >>>>>>> -	cpumask_set_cpu(cpu, &d->cpu_mask);
> >>>>>>> -
> >>>>>>> -	rdt_domain_reconfigure_cdp(r);
> >>>>>>> -
> >>>>>>> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> >>>>>>> -		domain_free(hw_dom);
> >>>>>>> -		return;
> >>>>>>> -	}
> >>>>>>> -
> >>>>>>> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> >>>>>>> -		domain_free(hw_dom);
> >>>>>>> -		return;
> >>>>>>> -	}
> >>>>>>> -
> >>>>>>> -	list_add_tail(&d->list, add_pos);
> >>>>>>> -
> >>>>>>> -	err = resctrl_online_domain(r, d);
> >>>>>>> -	if (err) {
> >>>>>>> -		list_del(&d->list);
> >>>>>>> -		domain_free(hw_dom);
> >>>>>>> -	}
> >>>>>>> +	if (r->alloc_capable)
> >>>>>>> +		domain_add_cpu_ctrl(cpu, r);
> >>>>>>> +	if (r->mon_capable)
> >>>>>>> +		domain_add_cpu_mon(cpu, r);
> >>>>>>>  }
> >>>>>>
> >>>>>> A resource could be both alloc and mon capable ... both
> >>>>>> domain_add_cpu_ctrl() and domain_add_cpu_mon() can fail.
> >>>>>> Should domain_add_cpu_mon() still be run for a CPU if
> >>>>>> domain_add_cpu_ctrl() failed? 
> >>>>>>
> >>>>>> Looking ahead the CPU should probably also not be added
> >>>>>> to the default groups mask if a failure occurred.
> >>>>>
> >>>>> Existing code doesn't do anything for the case where a CPU
> >>>>> can't be added to a domain (probably the only real error case
> >>>>> is failure to allocate memory for the domain structure).
> >>>>
> >>>> Is my statement about CPU being added to default group mask
> >>>> incorrect? Seems like a potential issue related to domain's
> >>>> CPU mask also.
> >>>>
> >>>> Please see my earlier question. Existing code does not proceed
> >>>> with monitor initialization if control initialization fails and
> >>>> undoes control initialization if monitor initialization fails. 
> >>>
> >>> Existing code silently continues if a domain structure cannot
> >>> be allocated to add a CPU to a domain:
> >>>
> >>> 503 static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >>> 504 {
> >>> 505         int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> >>> 506         struct list_head *add_pos = NULL;
> >>> 507         struct rdt_hw_domain *hw_dom;
> >>> 508         struct rdt_domain *d;
> >>> 509         int err;
> >>>
> >>> ...
> >>>
> >>> 523
> >>> 524         hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> >>> 525         if (!hw_dom)
> >>> 526                 return;
> >>> 527
> >>>
> >>
> >>
> >> Right ... and if it returns silently as above it runs:
> >>
> >> static int resctrl_online_cpu(unsigned int cpu)
> >> {
> >>
> >>
> >> 	for_each_capable_rdt_resource(r)
> >> 		domain_add_cpu(cpu, r);
> >> 	>>>>> cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); <<<<<<<<
> >>
> >> }
> >>
> >> Also, note within domain_add_cpu():
> >>
> >> static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >> {
> >>
> >>
> >> 	...
> >> 	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> >> 		domain_free(hw_dom);
> >> 		return;
> >> 	}
> >>
> >> 	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> >> 		domain_free(hw_dom);
> >> 		return;
> >> 	}
> >>
> >> 	...
> >> }
> >>
> >> The above is the other item that I've been trying to discuss
> >> with you. Note that existing resctrl will not initialize monitoring if
> >> control could not be initialized.
> >> Compare with this submission:
> >>
> >> 	if (r->alloc_capable)
> >> 		domain_add_cpu_ctrl(cpu, r);
> >> 	if (r->mon_capable)
> >> 		domain_add_cpu_mon(cpu, r);
> >>
> >>
> 
> I'll stop trying

I'll see what I can do to improve the error handling.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring
  2023-08-28 22:12                   ` Tony Luck
@ 2023-08-29  0:31                     ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-08-29  0:31 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Trimming to the core of the discussion.

> > >> 	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> > >> 		domain_free(hw_dom);
> > >> 		return;
> > >> 	}
> > >>
> > >> 	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> > >> 		domain_free(hw_dom);
> > >> 		return;
> > >> 	}

In the current code the control and monitor functions share a domain
structure. So enabling one without the other would lead to a whole
slew of complications. Various other bits of code would need to change
to make run-time checks that the control/monitor pieces of the shared
domain structure had been completely set up. So the existing code takes
the easy path out and says "if I can't have both, I'll take have
neither".

> > >> The above is the other item that I've been trying to discuss
> > >> with you. Note that existing resctrl will not initialize monitoring if
> > >> control could not be initialized.
> > >> Compare with this submission:
> > >>
> > >> 	if (r->alloc_capable)
> > >> 		domain_add_cpu_ctrl(cpu, r);
> > >> 	if (r->mon_capable)
> > >> 		domain_add_cpu_mon(cpu, r);

With the updates I'm making, there are separate domain structures for
control and monitor.  They don't have to be tied together. It's possible
for initialization to fail on one, but the other remain completely
functional (without adding "is this completely setup" checks to
other code).

The question is what should the code do? Should it allow a partial
success (now that is possible)? Or should it maintain legacy behavior
and block use of both control and monitor for domain if either fails
to allocate necessary memory to operate?

I'm generally in favor of legacy semantics. But it seems weird to make
the code deliberately worse to preserve this particular behavior.
Especially as it adds complexity to the code for a case that in practice
is never going to happen.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v5 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
                     ` (8 preceding siblings ...)
  2023-08-11 17:27   ` Reinette Chatre
@ 2023-08-29 23:44   ` Tony Luck
  2023-08-29 23:44     ` [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
                       ` (9 more replies)
  9 siblings, 10 replies; 331+ messages in thread
From: Tony Luck @ 2023-08-29 23:44 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The Sub-NUMA cluster feature on some Intel processors partitions
the CPUs that share an L3 cache into two or more sets. This plays
havoc with the Resource Director Technology (RDT) monitoring features.
Prior to this patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID
counters in the same way. This allows for monitoring features
to be used (with the caveat that memory accesses between different
SNC NUMA nodes may still not be counted accuratlely.

Note that this patch series improves resctrl reporting considerably
on systems with SNC enabled, but there will still be some anomalies
for processes accessing memory from other sub-NUMA nodes.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Changes since v4:

Rebased to upstream v6.5

Addressed problems reported by Reinette in follow-up messages to
v4 posting:
   https://lore.kernel.org/r/20230722190740.326190-1-tony.luck@intel.com

Broke the patch series into a (hopefully) more logical progression.

First two patches are infrastructure changes to allow resctrl
domains to have scopes that are not defined by which CPUs share
a particular cache instance, and to allow resources to have different
scope for control an monitor features.

Patch 3 cleans up some loose ends from the first two patches by
adding a new variant of the rdt_domain structure with just monitoring
fields, and removing the monitor fields from the original rdt_domain
structure since it is now only used for control features.

Patch 4 adds "node" as a scope option.

Patch 5 adjusts all code paths that need to be aware of SNC mode.

Patch 6 detects SNC mode, modifies the MSR that adjusts interpretation
of physical RMID counters.

Patch 7 updates documentation.

Patch 8 does a partial update for the resctrl selftests.

Tony Luck (8):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain structure
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes
  selftests/resctrl: Adjust effective L3 cache size when SNC enabled

 Documentation/arch/x86/resctrl.rst          |  25 +-
 include/linux/resctrl.h                     |  64 ++--
 arch/x86/include/asm/msr-index.h            |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h      |  42 ++-
 tools/testing/selftests/resctrl/resctrl.h   |   1 +
 arch/x86/kernel/cpu/resctrl/core.c          | 321 ++++++++++++++++----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c   |   6 +-
 arch/x86/kernel/cpu/resctrl/monitor.c       |  58 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c   |  15 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c      |  69 +++--
 tools/testing/selftests/resctrl/resctrlfs.c |  57 ++++
 11 files changed, 508 insertions(+), 151 deletions(-)


base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c
-- 
2.41.0


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope
  2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
@ 2023-08-29 23:44     ` Tony Luck
  2023-08-30  8:53       ` Maciej Wieczór-Retman
                         ` (2 more replies)
  2023-08-29 23:44     ` [PATCH v5 2/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
                       ` (8 subsequent siblings)
  9 siblings, 3 replies; 331+ messages in thread
From: Tony Luck @ 2023-08-29 23:44 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Legacy resctrl features operated on subsets of CPUs in the system with
the defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

No functional change.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |  9 ++++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 27 ++++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 15 ++++++++++++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 15 ++++++++++++-
 4 files changed, 56 insertions(+), 10 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..2db1244ae642 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L3_CACHE,
+	RESCTRL_L2_CACHE,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..0d3bae523ecb 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -487,6 +487,21 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, 3);
+	case RESCTRL_L2_CACHE:
+		return get_cpu_cacheinfo_id(cpu, 2);
+	default:
+		WARN_ON_ONCE(1);
+		break;
+	}
+
+	return -1;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -502,7 +517,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
@@ -552,7 +567,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 458cb7419502..e79324676f57 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -279,6 +279,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
 	struct cpu_cacheinfo *ci;
+	int cache_level;
 	int ret;
 	int i;
 
@@ -296,8 +297,20 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
+	switch (plr->s->res->scope) {
+	case RESCTRL_L3_CACHE:
+		cache_level = 3;
+		break;
+	case RESCTRL_L2_CACHE:
+		cache_level = 2;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return -ENODEV;
+	}
+
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == cache_level) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 725344048f85..f510414bf6ce 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1343,12 +1343,25 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 {
 	struct cpu_cacheinfo *ci;
 	unsigned int size = 0;
+	int cache_level;
 	int num_b, i;
 
+	switch (r->scope) {
+	case RESCTRL_L3_CACHE:
+		cache_level = 3;
+		break;
+	case RESCTRL_L2_CACHE:
+		cache_level = 2;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return size;
+	}
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == cache_level) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v5 2/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
  2023-08-29 23:44     ` [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-08-29 23:44     ` Tony Luck
  2023-09-25 23:24       ` Reinette Chatre
  2023-09-26 16:54       ` Moger, Babu
  2023-08-29 23:44     ` [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure Tony Luck
                       ` (7 subsequent siblings)
  9 siblings, 2 replies; 331+ messages in thread
From: Tony Luck @ 2023-08-29 23:44 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Existing resctrl assumes that control and monitor operations on a
resource are performed at the same scope.

Prepare for systems that use different scope (specifically L3 scope
for cache control and NODE scope for cache occupancy and memory
bandwidth monitoring).

Create separate domain lists for control and monitor operations.

No important functional change. But note that errors during
initialization of either control or monitor functions on a domain would
previously result in that domain being excluded from both control and
monitor operations. Now the domains are allocated independently it is
no longer required to disable both control and monitor operations if
either fail.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |  16 +-
 arch/x86/kernel/cpu/resctrl/internal.h    |   6 +-
 arch/x86/kernel/cpu/resctrl/core.c        | 227 +++++++++++++++-------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  32 +--
 7 files changed, 199 insertions(+), 88 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 2db1244ae642..33856943a787 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -155,10 +155,12 @@ enum resctrl_scope {
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @scope:		Scope of this resource
+ * @ctrl_scope:		Scope of this resource for control functions
+ * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		All domains for this resource
+ * @domains:		Control domains for this resource
+ * @mon_domains:	Monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -173,10 +175,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
+	struct list_head	mondomains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -222,8 +226,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 85ceaf9a31ac..31a5fc3b717f 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -511,8 +511,10 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos);
+struct rdt_domain *rdt_find_ctrldomain(struct list_head *h, int id,
+				       struct list_head **pos);
+struct rdt_domain *rdt_find_mondomain(struct list_head *h, int id,
+				      struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 0d3bae523ecb..97f6f9715fdb 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,7 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define domain_init(id, field) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.field)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -65,8 +65,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.mon_scope		= RESCTRL_L3_CACHE,
+			.domains		= domain_init(RDT_RESOURCE_L3, domains),
+			.mondomains		= domain_init(RDT_RESOURCE_L3, mondomains),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -79,8 +81,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.scope			= RESCTRL_L2_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.ctrl_scope		= RESCTRL_L2_CACHE,
+			.domains		= domain_init(RDT_RESOURCE_L2, domains),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -93,8 +95,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.domains		= domain_init(RDT_RESOURCE_MBA, domains),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -105,8 +107,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.domains		= domain_init(RDT_RESOURCE_SMBA, domains),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -384,15 +386,16 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * __rdt_find_domain - Find a domain in either the list of control or
+ * monitor domains that matches input resource id
  *
  * Search resource r's domain list to find the resource id. If the resource
  * id is found in a domain, return the domain. Otherwise, if requested by
  * caller, return the first domain whose id is bigger than the input id.
  * The domain list is sorted by id in ascending order.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos)
+static void *__rdt_find_domain(struct list_head *h, int id,
+			       struct list_head **pos)
 {
 	struct rdt_domain *d;
 	struct list_head *l;
@@ -400,7 +403,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	if (id < 0)
 		return ERR_PTR(-ENODEV);
 
-	list_for_each(l, &r->domains) {
+	list_for_each(l, h) {
 		d = list_entry(l, struct rdt_domain, list);
 		/* When id is found, return its domain. */
 		if (id == d->id)
@@ -416,6 +419,18 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	return NULL;
 }
 
+struct rdt_domain *rdt_find_ctrldomain(struct list_head *h, int id,
+				       struct list_head **pos)
+{
+	return __rdt_find_domain(h, id, pos);
+}
+
+struct rdt_domain *rdt_find_mondomain(struct list_head *h, int id,
+				      struct list_head **pos)
+{
+	return __rdt_find_domain(h, id, pos);
+}
+
 static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
@@ -431,10 +446,15 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 }
 
 static void domain_free(struct rdt_hw_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mondomain_free(struct rdt_hw_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
@@ -502,6 +522,93 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	return -1;
 }
 
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain *d;
+	int err;
+
+	d = rdt_find_ctrldomain(&r->domains, id, &add_pos);
+	if (IS_ERR(d)) {
+		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+
+	if (d) {
+		cpumask_set_cpu(cpu, &d->cpu_mask);
+		if (r->cache.arch_has_per_cpu_cfg)
+			rdt_domain_reconfigure_cdp(r);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->id = id;
+	cpumask_set_cpu(cpu, &d->cpu_mask);
+
+	rdt_domain_reconfigure_cdp(r);
+
+	if (domain_setup_ctrlval(r, d)) {
+		domain_free(hw_dom);
+		return;
+	}
+
+	list_add_tail(&d->list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del(&d->list);
+		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_mondom;
+	struct list_head *add_pos = NULL;
+	struct rdt_domain *d;
+	int err;
+
+	d = rdt_find_mondomain(&r->mondomains, id, &add_pos);
+	if (IS_ERR(d)) {
+		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+
+	if (d) {
+		cpumask_set_cpu(cpu, &d->cpu_mask);
+
+		return;
+	}
+
+	hw_mondom = kzalloc_node(sizeof(*hw_mondom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_mondom)
+		return;
+
+	d = &hw_mondom->d_resctrl;
+	d->id = id;
+	cpumask_set_cpu(cpu, &d->cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_mondom)) {
+		mondomain_free(hw_mondom);
+		return;
+	}
+
+	list_add_tail(&d->list, add_pos);
+
+	err = resctrl_online_mon_domain(r, d);
+	if (err) {
+		list_del(&d->list);
+		mondomain_free(hw_mondom);
+	}
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -517,70 +624,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
-	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
-	struct rdt_domain *d;
-	int err;
-
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
-		return;
-	}
-
-	if (d) {
-		cpumask_set_cpu(cpu, &d->cpu_mask);
-		if (r->cache.arch_has_per_cpu_cfg)
-			rdt_domain_reconfigure_cdp(r);
-		return;
-	}
-
-	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
-	if (!hw_dom)
-		return;
-
-	d = &hw_dom->d_resctrl;
-	d->id = id;
-	cpumask_set_cpu(cpu, &d->cpu_mask);
-
-	rdt_domain_reconfigure_cdp(r);
-
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
-		return;
-	}
-
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
-		return;
-	}
-
-	list_add_tail(&d->list, add_pos);
-
-	err = resctrl_online_domain(r, d);
-	if (err) {
-		list_del(&d->list);
-		domain_free(hw_dom);
-	}
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
-	d = rdt_find_domain(r, id, NULL);
+	d = rdt_find_ctrldomain(&r->domains, id, NULL);
 	if (IS_ERR_OR_NULL(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
 	hw_dom = resctrl_to_arch_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del(&d->list);
 
 		/*
@@ -593,6 +658,30 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 		return;
 	}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_mondom;
+	struct rdt_domain *d;
+
+	d = rdt_find_mondomain(&r->mondomains, id, NULL);
+	if (IS_ERR_OR_NULL(d)) {
+		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+	hw_mondom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->cpu_mask);
+	if (cpumask_empty(&d->cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del(&d->list);
+
+		mondomain_free(hw_mondom);
+
+		return;
+	}
 
 	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -607,6 +696,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b44c487727d4..468c1815edfd 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -560,7 +560,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
+	d = rdt_find_mondomain(&r->mondomains, domid, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		ret = -ENOENT;
 		goto out;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..66beca785535 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->mondomains, list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index e79324676f57..be8b5f28e638 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -297,7 +297,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
-	switch (plr->s->res->scope) {
+	switch (plr->s->res->ctrl_scope) {
 	case RESCTRL_L3_CACHE:
 		cache_level = 3;
 		break;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f510414bf6ce..f2aec39c49df 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1346,7 +1346,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	int cache_level;
 	int num_b, i;
 
-	switch (r->scope) {
+	switch (r->ctrl_scope) {
 	case RESCTRL_L3_CACHE:
 		cache_level = 3;
 		break;
@@ -1509,7 +1509,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->mondomains, list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1632,7 +1632,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->mondomains, list) {
 		if (d->id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
@@ -2538,7 +2538,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->mondomains, list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2932,7 +2932,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->mondomains, list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3721,15 +3721,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
+}
 
-	if (!r->mon_capable)
-		return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3786,18 +3788,22 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
-	int err;
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
-		return 0;
+	return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure
  2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
  2023-08-29 23:44     ` [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2023-08-29 23:44     ` [PATCH v5 2/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-08-29 23:44     ` Tony Luck
  2023-09-25 23:25       ` Reinette Chatre
  2023-08-29 23:44     ` [PATCH v5 4/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
                       ` (6 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-08-29 23:44 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The same rdt_domain structure is used for both control an monitor
functions. But this results in wasted memory as some of the fields
are only used by control functions, while most are only used for monitor
functions.

Create a new rdt_mondomain structure tailored explicitly for use in
monitor parts of the core. Slim down the rdt_domain structure by
removing the unused monitor fields.

Similar breakout of struct rdt_hw_mondomain from struct rdt_hw_domain.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   | 46 +++++++++++++++--------
 arch/x86/kernel/cpu/resctrl/internal.h    | 38 +++++++++++++------
 arch/x86/kernel/cpu/resctrl/core.c        | 18 ++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  4 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     | 40 ++++++++++----------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 24 ++++++------
 6 files changed, 101 insertions(+), 69 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 33856943a787..08382548571e 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,7 +53,29 @@ struct resctrl_staged_config {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain - group of CPUs sharing a resctrl control resource
+ * @list:		all instances of this resource
+ * @id:			unique id for this instance
+ * @cpu_mask:		which CPUs share this resource
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_domain {
+	// First three fields must match struct rdt_mondomain below.
+	struct list_head		list;
+	int				id;
+	struct cpumask			cpu_mask;
+
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mondomain - group of CPUs sharing a resctrl monitor resource
  * @list:		all instances of this resource
  * @id:			unique id for this instance
  * @cpu_mask:		which CPUs share this resource
@@ -64,16 +86,13 @@ struct resctrl_staged_config {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mondomain {
+	// First three fields must match struct rdt_domain above.
 	struct list_head		list;
 	int				id;
 	struct cpumask			cpu_mask;
+
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
 	struct mbm_state		*mbm_local;
@@ -81,9 +100,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };
 
 /**
@@ -227,9 +243,9 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
 int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mondomain *d);
 void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mondomain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -245,7 +261,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mondomain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
 
 /**
@@ -258,7 +274,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mondomain *d,
 			     u32 rmid, enum resctrl_event_id eventid);
 
 /**
@@ -270,7 +286,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mondomain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 31a5fc3b717f..c61fd6709730 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -106,7 +106,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mondomain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -320,17 +320,28 @@ struct arch_mbm_state {
 
 /**
  * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ *			  a control resource
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
- * @arch_mbm_total:	arch private state for MBM total bandwidth
- * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
 struct rdt_hw_domain {
 	struct rdt_domain		d_resctrl;
 	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mondomain - Arch private attributes of a set of CPUs that share
+ *			  a monitor resource
+ * @d_resctrl:	Properties exposed to the resctrl file system
+ * @arch_mbm_total:	arch private state for MBM total bandwidth
+ * @arch_mbm_local:	arch private state for MBM local bandwidth
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_mondomain {
+	struct rdt_mondomain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };
@@ -340,6 +351,11 @@ static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
 	return container_of(r, struct rdt_hw_domain, d_resctrl);
 }
 
+static inline struct rdt_hw_mondomain *resctrl_to_arch_mondom(struct rdt_mondomain *r)
+{
+	return container_of(r, struct rdt_hw_mondomain, d_resctrl);
+}
+
 /**
  * struct msr_param - set a range of MSRs from a domain
  * @res:       The resource to use
@@ -513,8 +529,8 @@ int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
 struct rdt_domain *rdt_find_ctrldomain(struct list_head *h, int id,
 				       struct list_head **pos);
-struct rdt_domain *rdt_find_mondomain(struct list_head *h, int id,
-				      struct list_head **pos);
+struct rdt_mondomain *rdt_find_mondomain(struct list_head *h, int id,
+					 struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -543,17 +559,17 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mondomain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mondomain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mondomain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mondomain *d);
+void __check_limbo(struct rdt_mondomain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 97f6f9715fdb..3e08aa04a7ff 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -425,8 +425,8 @@ struct rdt_domain *rdt_find_ctrldomain(struct list_head *h, int id,
 	return __rdt_find_domain(h, id, pos);
 }
 
-struct rdt_domain *rdt_find_mondomain(struct list_head *h, int id,
-				      struct list_head **pos)
+struct rdt_mondomain *rdt_find_mondomain(struct list_head *h, int id,
+					 struct list_head **pos)
 {
 	return __rdt_find_domain(h, id, pos);
 }
@@ -451,7 +451,7 @@ static void domain_free(struct rdt_hw_domain *hw_dom)
 	kfree(hw_dom);
 }
 
-static void mondomain_free(struct rdt_hw_domain *hw_dom)
+static void mondomain_free(struct rdt_hw_mondomain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
@@ -484,7 +484,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mondomain *hw_dom)
 {
 	size_t tsize;
 
@@ -570,9 +570,9 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_mondom;
+	struct rdt_hw_mondomain *hw_mondom;
 	struct list_head *add_pos = NULL;
-	struct rdt_domain *d;
+	struct rdt_mondomain *d;
 	int err;
 
 	d = rdt_find_mondomain(&r->mondomains, id, &add_pos);
@@ -663,15 +663,15 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_mondom;
-	struct rdt_domain *d;
+	struct rdt_hw_mondomain *hw_mondom;
+	struct rdt_mondomain *d;
 
 	d = rdt_find_mondomain(&r->mondomains, id, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
-	hw_mondom = resctrl_to_arch_dom(d);
+	hw_mondom = resctrl_to_arch_mondom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 468c1815edfd..5167ac9cbe98 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -521,7 +521,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mondomain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	/*
@@ -544,7 +544,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
+	struct rdt_mondomain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 66beca785535..42262d59ef9b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mondomain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mondomain *d,
 			     u32 rmid, enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mondomain *hw_dom = resctrl_to_arch_mondom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mondomain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mondomain *hw_dom = resctrl_to_arch_mondom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mondomain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mondomain *hw_mondom = resctrl_to_arch_mondom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -245,7 +245,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 	if (ret)
 		return ret;
 
-	am = get_arch_mbm_state(hw_dom, rmid, eventid);
+	am = get_arch_mbm_state(hw_mondom, rmid, eventid);
 	if (am) {
 		am->chunks += mbm_overflow_count(am->prev_msr, msr_val,
 						 hw_res->mbm_width);
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mondomain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	}
 }
 
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mondomain *d)
 {
 	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
 }
@@ -334,7 +334,7 @@ int alloc_rmid(void)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mondomain *d;
 	int cpu, err;
 	u64 val = 0;
 
@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mondomain *d, u32 rmid,
 				       enum resctrl_event_id evtid)
 {
 	switch (evtid) {
@@ -516,7 +516,7 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mondomain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	}
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mondomain *d, int rmid)
 {
 	struct rmid_read rr;
 
@@ -641,12 +641,12 @@ void cqm_handle_limbo(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	int cpu = smp_processor_id();
 	struct rdt_resource *r;
-	struct rdt_domain *d;
+	struct rdt_mondomain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mondomain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mondomain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -674,7 +674,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	int cpu = smp_processor_id();
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
+	struct rdt_mondomain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mondomain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mondomain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f2aec39c49df..5feec2c33544 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1496,7 +1496,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mondomain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1504,7 +1504,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mondomain *dom;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
@@ -1561,7 +1561,7 @@ static void mon_event_config_write(void *info)
 }
 
 static int mbm_config_write_domain(struct rdt_resource *r,
-				   struct rdt_domain *d, u32 evtid, u32 val)
+				   struct rdt_mondomain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 	int ret = 0;
@@ -1611,7 +1611,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 {
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mondomain *d;
 	int ret = 0;
 
 next:
@@ -2476,7 +2476,7 @@ static void schemata_list_destroy(void)
 static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
-	struct rdt_domain *dom;
+	struct rdt_mondomain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2858,7 +2858,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mondomain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -2907,7 +2907,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mondomain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -2929,7 +2929,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mondomain *dom;
 	int ret;
 
 	list_for_each_entry(dom, &r->mondomains, list) {
@@ -3714,7 +3714,7 @@ static int __init rdtgroup_setup_root(void)
 	return ret;
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mondomain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
@@ -3729,7 +3729,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 		mba_sc_domain_destroy(r, d);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mondomain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3758,7 +3758,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mondomain *d)
 {
 	size_t tsize;
 
@@ -3799,7 +3799,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mondomain *d)
 {
 	int err;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v5 4/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
                       ` (2 preceding siblings ...)
  2023-08-29 23:44     ` [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure Tony Luck
@ 2023-08-29 23:44     ` Tony Luck
  2023-09-25 23:25       ` Reinette Chatre
  2023-08-29 23:44     ` [PATCH v5 5/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                       ` (5 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-08-29 23:44 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add "node" as a new option for domain scope.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h            | 1 +
 arch/x86/kernel/cpu/resctrl/core.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 08382548571e..f55cf7afd4eb 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -163,6 +163,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L3_CACHE,
 	RESCTRL_L2_CACHE,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3e08aa04a7ff..9fcc264fac6c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -514,6 +514,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 		return get_cpu_cacheinfo_id(cpu, 3);
 	case RESCTRL_L2_CACHE:
 		return get_cpu_cacheinfo_id(cpu, 2);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
 	default:
 		WARN_ON_ONCE(1);
 		break;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v5 5/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
                       ` (3 preceding siblings ...)
  2023-08-29 23:44     ` [PATCH v5 4/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-08-29 23:44     ` Tony Luck
  2023-09-25 23:27       ` Reinette Chatre
  2023-08-29 23:44     ` [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
                       ` (4 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-08-29 23:44 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Intel Sub-NUMA Cluster mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that will show how many
SNC nodes share each L3 cache. When this is "1", SNC mode is either
not implemented, or not enabled.

A later patch will detect SNC mode and set snc_nodes_per_l3_cache to
the appropriate value. For now it remains at the default "1" to
indicate SNC mode is not active.

Code that needs to take action when SNC is enabled is:
1) The number of logical RMIDs available for use is the number of
   physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be adjusted for the number
   of SNC nodes.
3) When reading an RMID counter code must adjust from the logical
   RMID used to the physical RMID value that must be loaded into
   the IA32_QM_EVTSEL MSR.
4) The L3 cache is divided between the SNC nodes. So the value
   reported in the resctrl "size" file is adjusted.
5) The "-o mba_MBps" mount option must be disabled in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     |  7 +++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 ++--
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c61fd6709730..326ca6b3688a 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 9fcc264fac6c..ed4f55b3e5e4 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,13 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.
+ * Default is 1 for systems that do not support
+ * SNC, or have SNC disabled.
+ */
+int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 42262d59ef9b..b6b3fb0f9abe 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5feec2c33544..a8cf6251e506 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1367,7 +1367,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /**
@@ -2600,7 +2600,7 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		ctx->enable_cdpl2 = true;
 		return 0;
 	case Opt_mba_mbps:
-		if (!supports_mba_mbps())
+		if (!supports_mba_mbps() || snc_nodes_per_l3_cache > 1)
 			return -EINVAL;
 		ctx->enable_mba_mbps = true;
 		return 0;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
                       ` (4 preceding siblings ...)
  2023-08-29 23:44     ` [PATCH v5 5/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-08-29 23:44     ` Tony Luck
  2023-09-25 23:29       ` Reinette Chatre
  2023-09-26 19:16       ` Moger, Babu
  2023-08-29 23:44     ` [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
                       ` (3 subsequent siblings)
  9 siblings, 2 replies; 331+ messages in thread
From: Tony Luck @ 2023-08-29 23:44 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

There isn't a simple h/w bit that indicates whether a CPU is
running in Sub NUMA Cluster mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/msr-index.h   |  1 +
 arch/x86/kernel/cpu/resctrl/core.c | 68 ++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d111350197f..393d1b047617 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1100,6 +1100,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index ed4f55b3e5e4..9f0ac9721fab 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -724,11 +727,34 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode.
+ */
+static void snc_remap_rmids(int cpu)
+{
+	u64 val;
+
+	/* Only need to enable once per package */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
@@ -983,11 +1009,53 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	{}
+};
+
+static __init int get_snc_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids)
+			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+		else
+			mem_only_nodes++;
+	}
+	cpus_read_unlock();
+
+	ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
+	kfree(node_caches);
+
+	if (ret > 1)
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+
+	return ret;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = get_snc_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
                       ` (5 preceding siblings ...)
  2023-08-29 23:44     ` [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-08-29 23:44     ` Tony Luck
  2023-09-25 23:30       ` Reinette Chatre
  2023-09-26 19:00       ` Moger, Babu
  2023-08-29 23:44     ` [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled Tony Luck
                       ` (2 subsequent siblings)
  9 siblings, 2 replies; 331+ messages in thread
From: Tony Luck @ 2023-08-29 23:44 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/arch/x86/resctrl.rst | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index cb05d90111b4..407764f43f25 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -345,9 +345,15 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event. E.g. on a system with
+	SNC mode disabled with two L3 domains there will be subdirectories
+	"mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
+	L3 cache id.  With SNC enabled the directory names are the same,
+	but the numerical suffix refers to the node id.
+        Mappings from node ids to CPUs are available in the
+        /sys/devices/system/node/node*/cpulist files. Each of these
 	directories have one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
@@ -452,6 +458,19 @@ and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
 of the capacity of the cache. You could partition the cache into four
 equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled the "llc_occupancy", "mbm_total_bytes", and
+"mbm_local_bytes" will only give accurate results for well behaved NUMA
+applications. I.e. those that perform the majority of memory accesses
+to memory on the local NUMA node to the CPU where the task is executing.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache. But each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
                       ` (6 preceding siblings ...)
  2023-08-29 23:44     ` [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-08-29 23:44     ` Tony Luck
  2023-08-30 13:08       ` Maciej Wieczór-Retman
  2023-09-07  5:21       ` Shaopeng Tan (Fujitsu)
  2023-09-11 20:23     ` [PATCH v5 0/8] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
  9 siblings, 2 replies; 331+ messages in thread
From: Tony Luck @ 2023-08-29 23:44 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
nodes. Systems may support splitting into either two or four nodes.

When SNC mode is enabled the effective amount of L3 cache available
for allocation is divided by the number of nodes per L3.

Detect which SNC mode is active by comparing the number of CPUs
that share a cache with CPU0, with the number of CPUs on node0.

This gives some hope of tests passing. But additional test
infrastructure changes are needed to bind tests to nodes and
guarantee memory allocation from the local node.

Reported-by: "Shaopeng Tan (Fujitsu)" <tan.shaopeng@fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 tools/testing/selftests/resctrl/resctrl.h   |  1 +
 tools/testing/selftests/resctrl/resctrlfs.c | 57 +++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index 87e39456dee0..a8b43210b573 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -13,6 +13,7 @@
 #include <signal.h>
 #include <dirent.h>
 #include <stdbool.h>
+#include <ctype.h>
 #include <sys/stat.h>
 #include <sys/ioctl.h>
 #include <sys/mount.h>
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index fb00245dee92..79eecbf9f863 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -130,6 +130,61 @@ int get_resource_id(int cpu_no, int *resource_id)
 	return 0;
 }
 
+/*
+ * Count number of CPUs in a /sys bit map
+ */
+static int count_sys_bitmap_bits(char *name)
+{
+	FILE *fp = fopen(name, "r");
+	int count = 0, c;
+
+	if (!fp)
+		return 0;
+
+	while ((c = fgetc(fp)) != EOF) {
+		if (!isxdigit(c))
+			continue;
+		switch (c) {
+		case 'f':
+			count++;
+		case '7': case 'b': case 'd': case 'e':
+			count++;
+		case '3': case '5': case '6': case '9': case 'a': case 'c':
+			count++;
+		case '1': case '2': case '4': case '8':
+			count++;
+		}
+	}
+	fclose(fp);
+
+	return count;
+}
+
+/*
+ * Detect SNC by compating #CPUs in node0 with #CPUs sharing LLC with CPU0
+ * Try to get this right, even if a few CPUs are offline so that the number
+ * of CPUs in node0 is not exactly half or a quarter of the CPUs sharing the
+ * LLC of CPU0.
+ */
+static int snc_ways(void)
+{
+	int node_cpus, cache_cpus;
+
+	node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
+	cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
+
+	if (!node_cpus || !cache_cpus) {
+		fprintf(stderr, "Warning could not determine Sub-NUMA Cluster mode\n");
+		return 1;
+	}
+
+	if (4 * node_cpus >= cache_cpus)
+		return 4;
+	else if (2 * node_cpus >= cache_cpus)
+		return 2;
+	return 1;
+}
+
 /*
  * get_cache_size - Get cache size for a specified CPU
  * @cpu_no:	CPU number
@@ -190,6 +245,8 @@ int get_cache_size(int cpu_no, char *cache_type, unsigned long *cache_size)
 			break;
 	}
 
+	if (cache_num == 3)
+		*cache_size /= snc_ways();
 	return 0;
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope
  2023-08-29 23:44     ` [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-08-30  8:53       ` Maciej Wieczór-Retman
  2023-08-30 14:11         ` Luck, Tony
  2023-09-25 23:22       ` Reinette Chatre
  2023-09-26 16:36       ` Moger, Babu
  2 siblings, 1 reply; 331+ messages in thread
From: Maciej Wieczór-Retman @ 2023-08-30  8:53 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, linux-kernel, linux-doc, patches

Hello,

On 2023-08-29 at 16:44:19 -0700, Tony Luck wrote:
>diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>index 030d3b409768..0d3bae523ecb 100644
>--- a/arch/x86/kernel/cpu/resctrl/core.c
>+++ b/arch/x86/kernel/cpu/resctrl/core.c
>@@ -487,6 +487,21 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> 	return 0;
> }
> 
>+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>+{
>+	switch (scope) {
>+	case RESCTRL_L3_CACHE:
>+		return get_cpu_cacheinfo_id(cpu, 3);
>+	case RESCTRL_L2_CACHE:
>+		return get_cpu_cacheinfo_id(cpu, 2);
>+	default:
>+		WARN_ON_ONCE(1);
>+		break;
>+	}
>+
>+	return -1;
>+}

Is there some reason the "return -1" is outside of the default switch
case?

Other switch statements in this patch do have returns inside the default
case, is this one different in some way?

-- 
Kind regards
Maciej Wieczór-Retman

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-08-29 23:44     ` [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled Tony Luck
@ 2023-08-30 13:08       ` Maciej Wieczór-Retman
  2023-08-30 15:43         ` Luck, Tony
  2023-09-07  5:21       ` Shaopeng Tan (Fujitsu)
  1 sibling, 1 reply; 331+ messages in thread
From: Maciej Wieczór-Retman @ 2023-08-30 13:08 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, linux-kernel, linux-doc, patches

Hello,

On 2023-08-29 at 16:44:26 -0700, Tony Luck wrote:
>Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
>nodes. Systems may support splitting into either two or four nodes.
>
>When SNC mode is enabled the effective amount of L3 cache available
>for allocation is divided by the number of nodes per L3.
>
>Detect which SNC mode is active by comparing the number of CPUs
>that share a cache with CPU0, with the number of CPUs on node0.
>
>This gives some hope of tests passing. But additional test
>infrastructure changes are needed to bind tests to nodes and
>guarantee memory allocation from the local node.
>
>Reported-by: "Shaopeng Tan (Fujitsu)" <tan.shaopeng@fujitsu.com>
>Signed-off-by: Tony Luck <tony.luck@intel.com>
>---
> tools/testing/selftests/resctrl/resctrl.h   |  1 +
> tools/testing/selftests/resctrl/resctrlfs.c | 57 +++++++++++++++++++++
> 2 files changed, 58 insertions(+)
>
>diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
>index 87e39456dee0..a8b43210b573 100644
>--- a/tools/testing/selftests/resctrl/resctrl.h
>+++ b/tools/testing/selftests/resctrl/resctrl.h
>@@ -13,6 +13,7 @@
> #include <signal.h>
> #include <dirent.h>
> #include <stdbool.h>
>+#include <ctype.h>
> #include <sys/stat.h>
> #include <sys/ioctl.h>
> #include <sys/mount.h>
>diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
>index fb00245dee92..79eecbf9f863 100644
>--- a/tools/testing/selftests/resctrl/resctrlfs.c
>+++ b/tools/testing/selftests/resctrl/resctrlfs.c
>@@ -130,6 +130,61 @@ int get_resource_id(int cpu_no, int *resource_id)
> 	return 0;
> }
> 
>+/*
>+ * Count number of CPUs in a /sys bit map
>+ */
>+static int count_sys_bitmap_bits(char *name)
>+{
>+	FILE *fp = fopen(name, "r");
>+	int count = 0, c;
>+
>+	if (!fp)
>+		return 0;
>+
>+	while ((c = fgetc(fp)) != EOF) {
>+		if (!isxdigit(c))
>+			continue;
>+		switch (c) {
>+		case 'f':
>+			count++;
>+		case '7': case 'b': case 'd': case 'e':
>+			count++;
>+		case '3': case '5': case '6': case '9': case 'a': case 'c':
>+			count++;
>+		case '1': case '2': case '4': case '8':
>+			count++;
>+		}
>+	}
>+	fclose(fp);
>+
>+	return count;
>+}
>+

The resctrl selftest has a function for counting bits, could it be used
here instead of the switch statement like this for example?

count = count_bits(c);

Or is there some reason this wouldn't be a good fit here?

-- 
Kind regards
Maciej Wieczór-Retman

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope
  2023-08-30  8:53       ` Maciej Wieczór-Retman
@ 2023-08-30 14:11         ` Luck, Tony
  2023-08-31 13:02           ` Maciej Wieczór-Retman
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-08-30 14:11 UTC (permalink / raw)
  To: Wieczor-Retman, Maciej
  Cc: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, linux-kernel, linux-doc, patches

> >+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> >+{
> >+    switch (scope) {
> >+    case RESCTRL_L3_CACHE:
> >+            return get_cpu_cacheinfo_id(cpu, 3);
> >+    case RESCTRL_L2_CACHE:
> >+            return get_cpu_cacheinfo_id(cpu, 2);
> >+    default:
> >+            WARN_ON_ONCE(1);
> >+            break;
> >+    }
> >+
> >+    return -1;
> >+}
>
> Is there some reason the "return -1" is outside of the default switch
> case?
>
> Other switch statements in this patch do have returns inside the default
> case, is this one different in some way?

I've sometimes had compilers complain about code written:

static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
{
        switch (scope) {
        case RESCTRL_L3_CACHE:
                return get_cpu_cacheinfo_id(cpu, 3);
        case RESCTRL_L2_CACHE:
                return get_cpu_cacheinfo_id(cpu, 2);
        default:
                WARN_ON_ONCE(1);
                return -1;
        }
}

because they failed to notice that every path in the switch does a "return and they
issue a warning that the function has no return value because they don't realize
that the end of the function can't be reached.

So it's defensive programming against possible complier issues.

-Tony







^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-08-30 13:08       ` Maciej Wieczór-Retman
@ 2023-08-30 15:43         ` Luck, Tony
  2023-08-31  6:51           ` Maciej Wieczór-Retman
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-08-30 15:43 UTC (permalink / raw)
  To: Wieczor-Retman, Maciej
  Cc: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, linux-kernel, linux-doc, patches

> >+static int count_sys_bitmap_bits(char *name)
> >+{
> >+    FILE *fp = fopen(name, "r");
> >+    int count = 0, c;
> >+
> >+    if (!fp)
> >+            return 0;
> >+
> >+    while ((c = fgetc(fp)) != EOF) {
> >+            if (!isxdigit(c))
> >+                    continue;
> >+            switch (c) {
> >+            case 'f':
> >+                    count++;
> >+            case '7': case 'b': case 'd': case 'e':
> >+                    count++;
> >+            case '3': case '5': case '6': case '9': case 'a': case 'c':
> >+                    count++;
> >+            case '1': case '2': case '4': case '8':
> >+                    count++;
> >+            }
> >+    }
> >+    fclose(fp);
> >+
> >+    return count;
> >+}
> >+
>
> The resctrl selftest has a function for counting bits, could it be used
> here instead of the switch statement like this for example?
>
> count = count_bits(c);
>
> Or is there some reason this wouldn't be a good fit here?

Thanks for looking at my patch.

That count_bits() function is doing so with input from an "unsigned long"
argument.  My function is parsing the string result from a sysfs file which
might look like this:

$ cat shared_cpu_map
0000,00000fff,ffffff00,0000000f,ffffffff

To use count_bits() I'd have to use something like strtol() on each of the
comma separated fields first to convert from ascii strings to binary
values to feed into count_bits().

-Tony



^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-08-30 15:43         ` Luck, Tony
@ 2023-08-31  6:51           ` Maciej Wieczór-Retman
  2023-08-31 17:04             ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Maciej Wieczór-Retman @ 2023-08-31  6:51 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, linux-kernel, linux-doc, patches

On 2023-08-30 at 15:43:05 +0000, Luck, Tony wrote:
>> >+static int count_sys_bitmap_bits(char *name)
>> >+{
>> >+    FILE *fp = fopen(name, "r");
>> >+    int count = 0, c;
>> >+
>> >+    if (!fp)
>> >+            return 0;
>> >+
>> >+    while ((c = fgetc(fp)) != EOF) {
>> >+            if (!isxdigit(c))
>> >+                    continue;
>> >+            switch (c) {
>> >+            case 'f':
>> >+                    count++;
>> >+            case '7': case 'b': case 'd': case 'e':
>> >+                    count++;
>> >+            case '3': case '5': case '6': case '9': case 'a': case 'c':
>> >+                    count++;
>> >+            case '1': case '2': case '4': case '8':
>> >+                    count++;
>> >+            }
>> >+    }
>> >+    fclose(fp);
>> >+
>> >+    return count;
>> >+}
>> >+
>>
>> The resctrl selftest has a function for counting bits, could it be used
>> here instead of the switch statement like this for example?
>>
>> count = count_bits(c);
>>
>> Or is there some reason this wouldn't be a good fit here?
>
>Thanks for looking at my patch.
>
>That count_bits() function is doing so with input from an "unsigned long"
>argument.  My function is parsing the string result from a sysfs file which
>might look like this:
>
>$ cat shared_cpu_map
>0000,00000fff,ffffff00,0000000f,ffffffff
>
>To use count_bits() I'd have to use something like strtol() on each of the
>comma separated fields first to convert from ascii strings to binary
>values to feed into count_bits().

I missed they are being read as characters and not bytes, sorry.

Out of curiosity, what about using fscanf instead of fgetc? With the
format being %x and reading one byte at the time. Then instead of
isxdigit just checking if the read number was bigger than 0xF.

I also remembered there is a gcc (and I think clang has it as well)
builtin function that returns the number of set bits in a number.

So it would look like this:

	while ((fscanf(fp, "%x", c)) != EOF ) {
	    if (c > 0xF)
		    continue;
	    count = __builtin_popcount(c);
	}

Are there some problems with an approach like that?

-- 
Kind regards
Maciej Wieczór-Retman

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope
  2023-08-30 14:11         ` Luck, Tony
@ 2023-08-31 13:02           ` Maciej Wieczór-Retman
  2023-08-31 17:26             ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Maciej Wieczór-Retman @ 2023-08-31 13:02 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, linux-kernel, linux-doc, patches

On 2023-08-30 at 14:11:14 +0000, Luck, Tony wrote:
>> >+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>> >+{
>> >+    switch (scope) {
>> >+    case RESCTRL_L3_CACHE:
>> >+            return get_cpu_cacheinfo_id(cpu, 3);
>> >+    case RESCTRL_L2_CACHE:
>> >+            return get_cpu_cacheinfo_id(cpu, 2);
>> >+    default:
>> >+            WARN_ON_ONCE(1);
>> >+            break;
>> >+    }
>> >+
>> >+    return -1;
>> >+}
>>
>> Is there some reason the "return -1" is outside of the default switch
>> case?
>>
>> Other switch statements in this patch do have returns inside the default
>> case, is this one different in some way?
>
>I've sometimes had compilers complain about code written:
>
>static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>{
>        switch (scope) {
>        case RESCTRL_L3_CACHE:
>                return get_cpu_cacheinfo_id(cpu, 3);
>        case RESCTRL_L2_CACHE:
>                return get_cpu_cacheinfo_id(cpu, 2);
>        default:
>                WARN_ON_ONCE(1);
>                return -1;
>        }
>}
>
>because they failed to notice that every path in the switch does a "return and they
>issue a warning that the function has no return value because they don't realize
>that the end of the function can't be reached.
>
>So it's defensive programming against possible complier issues.

I recall getting that error somewhere while playing around with a
language server protocol for neovim a while ago but I tried to cause
it today with gcc and clang and with some different flags and coulnd't.
Are there some particular compilers or compiler flags that trigger
that?

-- 
Kind regards
Maciej Wieczór-Retman

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-08-31  6:51           ` Maciej Wieczór-Retman
@ 2023-08-31 17:04             ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-08-31 17:04 UTC (permalink / raw)
  To: Wieczor-Retman, Maciej
  Cc: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, linux-kernel, linux-doc, patches

> So it would look like this:
>
>       while ((fscanf(fp, "%x", c)) != EOF ) {
>           if (c > 0xF)
>                   continue;
>           count = __builtin_popcount(c);
>       }
>
> Are there some problems with an approach like that?

I think I'd prefer something that does more checking on the input
(e.g. the hex numbers are separated by "," and terminated with
a '\n').

If Shuah Khan doesn't like my original patch I can re-write
to use fscanf() et. al.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope
  2023-08-31 13:02           ` Maciej Wieczór-Retman
@ 2023-08-31 17:26             ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-08-31 17:26 UTC (permalink / raw)
  To: Wieczor-Retman, Maciej
  Cc: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, linux-kernel, linux-doc, patches

> >I've sometimes had compilers complain about code written:
> >
> >static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> >{
> >        switch (scope) {
> >        case RESCTRL_L3_CACHE:
> >                return get_cpu_cacheinfo_id(cpu, 3);
> >        case RESCTRL_L2_CACHE:
> >                return get_cpu_cacheinfo_id(cpu, 2);
> >        default:
> >                WARN_ON_ONCE(1);
> >                return -1;
> >        }
> >}
> >
> >because they failed to notice that every path in the switch does a "return and they
> >issue a warning that the function has no return value because they don't realize
> >that the end of the function can't be reached.
> >
> >So it's defensive programming against possible complier issues.
>
> I recall getting that error somewhere while playing around with a
> language server protocol for neovim a while ago but I tried to cause
> it today with gcc and clang and with some different flags and coulnd't.
> Are there some particular compilers or compiler flags that trigger
> that?

I think I saw this from an lkp report using clang. But it's quite possible
that the exact code construct was different in some way.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-08-29 23:44     ` [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled Tony Luck
  2023-08-30 13:08       ` Maciej Wieczór-Retman
@ 2023-09-07  5:21       ` Shaopeng Tan (Fujitsu)
  2023-09-07 16:19         ` Luck, Tony
  1 sibling, 1 reply; 331+ messages in thread
From: Shaopeng Tan (Fujitsu) @ 2023-09-07  5:21 UTC (permalink / raw)
  To: 'Tony Luck',
	Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap, linux-kernel,
	linux-doc, patches

Hi Tony,

> Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
> nodes. Systems may support splitting into either two or four nodes.
> 
> When SNC mode is enabled the effective amount of L3 cache available for
> allocation is divided by the number of nodes per L3.
> 
> Detect which SNC mode is active by comparing the number of CPUs that share
> a cache with CPU0, with the number of CPUs on node0.
> 
> This gives some hope of tests passing. But additional test infrastructure
> changes are needed to bind tests to nodes and guarantee memory allocation
> from the local node.
> 
> Reported-by: "Shaopeng Tan (Fujitsu)" <tan.shaopeng@fujitsu.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  tools/testing/selftests/resctrl/resctrl.h   |  1 +
>  tools/testing/selftests/resctrl/resctrlfs.c | 57
> +++++++++++++++++++++
>  2 files changed, 58 insertions(+)
> 
> diff --git a/tools/testing/selftests/resctrl/resctrl.h
> b/tools/testing/selftests/resctrl/resctrl.h
> index 87e39456dee0..a8b43210b573 100644
> --- a/tools/testing/selftests/resctrl/resctrl.h
> +++ b/tools/testing/selftests/resctrl/resctrl.h
> @@ -13,6 +13,7 @@
>  #include <signal.h>
>  #include <dirent.h>
>  #include <stdbool.h>
> +#include <ctype.h>
>  #include <sys/stat.h>
>  #include <sys/ioctl.h>
>  #include <sys/mount.h>
> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c
> b/tools/testing/selftests/resctrl/resctrlfs.c
> index fb00245dee92..79eecbf9f863 100644
> --- a/tools/testing/selftests/resctrl/resctrlfs.c
> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> @@ -130,6 +130,61 @@ int get_resource_id(int cpu_no, int *resource_id)
>  	return 0;
>  }
> 
> +/*
> + * Count number of CPUs in a /sys bit map  */ static int
> +count_sys_bitmap_bits(char *name) {
> +	FILE *fp = fopen(name, "r");
> +	int count = 0, c;
> +
> +	if (!fp)
> +		return 0;
> +
> +	while ((c = fgetc(fp)) != EOF) {
> +		if (!isxdigit(c))
> +			continue;
> +		switch (c) {
> +		case 'f':
> +			count++;
> +		case '7': case 'b': case 'd': case 'e':
> +			count++;
> +		case '3': case '5': case '6': case '9': case 'a': case 'c':
> +			count++;
> +		case '1': case '2': case '4': case '8':
> +			count++;
> +		}
> +	}
> +	fclose(fp);
> +
> +	return count;
> +}
> +
> +/*
> + * Detect SNC by compating #CPUs in node0 with #CPUs sharing LLC with
> +CPU0
> + * Try to get this right, even if a few CPUs are offline so that the
> +number
> + * of CPUs in node0 is not exactly half or a quarter of the CPUs
> +sharing the
> + * LLC of CPU0.
> + */
> +static int snc_ways(void)
> +{
> +	int node_cpus, cache_cpus;
> +
> +	node_cpus =
> count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
> +	cache_cpus =
> +count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/sh
> ared
> +_cpu_map");
> +
> +	if (!node_cpus || !cache_cpus) {
> +		fprintf(stderr, "Warning could not determine Sub-NUMA
> Cluster mode\n");
> +		return 1;
> +	}
> +
> +	if (4 * node_cpus >= cache_cpus)
> +		return 4;
> +	else if (2 * node_cpus >= cache_cpus)
> +		return 2;


If "4 * node_cpus >= cache_cpus " is not true,
"2 * node_cpus >= cache_cpus" will never be true.
Is it the following code?

+	if (2 * node_cpus >= cache_cpus)
+		return 2;
+	else if (4 * node_cpus >= cache_cpus)
+		return 4;

Best regards,
Shaopeng TAN

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-09-07  5:21       ` Shaopeng Tan (Fujitsu)
@ 2023-09-07 16:19         ` Luck, Tony
  2023-09-19  6:50           ` Maciej Wieczór-Retman
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-09-07 16:19 UTC (permalink / raw)
  To: Shaopeng Tan (Fujitsu),
	Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap, linux-kernel,
	linux-doc, patches

> > +   if (4 * node_cpus >= cache_cpus)
> > +           return 4;
> > +   else if (2 * node_cpus >= cache_cpus)
> > +           return 2;
>
>
> If "4 * node_cpus >= cache_cpus " is not true,
> "2 * node_cpus >= cache_cpus" will never be true.
> Is it the following code?
>
> +     if (2 * node_cpus >= cache_cpus)
> +             return 2;
> +     else if (4 * node_cpus >= cache_cpus)
> +             return 4;


Shaopeng TAN,

Good catch. Your solution is the correct one.

Will fix in next post.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
                       ` (7 preceding siblings ...)
  2023-08-29 23:44     ` [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled Tony Luck
@ 2023-09-11 20:23     ` Reinette Chatre
  2023-09-12 16:01       ` Tony Luck
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
  9 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-09-11 20:23 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/29/2023 4:44 PM, Tony Luck wrote:
> The Sub-NUMA cluster feature on some Intel processors partitions
> the CPUs that share an L3 cache into two or more sets. This plays
> havoc with the Resource Director Technology (RDT) monitoring features.
> Prior to this patch Intel has advised that SNC and RDT are incompatible.
> 
> Some of these CPU support an MSR that can partition the RMID
> counters in the same way. This allows for monitoring features
> to be used (with the caveat that memory accesses between different
> SNC NUMA nodes may still not be counted accuratlely.

Same typo as in V4.

> 
> Note that this patch series improves resctrl reporting considerably
> on systems with SNC enabled, but there will still be some anomalies
> for processes accessing memory from other sub-NUMA nodes.

I have the same question as with V4 that was not answered in that email
thread nor in this new version.
https://lore.kernel.org/lkml/e350514e-76ed-14ea-3e74-c0852658182f@intel.com/

I stop my review of this series here.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-09-11 20:23     ` [PATCH v5 0/8] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
@ 2023-09-12 16:01       ` Tony Luck
  2023-09-12 17:13         ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-12 16:01 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Sep 11, 2023 at 01:23:35PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/29/2023 4:44 PM, Tony Luck wrote:
> > The Sub-NUMA cluster feature on some Intel processors partitions
> > the CPUs that share an L3 cache into two or more sets. This plays
> > havoc with the Resource Director Technology (RDT) monitoring features.
> > Prior to this patch Intel has advised that SNC and RDT are incompatible.
> > 
> > Some of these CPU support an MSR that can partition the RMID
> > counters in the same way. This allows for monitoring features
> > to be used (with the caveat that memory accesses between different
> > SNC NUMA nodes may still not be counted accuratlely.
> 
> Same typo as in V4.

Sorry. Will fix and re-post.

> > 
> > Note that this patch series improves resctrl reporting considerably
> > on systems with SNC enabled, but there will still be some anomalies
> > for processes accessing memory from other sub-NUMA nodes.
> 
> I have the same question as with V4 that was not answered in that email
> thread nor in this new version.
> https://lore.kernel.org/lkml/e350514e-76ed-14ea-3e74-c0852658182f@intel.com/

Non-SNC systems already have an issue when reporting memory bandwidth
for a task that Linux may migrate the task to a CPU on a different node
which means that logging for that task will also move to different files
in the mon_data/mon_L3_*/ for the new node.

With SNC enabled, migration between NUMA nodes on the same socket may happen
much more frequently because:
1) The CPUs on the other NUMA nodes in the socket are in the same Linux
   L3 cache domain. So Linux regard the migration as "cheap".
2) The ACPI SLIT table on SNC enabled systems may also report the
   latency for remote access to another NUMA node on the same socket
   as significantly lower than the latency for cross-socket access. On
   my test system the SLIT distance for same socket nodes is 0xC,
   compared to 0x15 for cross-socket distance. This will also lead
   to Linux being more likely to migrate a task to a CPU on another
   SNC NUMA node in the same socket.

To avoid migration issues, users may use sched_setaffinity(2) to bind
tasks to the subset of CPUs that share an SNC NUMA node.

I can write this up in a new cover letter.

> I stop my review of this series here.

Reinette

Should I repost the whole series as v6 with the new cover letter. The
only change to the patches so far is to the selftest reported by
Shaopeng Tan[1].

-Tony

[1] https://lore.kernel.org/all/TYAPR01MB633033C489AAC0E514CBC6688BEEA@TYAPR01MB6330.jpnprd01.prod.outlook.com/

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-09-12 16:01       ` Tony Luck
@ 2023-09-12 17:13         ` Reinette Chatre
  2023-09-12 17:45           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-09-12 17:13 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 9/12/2023 9:01 AM, Tony Luck wrote:
> On Mon, Sep 11, 2023 at 01:23:35PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 8/29/2023 4:44 PM, Tony Luck wrote:
>>> The Sub-NUMA cluster feature on some Intel processors partitions
>>> the CPUs that share an L3 cache into two or more sets. This plays
>>> havoc with the Resource Director Technology (RDT) monitoring features.
>>> Prior to this patch Intel has advised that SNC and RDT are incompatible.
>>>
>>> Some of these CPU support an MSR that can partition the RMID
>>> counters in the same way. This allows for monitoring features
>>> to be used (with the caveat that memory accesses between different
>>> SNC NUMA nodes may still not be counted accuratlely.
>>
>> Same typo as in V4.
> 
> Sorry. Will fix and re-post.
> 
>>>
>>> Note that this patch series improves resctrl reporting considerably
>>> on systems with SNC enabled, but there will still be some anomalies
>>> for processes accessing memory from other sub-NUMA nodes.
>>
>> I have the same question as with V4 that was not answered in that email
>> thread nor in this new version.
>> https://lore.kernel.org/lkml/e350514e-76ed-14ea-3e74-c0852658182f@intel.com/
> 
> Non-SNC systems already have an issue when reporting memory bandwidth
> for a task that Linux may migrate the task to a CPU on a different node
> which means that logging for that task will also move to different files
> in the mon_data/mon_L3_*/ for the new node.

It is not obvious to me that this is an issue. From what I understand
the data remains accurate.

How does this map to the earlier "may still not be counted
accurately"? 

> 
> With SNC enabled, migration between NUMA nodes on the same socket may happen
> much more frequently because:
> 1) The CPUs on the other NUMA nodes in the socket are in the same Linux
>    L3 cache domain. So Linux regard the migration as "cheap".
> 2) The ACPI SLIT table on SNC enabled systems may also report the
>    latency for remote access to another NUMA node on the same socket
>    as significantly lower than the latency for cross-socket access. On
>    my test system the SLIT distance for same socket nodes is 0xC,
>    compared to 0x15 for cross-socket distance. This will also lead
>    to Linux being more likely to migrate a task to a CPU on another
>    SNC NUMA node in the same socket.
> 
> To avoid migration issues, users may use sched_setaffinity(2) to bind
> tasks to the subset of CPUs that share an SNC NUMA node.
> 
> I can write this up in a new cover letter.
> 
>> I stop my review of this series here.
> 
> Reinette
> 
> Should I repost the whole series as v6 with the new cover letter. The
> only change to the patches so far is to the selftest reported by
> Shaopeng Tan[1].
> 

Is this an assurance that the cover letter in no way reflects how 
feedback was addressed in the rest of this series?

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-09-12 17:13         ` Reinette Chatre
@ 2023-09-12 17:45           ` Tony Luck
  2023-09-12 18:31             ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-12 17:45 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Tue, Sep 12, 2023 at 10:13:31AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 9/12/2023 9:01 AM, Tony Luck wrote:
> > On Mon, Sep 11, 2023 at 01:23:35PM -0700, Reinette Chatre wrote:
> >> Hi Tony,
> >>
> >> On 8/29/2023 4:44 PM, Tony Luck wrote:
> >>> The Sub-NUMA cluster feature on some Intel processors partitions
> >>> the CPUs that share an L3 cache into two or more sets. This plays
> >>> havoc with the Resource Director Technology (RDT) monitoring features.
> >>> Prior to this patch Intel has advised that SNC and RDT are incompatible.
> >>>
> >>> Some of these CPU support an MSR that can partition the RMID
> >>> counters in the same way. This allows for monitoring features
> >>> to be used (with the caveat that memory accesses between different
> >>> SNC NUMA nodes may still not be counted accuratlely.
> >>
> >> Same typo as in V4.
> > 
> > Sorry. Will fix and re-post.
> > 
> >>>
> >>> Note that this patch series improves resctrl reporting considerably
> >>> on systems with SNC enabled, but there will still be some anomalies
> >>> for processes accessing memory from other sub-NUMA nodes.
> >>
> >> I have the same question as with V4 that was not answered in that email
> >> thread nor in this new version.
> >> https://lore.kernel.org/lkml/e350514e-76ed-14ea-3e74-c0852658182f@intel.com/
> > 
> > Non-SNC systems already have an issue when reporting memory bandwidth
> > for a task that Linux may migrate the task to a CPU on a different node
> > which means that logging for that task will also move to different files
> > in the mon_data/mon_L3_*/ for the new node.
> 
> It is not obvious to me that this is an issue. From what I understand
> the data remains accurate.
> 
> How does this map to the earlier "may still not be counted
> accurately"? 

Yes, the data is accurate. But a naive application reading the wrong
files from mon_data will not see the accurate data. Without SNC users
may only see this issue rarely as Linux tries hard to not migrate
processes to other NUMA nodes or L3 cache domains. But with SNC enabled
this is no longer the case. the ACPI SLIT distance of 0xC is below the
threshold that Linux checks for "is the target CPU for a migration far away"
so migration to other SNC nodes may be quite common.

I can move this out of the cover letter and provide guidance/warnings
in the patch to Documentation/arch/x86/resctrl.rst

> 
> > 
> > With SNC enabled, migration between NUMA nodes on the same socket may happen
> > much more frequently because:
> > 1) The CPUs on the other NUMA nodes in the socket are in the same Linux
> >    L3 cache domain. So Linux regard the migration as "cheap".
> > 2) The ACPI SLIT table on SNC enabled systems may also report the
> >    latency for remote access to another NUMA node on the same socket
> >    as significantly lower than the latency for cross-socket access. On
> >    my test system the SLIT distance for same socket nodes is 0xC,
> >    compared to 0x15 for cross-socket distance. This will also lead
> >    to Linux being more likely to migrate a task to a CPU on another
> >    SNC NUMA node in the same socket.
> > 
> > To avoid migration issues, users may use sched_setaffinity(2) to bind
> > tasks to the subset of CPUs that share an SNC NUMA node.
> > 
> > I can write this up in a new cover letter.
> > 
> >> I stop my review of this series here.
> > 
> > Reinette
> > 
> > Should I repost the whole series as v6 with the new cover letter. The
> > only change to the patches so far is to the selftest reported by
> > Shaopeng Tan[1].
> > 
> 
> Is this an assurance that the cover letter in no way reflects how 
> feedback was addressed in the rest of this series?

My track record here is far from perfect. I believe I addressed all
the issues you raised. But it's very possible that I may have missed
some, or misintepreted the concerns you raised.

> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-09-12 17:45           ` Tony Luck
@ 2023-09-12 18:31             ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-09-12 18:31 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 9/12/2023 10:45 AM, Tony Luck wrote:
> On Tue, Sep 12, 2023 at 10:13:31AM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 9/12/2023 9:01 AM, Tony Luck wrote:
>>> On Mon, Sep 11, 2023 at 01:23:35PM -0700, Reinette Chatre wrote:
>>>> Hi Tony,
>>>>
>>>> On 8/29/2023 4:44 PM, Tony Luck wrote:
>>>>> The Sub-NUMA cluster feature on some Intel processors partitions
>>>>> the CPUs that share an L3 cache into two or more sets. This plays
>>>>> havoc with the Resource Director Technology (RDT) monitoring features.
>>>>> Prior to this patch Intel has advised that SNC and RDT are incompatible.
>>>>>
>>>>> Some of these CPU support an MSR that can partition the RMID
>>>>> counters in the same way. This allows for monitoring features
>>>>> to be used (with the caveat that memory accesses between different
>>>>> SNC NUMA nodes may still not be counted accuratlely.
>>>>
>>>> Same typo as in V4.
>>>
>>> Sorry. Will fix and re-post.
>>>
>>>>>
>>>>> Note that this patch series improves resctrl reporting considerably
>>>>> on systems with SNC enabled, but there will still be some anomalies
>>>>> for processes accessing memory from other sub-NUMA nodes.
>>>>
>>>> I have the same question as with V4 that was not answered in that email
>>>> thread nor in this new version.
>>>> https://lore.kernel.org/lkml/e350514e-76ed-14ea-3e74-c0852658182f@intel.com/
>>>
>>> Non-SNC systems already have an issue when reporting memory bandwidth
>>> for a task that Linux may migrate the task to a CPU on a different node
>>> which means that logging for that task will also move to different files
>>> in the mon_data/mon_L3_*/ for the new node.
>>
>> It is not obvious to me that this is an issue. From what I understand
>> the data remains accurate.
>>
>> How does this map to the earlier "may still not be counted
>> accurately"? 
> 
> Yes, the data is accurate. But a naive application reading the wrong
> files from mon_data will not see the accurate data. Without SNC users
> may only see this issue rarely as Linux tries hard to not migrate
> processes to other NUMA nodes or L3 cache domains. But with SNC enabled
> this is no longer the case. the ACPI SLIT distance of 0xC is below the
> threshold that Linux checks for "is the target CPU for a migration far away"
> so migration to other SNC nodes may be quite common.

I would like to recommend that you take care not to present this work
using uncertain terms like "may not be counted accurately" or "there will
still be some anomalies". If I understand correctly there is no uncertainty.
When that "naive application" reads the wrong files then I think it can be
considered a usage error and should not be documented as an issue with the
counters. I understand that the user space requirements are not obvious,
and there should be guidance. 

> I can move this out of the cover letter and provide guidance/warnings
> in the patch to Documentation/arch/x86/resctrl.rst

Yes, I do think this will be very helpful.

> 
>>
>>>
>>> With SNC enabled, migration between NUMA nodes on the same socket may happen
>>> much more frequently because:
>>> 1) The CPUs on the other NUMA nodes in the socket are in the same Linux
>>>    L3 cache domain. So Linux regard the migration as "cheap".
>>> 2) The ACPI SLIT table on SNC enabled systems may also report the
>>>    latency for remote access to another NUMA node on the same socket
>>>    as significantly lower than the latency for cross-socket access. On
>>>    my test system the SLIT distance for same socket nodes is 0xC,
>>>    compared to 0x15 for cross-socket distance. This will also lead
>>>    to Linux being more likely to migrate a task to a CPU on another
>>>    SNC NUMA node in the same socket.
>>>
>>> To avoid migration issues, users may use sched_setaffinity(2) to bind
>>> tasks to the subset of CPUs that share an SNC NUMA node.
>>>
>>> I can write this up in a new cover letter.
>>>
>>>> I stop my review of this series here.
>>>
>>> Reinette
>>>
>>> Should I repost the whole series as v6 with the new cover letter. The
>>> only change to the patches so far is to the selftest reported by
>>> Shaopeng Tan[1].
>>>
>>
>> Is this an assurance that the cover letter in no way reflects how 
>> feedback was addressed in the rest of this series?
> 
> My track record here is far from perfect. I believe I addressed all
> the issues you raised. But it's very possible that I may have missed
> some, or misintepreted the concerns you raised.

This is a familiar response from you that just puts the burden back
on me to go dig out previous discussions. This will be the last time.

Reinette




^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-09-07 16:19         ` Luck, Tony
@ 2023-09-19  6:50           ` Maciej Wieczór-Retman
  2023-09-19 14:36             ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Maciej Wieczór-Retman @ 2023-09-19  6:50 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Shaopeng Tan (Fujitsu),
	Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, linux-kernel, linux-doc, patches

On 2023-09-07 at 16:19:37 +0000, Luck, Tony wrote:
>> > +   if (4 * node_cpus >= cache_cpus)
>> > +           return 4;
>> > +   else if (2 * node_cpus >= cache_cpus)
>> > +           return 2;
>>
>>
>> If "4 * node_cpus >= cache_cpus " is not true,
>> "2 * node_cpus >= cache_cpus" will never be true.
>> Is it the following code?
>>
>> +     if (2 * node_cpus >= cache_cpus)
>> +             return 2;
>> +     else if (4 * node_cpus >= cache_cpus)
>> +             return 4;
>
>
>Shaopeng TAN,
>
>Good catch. Your solution is the correct one.
>
>Will fix in next post.

I played around with this code a little and I think the logical
expressions are returning wrong values.

On a system that has SNC disabled the function reports both "node_cpus"
and "cache_cpus" equal to 56. In this case snc_ways() returns "2". It is
the same on a system with SNC enabled that reports the previously mentioned
variables to be different by a factor of two (36 and 72).

Is it possible for node_cpus and cache_cpus to not be multiples of each
other? (as in for example cache_cpus being 10 and node_cpus being 21?).
If not I'd suggest using "==" instead of ">=".

If yes then I guess something like this could work? :

+     if (node_cpus >= cache_cpus)
+             return 1;
+     else if (2 * node_cpus >= cache_cpus)
+             return 2;
+     else if (4 * node_cpus >= cache_cpus)
+             return 4;

PS. I did my tests on two Intel Ice Lakes.

-- 
Kind regards
Maciej Wieczór-Retman

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-09-19  6:50           ` Maciej Wieczór-Retman
@ 2023-09-19 14:36             ` Luck, Tony
  2023-09-20  6:06               ` Maciej Wieczór-Retman
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-09-19 14:36 UTC (permalink / raw)
  To: Wieczor-Retman, Maciej
  Cc: Shaopeng Tan (Fujitsu),
	Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, linux-kernel, linux-doc, patches

> On a system that has SNC disabled the function reports both "node_cpus"
> and "cache_cpus" equal to 56. In this case snc_ways() returns "2". It is
> the same on a system with SNC enabled that reports the previously mentioned
> variables to be different by a factor of two (36 and 72).

> Is it possible for node_cpus and cache_cpus to not be multiples of each
> other? (as in for example cache_cpus being 10 and node_cpus being 21?).
> If not I'd suggest using "==" instead of ">=".

Some CPUs may be offline when the test is run. E.g. with one CPU offline on SNC
node 0, you'd see node_cpus = 35 and cache_cpus = 71. But with one CPU offline
on node 1, you'd have node_cpus = 36, cache_cpus = 71.



> If yes then I guess something like this could work? :

+     if (node_cpus >= cache_cpus)
+             return 1;
+     else if (2 * node_cpus >= cache_cpus)
+             return 2;
+     else if (4 * node_cpus >= cache_cpus)
+             return 4;

This returns "4" for the 36 71 case. But should still be "2".

>> PS. I did my tests on two Intel Ice Lakes.

Perhaps easier to play with the algorithm in user code?

#include <stdio.h>
#include <stdlib.h>

static int snc(int node_cpus, int cache_cpus)
{
     if (node_cpus >= cache_cpus)
             return 1;
     else if (2 * node_cpus >= cache_cpus)
             return 2;
     else if (4 * node_cpus >= cache_cpus)
             return 4;
     return -1;
}

int main(int argc, char **argv)
{
        printf("%d\n", snc(atoi(argv[1]), atoi(argv[2])));

        return 0;
}

N.B. it's probably not possible to handle the case where somebody took ALL the CPUs in SNC
node 1 offline (or SNC nodes 1,2,3 for the SNC 4 case).

I think it reasonable that the code handle some simple "small number of CPUs offline" cases.
But don't worry too much about cases where the user has done something extreme.

-Tony



^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-09-19 14:36             ` Luck, Tony
@ 2023-09-20  6:06               ` Maciej Wieczór-Retman
  2023-09-20 15:00                 ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Maciej Wieczór-Retman @ 2023-09-20  6:06 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Shaopeng Tan (Fujitsu),
	Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, linux-kernel, linux-doc, patches

Hi, thanks for the reply

On 2023-09-19 at 14:36:06 +0000, Luck, Tony wrote:
>> On a system that has SNC disabled the function reports both "node_cpus"
>> and "cache_cpus" equal to 56. In this case snc_ways() returns "2". It is
>> the same on a system with SNC enabled that reports the previously mentioned
>> variables to be different by a factor of two (36 and 72).
>
>> Is it possible for node_cpus and cache_cpus to not be multiples of each
>> other? (as in for example cache_cpus being 10 and node_cpus being 21?).
>> If not I'd suggest using "==" instead of ">=".
>
>Some CPUs may be offline when the test is run. E.g. with one CPU offline on SNC
>node 0, you'd see node_cpus = 35 and cache_cpus = 71. But with one CPU offline
>on node 1, you'd have node_cpus = 36, cache_cpus = 71.

Okay, thanks, good to know. On systems with disabled SNC that number
should be equal even if some CPUs were offline, right? I was mostly
concerned that the previous version was returning the same number
whether SNC was enabled with 2 nodes or disabled.

>> If yes then I guess something like this could work? :
>
>+     if (node_cpus >= cache_cpus)
>+             return 1;
>+     else if (2 * node_cpus >= cache_cpus)
>+             return 2;
>+     else if (4 * node_cpus >= cache_cpus)
>+             return 4;
>
>This returns "4" for the 36 71 case. But should still be "2".
>
>>> PS. I did my tests on two Intel Ice Lakes.
>
>Perhaps easier to play with the algorithm in user code?
>
>#include <stdio.h>
>#include <stdlib.h>
>
>static int snc(int node_cpus, int cache_cpus)
>{
>     if (node_cpus >= cache_cpus)
>             return 1;
>     else if (2 * node_cpus >= cache_cpus)
>             return 2;
>     else if (4 * node_cpus >= cache_cpus)
>             return 4;
>     return -1;
>}
>
>int main(int argc, char **argv)
>{
>        printf("%d\n", snc(atoi(argv[1]), atoi(argv[2])));
>
>        return 0;
>}

My previous understanding was that the presence of ">=" comparison
implied the number of node_cpus could somehow get larger. So I
assumed that keeping it that way would be sufficient but now I can see
that wouldn't be the case.

>
>N.B. it's probably not possible to handle the case where somebody took ALL the CPUs in SNC
>node 1 offline (or SNC nodes 1,2,3 for the SNC 4 case).
>
>I think it reasonable that the code handle some simple "small number of CPUs offline" cases.
>But don't worry too much about cases where the user has done something extreme.
>
>-Tony

What about outputing this value to userspace from resctrl? The ratio is
already saved inside snc_nodes_per_l3_cache variable. And that would
help avoid these difficult cases when some cpus are offline which could
cause snc_ways() to return a wrong value. Or are there some pitfalls
to that approach?

-- 
Kind regards
Maciej Wieczór-Retman

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled
  2023-09-20  6:06               ` Maciej Wieczór-Retman
@ 2023-09-20 15:00                 ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-09-20 15:00 UTC (permalink / raw)
  To: Wieczor-Retman, Maciej
  Cc: Shaopeng Tan (Fujitsu),
	Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, linux-kernel, linux-doc, patches

> What about outputing this value to userspace from resctrl? The ratio is
> already saved inside snc_nodes_per_l3_cache variable. And that would
> help avoid these difficult cases when some cpus are offline which could
> cause snc_ways() to return a wrong value. Or are there some pitfalls
> to that approach?

My original patch series added an "snc_ways" file to the info/ directory
to make this visible. But I was talked out of it because of a lack of clear
user mode use case that needs it.

https://lore.kernel.org/all/f0841866-315b-4727-0a6c-ec60d22ca29c@arm.com/

I don't know if the resctrl self tests constitute a valid use case. Perhaps not
as they can figure this out.

But ... maybe the difficulties that user mode has (because of possibly offline
CPUs) are an indicator that the kernel should expose this. If only to
make the kernel's view clear in case there are situations where it got the
calculation wrong. E.g. kernel booted with a maxcpus=N parameter.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope
  2023-08-29 23:44     ` [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2023-08-30  8:53       ` Maciej Wieczór-Retman
@ 2023-09-25 23:22       ` Reinette Chatre
  2023-09-26  0:56         ` Tony Luck
  2023-09-26 16:36       ` Moger, Babu
  2 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-09-25 23:22 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/29/2023 4:44 PM, Tony Luck wrote:
> Legacy resctrl features operated on subsets of CPUs in the system with

What is a "legacy" resctrl feature? I am not aware of anything in resctrl
that distinguishes a feature as legacy. Could "resctrl resource" be more
appropriate to match the resctrl implementation? 

Please use "operate on" instead of "operated on".

> the defining attribute of each subset being an instance of a particular
> level of cache. E.g. all CPUs sharing an L3 cache would be part of the
> same domain.
> 
> In preparation for features that are scoped at the NUMA node level
> change the code from explicit references to "cache_level" to a more
> generic scope. At this point the only options for this scope are groups
> of CPUs that share an L2 cache or L3 cache.
> 
> No functional change.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h                   |  9 ++++++--
>  arch/x86/kernel/cpu/resctrl/core.c        | 27 ++++++++++++++++++-----
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 15 ++++++++++++-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 15 ++++++++++++-
>  4 files changed, 56 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 8334eeacfec5..2db1244ae642 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -144,13 +144,18 @@ struct resctrl_membw {
>  struct rdt_parse_data;
>  struct resctrl_schema;
>  
> +enum resctrl_scope {
> +	RESCTRL_L3_CACHE,
> +	RESCTRL_L2_CACHE,
> +};

I'm curious why L3 appears before L2? This is not a big deal but
I think the additional indirection that this introduces is
not necessary. As you had in an earlier version this could be
RESCTRL_L2_CACHE = 2 and then the value can just be used directly
(after ensuring it is used in a valid context).

> +
>  /**
>   * struct rdt_resource - attributes of a resctrl resource
>   * @rid:		The index of the resource
>   * @alloc_capable:	Is allocation available on this machine
>   * @mon_capable:	Is monitor feature available on this machine
>   * @num_rmid:		Number of RMIDs available
> - * @cache_level:	Which cache level defines scope of this resource
> + * @scope:		Scope of this resource
>   * @cache:		Cache allocation related data
>   * @membw:		If the component has bandwidth controls, their properties.
>   * @domains:		All domains for this resource
> @@ -168,7 +173,7 @@ struct rdt_resource {
>  	bool			alloc_capable;
>  	bool			mon_capable;
>  	int			num_rmid;
> -	int			cache_level;
> +	enum resctrl_scope	scope;
>  	struct resctrl_cache	cache;
>  	struct resctrl_membw	membw;
>  	struct list_head	domains;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 030d3b409768..0d3bae523ecb 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L3,
>  			.name			= "L3",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_L3),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
> @@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L2,
>  			.name			= "L2",
> -			.cache_level		= 2,
> +			.scope			= RESCTRL_L2_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_L2),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
> @@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_MBA,
>  			.name			= "MB",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_MBA),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
> @@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_SMBA,
>  			.name			= "SMBA",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_SMBA),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
> @@ -487,6 +487,21 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>  	return 0;
>  }
>  
> +static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> +{
> +	switch (scope) {
> +	case RESCTRL_L3_CACHE:
> +		return get_cpu_cacheinfo_id(cpu, 3);
> +	case RESCTRL_L2_CACHE:
> +		return get_cpu_cacheinfo_id(cpu, 2);
> +	default:
> +		WARN_ON_ONCE(1);
> +		break;
> +	}
> +
> +	return -1;
> +}

Looking ahead at the next patch some members of rdt_resources_all[]
are left with a default "0" initialization of mon_scope that is a
valid scope of RESCTRL_L3_CACHE in this implementation that would
not be caught with defensive code like above. For the above to catch
a case like this I think that there should be some default
initialization - but if you do move to something like you
had in v3 then this may not be necessary. If the L2 scope is 2,
L3 scope is 3, node scope is 4, then no initialization will be zero
and the above default can more accurately catch failure cases.

> +
>  /*
>   * domain_add_cpu - Add a cpu to a resource's domain list.
>   *
> @@ -502,7 +517,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>   */
>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>  	struct list_head *add_pos = NULL;
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
> @@ -552,7 +567,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  
>  static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 458cb7419502..e79324676f57 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -279,6 +279,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
>  static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  {
>  	struct cpu_cacheinfo *ci;
> +	int cache_level;
>  	int ret;
>  	int i;
>  
> @@ -296,8 +297,20 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  
>  	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
>  
> +	switch (plr->s->res->scope) {
> +	case RESCTRL_L3_CACHE:
> +		cache_level = 3;
> +		break;
> +	case RESCTRL_L2_CACHE:
> +		cache_level = 2;
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		return -ENODEV;
> +	}

I think this can be simplified without the indirection - a simplified
WARN can just test for valid values of plr->s->res->scope directly
(exit on failure) and then it can be used directly in the code below
when the enum value corresponds to a cache level.

> +
>  	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == plr->s->res->cache_level) {
> +		if (ci->info_list[i].level == cache_level) {
>  			plr->line_size = ci->info_list[i].coherency_line_size;
>  			return 0;
>  		}
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 725344048f85..f510414bf6ce 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1343,12 +1343,25 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>  {
>  	struct cpu_cacheinfo *ci;
>  	unsigned int size = 0;
> +	int cache_level;
>  	int num_b, i;
>  
> +	switch (r->scope) {
> +	case RESCTRL_L3_CACHE:
> +		cache_level = 3;
> +		break;
> +	case RESCTRL_L2_CACHE:
> +		cache_level = 2;
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		return size;
> +	}
> +

Same here.

>  	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
>  	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
>  	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == r->cache_level) {
> +		if (ci->info_list[i].level == cache_level) {
>  			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
>  			break;
>  		}


Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 2/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-08-29 23:44     ` [PATCH v5 2/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-09-25 23:24       ` Reinette Chatre
  2023-09-28 17:21         ` Tony Luck
  2023-09-26 16:54       ` Moger, Babu
  1 sibling, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-09-25 23:24 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/29/2023 4:44 PM, Tony Luck wrote:
> Existing resctrl assumes that control and monitor operations on a

Please remove "Existing", it can just be "resctrl assumes ..."

> resource are performed at the same scope.
> 
> Prepare for systems that use different scope (specifically L3 scope
> for cache control and NODE scope for cache occupancy and memory
> bandwidth monitoring).
> 
> Create separate domain lists for control and monitor operations.
> 
> No important functional change. But note that errors during
> initialization of either control or monitor functions on a domain would
> previously result in that domain being excluded from both control and
> monitor operations. Now the domains are allocated independently it is
> no longer required to disable both control and monitor operations if
> either fail.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h                   |  16 +-
>  arch/x86/kernel/cpu/resctrl/internal.h    |   6 +-
>  arch/x86/kernel/cpu/resctrl/core.c        | 227 +++++++++++++++-------
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
>  arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   2 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  32 +--
>  7 files changed, 199 insertions(+), 88 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 2db1244ae642..33856943a787 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -155,10 +155,12 @@ enum resctrl_scope {
>   * @alloc_capable:	Is allocation available on this machine
>   * @mon_capable:	Is monitor feature available on this machine
>   * @num_rmid:		Number of RMIDs available
> - * @scope:		Scope of this resource
> + * @ctrl_scope:		Scope of this resource for control functions
> + * @mon_scope:		Scope of this resource for monitor functions
>   * @cache:		Cache allocation related data
>   * @membw:		If the component has bandwidth controls, their properties.
> - * @domains:		All domains for this resource
> + * @domains:		Control domains for this resource
> + * @mon_domains:	Monitor domains for this resource
>   * @name:		Name to use in "schemata" file.
>   * @data_width:		Character width of data when displaying
>   * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
> @@ -173,10 +175,12 @@ struct rdt_resource {
>  	bool			alloc_capable;
>  	bool			mon_capable;
>  	int			num_rmid;
> -	enum resctrl_scope	scope;
> +	enum resctrl_scope	ctrl_scope;
> +	enum resctrl_scope	mon_scope;
>  	struct resctrl_cache	cache;
>  	struct resctrl_membw	membw;
>  	struct list_head	domains;
> +	struct list_head	mondomains;

kerneldoc is "mon_domains" while member is "mondomains".
I do think to be consistent with other members that "mon_domains"
would be appropriate. I also think it will be very helpful it
domains is renamed to "ctrl_domains" to more accurately match
"ctrl_scope" and what it represents.

>  	char			*name;
>  	int			data_width;
>  	u32			default_ctrl;
> @@ -222,8 +226,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
>  
>  u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
>  			    u32 closid, enum resctrl_conf_type type);
> -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
> -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
>  
>  /**
>   * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 85ceaf9a31ac..31a5fc3b717f 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -511,8 +511,10 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
>  int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
>  int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
>  			     umode_t mask);
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> -				   struct list_head **pos);
> +struct rdt_domain *rdt_find_ctrldomain(struct list_head *h, int id,
> +				       struct list_head **pos);
> +struct rdt_domain *rdt_find_mondomain(struct list_head *h, int id,
> +				      struct list_head **pos);

This is not what I expected after our previous discussion. It
should not be necessary for caller to be aware of the different
lists. It looks to me like the original parameters of (struct rdt_resource *r,
int id, struct list_head **pos) can be maintained.

>  ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off);
>  int rdtgroup_schemata_show(struct kernfs_open_file *of,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 0d3bae523ecb..97f6f9715fdb 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -57,7 +57,7 @@ static void
>  mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
>  	      struct rdt_resource *r);
>  
> -#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
> +#define domain_init(id, field) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.field)
>  

It may make the code easier to read if this is split into
a "mon_domain_init()" and "ctrl_domain_init()"

>  struct rdt_hw_resource rdt_resources_all[] = {
>  	[RDT_RESOURCE_L3] =
> @@ -65,8 +65,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L3,
>  			.name			= "L3",
> -			.scope			= RESCTRL_L3_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_L3),
> +			.ctrl_scope		= RESCTRL_L3_CACHE,
> +			.mon_scope		= RESCTRL_L3_CACHE,
> +			.domains		= domain_init(RDT_RESOURCE_L3, domains),
> +			.mondomains		= domain_init(RDT_RESOURCE_L3, mondomains),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
>  			.fflags			= RFTYPE_RES_CACHE,
> @@ -79,8 +81,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L2,
>  			.name			= "L2",
> -			.scope			= RESCTRL_L2_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_L2),
> +			.ctrl_scope		= RESCTRL_L2_CACHE,
> +			.domains		= domain_init(RDT_RESOURCE_L2, domains),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
>  			.fflags			= RFTYPE_RES_CACHE,
> @@ -93,8 +95,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_MBA,
>  			.name			= "MB",
> -			.scope			= RESCTRL_L3_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_MBA),
> +			.ctrl_scope		= RESCTRL_L3_CACHE,
> +			.domains		= domain_init(RDT_RESOURCE_MBA, domains),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
>  			.fflags			= RFTYPE_RES_MB,
> @@ -105,8 +107,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_SMBA,
>  			.name			= "SMBA",
> -			.scope			= RESCTRL_L3_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_SMBA),
> +			.ctrl_scope		= RESCTRL_L3_CACHE,
> +			.domains		= domain_init(RDT_RESOURCE_SMBA, domains),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
>  			.fflags			= RFTYPE_RES_MB,
> @@ -384,15 +386,16 @@ void rdt_ctrl_update(void *arg)
>  }
>  
>  /*
> - * rdt_find_domain - Find a domain in a resource that matches input resource id
> + * __rdt_find_domain - Find a domain in either the list of control or
> + * monitor domains that matches input resource id
>   *
>   * Search resource r's domain list to find the resource id. If the resource
>   * id is found in a domain, return the domain. Otherwise, if requested by
>   * caller, return the first domain whose id is bigger than the input id.
>   * The domain list is sorted by id in ascending order.
>   */
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> -				   struct list_head **pos)
> +static void *__rdt_find_domain(struct list_head *h, int id,
> +			       struct list_head **pos)
>  {
>  	struct rdt_domain *d;
>  	struct list_head *l;
> @@ -400,7 +403,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
>  	if (id < 0)
>  		return ERR_PTR(-ENODEV);
>  
> -	list_for_each(l, &r->domains) {
> +	list_for_each(l, h) {
>  		d = list_entry(l, struct rdt_domain, list);
>  		/* When id is found, return its domain. */
>  		if (id == d->id)
> @@ -416,6 +419,18 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
>  	return NULL;
>  }
>  
> +struct rdt_domain *rdt_find_ctrldomain(struct list_head *h, int id,
> +				       struct list_head **pos)
> +{
> +	return __rdt_find_domain(h, id, pos);
> +}
> +
> +struct rdt_domain *rdt_find_mondomain(struct list_head *h, int id,
> +				      struct list_head **pos)
> +{
> +	return __rdt_find_domain(h, id, pos);
> +}
> +
>  static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
>  {
>  	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> @@ -431,10 +446,15 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
>  }
>  
>  static void domain_free(struct rdt_hw_domain *hw_dom)
> +{
> +	kfree(hw_dom->ctrl_val);
> +	kfree(hw_dom);
> +}
> +
> +static void mondomain_free(struct rdt_hw_domain *hw_dom)
>  {
>  	kfree(hw_dom->arch_mbm_total);
>  	kfree(hw_dom->arch_mbm_local);
> -	kfree(hw_dom->ctrl_val);
>  	kfree(hw_dom);
>  }
>  
> @@ -502,6 +522,93 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>  	return -1;
>  }
>  
> +static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
> +	struct list_head *add_pos = NULL;
> +	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain *d;
> +	int err;
> +
> +	d = rdt_find_ctrldomain(&r->domains, id, &add_pos);
> +	if (IS_ERR(d)) {
> +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);

I am not sure here ... this generates identical error messages
for control and monitor domain. How will the user know what failed?
Also, how does printing id help? If I understand correctly it can
only be -1 when the above prints.

> +		return;
> +	}
> +
> +	if (d) {
> +		cpumask_set_cpu(cpu, &d->cpu_mask);
> +		if (r->cache.arch_has_per_cpu_cfg)
> +			rdt_domain_reconfigure_cdp(r);
> +		return;
> +	}
> +
> +	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!hw_dom)
> +		return;
> +
> +	d = &hw_dom->d_resctrl;
> +	d->id = id;
> +	cpumask_set_cpu(cpu, &d->cpu_mask);
> +
> +	rdt_domain_reconfigure_cdp(r);
> +
> +	if (domain_setup_ctrlval(r, d)) {
> +		domain_free(hw_dom);
> +		return;
> +	}
> +
> +	list_add_tail(&d->list, add_pos);
> +
> +	err = resctrl_online_ctrl_domain(r, d);
> +	if (err) {
> +		list_del(&d->list);
> +		domain_free(hw_dom);
> +	}
> +}
> +
> +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> +	struct rdt_hw_domain *hw_mondom;
> +	struct list_head *add_pos = NULL;

Please maintain reverse fir ordering. Please check all code.

> +	struct rdt_domain *d;
> +	int err;
> +
> +	d = rdt_find_mondomain(&r->mondomains, id, &add_pos);
> +	if (IS_ERR(d)) {
> +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
> +		return;
> +	}
> +
> +	if (d) {
> +		cpumask_set_cpu(cpu, &d->cpu_mask);
> +

This is an unnecessary empty line that distracts from how the rest
of the function looks.

> +		return;
> +	}
> +
> +	hw_mondom = kzalloc_node(sizeof(*hw_mondom), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!hw_mondom)
> +		return;
> +
> +	d = &hw_mondom->d_resctrl;
> +	d->id = id;
> +	cpumask_set_cpu(cpu, &d->cpu_mask);
> +
> +	if (arch_domain_mbm_alloc(r->num_rmid, hw_mondom)) {
> +		mondomain_free(hw_mondom);
> +		return;
> +	}
> +
> +	list_add_tail(&d->list, add_pos);
> +
> +	err = resctrl_online_mon_domain(r, d);
> +	if (err) {
> +		list_del(&d->list);
> +		mondomain_free(hw_mondom);
> +	}
> +}
> +
>  /*
>   * domain_add_cpu - Add a cpu to a resource's domain list.
>   *

Note that this leaves the comments about list management here while all
the list management code is moved away.

> @@ -517,70 +624,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>   */
>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_domain_id_from_scope(cpu, r->scope);
> -	struct list_head *add_pos = NULL;
> -	struct rdt_hw_domain *hw_dom;
> -	struct rdt_domain *d;
> -	int err;
> -
> -	d = rdt_find_domain(r, id, &add_pos);
> -	if (IS_ERR(d)) {
> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> -		return;
> -	}
> -
> -	if (d) {
> -		cpumask_set_cpu(cpu, &d->cpu_mask);
> -		if (r->cache.arch_has_per_cpu_cfg)
> -			rdt_domain_reconfigure_cdp(r);
> -		return;
> -	}
> -
> -	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> -	if (!hw_dom)
> -		return;
> -
> -	d = &hw_dom->d_resctrl;
> -	d->id = id;
> -	cpumask_set_cpu(cpu, &d->cpu_mask);
> -
> -	rdt_domain_reconfigure_cdp(r);
> -
> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> -		domain_free(hw_dom);
> -		return;
> -	}
> -
> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> -		domain_free(hw_dom);
> -		return;
> -	}
> -
> -	list_add_tail(&d->list, add_pos);
> -
> -	err = resctrl_online_domain(r, d);
> -	if (err) {
> -		list_del(&d->list);
> -		domain_free(hw_dom);
> -	}
> +	if (r->alloc_capable)
> +		domain_add_cpu_ctrl(cpu, r);
> +	if (r->mon_capable)
> +		domain_add_cpu_mon(cpu, r);
>  }
>  
> -static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> +static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_domain_id_from_scope(cpu, r->scope);
> +	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  
> -	d = rdt_find_domain(r, id, NULL);
> +	d = rdt_find_ctrldomain(&r->domains, id, NULL);
>  	if (IS_ERR_OR_NULL(d)) {
> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
>  		return;
>  	}
>  	hw_dom = resctrl_to_arch_dom(d);
>  
>  	cpumask_clear_cpu(cpu, &d->cpu_mask);
>  	if (cpumask_empty(&d->cpu_mask)) {
> -		resctrl_offline_domain(r, d);
> +		resctrl_offline_ctrl_domain(r, d);
>  		list_del(&d->list);
>  
>  		/*
> @@ -593,6 +658,30 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  
>  		return;
>  	}
> +}
> +
> +static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> +	struct rdt_hw_domain *hw_mondom;
> +	struct rdt_domain *d;
> +
> +	d = rdt_find_mondomain(&r->mondomains, id, NULL);
> +	if (IS_ERR_OR_NULL(d)) {
> +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
> +		return;
> +	}
> +	hw_mondom = resctrl_to_arch_dom(d);
> +
> +	cpumask_clear_cpu(cpu, &d->cpu_mask);
> +	if (cpumask_empty(&d->cpu_mask)) {
> +		resctrl_offline_mon_domain(r, d);
> +		list_del(&d->list);
> +

This is an awkward empty line.

> +		mondomain_free(hw_mondom);
> +
> +		return;
> +	}
>  
>  	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
>  		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {

...

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure
  2023-08-29 23:44     ` [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure Tony Luck
@ 2023-09-25 23:25       ` Reinette Chatre
  2023-09-26 18:46         ` Tony Luck
  2023-09-28 17:33         ` Tony Luck
  0 siblings, 2 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-09-25 23:25 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

Subject:

x86/resctrl: Split the rdt_domain and rdt_hw_domain structures


On 8/29/2023 4:44 PM, Tony Luck wrote:
> The same rdt_domain structure is used for both control an monitor

"control an monitor" -> "control and monitor"

> functions. But this results in wasted memory as some of the fields
> are only used by control functions, while most are only used for monitor
> functions.
> 
> Create a new rdt_mondomain structure tailored explicitly for use in
> monitor parts of the core. Slim down the rdt_domain structure by
> removing the unused monitor fields.
> 

Similar to the previous patch I think it will make the code
easier to understand if the naming is clear for both
monitoring and control structured. Why not rdt_mondomain
and rdt_ctrldomain instead?


> Similar breakout of struct rdt_hw_mondomain from struct rdt_hw_domain.

rdt_hw_mondomain and rdt_hw_ctrldomain?

> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h                   | 46 +++++++++++++++--------
>  arch/x86/kernel/cpu/resctrl/internal.h    | 38 +++++++++++++------
>  arch/x86/kernel/cpu/resctrl/core.c        | 18 ++++-----
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  4 +-
>  arch/x86/kernel/cpu/resctrl/monitor.c     | 40 ++++++++++----------
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 24 ++++++------
>  6 files changed, 101 insertions(+), 69 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 33856943a787..08382548571e 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -53,7 +53,29 @@ struct resctrl_staged_config {
>  };
>  
>  /**
> - * struct rdt_domain - group of CPUs sharing a resctrl resource
> + * struct rdt_domain - group of CPUs sharing a resctrl control resource
> + * @list:		all instances of this resource
> + * @id:			unique id for this instance
> + * @cpu_mask:		which CPUs share this resource
> + * @plr:		pseudo-locked region (if any) associated with domain
> + * @staged_config:	parsed configuration to be applied
> + * @mbps_val:		When mba_sc is enabled, this holds the array of user
> + *			specified control values for mba_sc in MBps, indexed
> + *			by closid
> + */
> +struct rdt_domain {
> +	// First three fields must match struct rdt_mondomain below.

Please avoid comments within declarations. Even so, could you please
elaborate what the above means? Why do the first three fields have to
match? I understand there is common code, for example, __rdt_find_domain()
that operated on the same members of the two structs but does that
require the members be in the same position in the struct?
I understand that a comment may be required if position in the struct
is important but I cannot see that it is.

> +	struct list_head		list;
> +	int				id;
> +	struct cpumask			cpu_mask;
> +
> +	struct pseudo_lock_region	*plr;
> +	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
> +	u32				*mbps_val;
> +};
> +
> +/**
> + * struct rdt_mondomain - group of CPUs sharing a resctrl monitor resource
>   * @list:		all instances of this resource
>   * @id:			unique id for this instance
>   * @cpu_mask:		which CPUs share this resource
> @@ -64,16 +86,13 @@ struct resctrl_staged_config {
>   * @cqm_limbo:		worker to periodically read CQM h/w counters
>   * @mbm_work_cpu:	worker CPU for MBM h/w counters
>   * @cqm_work_cpu:	worker CPU for CQM h/w counters
> - * @plr:		pseudo-locked region (if any) associated with domain
> - * @staged_config:	parsed configuration to be applied
> - * @mbps_val:		When mba_sc is enabled, this holds the array of user
> - *			specified control values for mba_sc in MBps, indexed
> - *			by closid
>   */
> -struct rdt_domain {
> +struct rdt_mondomain {
> +	// First three fields must match struct rdt_domain above.

Same comment.

>  	struct list_head		list;
>  	int				id;
>  	struct cpumask			cpu_mask;
> +
>  	unsigned long			*rmid_busy_llc;
>  	struct mbm_state		*mbm_total;
>  	struct mbm_state		*mbm_local;
> @@ -81,9 +100,6 @@ struct rdt_domain {
>  	struct delayed_work		cqm_limbo;
>  	int				mbm_work_cpu;
>  	int				cqm_work_cpu;
> -	struct pseudo_lock_region	*plr;
> -	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
> -	u32				*mbps_val;
>  };
>  
>  /**

...

> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 468c1815edfd..5167ac9cbe98 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -521,7 +521,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
>  }
>  
>  void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> -		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
> +		    struct rdt_mondomain *d, struct rdtgroup *rdtgrp,
>  		    int evtid, int first)
>  {
>  	/*
> @@ -544,7 +544,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>  	struct rdtgroup *rdtgrp;
>  	struct rdt_resource *r;
>  	union mon_data_bits md;
> -	struct rdt_domain *d;
> +	struct rdt_mondomain *d;

Reverse fir order.

>  	struct rmid_read rr;
>  	int ret = 0;
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 66beca785535..42262d59ef9b 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>  	return 0;
>  }
>  
> -static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
> +static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mondomain *hw_dom,
>  						 u32 rmid,
>  						 enum resctrl_event_id eventid)
>  {
> @@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
>  	return NULL;
>  }
>  
> -void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
> +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mondomain *d,
>  			     u32 rmid, enum resctrl_event_id eventid)
>  {
> -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> +	struct rdt_hw_mondomain *hw_dom = resctrl_to_arch_mondom(d);
>  	struct arch_mbm_state *am;
>  
>  	am = get_arch_mbm_state(hw_dom, rmid, eventid);
> @@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
>   * Assumes that hardware counters are also reset and thus that there is
>   * no need to record initial non-zero counts.
>   */
> -void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
> +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mondomain *d)
>  {
> -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> +	struct rdt_hw_mondomain *hw_dom = resctrl_to_arch_mondom(d);
>  
>  	if (is_mbm_total_enabled())
>  		memset(hw_dom->arch_mbm_total, 0,
> @@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
>  	return chunks >> shift;
>  }
>  
> -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mondomain *d,
>  			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
>  {
>  	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> +	struct rdt_hw_mondomain *hw_mondom = resctrl_to_arch_mondom(d);

Reverse fir.

>  	struct arch_mbm_state *am;
>  	u64 msr_val, chunks;
>  	int ret;
> @@ -245,7 +245,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>  	if (ret)
>  		return ret;
>  
> -	am = get_arch_mbm_state(hw_dom, rmid, eventid);
> +	am = get_arch_mbm_state(hw_mondom, rmid, eventid);
>  	if (am) {
>  		am->chunks += mbm_overflow_count(am->prev_msr, msr_val,
>  						 hw_res->mbm_width);
> @@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>   * decrement the count. If the busy count gets to zero on an RMID, we
>   * free the RMID
>   */
> -void __check_limbo(struct rdt_domain *d, bool force_free)
> +void __check_limbo(struct rdt_mondomain *d, bool force_free)
>  {
>  	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>  	struct rmid_entry *entry;
> @@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>  	}
>  }
>  
> -bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
> +bool has_busy_rmid(struct rdt_resource *r, struct rdt_mondomain *d)
>  {
>  	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
>  }
> @@ -334,7 +334,7 @@ int alloc_rmid(void)
>  static void add_rmid_to_limbo(struct rmid_entry *entry)
>  {
>  	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> -	struct rdt_domain *d;
> +	struct rdt_mondomain *d;
>  	int cpu, err;
>  	u64 val = 0;
>  
> @@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
>  		list_add_tail(&entry->list, &rmid_free_lru);
>  }
>  
> -static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
> +static struct mbm_state *get_mbm_state(struct rdt_mondomain *d, u32 rmid,
>  				       enum resctrl_event_id evtid)
>  {
>  	switch (evtid) {
> @@ -516,7 +516,7 @@ void mon_event_count(void *info)
>   * throttle MSRs already have low percentage values.  To avoid
>   * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
>   */
> -static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> +static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mondomain *dom_mbm)
>  {
>  	u32 closid, rmid, cur_msr_val, new_msr_val;
>  	struct mbm_state *pmbm_data, *cmbm_data;
> @@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
>  	}
>  }
>  
> -static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
> +static void mbm_update(struct rdt_resource *r, struct rdt_mondomain *d, int rmid)
>  {
>  	struct rmid_read rr;
>  
> @@ -641,12 +641,12 @@ void cqm_handle_limbo(struct work_struct *work)
>  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
>  	int cpu = smp_processor_id();
>  	struct rdt_resource *r;
> -	struct rdt_domain *d;
> +	struct rdt_mondomain *d;

Reverse fir (Please check all code).

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 4/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-08-29 23:44     ` [PATCH v5 4/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-09-25 23:25       ` Reinette Chatre
  2023-09-28 17:42         ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-09-25 23:25 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/29/2023 4:44 PM, Tony Luck wrote:
> Currently supported resctrl features are all domain scoped the same as the
> scope of the L2 or L3 caches.

fyi ... this patch series seems to use the terms "resctrl feature"
and "resctrl resource" interchangeably and it is not always clear
if the terms mean something different.

> 
> Add "node" as a new option for domain scope.

Could the commit message please get a snippet about what "node"
represents and why this new scope is needed?

> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h            | 1 +
>  arch/x86/kernel/cpu/resctrl/core.c | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 08382548571e..f55cf7afd4eb 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -163,6 +163,7 @@ struct resctrl_schema;
>  enum resctrl_scope {
>  	RESCTRL_L3_CACHE,
>  	RESCTRL_L2_CACHE,
> +	RESCTRL_NODE,
>  };
>  
>  /**
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 3e08aa04a7ff..9fcc264fac6c 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -514,6 +514,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>  		return get_cpu_cacheinfo_id(cpu, 3);
>  	case RESCTRL_L2_CACHE:
>  		return get_cpu_cacheinfo_id(cpu, 2);
> +	case RESCTRL_NODE:
> +		return cpu_to_node(cpu);
>  	default:
>  		WARN_ON_ONCE(1);
>  		break;


Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 5/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-08-29 23:44     ` [PATCH v5 5/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-09-25 23:27       ` Reinette Chatre
  2023-09-28 17:48         ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-09-25 23:27 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/29/2023 4:44 PM, Tony Luck wrote:

Could the commit message please provide a brief overview of what SNC is
before jumping to the things needed to support it?

> Intel Sub-NUMA Cluster mode requires several changes in resctrl

I think the intention is to introduce the acronym here so maybe:
"Intel Sub-NUMA Cluster (SNC) ..."

> behavior for correct operation.
> 
> Add a global integer "snc_nodes_per_l3_cache" that will show how many
> SNC nodes share each L3 cache. When this is "1", SNC mode is either
> not implemented, or not enabled.
> 
> A later patch will detect SNC mode and set snc_nodes_per_l3_cache to
> the appropriate value. For now it remains at the default "1" to
> indicate SNC mode is not active.
> 
> Code that needs to take action when SNC is enabled is:
> 1) The number of logical RMIDs available for use is the number of
>    physical RMIDs divided by the number of SNC nodes.

Could this maybe be "... number of SNC nodes per L3 cache" to be
specific? Even so, this jumps into supporting logical RMIDs and 
physical RMIDs without introducing what logical vs physical means.
Is this something that can be added to the intro of this commit message?

> 2) Likewise the "mon_scale" value must be adjusted for the number
>    of SNC nodes.
> 3) When reading an RMID counter code must adjust from the logical
>    RMID used to the physical RMID value that must be loaded into
>    the IA32_QM_EVTSEL MSR.
> 4) The L3 cache is divided between the SNC nodes. So the value
>    reported in the resctrl "size" file is adjusted.
> 5) The "-o mba_MBps" mount option must be disabled in SNC mode
>    because the monitoring is being done per SNC node, while the
>    bandwidth allocation is still done at the L3 cache scope.

This motivation for disabling is not clear to me. Why is only
mba_MBps impacted? MBA is also at the L3 scope and it is not
disabled. Neither is cache allocation that remains at L3
scope with its monitoring moving to node scope.


> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
>  arch/x86/kernel/cpu/resctrl/core.c     |  7 +++++++
>  arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 ++--
>  4 files changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index c61fd6709730..326ca6b3688a 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>  
>  extern struct dentry *debugfs_resctrl;
>  
> +extern int snc_nodes_per_l3_cache;
> +
>  enum resctrl_res_level {
>  	RDT_RESOURCE_L3,
>  	RDT_RESOURCE_L2,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 9fcc264fac6c..ed4f55b3e5e4 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -48,6 +48,13 @@ int max_name_width, max_data_width;
>   */
>  bool rdt_alloc_capable;
>  
> +/*
> + * Number of SNC nodes that share each L3 cache.
> + * Default is 1 for systems that do not support
> + * SNC, or have SNC disabled.
> + */

There is some extra space available to make the lines longer.

> +int snc_nodes_per_l3_cache = 1;
> +
>  static void
>  mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
>  		struct rdt_resource *r);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 42262d59ef9b..b6b3fb0f9abe 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
>  
>  static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>  {
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	int cpu = smp_processor_id();
> +	int rmid_offset = 0;
>  	u64 msr_val;
>  
> +	/*
> +	 * When SNC mode is on, need to compute the offset to read the
> +	 * physical RMID counter for the node to which this CPU belongs
> +	 */

Please end sentence with a period.

> +	if (snc_nodes_per_l3_cache > 1)
> +		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> +
>  	/*
>  	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
>  	 * with a valid event code for supported resource type and the bits
> @@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>  	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>  	 * are error bits.
>  	 */
> -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
>  	rdmsrl(MSR_IA32_QM_CTR, msr_val);
>  
>  	if (msr_val & RMID_VAL_ERROR)
> @@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>  	int ret;
>  
>  	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> -	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> -	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> +	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> +	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
>  	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>  
>  	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 5feec2c33544..a8cf6251e506 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1367,7 +1367,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>  		}
>  	}
>  
> -	return size;
> +	return size / snc_nodes_per_l3_cache;
>  }
>  
>  /**
> @@ -2600,7 +2600,7 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  		ctx->enable_cdpl2 = true;
>  		return 0;
>  	case Opt_mba_mbps:
> -		if (!supports_mba_mbps())
> +		if (!supports_mba_mbps() || snc_nodes_per_l3_cache > 1)
>  			return -EINVAL;
>  		ctx->enable_mba_mbps = true;
>  		return 0;


Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-08-29 23:44     ` [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-09-25 23:29       ` Reinette Chatre
  2023-09-28 18:12         ` Tony Luck
  2023-09-26 19:16       ` Moger, Babu
  1 sibling, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-09-25 23:29 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/29/2023 4:44 PM, Tony Luck wrote:
> There isn't a simple h/w bit that indicates whether a CPU is
> running in Sub NUMA Cluster mode. Infer the state by comparing

Sub NUMA Cluster (SNC)

> the ratio of NUMA nodes to L3 cache instances.
> 
> When SNC mode is detected, reconfigure the RMID counters by updating
> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Can it be explained how RMID counters are reconfigured when
the MSR is updated?

> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/include/asm/msr-index.h   |  1 +
>  arch/x86/kernel/cpu/resctrl/core.c | 68 ++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+)
> 
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 1d111350197f..393d1b047617 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1100,6 +1100,7 @@
>  #define MSR_IA32_QM_CTR			0xc8e
>  #define MSR_IA32_PQR_ASSOC		0xc8f
>  #define MSR_IA32_L3_CBM_BASE		0xc90
> +#define MSR_RMID_SNC_CONFIG		0xca0
>  #define MSR_IA32_L2_CBM_BASE		0xd10
>  #define MSR_IA32_MBA_THRTL_BASE		0xd50
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index ed4f55b3e5e4..9f0ac9721fab 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>  
>  #define pr_fmt(fmt)	"resctrl: " fmt
>  
> +#include <linux/cpu.h>
>  #include <linux/slab.h>
>  #include <linux/err.h>
>  #include <linux/cacheinfo.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>  
> +#include <asm/cpu_device_id.h>
>  #include <asm/intel-family.h>
>  #include <asm/resctrl.h>
>  #include "internal.h"
> @@ -724,11 +727,34 @@ static void clear_closid_rmid(int cpu)
>  	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
>  }
>  
> +/*
> + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> + * which indicates that RMIDs are configured in legacy mode.
> + * Clearing bit 0 reconfigures the RMID counters for use
> + * in Sub NUMA Cluster mode.
> + */

I think I missed where "legacy mode" and "Sub NUMA Cluster mode"
are explained. 

> +static void snc_remap_rmids(int cpu)
> +{
> +	u64 val;
> +
> +	/* Only need to enable once per package */

Sentence end with period.

> +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> +		return;
> +

This is an area I am not familiar with. The above code seems
to assume that CPUs are onlined in a particular numerical
order. For example, if I understand correctly, if CPUs
are onlined from higher number to lower number then
the above code may end up running on every CPU online.

> +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
> +	val &= ~BIT_ULL(0);
> +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
> +}
> +
>  static int resctrl_online_cpu(unsigned int cpu)
>  {
>  	struct rdt_resource *r;
>  
>  	mutex_lock(&rdtgroup_mutex);
> +
> +	if (snc_nodes_per_l3_cache > 1)
> +		snc_remap_rmids(cpu);
> +
>  	for_each_capable_rdt_resource(r)
>  		domain_add_cpu(cpu, r);
>  	/* The cpu is set in default rdtgroup after online. */
> @@ -983,11 +1009,53 @@ static __init bool get_rdt_resources(void)
>  	return (rdt_mon_capable || rdt_alloc_capable);
>  }
>  
> +/* CPU models that support MSR_RMID_SNC_CONFIG */
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> +	{}
> +};
> +
> +static __init int get_snc_config(void)

Could this function please get a comment about what it does?
The first paragraph from the commit message should be ok.

> +{
> +	unsigned long *node_caches;
> +	int mem_only_nodes = 0;
> +	int cpu, node, ret;
> +
> +	if (!x86_match_cpu(snc_cpu_ids))
> +		return 1;
> +
> +	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);

How about bitmap_zalloc()?

> +	if (!node_caches)
> +		return 1;
> +
> +	cpus_read_lock();
> +	for_each_node(node) {
> +		cpu = cpumask_first(cpumask_of_node(node));
> +		if (cpu < nr_cpu_ids)
> +			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> +		else
> +			mem_only_nodes++;
> +	}
> +	cpus_read_unlock();
> +
> +	ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);

My static checker complains about the above line risking
a "divide by zero" that may be the case if the bitmap_weight() returns
zero. You may need to ensure this is safe here before more static checkers
start analyzing this code.

> +	kfree(node_caches);
> +
> +	if (ret > 1)
> +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> +
> +	return ret;
> +}

As discussed there are scenarios where the above may not provide the
correct value. Could you please document the assumptions of this code to
highlight this to x86 maintainers for their opinions?

> +
>  static __init void rdt_init_res_defs_intel(void)
>  {
>  	struct rdt_hw_resource *hw_res;
>  	struct rdt_resource *r;
>  
> +	snc_nodes_per_l3_cache = get_snc_config();
> +
>  	for_each_rdt_resource(r) {
>  		hw_res = resctrl_to_arch_res(r);
>  


Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-08-29 23:44     ` [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-09-25 23:30       ` Reinette Chatre
  2023-09-28 18:15         ` Tony Luck
  2023-09-26 19:00       ` Moger, Babu
  1 sibling, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-09-25 23:30 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/29/2023 4:44 PM, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
> 
> Users should be aware that SNC mode also affects the amount of L3 cache
> available for allocation within each SNC node.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  Documentation/arch/x86/resctrl.rst | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index cb05d90111b4..407764f43f25 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -345,9 +345,15 @@ When control is enabled all CTRL_MON groups will also contain:
>  When monitoring is enabled all MON groups will also contain:
>  
>  "mon_data":
> -	This contains a set of files organized by L3 domain and by
> -	RDT event. E.g. on a system with two L3 domains there will
> -	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
> +	This contains a set of files organized by L3 domain or by NUMA
> +	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> +	or enabled respectively) and by RDT event. E.g. on a system with
> +	SNC mode disabled with two L3 domains there will be subdirectories
> +	"mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
> +	L3 cache id.  With SNC enabled the directory names are the same,
> +	but the numerical suffix refers to the node id.
> +        Mappings from node ids to CPUs are available in the
> +        /sys/devices/system/node/node*/cpulist files. Each of these

Please be consistent with tabs vs spaces.

>  	directories have one file per event (e.g. "llc_occupancy",
>  	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
>  	files provide a read out of the current value of the event for
> @@ -452,6 +458,19 @@ and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
>  of the capacity of the cache. You could partition the cache into four
>  equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>  
> +Notes on Sub-NUMA Cluster mode
> +==============================
> +When SNC mode is enabled the "llc_occupancy", "mbm_total_bytes", and
> +"mbm_local_bytes" will only give accurate results for well behaved NUMA
> +applications. I.e. those that perform the majority of memory accesses
> +to memory on the local NUMA node to the CPU where the task is executing.
> +

Following our discussion of the cover letter I believe this will change to
be specific about what user space can expect and what is actually "accurate".

> +The cache allocation feature still provides the same number of
> +bits in a mask to control allocation into the L3 cache. But each
> +of those ways has its capacity reduced because the cache is divided
> +between the SNC nodes. The values reported in the resctrl
> +"size" files are adjusted accordingly.
> +
>  Memory bandwidth Allocation and monitoring
>  ==========================================
>  

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope
  2023-09-25 23:22       ` Reinette Chatre
@ 2023-09-26  0:56         ` Tony Luck
  2023-09-26 15:19           ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-26  0:56 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Sep 25, 2023 at 04:22:29PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/29/2023 4:44 PM, Tony Luck wrote:
> > Legacy resctrl features operated on subsets of CPUs in the system with
> 
> What is a "legacy" resctrl feature? I am not aware of anything in resctrl
> that distinguishes a feature as legacy. Could "resctrl resource" be more
> appropriate to match the resctrl implementation? 

Ok. Will change.

> 
> Please use "operate on" instead of "operated on".

Ditto.

> 
> > the defining attribute of each subset being an instance of a particular
> > level of cache. E.g. all CPUs sharing an L3 cache would be part of the
> > same domain.
> > 
> > In preparation for features that are scoped at the NUMA node level
> > change the code from explicit references to "cache_level" to a more
> > generic scope. At this point the only options for this scope are groups
> > of CPUs that share an L2 cache or L3 cache.
> > 
> > No functional change.
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  include/linux/resctrl.h                   |  9 ++++++--
> >  arch/x86/kernel/cpu/resctrl/core.c        | 27 ++++++++++++++++++-----
> >  arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 15 ++++++++++++-
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 15 ++++++++++++-
> >  4 files changed, 56 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 8334eeacfec5..2db1244ae642 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -144,13 +144,18 @@ struct resctrl_membw {
> >  struct rdt_parse_data;
> >  struct resctrl_schema;
> >  
> > +enum resctrl_scope {
> > +	RESCTRL_L3_CACHE,
> > +	RESCTRL_L2_CACHE,
> > +};
> 
> I'm curious why L3 appears before L2? This is not a big deal but
> I think the additional indirection that this introduces is
> not necessary. As you had in an earlier version this could be
> RESCTRL_L2_CACHE = 2 and then the value can just be used directly
> (after ensuring it is used in a valid context).

I appear to have misundertood your earlier comments. I thought you
didn't like my use of RESCTRL_L2_CACHE = 2, and RESCTRL_L3_CACHE = 3
so that the could be passed directly to get_cpu_cacheinfo_id().

But I see now the issue was "after ensuring it is used in a valid
context". Are you looking for something like this:

enum resctrl_scope {
	RESCTRL_UNINITIALIZED_SCOPE = 0,
	RESCTRL_L2_CACHE = 2,
	RESCTRL_L3_CACHE = 3,
	RESCTRL_L3_NODE = 4,
};

static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
{
	switch (scope) {
	case RESCTRL_L2_CACHE:
	case RESCTRL_L3_CACHE:
		return get_cpu_cacheinfo_id(cpu, scope);
	case RESCTRL_NODE:
		return cpu_to_node(cpu);
	default:
	case RESCTRL_UNINITIALIZED_SCOPE:
		WARN_ON_ONCE(1);
		return -1;
	}

	return -1;
}

> 
> > +
> >  /**
> >   * struct rdt_resource - attributes of a resctrl resource
> >   * @rid:		The index of the resource
> >   * @alloc_capable:	Is allocation available on this machine
> >   * @mon_capable:	Is monitor feature available on this machine
> >   * @num_rmid:		Number of RMIDs available
> > - * @cache_level:	Which cache level defines scope of this resource
> > + * @scope:		Scope of this resource
> >   * @cache:		Cache allocation related data
> >   * @membw:		If the component has bandwidth controls, their properties.
> >   * @domains:		All domains for this resource
> > @@ -168,7 +173,7 @@ struct rdt_resource {
> >  	bool			alloc_capable;
> >  	bool			mon_capable;
> >  	int			num_rmid;
> > -	int			cache_level;
> > +	enum resctrl_scope	scope;
> >  	struct resctrl_cache	cache;
> >  	struct resctrl_membw	membw;
> >  	struct list_head	domains;
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 030d3b409768..0d3bae523ecb 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> >  		.r_resctrl = {
> >  			.rid			= RDT_RESOURCE_L3,
> >  			.name			= "L3",
> > -			.cache_level		= 3,
> > +			.scope			= RESCTRL_L3_CACHE,
> >  			.domains		= domain_init(RDT_RESOURCE_L3),
> >  			.parse_ctrlval		= parse_cbm,
> >  			.format_str		= "%d=%0*x",
> > @@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> >  		.r_resctrl = {
> >  			.rid			= RDT_RESOURCE_L2,
> >  			.name			= "L2",
> > -			.cache_level		= 2,
> > +			.scope			= RESCTRL_L2_CACHE,
> >  			.domains		= domain_init(RDT_RESOURCE_L2),
> >  			.parse_ctrlval		= parse_cbm,
> >  			.format_str		= "%d=%0*x",
> > @@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> >  		.r_resctrl = {
> >  			.rid			= RDT_RESOURCE_MBA,
> >  			.name			= "MB",
> > -			.cache_level		= 3,
> > +			.scope			= RESCTRL_L3_CACHE,
> >  			.domains		= domain_init(RDT_RESOURCE_MBA),
> >  			.parse_ctrlval		= parse_bw,
> >  			.format_str		= "%d=%*u",
> > @@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> >  		.r_resctrl = {
> >  			.rid			= RDT_RESOURCE_SMBA,
> >  			.name			= "SMBA",
> > -			.cache_level		= 3,
> > +			.scope			= RESCTRL_L3_CACHE,
> >  			.domains		= domain_init(RDT_RESOURCE_SMBA),
> >  			.parse_ctrlval		= parse_bw,
> >  			.format_str		= "%d=%*u",
> > @@ -487,6 +487,21 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> >  	return 0;
> >  }
> >  
> > +static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> > +{
> > +	switch (scope) {
> > +	case RESCTRL_L3_CACHE:
> > +		return get_cpu_cacheinfo_id(cpu, 3);
> > +	case RESCTRL_L2_CACHE:
> > +		return get_cpu_cacheinfo_id(cpu, 2);
> > +	default:
> > +		WARN_ON_ONCE(1);
> > +		break;
> > +	}
> > +
> > +	return -1;
> > +}
> 
> Looking ahead at the next patch some members of rdt_resources_all[]
> are left with a default "0" initialization of mon_scope that is a
> valid scope of RESCTRL_L3_CACHE in this implementation that would
> not be caught with defensive code like above. For the above to catch
> a case like this I think that there should be some default
> initialization - but if you do move to something like you
> had in v3 then this may not be necessary. If the L2 scope is 2,
> L3 scope is 3, node scope is 4, then no initialization will be zero
> and the above default can more accurately catch failure cases.

See above (with a defensive enum constant that has the value 0).

> 
> > +
> >  /*
> >   * domain_add_cpu - Add a cpu to a resource's domain list.
> >   *
> > @@ -502,7 +517,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> >   */
> >  static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >  {
> > -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> > +	int id = get_domain_id_from_scope(cpu, r->scope);
> >  	struct list_head *add_pos = NULL;
> >  	struct rdt_hw_domain *hw_dom;
> >  	struct rdt_domain *d;
> > @@ -552,7 +567,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >  
> >  static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> >  {
> > -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> > +	int id = get_domain_id_from_scope(cpu, r->scope);
> >  	struct rdt_hw_domain *hw_dom;
> >  	struct rdt_domain *d;
> >  
> > diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> > index 458cb7419502..e79324676f57 100644
> > --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> > +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> > @@ -279,6 +279,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
> >  static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
> >  {
> >  	struct cpu_cacheinfo *ci;
> > +	int cache_level;
> >  	int ret;
> >  	int i;
> >  
> > @@ -296,8 +297,20 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
> >  
> >  	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
> >  
> > +	switch (plr->s->res->scope) {
> > +	case RESCTRL_L3_CACHE:
> > +		cache_level = 3;
> > +		break;
> > +	case RESCTRL_L2_CACHE:
> > +		cache_level = 2;
> > +		break;
> > +	default:
> > +		WARN_ON_ONCE(1);
> > +		return -ENODEV;
> > +	}
> 
> I think this can be simplified without the indirection - a simplified
> WARN can just test for valid values of plr->s->res->scope directly
> (exit on failure) and then it can be used directly in the code below
> when the enum value corresponds to a cache level.

Is this what you want here?

	if (plr->s->res->scope != RESCTRL_L2_CACHE &&
	    plr->s->res->scope != RESCTRL_L3_CACHE) {
	    	WARN_ON_ONCE(1);
		return -ENODEV;
	}

> 
> > +
> >  	for (i = 0; i < ci->num_leaves; i++) {
> > -		if (ci->info_list[i].level == plr->s->res->cache_level) {
> > +		if (ci->info_list[i].level == cache_level) {

then drop this change to keep using plr->s->res->cache_level (and
delete the now unused local variable cache_level).

> >  			plr->line_size = ci->info_list[i].coherency_line_size;
> >  			return 0;
> >  		}
> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index 725344048f85..f510414bf6ce 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -1343,12 +1343,25 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> >  {
> >  	struct cpu_cacheinfo *ci;
> >  	unsigned int size = 0;
> > +	int cache_level;
> >  	int num_b, i;
> >  
> > +	switch (r->scope) {
> > +	case RESCTRL_L3_CACHE:
> > +		cache_level = 3;
> > +		break;
> > +	case RESCTRL_L2_CACHE:
> > +		cache_level = 2;
> > +		break;
> > +	default:
> > +		WARN_ON_ONCE(1);
> > +		return size;
> > +	}
> > +
> 
> Same here.

If above it what you want, will do it here too.

> 
> >  	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
> >  	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
> >  	for (i = 0; i < ci->num_leaves; i++) {
> > -		if (ci->info_list[i].level == r->cache_level) {
> > +		if (ci->info_list[i].level == cache_level) {
> >  			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
> >  			break;
> >  		}
> 
> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope
  2023-09-26  0:56         ` Tony Luck
@ 2023-09-26 15:19           ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-09-26 15:19 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 9/25/2023 5:56 PM, Tony Luck wrote:
> On Mon, Sep 25, 2023 at 04:22:29PM -0700, Reinette Chatre wrote:
>> On 8/29/2023 4:44 PM, Tony Luck wrote:

...

>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index 8334eeacfec5..2db1244ae642 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -144,13 +144,18 @@ struct resctrl_membw {
>>>  struct rdt_parse_data;
>>>  struct resctrl_schema;
>>>  
>>> +enum resctrl_scope {
>>> +	RESCTRL_L3_CACHE,
>>> +	RESCTRL_L2_CACHE,
>>> +};
>>
>> I'm curious why L3 appears before L2? This is not a big deal but
>> I think the additional indirection that this introduces is
>> not necessary. As you had in an earlier version this could be
>> RESCTRL_L2_CACHE = 2 and then the value can just be used directly
>> (after ensuring it is used in a valid context).
> 
> I appear to have misundertood your earlier comments. I thought you
> didn't like my use of RESCTRL_L2_CACHE = 2, and RESCTRL_L3_CACHE = 3
> so that the could be passed directly to get_cpu_cacheinfo_id().
> 
> But I see now the issue was "after ensuring it is used in a valid
> context". Are you looking for something like this:

Indeed. My concern with the earlier version was seeing a variable that
may contain any scope be used in a context where it can only represent
a cache level.

> 
> enum resctrl_scope {
> 	RESCTRL_UNINITIALIZED_SCOPE = 0,
> 	RESCTRL_L2_CACHE = 2,
> 	RESCTRL_L3_CACHE = 3,
> 	RESCTRL_L3_NODE = 4,
> };

I do not think RESCTRL_UNINITIALIZED_SCOPE is required (perhaps getting too
defensive?) but adding it would surely not do harm and indeed make it
obvious how the uninitialized case is handled. I am not familiar with
customs in this regard and would be ok either way. You can decide what
works best for you.

About the new name RESCTRL_L3_NODE. It is not clear to me why "L3" is
in the name. 

> 
> static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> {
> 	switch (scope) {
> 	case RESCTRL_L2_CACHE:
> 	case RESCTRL_L3_CACHE:
> 		return get_cpu_cacheinfo_id(cpu, scope);
> 	case RESCTRL_NODE:
> 		return cpu_to_node(cpu);

(oh - maybe the earlier RESCTRL_L3_NODE was a typo?)

> 	default:
> 	case RESCTRL_UNINITIALIZED_SCOPE:
> 		WARN_ON_ONCE(1);
> 		return -1;
> 	}
> 
> 	return -1;
> }
> 

I'll leave it to you to decide if you want to use
RESCTRL_UNINITIALIZED_SCOPE. If you do decide to keep it
could you please let the "default" case be the last one?

>>
>>> +
>>>  /**
>>>   * struct rdt_resource - attributes of a resctrl resource
>>>   * @rid:		The index of the resource
>>>   * @alloc_capable:	Is allocation available on this machine
>>>   * @mon_capable:	Is monitor feature available on this machine
>>>   * @num_rmid:		Number of RMIDs available
>>> - * @cache_level:	Which cache level defines scope of this resource
>>> + * @scope:		Scope of this resource
>>>   * @cache:		Cache allocation related data
>>>   * @membw:		If the component has bandwidth controls, their properties.
>>>   * @domains:		All domains for this resource
>>> @@ -168,7 +173,7 @@ struct rdt_resource {
>>>  	bool			alloc_capable;
>>>  	bool			mon_capable;
>>>  	int			num_rmid;
>>> -	int			cache_level;
>>> +	enum resctrl_scope	scope;
>>>  	struct resctrl_cache	cache;
>>>  	struct resctrl_membw	membw;
>>>  	struct list_head	domains;
>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>> index 030d3b409768..0d3bae523ecb 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>> @@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>>>  		.r_resctrl = {
>>>  			.rid			= RDT_RESOURCE_L3,
>>>  			.name			= "L3",
>>> -			.cache_level		= 3,
>>> +			.scope			= RESCTRL_L3_CACHE,
>>>  			.domains		= domain_init(RDT_RESOURCE_L3),
>>>  			.parse_ctrlval		= parse_cbm,
>>>  			.format_str		= "%d=%0*x",
>>> @@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>>>  		.r_resctrl = {
>>>  			.rid			= RDT_RESOURCE_L2,
>>>  			.name			= "L2",
>>> -			.cache_level		= 2,
>>> +			.scope			= RESCTRL_L2_CACHE,
>>>  			.domains		= domain_init(RDT_RESOURCE_L2),
>>>  			.parse_ctrlval		= parse_cbm,
>>>  			.format_str		= "%d=%0*x",
>>> @@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>>>  		.r_resctrl = {
>>>  			.rid			= RDT_RESOURCE_MBA,
>>>  			.name			= "MB",
>>> -			.cache_level		= 3,
>>> +			.scope			= RESCTRL_L3_CACHE,
>>>  			.domains		= domain_init(RDT_RESOURCE_MBA),
>>>  			.parse_ctrlval		= parse_bw,
>>>  			.format_str		= "%d=%*u",
>>> @@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>>>  		.r_resctrl = {
>>>  			.rid			= RDT_RESOURCE_SMBA,
>>>  			.name			= "SMBA",
>>> -			.cache_level		= 3,
>>> +			.scope			= RESCTRL_L3_CACHE,
>>>  			.domains		= domain_init(RDT_RESOURCE_SMBA),
>>>  			.parse_ctrlval		= parse_bw,
>>>  			.format_str		= "%d=%*u",
>>> @@ -487,6 +487,21 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>>>  	return 0;
>>>  }
>>>  
>>> +static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>>> +{
>>> +	switch (scope) {
>>> +	case RESCTRL_L3_CACHE:
>>> +		return get_cpu_cacheinfo_id(cpu, 3);
>>> +	case RESCTRL_L2_CACHE:
>>> +		return get_cpu_cacheinfo_id(cpu, 2);
>>> +	default:
>>> +		WARN_ON_ONCE(1);
>>> +		break;
>>> +	}
>>> +
>>> +	return -1;
>>> +}
>>
>> Looking ahead at the next patch some members of rdt_resources_all[]
>> are left with a default "0" initialization of mon_scope that is a
>> valid scope of RESCTRL_L3_CACHE in this implementation that would
>> not be caught with defensive code like above. For the above to catch
>> a case like this I think that there should be some default
>> initialization - but if you do move to something like you
>> had in v3 then this may not be necessary. If the L2 scope is 2,
>> L3 scope is 3, node scope is 4, then no initialization will be zero
>> and the above default can more accurately catch failure cases.
> 
> See above (with a defensive enum constant that has the value 0).
> 
>>
>>> +
>>>  /*
>>>   * domain_add_cpu - Add a cpu to a resource's domain list.
>>>   *
>>> @@ -502,7 +517,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>>>   */
>>>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>>>  {
>>> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
>>> +	int id = get_domain_id_from_scope(cpu, r->scope);
>>>  	struct list_head *add_pos = NULL;
>>>  	struct rdt_hw_domain *hw_dom;
>>>  	struct rdt_domain *d;
>>> @@ -552,7 +567,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>>>  
>>>  static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>>>  {
>>> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
>>> +	int id = get_domain_id_from_scope(cpu, r->scope);
>>>  	struct rdt_hw_domain *hw_dom;
>>>  	struct rdt_domain *d;
>>>  
>>> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
>>> index 458cb7419502..e79324676f57 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
>>> @@ -279,6 +279,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
>>>  static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>>>  {
>>>  	struct cpu_cacheinfo *ci;
>>> +	int cache_level;
>>>  	int ret;
>>>  	int i;
>>>  
>>> @@ -296,8 +297,20 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>>>  
>>>  	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
>>>  
>>> +	switch (plr->s->res->scope) {
>>> +	case RESCTRL_L3_CACHE:
>>> +		cache_level = 3;
>>> +		break;
>>> +	case RESCTRL_L2_CACHE:
>>> +		cache_level = 2;
>>> +		break;
>>> +	default:
>>> +		WARN_ON_ONCE(1);
>>> +		return -ENODEV;
>>> +	}
>>
>> I think this can be simplified without the indirection - a simplified
>> WARN can just test for valid values of plr->s->res->scope directly
>> (exit on failure) and then it can be used directly in the code below
>> when the enum value corresponds to a cache level.
> 
> Is this what you want here?
> 
> 	if (plr->s->res->scope != RESCTRL_L2_CACHE &&
> 	    plr->s->res->scope != RESCTRL_L3_CACHE) {
> 	    	WARN_ON_ONCE(1);
> 		return -ENODEV;
> 	}

Something like this, yes. It could be simplified more with a:

	/* Just to get line length shorter: */
	enum resctrl_scope scope = plr->s->res->scope; /* or cache_level? */

	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
		return -ENODEV;

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope
  2023-08-29 23:44     ` [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2023-08-30  8:53       ` Maciej Wieczór-Retman
  2023-09-25 23:22       ` Reinette Chatre
@ 2023-09-26 16:36       ` Moger, Babu
  2023-09-26 17:18         ` Luck, Tony
  2 siblings, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2023-09-26 16:36 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/29/23 18:44, Tony Luck wrote:
> Legacy resctrl features operated on subsets of CPUs in the system with
> the defining attribute of each subset being an instance of a particular
> level of cache. E.g. all CPUs sharing an L3 cache would be part of the
> same domain.
> 
> In preparation for features that are scoped at the NUMA node level
> change the code from explicit references to "cache_level" to a more
> generic scope. At this point the only options for this scope are groups
> of CPUs that share an L2 cache or L3 cache.
> 
> No functional change.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h                   |  9 ++++++--
>  arch/x86/kernel/cpu/resctrl/core.c        | 27 ++++++++++++++++++-----
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 15 ++++++++++++-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 15 ++++++++++++-
>  4 files changed, 56 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 8334eeacfec5..2db1244ae642 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -144,13 +144,18 @@ struct resctrl_membw {
>  struct rdt_parse_data;
>  struct resctrl_schema;
>  
> +enum resctrl_scope {
> +	RESCTRL_L3_CACHE,
> +	RESCTRL_L2_CACHE,
> +};

How about?

enum resctrl_scope {
       RESCTRL_L2_CACHE = 2,
       RESCTRL_L3_CACHE,
};

I think this will simplify lot of your code below. You can directly use
scope as it is done now.


> +
>  /**
>   * struct rdt_resource - attributes of a resctrl resource
>   * @rid:		The index of the resource
>   * @alloc_capable:	Is allocation available on this machine
>   * @mon_capable:	Is monitor feature available on this machine
>   * @num_rmid:		Number of RMIDs available
> - * @cache_level:	Which cache level defines scope of this resource
> + * @scope:		Scope of this resource
>   * @cache:		Cache allocation related data
>   * @membw:		If the component has bandwidth controls, their properties.
>   * @domains:		All domains for this resource
> @@ -168,7 +173,7 @@ struct rdt_resource {
>  	bool			alloc_capable;
>  	bool			mon_capable;
>  	int			num_rmid;
> -	int			cache_level;
> +	enum resctrl_scope	scope;
>  	struct resctrl_cache	cache;
>  	struct resctrl_membw	membw;
>  	struct list_head	domains;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 030d3b409768..0d3bae523ecb 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L3,
>  			.name			= "L3",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_L3),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
> @@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L2,
>  			.name			= "L2",
> -			.cache_level		= 2,
> +			.scope			= RESCTRL_L2_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_L2),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
> @@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_MBA,
>  			.name			= "MB",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_MBA),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
> @@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_SMBA,
>  			.name			= "SMBA",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_SMBA),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
> @@ -487,6 +487,21 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>  	return 0;
>  }
>  
> +static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> +{
> +	switch (scope) {
> +	case RESCTRL_L3_CACHE:
> +		return get_cpu_cacheinfo_id(cpu, 3);
> +	case RESCTRL_L2_CACHE:
> +		return get_cpu_cacheinfo_id(cpu, 2);
> +	default:
> +		WARN_ON_ONCE(1);
> +		break;
> +	}
> +
> +	return -1;
> +}
> +

You may not need this code with new definition of resctrl_scope.

>  /*
>   * domain_add_cpu - Add a cpu to a resource's domain list.
>   *
> @@ -502,7 +517,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>   */
>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_domain_id_from_scope(cpu, r->scope);

You can directly call
        int id = get_cpu_cacheinfo_id(cpu, r->scope);

>  	struct list_head *add_pos = NULL;
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
> @@ -552,7 +567,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  
>  static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 458cb7419502..e79324676f57 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -279,6 +279,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
>  static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  {
>  	struct cpu_cacheinfo *ci;
> +	int cache_level;
>  	int ret;
>  	int i;
>  
> @@ -296,8 +297,20 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  
>  	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
>  
> +	switch (plr->s->res->scope) {
> +	case RESCTRL_L3_CACHE:
> +		cache_level = 3;
> +		break;
> +	case RESCTRL_L2_CACHE:
> +		cache_level = 2;
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		return -ENODEV;
> +	}

Again this may not be required.

> +
>  	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == plr->s->res->cache_level) {
> +		if (ci->info_list[i].level == cache_level) {
>  			plr->line_size = ci->info_list[i].coherency_line_size;
>  			return 0;
>  		}
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 725344048f85..f510414bf6ce 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1343,12 +1343,25 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>  {
>  	struct cpu_cacheinfo *ci;
>  	unsigned int size = 0;
> +	int cache_level;
>  	int num_b, i;
>  
> +	switch (r->scope) {
> +	case RESCTRL_L3_CACHE:
> +		cache_level = 3;
> +		break;
> +	case RESCTRL_L2_CACHE:
> +		cache_level = 2;
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		return size;
> +	}

Again this may not be required.

> +
>  	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
>  	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
>  	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == r->cache_level) {
> +		if (ci->info_list[i].level == cache_level) {
>  			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
>  			break;
>  		}

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 2/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-08-29 23:44     ` [PATCH v5 2/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
  2023-09-25 23:24       ` Reinette Chatre
@ 2023-09-26 16:54       ` Moger, Babu
  2023-09-26 17:45         ` Luck, Tony
  1 sibling, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2023-09-26 16:54 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/29/23 18:44, Tony Luck wrote:
> Existing resctrl assumes that control and monitor operations on a
> resource are performed at the same scope.
> 
> Prepare for systems that use different scope (specifically L3 scope
> for cache control and NODE scope for cache occupancy and memory
> bandwidth monitoring).
> 
> Create separate domain lists for control and monitor operations.
> 
> No important functional change. But note that errors during

Its better to remove the line "No important functional change."

> initialization of either control or monitor functions on a domain would
> previously result in that domain being excluded from both control and
> monitor operations. Now the domains are allocated independently it is
> no longer required to disable both control and monitor operations if
> either fail.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h                   |  16 +-
>  arch/x86/kernel/cpu/resctrl/internal.h    |   6 +-
>  arch/x86/kernel/cpu/resctrl/core.c        | 227 +++++++++++++++-------
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
>  arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   2 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  32 +--
>  7 files changed, 199 insertions(+), 88 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 2db1244ae642..33856943a787 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -155,10 +155,12 @@ enum resctrl_scope {
>   * @alloc_capable:	Is allocation available on this machine
>   * @mon_capable:	Is monitor feature available on this machine
>   * @num_rmid:		Number of RMIDs available
> - * @scope:		Scope of this resource
> + * @ctrl_scope:		Scope of this resource for control functions
> + * @mon_scope:		Scope of this resource for monitor functions
>   * @cache:		Cache allocation related data
>   * @membw:		If the component has bandwidth controls, their properties.
> - * @domains:		All domains for this resource
> + * @domains:		Control domains for this resource
> + * @mon_domains:	Monitor domains for this resource
>   * @name:		Name to use in "schemata" file.
>   * @data_width:		Character width of data when displaying
>   * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
> @@ -173,10 +175,12 @@ struct rdt_resource {
>  	bool			alloc_capable;
>  	bool			mon_capable;
>  	int			num_rmid;
> -	enum resctrl_scope	scope;
> +	enum resctrl_scope	ctrl_scope;
> +	enum resctrl_scope	mon_scope;
>  	struct resctrl_cache	cache;
>  	struct resctrl_membw	membw;
>  	struct list_head	domains;
> +	struct list_head	mondomains;

For consistancy, its better to rename it to mon_domains(to be inline with
mon_scope, mon_capable).

>  	char			*name;
>  	int			data_width;
>  	u32			default_ctrl;
> @@ -222,8 +226,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
>  
>  u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
>  			    u32 closid, enum resctrl_conf_type type);
> -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
> -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
>  
>  /**
>   * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 85ceaf9a31ac..31a5fc3b717f 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -511,8 +511,10 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
>  int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
>  int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
>  			     umode_t mask);
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> -				   struct list_head **pos);
> +struct rdt_domain *rdt_find_ctrldomain(struct list_head *h, int id,
> +				       struct list_head **pos);
> +struct rdt_domain *rdt_find_mondomain(struct list_head *h, int id,
> +				      struct list_head **pos);

For consistancy, it is better to rename to rdt_find_ctrl_domain and
rdt_find_mon_domain respectively.


>  ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off);
>  int rdtgroup_schemata_show(struct kernfs_open_file *of,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 0d3bae523ecb..97f6f9715fdb 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -57,7 +57,7 @@ static void
>  mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
>  	      struct rdt_resource *r);
>  
> -#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
> +#define domain_init(id, field) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.field)
>  
>  struct rdt_hw_resource rdt_resources_all[] = {
>  	[RDT_RESOURCE_L3] =
> @@ -65,8 +65,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L3,
>  			.name			= "L3",
> -			.scope			= RESCTRL_L3_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_L3),
> +			.ctrl_scope		= RESCTRL_L3_CACHE,
> +			.mon_scope		= RESCTRL_L3_CACHE,
> +			.domains		= domain_init(RDT_RESOURCE_L3, domains),
> +			.mondomains		= domain_init(RDT_RESOURCE_L3, mondomains),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
>  			.fflags			= RFTYPE_RES_CACHE,
> @@ -79,8 +81,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L2,
>  			.name			= "L2",
> -			.scope			= RESCTRL_L2_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_L2),
> +			.ctrl_scope		= RESCTRL_L2_CACHE,
> +			.domains		= domain_init(RDT_RESOURCE_L2, domains),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
>  			.fflags			= RFTYPE_RES_CACHE,
> @@ -93,8 +95,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_MBA,
>  			.name			= "MB",
> -			.scope			= RESCTRL_L3_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_MBA),
> +			.ctrl_scope		= RESCTRL_L3_CACHE,
> +			.domains		= domain_init(RDT_RESOURCE_MBA, domains),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
>  			.fflags			= RFTYPE_RES_MB,
> @@ -105,8 +107,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_SMBA,
>  			.name			= "SMBA",
> -			.scope			= RESCTRL_L3_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_SMBA),
> +			.ctrl_scope		= RESCTRL_L3_CACHE,
> +			.domains		= domain_init(RDT_RESOURCE_SMBA, domains),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
>  			.fflags			= RFTYPE_RES_MB,
> @@ -384,15 +386,16 @@ void rdt_ctrl_update(void *arg)
>  }
>  
>  /*
> - * rdt_find_domain - Find a domain in a resource that matches input resource id
> + * __rdt_find_domain - Find a domain in either the list of control or
> + * monitor domains that matches input resource id
>   *
>   * Search resource r's domain list to find the resource id. If the resource
>   * id is found in a domain, return the domain. Otherwise, if requested by
>   * caller, return the first domain whose id is bigger than the input id.
>   * The domain list is sorted by id in ascending order.
>   */
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> -				   struct list_head **pos)
> +static void *__rdt_find_domain(struct list_head *h, int id,
> +			       struct list_head **pos)
>  {
>  	struct rdt_domain *d;
>  	struct list_head *l;
> @@ -400,7 +403,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
>  	if (id < 0)
>  		return ERR_PTR(-ENODEV);
>  
> -	list_for_each(l, &r->domains) {
> +	list_for_each(l, h) {
>  		d = list_entry(l, struct rdt_domain, list);
>  		/* When id is found, return its domain. */
>  		if (id == d->id)
> @@ -416,6 +419,18 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
>  	return NULL;
>  }
>  
> +struct rdt_domain *rdt_find_ctrldomain(struct list_head *h, int id,
> +				       struct list_head **pos)
> +{
> +	return __rdt_find_domain(h, id, pos);
> +}
> +
> +struct rdt_domain *rdt_find_mondomain(struct list_head *h, int id,
> +				      struct list_head **pos)
> +{
> +	return __rdt_find_domain(h, id, pos);
> +}
> +
>  static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
>  {
>  	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> @@ -431,10 +446,15 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
>  }
>  
>  static void domain_free(struct rdt_hw_domain *hw_dom)
> +{
> +	kfree(hw_dom->ctrl_val);
> +	kfree(hw_dom);
> +}
> +
> +static void mondomain_free(struct rdt_hw_domain *hw_dom)

Its better to rename to mon_domain_free

>  {
>  	kfree(hw_dom->arch_mbm_total);
>  	kfree(hw_dom->arch_mbm_local);
> -	kfree(hw_dom->ctrl_val);
>  	kfree(hw_dom);
>  }
>  
> @@ -502,6 +522,93 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>  	return -1;
>  }
>  
> +static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
> +	struct list_head *add_pos = NULL;
> +	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain *d;
> +	int err;
> +
> +	d = rdt_find_ctrldomain(&r->domains, id, &add_pos);
> +	if (IS_ERR(d)) {
> +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
> +		return;
> +	}
> +
> +	if (d) {
> +		cpumask_set_cpu(cpu, &d->cpu_mask);
> +		if (r->cache.arch_has_per_cpu_cfg)
> +			rdt_domain_reconfigure_cdp(r);
> +		return;
> +	}
> +
> +	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!hw_dom)
> +		return;
> +
> +	d = &hw_dom->d_resctrl;
> +	d->id = id;
> +	cpumask_set_cpu(cpu, &d->cpu_mask);
> +
> +	rdt_domain_reconfigure_cdp(r);
> +
> +	if (domain_setup_ctrlval(r, d)) {
> +		domain_free(hw_dom);
> +		return;
> +	}
> +
> +	list_add_tail(&d->list, add_pos);
> +
> +	err = resctrl_online_ctrl_domain(r, d);
> +	if (err) {
> +		list_del(&d->list);
> +		domain_free(hw_dom);
> +	}
> +}
> +
> +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> +	struct rdt_hw_domain *hw_mondom;
> +	struct list_head *add_pos = NULL;
> +	struct rdt_domain *d;
> +	int err;
> +
> +	d = rdt_find_mondomain(&r->mondomains, id, &add_pos);
> +	if (IS_ERR(d)) {
> +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
> +		return;
> +	}
> +
> +	if (d) {
> +		cpumask_set_cpu(cpu, &d->cpu_mask);
> +
> +		return;
> +	}
> +
> +	hw_mondom = kzalloc_node(sizeof(*hw_mondom), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!hw_mondom)
> +		return;
> +
> +	d = &hw_mondom->d_resctrl;
> +	d->id = id;
> +	cpumask_set_cpu(cpu, &d->cpu_mask);
> +
> +	if (arch_domain_mbm_alloc(r->num_rmid, hw_mondom)) {
> +		mondomain_free(hw_mondom);
> +		return;
> +	}
> +
> +	list_add_tail(&d->list, add_pos);
> +
> +	err = resctrl_online_mon_domain(r, d);
> +	if (err) {
> +		list_del(&d->list);
> +		mondomain_free(hw_mondom);
> +	}
> +}
> +
>  /*
>   * domain_add_cpu - Add a cpu to a resource's domain list.
>   *
> @@ -517,70 +624,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>   */
>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_domain_id_from_scope(cpu, r->scope);
> -	struct list_head *add_pos = NULL;
> -	struct rdt_hw_domain *hw_dom;
> -	struct rdt_domain *d;
> -	int err;
> -
> -	d = rdt_find_domain(r, id, &add_pos);
> -	if (IS_ERR(d)) {
> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> -		return;
> -	}
> -
> -	if (d) {
> -		cpumask_set_cpu(cpu, &d->cpu_mask);
> -		if (r->cache.arch_has_per_cpu_cfg)
> -			rdt_domain_reconfigure_cdp(r);
> -		return;
> -	}
> -
> -	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> -	if (!hw_dom)
> -		return;
> -
> -	d = &hw_dom->d_resctrl;
> -	d->id = id;
> -	cpumask_set_cpu(cpu, &d->cpu_mask);
> -
> -	rdt_domain_reconfigure_cdp(r);
> -
> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> -		domain_free(hw_dom);
> -		return;
> -	}
> -
> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> -		domain_free(hw_dom);
> -		return;
> -	}
> -
> -	list_add_tail(&d->list, add_pos);
> -
> -	err = resctrl_online_domain(r, d);
> -	if (err) {
> -		list_del(&d->list);
> -		domain_free(hw_dom);
> -	}
> +	if (r->alloc_capable)
> +		domain_add_cpu_ctrl(cpu, r);
> +	if (r->mon_capable)
> +		domain_add_cpu_mon(cpu, r);
>  }
>  
> -static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> +static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_domain_id_from_scope(cpu, r->scope);
> +	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  
> -	d = rdt_find_domain(r, id, NULL);
> +	d = rdt_find_ctrldomain(&r->domains, id, NULL);
>  	if (IS_ERR_OR_NULL(d)) {
> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
>  		return;
>  	}
>  	hw_dom = resctrl_to_arch_dom(d);
>  
>  	cpumask_clear_cpu(cpu, &d->cpu_mask);
>  	if (cpumask_empty(&d->cpu_mask)) {
> -		resctrl_offline_domain(r, d);
> +		resctrl_offline_ctrl_domain(r, d);
>  		list_del(&d->list);
>  
>  		/*
> @@ -593,6 +658,30 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  
>  		return;
>  	}
> +}
> +
> +static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> +	struct rdt_hw_domain *hw_mondom;
> +	struct rdt_domain *d;
> +
> +	d = rdt_find_mondomain(&r->mondomains, id, NULL);
> +	if (IS_ERR_OR_NULL(d)) {
> +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
> +		return;
> +	}
> +	hw_mondom = resctrl_to_arch_dom(d);
> +
> +	cpumask_clear_cpu(cpu, &d->cpu_mask);
> +	if (cpumask_empty(&d->cpu_mask)) {
> +		resctrl_offline_mon_domain(r, d);
> +		list_del(&d->list);
> +
> +		mondomain_free(hw_mondom);
> +
> +		return;
> +	}
>  
>  	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
>  		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> @@ -607,6 +696,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  	}
>  }
>  
> +static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> +{
> +	if (r->alloc_capable)
> +		domain_remove_cpu_ctrl(cpu, r);
> +	if (r->mon_capable)
> +		domain_remove_cpu_mon(cpu, r);
> +}
> +
>  static void clear_closid_rmid(int cpu)
>  {
>  	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index b44c487727d4..468c1815edfd 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -560,7 +560,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>  	evtid = md.u.evtid;
>  
>  	r = &rdt_resources_all[resid].r_resctrl;
> -	d = rdt_find_domain(r, domid, NULL);
> +	d = rdt_find_mondomain(&r->mondomains, domid, NULL);
>  	if (IS_ERR_OR_NULL(d)) {
>  		ret = -ENOENT;
>  		goto out;
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index ded1fc7cb7cb..66beca785535 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>  
>  	entry->busy = 0;
>  	cpu = get_cpu();
> -	list_for_each_entry(d, &r->domains, list) {
> +	list_for_each_entry(d, &r->mondomains, list) {
>  		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
>  			err = resctrl_arch_rmid_read(r, d, entry->rmid,
>  						     QOS_L3_OCCUP_EVENT_ID,
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index e79324676f57..be8b5f28e638 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -297,7 +297,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  
>  	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
>  
> -	switch (plr->s->res->scope) {
> +	switch (plr->s->res->ctrl_scope) {
>  	case RESCTRL_L3_CACHE:
>  		cache_level = 3;
>  		break;
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index f510414bf6ce..f2aec39c49df 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1346,7 +1346,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>  	int cache_level;
>  	int num_b, i;
>  
> -	switch (r->scope) {
> +	switch (r->ctrl_scope) {
>  	case RESCTRL_L3_CACHE:
>  		cache_level = 3;
>  		break;
> @@ -1509,7 +1509,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
>  
>  	mutex_lock(&rdtgroup_mutex);
>  
> -	list_for_each_entry(dom, &r->domains, list) {
> +	list_for_each_entry(dom, &r->mondomains, list) {
>  		if (sep)
>  			seq_puts(s, ";");
>  
> @@ -1632,7 +1632,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
>  		return -EINVAL;
>  	}
>  
> -	list_for_each_entry(d, &r->domains, list) {
> +	list_for_each_entry(d, &r->mondomains, list) {
>  		if (d->id == dom_id) {
>  			ret = mbm_config_write_domain(r, d, evtid, val);
>  			if (ret)
> @@ -2538,7 +2538,7 @@ static int rdt_get_tree(struct fs_context *fc)
>  
>  	if (is_mbm_enabled()) {
>  		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> -		list_for_each_entry(dom, &r->domains, list)
> +		list_for_each_entry(dom, &r->mondomains, list)
>  			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
>  	}
>  
> @@ -2932,7 +2932,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
>  	struct rdt_domain *dom;
>  	int ret;
>  
> -	list_for_each_entry(dom, &r->domains, list) {
> +	list_for_each_entry(dom, &r->mondomains, list) {
>  		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
>  		if (ret)
>  			return ret;
> @@ -3721,15 +3721,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
>  	kfree(d->mbm_local);
>  }
>  
> -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
>  {
>  	lockdep_assert_held(&rdtgroup_mutex);
>  
>  	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
>  		mba_sc_domain_destroy(r, d);
> +}
>  
> -	if (!r->mon_capable)
> -		return;
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> +{
> +	lockdep_assert_held(&rdtgroup_mutex);
>  
>  	/*
>  	 * If resctrl is mounted, remove all the
> @@ -3786,18 +3788,22 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
>  	return 0;
>  }
>  
> -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
>  {
> -	int err;
> -
>  	lockdep_assert_held(&rdtgroup_mutex);
>  
>  	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
>  		/* RDT_RESOURCE_MBA is never mon_capable */
>  		return mba_sc_domain_allocate(r, d);
>  
> -	if (!r->mon_capable)
> -		return 0;
> +	return 0;
> +}
> +
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> +{
> +	int err;
> +
> +	lockdep_assert_held(&rdtgroup_mutex);
>  
>  	err = domain_setup_mon_state(r, d);
>  	if (err)

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope
  2023-09-26 16:36       ` Moger, Babu
@ 2023-09-26 17:18         ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-09-26 17:18 UTC (permalink / raw)
  To: babu.moger, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

>> +enum resctrl_scope {
>> +	RESCTRL_L3_CACHE,
>> +	RESCTRL_L2_CACHE,
>> +};
>
> How about?
>
> enum resctrl_scope {
>       RESCTRL_L2_CACHE = 2,
>       RESCTRL_L3_CACHE,
> };

Babu. Thanks for the review.

Reinette made the same observation. I'm updating the
patch to do this. With small extra defensive step to explicitly define

	RESCTRL_L3_CACHE = 3,

rather than relying on the compiler picking the next integer ... just in
case somebody adds another enum between the L2 and L3 lines.

-Tony


^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 2/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-09-26 16:54       ` Moger, Babu
@ 2023-09-26 17:45         ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-09-26 17:45 UTC (permalink / raw)
  To: babu.moger, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

>> Create separate domain lists for control and monitor operations.
>> 
>> No important functional change. But note that errors during
>
> Its better to remove the line "No important functional change."

Will drop it.

>>  	struct list_head	domains;
>> +	struct list_head	mondomains;
>
> For consistancy, its better to rename it to mon_domains(to be inline with
> mon_scope, mon_capable).

Again, you are thinking like Reinette! You'll be a resctrl maintainer soon.

>> +struct rdt_domain *rdt_find_ctrldomain(struct list_head *h, int id,
>> +				       struct list_head **pos);
>> +struct rdt_domain *rdt_find_mondomain(struct list_head *h, int id,
>> +				      struct list_head **pos);
>
> For consistancy, it is better to rename to rdt_find_ctrl_domain and
> rdt_find_mon_domain respectively.

Agreed.

Thanks for the review.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure
  2023-09-25 23:25       ` Reinette Chatre
@ 2023-09-26 18:46         ` Tony Luck
  2023-09-26 22:00           ` Reinette Chatre
  2023-09-28 17:33         ` Tony Luck
  1 sibling, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-26 18:46 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Sep 25, 2023 at 04:25:15PM -0700, Reinette Chatre wrote:
> > +struct rdt_domain {
> > +	// First three fields must match struct rdt_mondomain below.
> 
> Please avoid comments within declarations. Even so, could you please
> elaborate what the above means? Why do the first three fields have to
> match? I understand there is common code, for example, __rdt_find_domain()
> that operated on the same members of the two structs but does that
> require the members be in the same position in the struct?
> I understand that a comment may be required if position in the struct
> is important but I cannot see that it is.

[Just replying to this one point in your message to get guidance. I'll
address all the rest in other replies]

I'm wrong about the first *three* fields ... but the first *two* fields
(the "list" and the "id") do need to be at the same offsets in different
structures if a common routine is going to be used to access those
fields.

If the "id" were at offset 0x10 in the control version of the domain
structure, and at offset 0x20 in the monitor version of the domain
structure, there would be no hope for a common routine to access the
"id" field when searching a list that could be either control or
monitor domains.

I'm looking at making this far more explicit with a new patch between
0001 and 0002 that pulls the two fields into a common substructure that
will be included in each of the control and monitor versions of the
structure.

Patch included below.

But this seems like it is a lot of churn to avoid having separate
functions to search control and monitor lists. Each a clone of
the existing ~24 line rdt_find_domain() with just the type changed
for the return value and the list travsersal.

What do you think?

-Tony

commit 08992b4be1f53a3144f8aadd821f815a40a05e75
Author: Tony Luck <tony.luck@intel.com>
Date:   Tue Sep 26 11:07:12 2023 -0700

    x86/resctrl: Prepare to split rdt_domain structure
    
    The rdt_domain structure is used for both control and monitor features.
    It is about to be split into separate structures for these two usages
    because the scope for control and monitoring features for a resource
    will be different for future resources.
    
    To allow for common code that scans a list of domains looking for a
    specific domain id, move the "list" and "id" fields into their own
    structure within the rdt_domain structure.
    
    Signed-off-by: Tony Luck <tony.luck@intel.com>

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 618735e396cb..a583fa88ea5a 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,9 +53,18 @@ struct resctrl_staged_config {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
+ */
+struct rdt_domain_hdr {
+	struct list_head		list;
+	int				id;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr:		common header for different domain types
  * @cpu_mask:		which CPUs share this resource
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
@@ -71,8 +80,7 @@ struct resctrl_staged_config {
  *			by closid
  */
 struct rdt_domain {
-	struct list_head		list;
-	int				id;
+	struct rdt_domain_hdr		hdr;
 	struct cpumask			cpu_mask;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3b1837e1fb6b..05369add4578 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -352,7 +352,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->cpu_mask))
 			return d;
@@ -401,12 +401,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 		return ERR_PTR(-ENODEV);
 
 	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, list);
+		d = list_entry(l, struct rdt_domain, hdr.list);
 		/* When id is found, return its domain. */
-		if (id == d->id)
+		if (id == d->hdr.id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->id)
+		if (id < d->hdr.id)
 			break;
 	}
 
@@ -544,7 +544,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 
 	d = &hw_dom->d_resctrl;
-	d->id = id;
+	d->hdr.id = id;
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
@@ -559,11 +559,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail(&d->list, add_pos);
+	list_add_tail(&d->hdr.list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
@@ -587,7 +587,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 
 		/*
 		 * rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b44c487727d4..8bce591a1018 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -144,7 +144,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -224,8 +224,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
 			if (r->parse_ctrlval(&data, s, d))
@@ -316,7 +316,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -464,7 +464,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -474,7 +474,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 			ctrl_val = resctrl_arch_get_config(r, dom, closid,
 							   schema->conf_type);
 
-		seq_printf(s, r->format_str, dom->id, max_data_width,
+		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
 			   ctrl_val);
 		sep = true;
 	}
@@ -503,7 +503,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			} else {
 				seq_printf(s, "%s:%d=%x\n",
 					   rdtgrp->plr->s->res->name,
-					   rdtgrp->plr->d->id,
+					   rdtgrp->plr->d->hdr.id,
 					   rdtgrp->plr->cbm);
 			}
 		} else {
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..27cda5988d7f 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8c5f932bc00b..18b6183a1b48 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, list) {
+		list_for_each_entry(d_i, &r->domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 1cf2b36f5bf8..42adf17ea6fa 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -86,7 +86,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -928,12 +928,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
 		exclusive = 0;
-		seq_printf(seq, "%d=", dom->id);
+		seq_printf(seq, "%d=", dom->hdr.id);
 		for (i = 0; i < closids_supported(); i++) {
 			if (!closid_allocated(i))
 				continue;
@@ -1233,7 +1233,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1398,7 +1398,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
 						    rdtgrp->plr->d,
 						    rdtgrp->plr->cbm);
-			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
 		}
 		goto out;
 	}
@@ -1410,7 +1410,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1428,7 +1428,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 				else
 					size = rdtgroup_cbm_to_size(r, d, ctrl);
 			}
-			seq_printf(s, "%d=%u", d->id, size);
+			seq_printf(s, "%d=%u", d->hdr.id, size);
 			sep = true;
 		}
 		seq_putc(s, '\n');
@@ -1499,7 +1499,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1507,7 +1507,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		mon_info.evtid = evtid;
 		mondata_config_read(dom, &mon_info);
 
-		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1622,8 +1622,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
 				return -EINVAL;
@@ -2141,7 +2141,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, list) {
+	list_for_each_entry(d, &r_l->domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->cpu_mask)
@@ -2226,7 +2226,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2528,7 +2528,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2652,7 +2652,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * CBMs in all domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
@@ -2858,7 +2858,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
 	/* create the directory */
 	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
 	if (IS_ERR(kn))
@@ -2874,7 +2874,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	}
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->id;
+	priv.u.domid = d->hdr.id;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -2922,7 +2922,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3081,7 +3081,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
 	 */
 	tmp_cbm = cfg->new_ctrl;
 	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
-		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
 		return -ENOSPC;
 	}
 	cfg->have_new_ctrl = true;
@@ -3104,7 +3104,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, list) {
+	list_for_each_entry(d, &s->res->domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3119,7 +3119,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3726,7 +3726,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (static_branch_unlikely(&rdt_mon_enable_key))
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);

^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-08-29 23:44     ` [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
  2023-09-25 23:30       ` Reinette Chatre
@ 2023-09-26 19:00       ` Moger, Babu
  2023-09-26 19:11         ` Luck, Tony
  1 sibling, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2023-09-26 19:00 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

How does user know Sub-NUMA Cluster mode is being enabled on the system?

Do you have any information in /sys/fs/resctrl/info?

Below documentation does not have any info about it.
Would it be better to add that in "info" directory?

Thanks
Babu



On 8/29/23 18:44, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
> 
> Users should be aware that SNC mode also affects the amount of L3 cache
> available for allocation within each SNC node.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  Documentation/arch/x86/resctrl.rst | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index cb05d90111b4..407764f43f25 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -345,9 +345,15 @@ When control is enabled all CTRL_MON groups will also contain:
>  When monitoring is enabled all MON groups will also contain:
>  
>  "mon_data":
> -	This contains a set of files organized by L3 domain and by
> -	RDT event. E.g. on a system with two L3 domains there will
> -	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
> +	This contains a set of files organized by L3 domain or by NUMA
> +	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> +	or enabled respectively) and by RDT event. E.g. on a system with
> +	SNC mode disabled with two L3 domains there will be subdirectories
> +	"mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
> +	L3 cache id.  With SNC enabled the directory names are the same,
> +	but the numerical suffix refers to the node id.
> +        Mappings from node ids to CPUs are available in the
> +        /sys/devices/system/node/node*/cpulist files. Each of these
>  	directories have one file per event (e.g. "llc_occupancy",
>  	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
>  	files provide a read out of the current value of the event for
> @@ -452,6 +458,19 @@ and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
>  of the capacity of the cache. You could partition the cache into four
>  equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>  
> +Notes on Sub-NUMA Cluster mode
> +==============================
> +When SNC mode is enabled the "llc_occupancy", "mbm_total_bytes", and
> +"mbm_local_bytes" will only give accurate results for well behaved NUMA
> +applications. I.e. those that perform the majority of memory accesses
> +to memory on the local NUMA node to the CPU where the task is executing.
> +
> +The cache allocation feature still provides the same number of
> +bits in a mask to control allocation into the L3 cache. But each
> +of those ways has its capacity reduced because the cache is divided
> +between the SNC nodes. The values reported in the resctrl
> +"size" files are adjusted accordingly.
> +
>  Memory bandwidth Allocation and monitoring
>  ==========================================
>  

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-09-26 19:00       ` Moger, Babu
@ 2023-09-26 19:11         ` Luck, Tony
  2023-09-26 19:26           ` Moger, Babu
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-09-26 19:11 UTC (permalink / raw)
  To: babu.moger, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

> How does user know Sub-NUMA Cluster mode is being enabled on the system?
>
> Do you have any information in /sys/fs/resctrl/info?
>
> Below documentation does not have any info about it.
> Would it be better to add that in "info" directory?


Babu,

My original patch series added an "snc_ways" file to the info/ directory
to make this visible. But I was talked out of it because of a lack of clear
user mode use case that needs it.

https://lore.kernel.org/all/f0841866-315b-4727-0a6c-ec60d22ca29c@arm.com/

Users can figure out whether SNC is enabled by looking at topology
files for nodes and L3 caches in /sys. The open question is whether
we should provide a simple short cut to that information, or make them
walk two miles uphill in the snow to figure it out.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-08-29 23:44     ` [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
  2023-09-25 23:29       ` Reinette Chatre
@ 2023-09-26 19:16       ` Moger, Babu
  2023-09-26 19:40         ` Luck, Tony
  1 sibling, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2023-09-26 19:16 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 8/29/23 18:44, Tony Luck wrote:
> There isn't a simple h/w bit that indicates whether a CPU is
> running in Sub NUMA Cluster mode. Infer the state by comparing
> the ratio of NUMA nodes to L3 cache instances.
> 
> When SNC mode is detected, reconfigure the RMID counters by updating
> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/include/asm/msr-index.h   |  1 +
>  arch/x86/kernel/cpu/resctrl/core.c | 68 ++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+)
> 
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 1d111350197f..393d1b047617 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1100,6 +1100,7 @@
>  #define MSR_IA32_QM_CTR			0xc8e
>  #define MSR_IA32_PQR_ASSOC		0xc8f
>  #define MSR_IA32_L3_CBM_BASE		0xc90
> +#define MSR_RMID_SNC_CONFIG		0xca0
>  #define MSR_IA32_L2_CBM_BASE		0xd10
>  #define MSR_IA32_MBA_THRTL_BASE		0xd50
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index ed4f55b3e5e4..9f0ac9721fab 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>  
>  #define pr_fmt(fmt)	"resctrl: " fmt
>  
> +#include <linux/cpu.h>
>  #include <linux/slab.h>
>  #include <linux/err.h>
>  #include <linux/cacheinfo.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>

I didnt see the need for this include.
>  
> +#include <asm/cpu_device_id.h>
>  #include <asm/intel-family.h>
>  #include <asm/resctrl.h>
>  #include "internal.h"
> @@ -724,11 +727,34 @@ static void clear_closid_rmid(int cpu)
>  	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
>  }
>  
> +/*
> + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> + * which indicates that RMIDs are configured in legacy mode.
> + * Clearing bit 0 reconfigures the RMID counters for use
> + * in Sub NUMA Cluster mode.
> + */
> +static void snc_remap_rmids(int cpu)

While adding the new functions, i see that new function names start with
resctrl_ prefix.  However, we are all not very consistent. Can ypu rename
this function to resctrl_snc_remap_rmids?


> +{
> +	u64 val;
> +
> +	/* Only need to enable once per package */
> +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> +		return;
> +
> +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
> +	val &= ~BIT_ULL(0);
> +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
> +}
> +
>  static int resctrl_online_cpu(unsigned int cpu)
>  {
>  	struct rdt_resource *r;
>  
>  	mutex_lock(&rdtgroup_mutex);
> +
> +	if (snc_nodes_per_l3_cache > 1)
> +		snc_remap_rmids(cpu);
> +
>  	for_each_capable_rdt_resource(r)
>  		domain_add_cpu(cpu, r);
>  	/* The cpu is set in default rdtgroup after online. */
> @@ -983,11 +1009,53 @@ static __init bool get_rdt_resources(void)
>  	return (rdt_mon_capable || rdt_alloc_capable);
>  }
>  
> +/* CPU models that support MSR_RMID_SNC_CONFIG */
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> +	{}
> +};
> +
> +static __init int get_snc_config(void)

Same comment as above.

> +{
> +	unsigned long *node_caches;
> +	int mem_only_nodes = 0;
> +	int cpu, node, ret;
> +
> +	if (!x86_match_cpu(snc_cpu_ids))
> +		return 1;
> +
> +	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> +	if (!node_caches)
> +		return 1;
> +
> +	cpus_read_lock();
> +	for_each_node(node) {
> +		cpu = cpumask_first(cpumask_of_node(node));
> +		if (cpu < nr_cpu_ids)
> +			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> +		else
> +			mem_only_nodes++;
> +	}
> +	cpus_read_unlock();
> +
> +	ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
> +	kfree(node_caches);
> +
> +	if (ret > 1)
> +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> +
> +	return ret;
> +}
> +
>  static __init void rdt_init_res_defs_intel(void)
>  {
>  	struct rdt_hw_resource *hw_res;
>  	struct rdt_resource *r;
>  
> +	snc_nodes_per_l3_cache = get_snc_config();
> +
>  	for_each_rdt_resource(r) {
>  		hw_res = resctrl_to_arch_res(r);
>  

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-09-26 19:11         ` Luck, Tony
@ 2023-09-26 19:26           ` Moger, Babu
  2023-09-26 19:43             ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2023-09-26 19:26 UTC (permalink / raw)
  To: Luck, Tony, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches



On 9/26/23 14:11, Luck, Tony wrote:
>> How does user know Sub-NUMA Cluster mode is being enabled on the system?
>>
>> Do you have any information in /sys/fs/resctrl/info?
>>
>> Below documentation does not have any info about it.
>> Would it be better to add that in "info" directory?
> 
> 
> Babu,
> 
> My original patch series added an "snc_ways" file to the info/ directory
> to make this visible. But I was talked out of it because of a lack of clear
> user mode use case that needs it.

ok. Lets go with it.

Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-09-26 19:16       ` Moger, Babu
@ 2023-09-26 19:40         ` Luck, Tony
  2023-09-26 19:48           ` Moger, Babu
  2023-09-26 21:03           ` Reinette Chatre
  0 siblings, 2 replies; 331+ messages in thread
From: Luck, Tony @ 2023-09-26 19:40 UTC (permalink / raw)
  To: babu.moger, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

>> +#include <linux/mod_devicetable.h>
>
> I didn't see the need for this include.

struct x86_cpu_id is defined in this #include file.

>> +static void snc_remap_rmids(int cpu)
>
> While adding the new functions, i see that new function names start with
> resctrl_ prefix.  However, we are all not very consistent. Can ypu rename
> this function to resctrl_snc_remap_rmids?

I try to put a subsystem prefix on any global symbols to avoid random
conflicts in other parts of the kernel. But I'm less sure of the value for
static functions and variables that are only visible inside a single ".c"
file.

If it must have a prefix, should it be "intel_" rather than "resctrl_" to
indicate that it is an Intel specific function?


>> +static __init int get_snc_config(void)
>
> Same comment as above.

Same answer.


Reinette: Any opinions on these?

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-09-26 19:26           ` Moger, Babu
@ 2023-09-26 19:43             ` Luck, Tony
  2023-09-26 19:52               ` Moger, Babu
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-09-26 19:43 UTC (permalink / raw)
  To: babu.moger, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

>> My original patch series added an "snc_ways" file to the info/ directory
>> to make this visible. But I was talked out of it because of a lack of clear
>> user mode use case that needs it.
>
> ok. Lets go with it.

Babu,

I'm not sure whether you are voting to bring the snc_ways patch back into
the series. Or to stick with the decision to drop that patch.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-09-26 19:40         ` Luck, Tony
@ 2023-09-26 19:48           ` Moger, Babu
  2023-09-26 20:02             ` Luck, Tony
  2023-09-26 21:03           ` Reinette Chatre
  1 sibling, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2023-09-26 19:48 UTC (permalink / raw)
  To: Luck, Tony, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches



On 9/26/23 14:40, Luck, Tony wrote:
>>> +#include <linux/mod_devicetable.h>
>>
>> I didn't see the need for this include.
> 
> struct x86_cpu_id is defined in this #include file.

Actually, the file <asm/cpu_device_id.h> already includes the above file.

> 
>>> +static void snc_remap_rmids(int cpu)
>>
>> While adding the new functions, i see that new function names start with
>> resctrl_ prefix.  However, we are all not very consistent. Can ypu rename
>> this function to resctrl_snc_remap_rmids?
> 
> I try to put a subsystem prefix on any global symbols to avoid random
> conflicts in other parts of the kernel. But I'm less sure of the value for
> static functions and variables that are only visible inside a single ".c"
> file.
> 
> If it must have a prefix, should it be "intel_" rather than "resctrl_" to
> indicate that it is an Intel specific function?

If you add it it should be resctrl_.

Thanks
Babu
> 
> 
>>> +static __init int get_snc_config(void)
>>
>> Same comment as above.
> 
> Same answer.
> 
> 
> Reinette: Any opinions on these?
> 
> -Tony


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-09-26 19:43             ` Luck, Tony
@ 2023-09-26 19:52               ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2023-09-26 19:52 UTC (permalink / raw)
  To: Luck, Tony, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches



On 9/26/23 14:43, Luck, Tony wrote:
>>> My original patch series added an "snc_ways" file to the info/ directory
>>> to make this visible. But I was talked out of it because of a lack of clear
>>> user mode use case that needs it.
>>
>> ok. Lets go with it.
> 
> Babu,
> 
> I'm not sure whether you are voting to bring the snc_ways patch back into
> the series. Or to stick with the decision to drop that patch.
> 

Sorry.. for not being clear. Lets stick with the James comment to drop it.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-09-26 19:48           ` Moger, Babu
@ 2023-09-26 20:02             ` Luck, Tony
  2023-09-26 20:18               ` Moger, Babu
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-09-26 20:02 UTC (permalink / raw)
  To: babu.moger, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

>>>> +#include <linux/mod_devicetable.h>
>>>
>>> I didn't see the need for this include.
>> 
>> struct x86_cpu_id is defined in this #include file.
>
> Actually, the file <asm/cpu_device_id.h> already includes the above file.

It does today. Will it include it next week when somebody has to
re-arrange things to resolve some #include dependency?

I thought there was a guideline somewhere that says to #include
the files that define the things you use. Not just rely on chains of
#includes. But some simple grep's haven't found that written in
Documentation/*

-Tony


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-09-26 20:02             ` Luck, Tony
@ 2023-09-26 20:18               ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2023-09-26 20:18 UTC (permalink / raw)
  To: Luck, Tony, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches



On 9/26/23 15:02, Luck, Tony wrote:
>>>>> +#include <linux/mod_devicetable.h>
>>>>
>>>> I didn't see the need for this include.
>>>
>>> struct x86_cpu_id is defined in this #include file.
>>
>> Actually, the file <asm/cpu_device_id.h> already includes the above file.
> 
> It does today. Will it include it next week when somebody has to
> re-arrange things to resolve some #include dependency?
> 
> I thought there was a guideline somewhere that says to #include
> the files that define the things you use. Not just rely on chains of
> #includes. But some simple grep's haven't found that written in
> Documentation/*

Yea. correct. There is no guidelines. Lets keep the #include
<linux/mod_devicetable.h>

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-09-26 19:40         ` Luck, Tony
  2023-09-26 19:48           ` Moger, Babu
@ 2023-09-26 21:03           ` Reinette Chatre
  1 sibling, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-09-26 21:03 UTC (permalink / raw)
  To: Luck, Tony, babu.moger, Yu, Fenghua, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 9/26/2023 12:40 PM, Luck, Tony wrote:
>>> +#include <linux/mod_devicetable.h>
>>
>> I didn't see the need for this include.
> 
> struct x86_cpu_id is defined in this #include file.
> 
>>> +static void snc_remap_rmids(int cpu)
>>
>> While adding the new functions, i see that new function names start with
>> resctrl_ prefix.  However, we are all not very consistent. Can ypu rename
>> this function to resctrl_snc_remap_rmids?
> 
> I try to put a subsystem prefix on any global symbols to avoid random
> conflicts in other parts of the kernel. But I'm less sure of the value for
> static functions and variables that are only visible inside a single ".c"
> file.
> 
> If it must have a prefix, should it be "intel_" rather than "resctrl_" to
> indicate that it is an Intel specific function?
> 
> 
>>> +static __init int get_snc_config(void)
>>
>> Same comment as above.
> 
> Same answer.
> 
> 
> Reinette: Any opinions on these?
> 

I agree with you that a consistent prefix is expected from the global
symbols and a prefix of resctrl_ is indeed the goal as can be seen in
the growing include/linux/resctrl.h. resctrl has no established pattern
for static functions (look at the other static functions in this file
being changed) or even those non-static functions and data
structures shared between the resctrl files (see
arch/x86/kernel/cpu/resctrl/internal.h).

I'm ok with the static functions named snc_remap_rmids() and get_snc_config().

We can surely start a discussion if there is concern about resctrl static
function prefixes but that discussion will have larger scope than
this series.

If we do want to focus on function naming I do think a change that may
benefit this work is to be consistent with verb placement in the function
names. i.e either always have verb first or have a consistent snc_ prefix
followed by a verb. I do not recall if there are x86 requirements in this
regard.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure
  2023-09-26 18:46         ` Tony Luck
@ 2023-09-26 22:00           ` Reinette Chatre
  2023-09-26 22:14             ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-09-26 22:00 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 9/26/2023 11:46 AM, Tony Luck wrote:
> On Mon, Sep 25, 2023 at 04:25:15PM -0700, Reinette Chatre wrote:
>>> +struct rdt_domain {
>>> +	// First three fields must match struct rdt_mondomain below.
>>
>> Please avoid comments within declarations. Even so, could you please
>> elaborate what the above means? Why do the first three fields have to
>> match? I understand there is common code, for example, __rdt_find_domain()
>> that operated on the same members of the two structs but does that
>> require the members be in the same position in the struct?
>> I understand that a comment may be required if position in the struct
>> is important but I cannot see that it is.
> 
> [Just replying to this one point in your message to get guidance. I'll
> address all the rest in other replies]
> 
> I'm wrong about the first *three* fields ... but the first *two* fields
> (the "list" and the "id") do need to be at the same offsets in different
> structures if a common routine is going to be used to access those
> fields.
> 
> If the "id" were at offset 0x10 in the control version of the domain
> structure, and at offset 0x20 in the monitor version of the domain
> structure, there would be no hope for a common routine to access the
> "id" field when searching a list that could be either control or
> monitor domains.
> 
> I'm looking at making this far more explicit with a new patch between
> 0001 and 0002 that pulls the two fields into a common substructure that
> will be included in each of the control and monitor versions of the
> structure.
> 
> Patch included below.
> 
> But this seems like it is a lot of churn to avoid having separate
> functions to search control and monitor lists. Each a clone of
> the existing ~24 line rdt_find_domain() with just the type changed
> for the return value and the list travsersal.

Yes. Sorry, I did not realize this implication during the earlier
discussions.

> 
> What do you think?
> 

It sounds to me as though you are advocating for open coding 
rdt_find_ctrl_domain() and rdt_find_mon_domain()? That sounds good
to me.

Sorry for the noise.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure
  2023-09-26 22:00           ` Reinette Chatre
@ 2023-09-26 22:14             ` Luck, Tony
  2023-09-26 23:06               ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-09-26 22:14 UTC (permalink / raw)
  To: Chatre, Reinette
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

> > But this seems like it is a lot of churn to avoid having separate
> > functions to search control and monitor lists. Each a clone of
> > the existing ~24 line rdt_find_domain() with just the type changed
> > for the return value and the list travsersal.
>
> Yes. Sorry, I did not realize this implication during the earlier
> discussions.
>
> >
> > What do you think?
> >
>
> It sounds to me as though you are advocating for open coding
> rdt_find_ctrl_domain() and rdt_find_mon_domain()? That sounds good
> to me.

Reinette,

While there is some churn, it maybe isn't all that bad.  I also ran the open
coding case and having a pair of 24-line functions one after the other with
just two trivial lines changed between them is unlikely to get past the x86
maintainers without running this same conversation again.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure
  2023-09-26 22:14             ` Luck, Tony
@ 2023-09-26 23:06               ` Reinette Chatre
  2023-09-27  0:00                 ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-09-26 23:06 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 9/26/2023 3:14 PM, Luck, Tony wrote:
>>> But this seems like it is a lot of churn to avoid having separate
>>> functions to search control and monitor lists. Each a clone of
>>> the existing ~24 line rdt_find_domain() with just the type changed
>>> for the return value and the list travsersal.
>>
>> Yes. Sorry, I did not realize this implication during the earlier
>> discussions.
>>
>>>
>>> What do you think?
>>>
>>
>> It sounds to me as though you are advocating for open coding
>> rdt_find_ctrl_domain() and rdt_find_mon_domain()? That sounds good
>> to me.
> 
> Reinette,
> 
> While there is some churn, it maybe isn't all that bad.  I also ran the open
> coding case and having a pair of 24-line functions one after the other with
> just two trivial lines changed between them is unlikely to get past the x86
> maintainers without running this same conversation again.
> 

I expect you are right and your proposal makes the code cleaner.
Could the list_entry() call within rdt_find_domain() instead
extract a pointer to a struct rdt_domain_hdr? To me that would make
it most generic and avoid using wrong type for a pointer. I do think that
may have been what you intended by moving id in there though ...

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure
  2023-09-26 23:06               ` Reinette Chatre
@ 2023-09-27  0:00                 ` Luck, Tony
  2023-09-27 15:22                   ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-09-27  0:00 UTC (permalink / raw)
  To: Chatre, Reinette
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

> I expect you are right and your proposal makes the code cleaner.
> Could the list_entry() call within rdt_find_domain() instead
> extract a pointer to a struct rdt_domain_hdr? To me that would make
> it most generic and avoid using wrong type for a pointer. I do think that
> may have been what you intended by moving id in there though ...

Reinette,

Exactly my thoughts! The function currently looks like this
(though I expect my mail client may mess up the formatting):

struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
                                       struct list_head **pos)
{
        struct rdt_domain_hdr *d;
        struct list_head *l;

        if (id < 0)
                return ERR_PTR(-ENODEV);

        list_for_each(l, h) {
                d = list_entry(l, struct rdt_domain_hdr, list);
                /* When id is found, return its domain. */
                if (id == d->id)
                        return d;
                /* Stop searching when finding id's position in sorted list. */
                if (id < d->id)
                        break;
        }

        if (pos)
                *pos = l;

        return NULL;
}

and a typical caller does this:

        struct rdt_domain_hdr *hdr;
        struct rdt_mon_domain *d;




        hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
        if (IS_ERR(hdr)) {
                pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
                return;
        }
        d = container_of(hdr, struct rdt_mon_domain, hdr);

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure
  2023-09-27  0:00                 ` Luck, Tony
@ 2023-09-27 15:22                   ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-09-27 15:22 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 9/26/2023 5:00 PM, Luck, Tony wrote:
>> I expect you are right and your proposal makes the code cleaner.
>> Could the list_entry() call within rdt_find_domain() instead
>> extract a pointer to a struct rdt_domain_hdr? To me that would make
>> it most generic and avoid using wrong type for a pointer. I do think that
>> may have been what you intended by moving id in there though ...
> 
> Reinette,
> 
> Exactly my thoughts! The function currently looks like this
> (though I expect my mail client may mess up the formatting):
> 
> struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
>                                        struct list_head **pos)
> {
>         struct rdt_domain_hdr *d;
>         struct list_head *l;
> 
>         if (id < 0)
>                 return ERR_PTR(-ENODEV);
> 
>         list_for_each(l, h) {
>                 d = list_entry(l, struct rdt_domain_hdr, list);
>                 /* When id is found, return its domain. */
>                 if (id == d->id)
>                         return d;
>                 /* Stop searching when finding id's position in sorted list. */
>                 if (id < d->id)
>                         break;
>         }
> 
>         if (pos)
>                 *pos = l;
> 
>         return NULL;
> }
> 
> and a typical caller does this:
> 
>         struct rdt_domain_hdr *hdr;
>         struct rdt_mon_domain *d;
> 
> 
> 
> 
>         hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
>         if (IS_ERR(hdr)) {
>                 pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
>                 return;
>         }
>         d = container_of(hdr, struct rdt_mon_domain, hdr);
> 

This looks very good to me. Thank you very much for creating a
solution that integrates well.

Reinette


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 2/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-09-25 23:24       ` Reinette Chatre
@ 2023-09-28 17:21         ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-09-28 17:21 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Sep 25, 2023 at 04:24:10PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/29/2023 4:44 PM, Tony Luck wrote:
> > Existing resctrl assumes that control and monitor operations on a
> 
> Please remove "Existing", it can just be "resctrl assumes ..."

Will do.

> 
> > resource are performed at the same scope.
> > 
> > Prepare for systems that use different scope (specifically L3 scope
> > for cache control and NODE scope for cache occupancy and memory
> > bandwidth monitoring).
> > 
> > Create separate domain lists for control and monitor operations.
> > 
> > No important functional change. But note that errors during
> > initialization of either control or monitor functions on a domain would
> > previously result in that domain being excluded from both control and
> > monitor operations. Now the domains are allocated independently it is
> > no longer required to disable both control and monitor operations if
> > either fail.
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  include/linux/resctrl.h                   |  16 +-
> >  arch/x86/kernel/cpu/resctrl/internal.h    |   6 +-
> >  arch/x86/kernel/cpu/resctrl/core.c        | 227 +++++++++++++++-------
> >  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
> >  arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
> >  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   2 +-
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  32 +--
> >  7 files changed, 199 insertions(+), 88 deletions(-)
> > 
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 2db1244ae642..33856943a787 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -155,10 +155,12 @@ enum resctrl_scope {
> >   * @alloc_capable:	Is allocation available on this machine
> >   * @mon_capable:	Is monitor feature available on this machine
> >   * @num_rmid:		Number of RMIDs available
> > - * @scope:		Scope of this resource
> > + * @ctrl_scope:		Scope of this resource for control functions
> > + * @mon_scope:		Scope of this resource for monitor functions
> >   * @cache:		Cache allocation related data
> >   * @membw:		If the component has bandwidth controls, their properties.
> > - * @domains:		All domains for this resource
> > + * @domains:		Control domains for this resource
> > + * @mon_domains:	Monitor domains for this resource
> >   * @name:		Name to use in "schemata" file.
> >   * @data_width:		Character width of data when displaying
> >   * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
> > @@ -173,10 +175,12 @@ struct rdt_resource {
> >  	bool			alloc_capable;
> >  	bool			mon_capable;
> >  	int			num_rmid;
> > -	enum resctrl_scope	scope;
> > +	enum resctrl_scope	ctrl_scope;
> > +	enum resctrl_scope	mon_scope;
> >  	struct resctrl_cache	cache;
> >  	struct resctrl_membw	membw;
> >  	struct list_head	domains;
> > +	struct list_head	mondomains;
> 
> kerneldoc is "mon_domains" while member is "mondomains".
> I do think to be consistent with other members that "mon_domains"
> would be appropriate. I also think it will be very helpful it
> domains is renamed to "ctrl_domains" to more accurately match
> "ctrl_scope" and what it represents.

Next version will use "mon_domains" (and "mon_" in all the associated
functions and macros). I'm also changing to "ctrl_" as suggested.

> >  	char			*name;
> >  	int			data_width;
> >  	u32			default_ctrl;
> > @@ -222,8 +226,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
> >  
> >  u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
> >  			    u32 closid, enum resctrl_conf_type type);
> > -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
> > -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
> > +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> > +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
> > +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> > +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
> >  
> >  /**
> >   * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
> > diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> > index 85ceaf9a31ac..31a5fc3b717f 100644
> > --- a/arch/x86/kernel/cpu/resctrl/internal.h
> > +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> > @@ -511,8 +511,10 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
> >  int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
> >  int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
> >  			     umode_t mask);
> > -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> > -				   struct list_head **pos);
> > +struct rdt_domain *rdt_find_ctrldomain(struct list_head *h, int id,
> > +				       struct list_head **pos);
> > +struct rdt_domain *rdt_find_mondomain(struct list_head *h, int id,
> > +				      struct list_head **pos);
> 
> This is not what I expected after our previous discussion. It
> should not be necessary for caller to be aware of the different
> lists. It looks to me like the original parameters of (struct rdt_resource *r,
> int id, struct list_head **pos) can be maintained.

Discussed in other e-mail. There is now a "struct rdt_domain_hdr"
embedded in the control and monitor versions of the domain structure.
rdt_find_domain() operates on that, and callers user container_of() to
get to the enclosing structure.

> 
> >  ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
> >  				char *buf, size_t nbytes, loff_t off);
> >  int rdtgroup_schemata_show(struct kernfs_open_file *of,
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 0d3bae523ecb..97f6f9715fdb 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -57,7 +57,7 @@ static void
> >  mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
> >  	      struct rdt_resource *r);
> >  
> > -#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
> > +#define domain_init(id, field) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.field)
> >  
> 
> It may make the code easier to read if this is split into
> a "mon_domain_init()" and "ctrl_domain_init()"

Done

> 
> >  struct rdt_hw_resource rdt_resources_all[] = {
> >  	[RDT_RESOURCE_L3] =
> > @@ -65,8 +65,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
> >  		.r_resctrl = {
> >  			.rid			= RDT_RESOURCE_L3,
> >  			.name			= "L3",
> > -			.scope			= RESCTRL_L3_CACHE,
> > -			.domains		= domain_init(RDT_RESOURCE_L3),
> > +			.ctrl_scope		= RESCTRL_L3_CACHE,
> > +			.mon_scope		= RESCTRL_L3_CACHE,
> > +			.domains		= domain_init(RDT_RESOURCE_L3, domains),
> > +			.mondomains		= domain_init(RDT_RESOURCE_L3, mondomains),
> >  			.parse_ctrlval		= parse_cbm,
> >  			.format_str		= "%d=%0*x",
> >  			.fflags			= RFTYPE_RES_CACHE,
> > @@ -79,8 +81,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
> >  		.r_resctrl = {
> >  			.rid			= RDT_RESOURCE_L2,
> >  			.name			= "L2",
> > -			.scope			= RESCTRL_L2_CACHE,
> > -			.domains		= domain_init(RDT_RESOURCE_L2),
> > +			.ctrl_scope		= RESCTRL_L2_CACHE,
> > +			.domains		= domain_init(RDT_RESOURCE_L2, domains),
> >  			.parse_ctrlval		= parse_cbm,
> >  			.format_str		= "%d=%0*x",
> >  			.fflags			= RFTYPE_RES_CACHE,
> > @@ -93,8 +95,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
> >  		.r_resctrl = {
> >  			.rid			= RDT_RESOURCE_MBA,
> >  			.name			= "MB",
> > -			.scope			= RESCTRL_L3_CACHE,
> > -			.domains		= domain_init(RDT_RESOURCE_MBA),
> > +			.ctrl_scope		= RESCTRL_L3_CACHE,
> > +			.domains		= domain_init(RDT_RESOURCE_MBA, domains),
> >  			.parse_ctrlval		= parse_bw,
> >  			.format_str		= "%d=%*u",
> >  			.fflags			= RFTYPE_RES_MB,
> > @@ -105,8 +107,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
> >  		.r_resctrl = {
> >  			.rid			= RDT_RESOURCE_SMBA,
> >  			.name			= "SMBA",
> > -			.scope			= RESCTRL_L3_CACHE,
> > -			.domains		= domain_init(RDT_RESOURCE_SMBA),
> > +			.ctrl_scope		= RESCTRL_L3_CACHE,
> > +			.domains		= domain_init(RDT_RESOURCE_SMBA, domains),
> >  			.parse_ctrlval		= parse_bw,
> >  			.format_str		= "%d=%*u",
> >  			.fflags			= RFTYPE_RES_MB,
> > @@ -384,15 +386,16 @@ void rdt_ctrl_update(void *arg)
> >  }
> >  
> >  /*
> > - * rdt_find_domain - Find a domain in a resource that matches input resource id
> > + * __rdt_find_domain - Find a domain in either the list of control or
> > + * monitor domains that matches input resource id
> >   *
> >   * Search resource r's domain list to find the resource id. If the resource
> >   * id is found in a domain, return the domain. Otherwise, if requested by
> >   * caller, return the first domain whose id is bigger than the input id.
> >   * The domain list is sorted by id in ascending order.
> >   */
> > -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> > -				   struct list_head **pos)
> > +static void *__rdt_find_domain(struct list_head *h, int id,
> > +			       struct list_head **pos)
> >  {
> >  	struct rdt_domain *d;
> >  	struct list_head *l;
> > @@ -400,7 +403,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> >  	if (id < 0)
> >  		return ERR_PTR(-ENODEV);
> >  
> > -	list_for_each(l, &r->domains) {
> > +	list_for_each(l, h) {
> >  		d = list_entry(l, struct rdt_domain, list);
> >  		/* When id is found, return its domain. */
> >  		if (id == d->id)
> > @@ -416,6 +419,18 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> >  	return NULL;
> >  }
> >  
> > +struct rdt_domain *rdt_find_ctrldomain(struct list_head *h, int id,
> > +				       struct list_head **pos)
> > +{
> > +	return __rdt_find_domain(h, id, pos);
> > +}
> > +
> > +struct rdt_domain *rdt_find_mondomain(struct list_head *h, int id,
> > +				      struct list_head **pos)
> > +{
> > +	return __rdt_find_domain(h, id, pos);
> > +}
> > +
> >  static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
> >  {
> >  	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> > @@ -431,10 +446,15 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
> >  }
> >  
> >  static void domain_free(struct rdt_hw_domain *hw_dom)
> > +{
> > +	kfree(hw_dom->ctrl_val);
> > +	kfree(hw_dom);
> > +}
> > +
> > +static void mondomain_free(struct rdt_hw_domain *hw_dom)
> >  {
> >  	kfree(hw_dom->arch_mbm_total);
> >  	kfree(hw_dom->arch_mbm_local);
> > -	kfree(hw_dom->ctrl_val);
> >  	kfree(hw_dom);
> >  }
> >  
> > @@ -502,6 +522,93 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> >  	return -1;
> >  }
> >  
> > +static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> > +{
> > +	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
> > +	struct list_head *add_pos = NULL;
> > +	struct rdt_hw_domain *hw_dom;
> > +	struct rdt_domain *d;
> > +	int err;
> > +
> > +	d = rdt_find_ctrldomain(&r->domains, id, &add_pos);
> > +	if (IS_ERR(d)) {
> > +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
> 
> I am not sure here ... this generates identical error messages
> for control and monitor domain. How will the user know what failed?
> Also, how does printing id help? If I understand correctly it can
> only be -1 when the above prints.

Improved error messages in next version. Separate ones for failure in
get_domain_id_from_scope() and rdt_find_domain(). Each have "control"
or "monitor" in the message to help the user pinpoint exactly what
went wrong.

> > +		return;
> > +	}
> > +
> > +	if (d) {
> > +		cpumask_set_cpu(cpu, &d->cpu_mask);
> > +		if (r->cache.arch_has_per_cpu_cfg)
> > +			rdt_domain_reconfigure_cdp(r);
> > +		return;
> > +	}
> > +
> > +	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> > +	if (!hw_dom)
> > +		return;
> > +
> > +	d = &hw_dom->d_resctrl;
> > +	d->id = id;
> > +	cpumask_set_cpu(cpu, &d->cpu_mask);
> > +
> > +	rdt_domain_reconfigure_cdp(r);
> > +
> > +	if (domain_setup_ctrlval(r, d)) {
> > +		domain_free(hw_dom);
> > +		return;
> > +	}
> > +
> > +	list_add_tail(&d->list, add_pos);
> > +
> > +	err = resctrl_online_ctrl_domain(r, d);
> > +	if (err) {
> > +		list_del(&d->list);
> > +		domain_free(hw_dom);
> > +	}
> > +}
> > +
> > +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> > +{
> > +	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> > +	struct rdt_hw_domain *hw_mondom;
> > +	struct list_head *add_pos = NULL;
> 
> Please maintain reverse fir ordering. Please check all code.

I believe I got them all (for functions that are touched by this
series - there are a handful of places in resctrl code that don't
have reverse fir ordering - I didn't touch those).

> > +	struct rdt_domain *d;
> > +	int err;
> > +
> > +	d = rdt_find_mondomain(&r->mondomains, id, &add_pos);
> > +	if (IS_ERR(d)) {
> > +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
> > +		return;
> > +	}
> > +
> > +	if (d) {
> > +		cpumask_set_cpu(cpu, &d->cpu_mask);
> > +
> 
> This is an unnecessary empty line that distracts from how the rest
> of the function looks.

Deleted it.

> 
> > +		return;
> > +	}
> > +
> > +	hw_mondom = kzalloc_node(sizeof(*hw_mondom), GFP_KERNEL, cpu_to_node(cpu));
> > +	if (!hw_mondom)
> > +		return;
> > +
> > +	d = &hw_mondom->d_resctrl;
> > +	d->id = id;
> > +	cpumask_set_cpu(cpu, &d->cpu_mask);
> > +
> > +	if (arch_domain_mbm_alloc(r->num_rmid, hw_mondom)) {
> > +		mondomain_free(hw_mondom);
> > +		return;
> > +	}
> > +
> > +	list_add_tail(&d->list, add_pos);
> > +
> > +	err = resctrl_online_mon_domain(r, d);
> > +	if (err) {
> > +		list_del(&d->list);
> > +		mondomain_free(hw_mondom);
> > +	}
> > +}
> > +
> >  /*
> >   * domain_add_cpu - Add a cpu to a resource's domain list.
> >   *
> 
> Note that this leaves the comments about list management here while all
> the list management code is moved away.

Moved the comment in front of rdt_find_domain() which does most of the
work. Small edit in the comment to say that callers do the rest.

> 
> > @@ -517,70 +624,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> >   */
> >  static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >  {
> > -	int id = get_domain_id_from_scope(cpu, r->scope);
> > -	struct list_head *add_pos = NULL;
> > -	struct rdt_hw_domain *hw_dom;
> > -	struct rdt_domain *d;
> > -	int err;
> > -
> > -	d = rdt_find_domain(r, id, &add_pos);
> > -	if (IS_ERR(d)) {
> > -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> > -		return;
> > -	}
> > -
> > -	if (d) {
> > -		cpumask_set_cpu(cpu, &d->cpu_mask);
> > -		if (r->cache.arch_has_per_cpu_cfg)
> > -			rdt_domain_reconfigure_cdp(r);
> > -		return;
> > -	}
> > -
> > -	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> > -	if (!hw_dom)
> > -		return;
> > -
> > -	d = &hw_dom->d_resctrl;
> > -	d->id = id;
> > -	cpumask_set_cpu(cpu, &d->cpu_mask);
> > -
> > -	rdt_domain_reconfigure_cdp(r);
> > -
> > -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> > -		domain_free(hw_dom);
> > -		return;
> > -	}
> > -
> > -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> > -		domain_free(hw_dom);
> > -		return;
> > -	}
> > -
> > -	list_add_tail(&d->list, add_pos);
> > -
> > -	err = resctrl_online_domain(r, d);
> > -	if (err) {
> > -		list_del(&d->list);
> > -		domain_free(hw_dom);
> > -	}
> > +	if (r->alloc_capable)
> > +		domain_add_cpu_ctrl(cpu, r);
> > +	if (r->mon_capable)
> > +		domain_add_cpu_mon(cpu, r);
> >  }
> >  
> > -static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> > +static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> >  {
> > -	int id = get_domain_id_from_scope(cpu, r->scope);
> > +	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
> >  	struct rdt_hw_domain *hw_dom;
> >  	struct rdt_domain *d;
> >  
> > -	d = rdt_find_domain(r, id, NULL);
> > +	d = rdt_find_ctrldomain(&r->domains, id, NULL);
> >  	if (IS_ERR_OR_NULL(d)) {
> > -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> > +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
> >  		return;
> >  	}
> >  	hw_dom = resctrl_to_arch_dom(d);
> >  
> >  	cpumask_clear_cpu(cpu, &d->cpu_mask);
> >  	if (cpumask_empty(&d->cpu_mask)) {
> > -		resctrl_offline_domain(r, d);
> > +		resctrl_offline_ctrl_domain(r, d);
> >  		list_del(&d->list);
> >  
> >  		/*
> > @@ -593,6 +658,30 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> >  
> >  		return;
> >  	}
> > +}
> > +
> > +static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> > +{
> > +	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> > +	struct rdt_hw_domain *hw_mondom;
> > +	struct rdt_domain *d;
> > +
> > +	d = rdt_find_mondomain(&r->mondomains, id, NULL);
> > +	if (IS_ERR_OR_NULL(d)) {
> > +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
> > +		return;
> > +	}
> > +	hw_mondom = resctrl_to_arch_dom(d);
> > +
> > +	cpumask_clear_cpu(cpu, &d->cpu_mask);
> > +	if (cpumask_empty(&d->cpu_mask)) {
> > +		resctrl_offline_mon_domain(r, d);
> > +		list_del(&d->list);
> > +
> 
> This is an awkward empty line.

Deleted it.

> 
> > +		mondomain_free(hw_mondom);
> > +
> > +		return;
> > +	}
> >  
> >  	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> >  		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> 
> ...
> 
> Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure
  2023-09-25 23:25       ` Reinette Chatre
  2023-09-26 18:46         ` Tony Luck
@ 2023-09-28 17:33         ` Tony Luck
  2023-09-28 18:42           ` Reinette Chatre
  1 sibling, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-28 17:33 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Sep 25, 2023 at 04:25:15PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> Subject:
> 
> x86/resctrl: Split the rdt_domain and rdt_hw_domain structures

Fixed subject as suggested.

> 
> 
> On 8/29/2023 4:44 PM, Tony Luck wrote:
> > The same rdt_domain structure is used for both control an monitor
> 
> "control an monitor" -> "control and monitor"

Fixed.

> 
> > functions. But this results in wasted memory as some of the fields
> > are only used by control functions, while most are only used for monitor
> > functions.
> > 
> > Create a new rdt_mondomain structure tailored explicitly for use in
> > monitor parts of the core. Slim down the rdt_domain structure by
> > removing the unused monitor fields.
> > 
> 
> Similar to the previous patch I think it will make the code
> easier to understand if the naming is clear for both
> monitoring and control structured. Why not rdt_mondomain
> and rdt_ctrldomain instead?

Done.

> 
> 
> > Similar breakout of struct rdt_hw_mondomain from struct rdt_hw_domain.
> 
> rdt_hw_mondomain and rdt_hw_ctrldomain?

Yes. rdt_hw_domain is split two ways and each gets an appropriate name.
I added an extra "_" to match shape of other names. So these are now
"rdt_hw_ctrl_domain" and "rdt_hw_mon_domain"

> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  include/linux/resctrl.h                   | 46 +++++++++++++++--------
> >  arch/x86/kernel/cpu/resctrl/internal.h    | 38 +++++++++++++------
> >  arch/x86/kernel/cpu/resctrl/core.c        | 18 ++++-----
> >  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  4 +-
> >  arch/x86/kernel/cpu/resctrl/monitor.c     | 40 ++++++++++----------
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 24 ++++++------
> >  6 files changed, 101 insertions(+), 69 deletions(-)
> > 
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 33856943a787..08382548571e 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -53,7 +53,29 @@ struct resctrl_staged_config {
> >  };
> >  
> >  /**
> > - * struct rdt_domain - group of CPUs sharing a resctrl resource
> > + * struct rdt_domain - group of CPUs sharing a resctrl control resource
> > + * @list:		all instances of this resource
> > + * @id:			unique id for this instance
> > + * @cpu_mask:		which CPUs share this resource
> > + * @plr:		pseudo-locked region (if any) associated with domain
> > + * @staged_config:	parsed configuration to be applied
> > + * @mbps_val:		When mba_sc is enabled, this holds the array of user
> > + *			specified control values for mba_sc in MBps, indexed
> > + *			by closid
> > + */
> > +struct rdt_domain {
> > +	// First three fields must match struct rdt_mondomain below.
> 
> Please avoid comments within declarations. Even so, could you please
> elaborate what the above means? Why do the first three fields have to
> match? I understand there is common code, for example, __rdt_find_domain()
> that operated on the same members of the two structs but does that
> require the members be in the same position in the struct?
> I understand that a comment may be required if position in the struct
> is important but I cannot see that it is.

Discussed in other e-mail thread. Comments go away. The TWO (not three)
fields that must be common are now in an embedded "rdt_domain_hdr"
structure.

> 
> > +	struct list_head		list;
> > +	int				id;
> > +	struct cpumask			cpu_mask;
> > +
> > +	struct pseudo_lock_region	*plr;
> > +	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
> > +	u32				*mbps_val;
> > +};
> > +
> > +/**
> > + * struct rdt_mondomain - group of CPUs sharing a resctrl monitor resource
> >   * @list:		all instances of this resource
> >   * @id:			unique id for this instance
> >   * @cpu_mask:		which CPUs share this resource
> > @@ -64,16 +86,13 @@ struct resctrl_staged_config {
> >   * @cqm_limbo:		worker to periodically read CQM h/w counters
> >   * @mbm_work_cpu:	worker CPU for MBM h/w counters
> >   * @cqm_work_cpu:	worker CPU for CQM h/w counters
> > - * @plr:		pseudo-locked region (if any) associated with domain
> > - * @staged_config:	parsed configuration to be applied
> > - * @mbps_val:		When mba_sc is enabled, this holds the array of user
> > - *			specified control values for mba_sc in MBps, indexed
> > - *			by closid
> >   */
> > -struct rdt_domain {
> > +struct rdt_mondomain {
> > +	// First three fields must match struct rdt_domain above.
> 
> Same comment.

Same solution.
> 
> >  	struct list_head		list;
> >  	int				id;
> >  	struct cpumask			cpu_mask;
> > +
> >  	unsigned long			*rmid_busy_llc;
> >  	struct mbm_state		*mbm_total;
> >  	struct mbm_state		*mbm_local;
> > @@ -81,9 +100,6 @@ struct rdt_domain {
> >  	struct delayed_work		cqm_limbo;
> >  	int				mbm_work_cpu;
> >  	int				cqm_work_cpu;
> > -	struct pseudo_lock_region	*plr;
> > -	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
> > -	u32				*mbps_val;
> >  };
> >  
> >  /**
> 
> ...
> 
> > diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> > index 468c1815edfd..5167ac9cbe98 100644
> > --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> > +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> > @@ -521,7 +521,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
> >  }
> >  
> >  void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> > -		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
> > +		    struct rdt_mondomain *d, struct rdtgroup *rdtgrp,
> >  		    int evtid, int first)
> >  {
> >  	/*
> > @@ -544,7 +544,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> >  	struct rdtgroup *rdtgrp;
> >  	struct rdt_resource *r;
> >  	union mon_data_bits md;
> > -	struct rdt_domain *d;
> > +	struct rdt_mondomain *d;
> 
> Reverse fir order.

Fixed.

> 
> >  	struct rmid_read rr;
> >  	int ret = 0;
> >  
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index 66beca785535..42262d59ef9b 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> >  	return 0;
> >  }
> >  
> > -static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
> > +static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mondomain *hw_dom,
> >  						 u32 rmid,
> >  						 enum resctrl_event_id eventid)
> >  {
> > @@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
> >  	return NULL;
> >  }
> >  
> > -void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
> > +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mondomain *d,
> >  			     u32 rmid, enum resctrl_event_id eventid)
> >  {
> > -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> > +	struct rdt_hw_mondomain *hw_dom = resctrl_to_arch_mondom(d);
> >  	struct arch_mbm_state *am;
> >  
> >  	am = get_arch_mbm_state(hw_dom, rmid, eventid);
> > @@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
> >   * Assumes that hardware counters are also reset and thus that there is
> >   * no need to record initial non-zero counts.
> >   */
> > -void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
> > +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mondomain *d)
> >  {
> > -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> > +	struct rdt_hw_mondomain *hw_dom = resctrl_to_arch_mondom(d);
> >  
> >  	if (is_mbm_total_enabled())
> >  		memset(hw_dom->arch_mbm_total, 0,
> > @@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
> >  	return chunks >> shift;
> >  }
> >  
> > -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> > +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mondomain *d,
> >  			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
> >  {
> >  	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> > -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> > +	struct rdt_hw_mondomain *hw_mondom = resctrl_to_arch_mondom(d);
> 
> Reverse fir.

Fixed.

> 
> >  	struct arch_mbm_state *am;
> >  	u64 msr_val, chunks;
> >  	int ret;
> > @@ -245,7 +245,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> >  	if (ret)
> >  		return ret;
> >  
> > -	am = get_arch_mbm_state(hw_dom, rmid, eventid);
> > +	am = get_arch_mbm_state(hw_mondom, rmid, eventid);
> >  	if (am) {
> >  		am->chunks += mbm_overflow_count(am->prev_msr, msr_val,
> >  						 hw_res->mbm_width);
> > @@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> >   * decrement the count. If the busy count gets to zero on an RMID, we
> >   * free the RMID
> >   */
> > -void __check_limbo(struct rdt_domain *d, bool force_free)
> > +void __check_limbo(struct rdt_mondomain *d, bool force_free)
> >  {
> >  	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> >  	struct rmid_entry *entry;
> > @@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
> >  	}
> >  }
> >  
> > -bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
> > +bool has_busy_rmid(struct rdt_resource *r, struct rdt_mondomain *d)
> >  {
> >  	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
> >  }
> > @@ -334,7 +334,7 @@ int alloc_rmid(void)
> >  static void add_rmid_to_limbo(struct rmid_entry *entry)
> >  {
> >  	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> > -	struct rdt_domain *d;
> > +	struct rdt_mondomain *d;
> >  	int cpu, err;
> >  	u64 val = 0;
> >  
> > @@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
> >  		list_add_tail(&entry->list, &rmid_free_lru);
> >  }
> >  
> > -static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
> > +static struct mbm_state *get_mbm_state(struct rdt_mondomain *d, u32 rmid,
> >  				       enum resctrl_event_id evtid)
> >  {
> >  	switch (evtid) {
> > @@ -516,7 +516,7 @@ void mon_event_count(void *info)
> >   * throttle MSRs already have low percentage values.  To avoid
> >   * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
> >   */
> > -static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> > +static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mondomain *dom_mbm)
> >  {
> >  	u32 closid, rmid, cur_msr_val, new_msr_val;
> >  	struct mbm_state *pmbm_data, *cmbm_data;
> > @@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> >  	}
> >  }
> >  
> > -static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
> > +static void mbm_update(struct rdt_resource *r, struct rdt_mondomain *d, int rmid)
> >  {
> >  	struct rmid_read rr;
> >  
> > @@ -641,12 +641,12 @@ void cqm_handle_limbo(struct work_struct *work)
> >  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
> >  	int cpu = smp_processor_id();
> >  	struct rdt_resource *r;
> > -	struct rdt_domain *d;
> > +	struct rdt_mondomain *d;
> 
> Reverse fir (Please check all code).

Fixed. (I wrote a simple awk script to find these (as I'm
obviously bad at noticing them). Included at end of this
message).

> 
> Reinette


#!/usr/bin/awk -f

BEGIN {
	keyw["bool"] = 1
	keyw["char"] = 1
	keyw["enum"] = 1
	keyw["int"] = 1
	keyw["long"] = 1
	keyw["short"] = 1
	keyw["static"] = 1
	keyw["struct"] = 1
	keyw["u16"] = 1
	keyw["u32"] = 1
	keyw["u64"] = 1
	keyw["u8"] = 1
	keyw["unsigned"] = 1
	keyw["void"] = 1
}

{
	source[NR] = $0

	if ($0 == "{") {
		infunction = NR
		skip
	}

	if ($0 == "}") {
		infunction = 0
		skip
	}

	if (infunction && ($1 in keyw)) {
		inlocals++
		if (inlocals > 1 && length(source[NR - 1]) < length)
			badfir = 1
		skip
	}
	if (inlocals && NF == 0) {
		if (badfir) {
			print "==== BAD FIR ===="
			for (i = infunction - 2; i < NR; i++)
				printf("%4d: %s\n", i, source[i])
		}
		inlocals = 0
		badfir = 0
	}
}

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 4/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-09-25 23:25       ` Reinette Chatre
@ 2023-09-28 17:42         ` Tony Luck
  2023-09-28 18:44           ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-28 17:42 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Sep 25, 2023 at 04:25:54PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/29/2023 4:44 PM, Tony Luck wrote:
> > Currently supported resctrl features are all domain scoped the same as the
> > scope of the L2 or L3 caches.
> 
> fyi ... this patch series seems to use the terms "resctrl feature"
> and "resctrl resource" interchangeably and it is not always clear
> if the terms mean something different.

I think a "resctrl feature" is a h/w control or monitor feature. A
"resctrl resource" is "struct rdt_resource" (which may have more than
one "resctrl feature" attached to it. E.g. the RDT_RESOURCE_L3 resource
has L3 CAT, MBM, CQM attached).

> 
> > 
> > Add "node" as a new option for domain scope.
> 
> Could the commit message please get a snippet about what "node"
> represents and why this new scope is needed?

Yes. I've added a note.

> 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  include/linux/resctrl.h            | 1 +
> >  arch/x86/kernel/cpu/resctrl/core.c | 2 ++
> >  2 files changed, 3 insertions(+)
> > 
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 08382548571e..f55cf7afd4eb 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -163,6 +163,7 @@ struct resctrl_schema;
> >  enum resctrl_scope {
> >  	RESCTRL_L3_CACHE,
> >  	RESCTRL_L2_CACHE,
> > +	RESCTRL_NODE,
> >  };
> >  
> >  /**
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 3e08aa04a7ff..9fcc264fac6c 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -514,6 +514,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> >  		return get_cpu_cacheinfo_id(cpu, 3);
> >  	case RESCTRL_L2_CACHE:
> >  		return get_cpu_cacheinfo_id(cpu, 2);
> > +	case RESCTRL_NODE:
> > +		return cpu_to_node(cpu);
> >  	default:
> >  		WARN_ON_ONCE(1);
> >  		break;
> 
> 
> Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 5/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-09-25 23:27       ` Reinette Chatre
@ 2023-09-28 17:48         ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-09-28 17:48 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Sep 25, 2023 at 04:27:45PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/29/2023 4:44 PM, Tony Luck wrote:
> 
> Could the commit message please provide a brief overview of what SNC is
> before jumping to the things needed to support it?

Ok. Added an overview.

> 
> > Intel Sub-NUMA Cluster mode requires several changes in resctrl
> 
> I think the intention is to introduce the acronym here so maybe:
> "Intel Sub-NUMA Cluster (SNC) ..."

Commit now starts with this definiton.

> 
> > behavior for correct operation.
> > 
> > Add a global integer "snc_nodes_per_l3_cache" that will show how many
> > SNC nodes share each L3 cache. When this is "1", SNC mode is either
> > not implemented, or not enabled.
> > 
> > A later patch will detect SNC mode and set snc_nodes_per_l3_cache to
> > the appropriate value. For now it remains at the default "1" to
> > indicate SNC mode is not active.
> > 
> > Code that needs to take action when SNC is enabled is:
> > 1) The number of logical RMIDs available for use is the number of
> >    physical RMIDs divided by the number of SNC nodes.
> 
> Could this maybe be "... number of SNC nodes per L3 cache" to be
> specific? Even so, this jumps into supporting logical RMIDs and 
> physical RMIDs without introducing what logical vs physical means.
> Is this something that can be added to the intro of this commit message?

Added that to be specific. Also added more preamble text to set
up context.

> 
> > 2) Likewise the "mon_scale" value must be adjusted for the number
> >    of SNC nodes.
> > 3) When reading an RMID counter code must adjust from the logical
> >    RMID used to the physical RMID value that must be loaded into
> >    the IA32_QM_EVTSEL MSR.
> > 4) The L3 cache is divided between the SNC nodes. So the value
> >    reported in the resctrl "size" file is adjusted.
> > 5) The "-o mba_MBps" mount option must be disabled in SNC mode
> >    because the monitoring is being done per SNC node, while the
> >    bandwidth allocation is still done at the L3 cache scope.
> 
> This motivation for disabling is not clear to me. Why is only
> mba_MBps impacted? MBA is also at the L3 scope and it is not
> disabled. Neither is cache allocation that remains at L3
> scope with its monitoring moving to node scope.

Added text for why (essentially the s/w feedback loop now has
independent MBM inputs from each SNC node, but still only one
MBA control at L3 cache scope. The feedback code can't do any
thing useful if one SNC node says "I'm running too fast" while
another node sharing same L3 says "I'm running too slow".

> 
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
> >  arch/x86/kernel/cpu/resctrl/core.c     |  7 +++++++
> >  arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 ++--
> >  4 files changed, 24 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> > index c61fd6709730..326ca6b3688a 100644
> > --- a/arch/x86/kernel/cpu/resctrl/internal.h
> > +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> > @@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
> >  
> >  extern struct dentry *debugfs_resctrl;
> >  
> > +extern int snc_nodes_per_l3_cache;
> > +
> >  enum resctrl_res_level {
> >  	RDT_RESOURCE_L3,
> >  	RDT_RESOURCE_L2,
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 9fcc264fac6c..ed4f55b3e5e4 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -48,6 +48,13 @@ int max_name_width, max_data_width;
> >   */
> >  bool rdt_alloc_capable;
> >  
> > +/*
> > + * Number of SNC nodes that share each L3 cache.
> > + * Default is 1 for systems that do not support
> > + * SNC, or have SNC disabled.
> > + */
> 
> There is some extra space available to make the lines longer.

Re-formatted to use longer lines.

> 
> > +int snc_nodes_per_l3_cache = 1;
> > +
> >  static void
> >  mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> >  		struct rdt_resource *r);
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index 42262d59ef9b..b6b3fb0f9abe 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
> >  
> >  static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> >  {
> > +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> > +	int cpu = smp_processor_id();
> > +	int rmid_offset = 0;
> >  	u64 msr_val;
> >  
> > +	/*
> > +	 * When SNC mode is on, need to compute the offset to read the
> > +	 * physical RMID counter for the node to which this CPU belongs
> > +	 */
> 
> Please end sentence with a period.

Added period.

> 
> > +	if (snc_nodes_per_l3_cache > 1)
> > +		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> > +
> >  	/*
> >  	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> >  	 * with a valid event code for supported resource type and the bits
> > @@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> >  	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> >  	 * are error bits.
> >  	 */
> > -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> > +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
> >  	rdmsrl(MSR_IA32_QM_CTR, msr_val);
> >  
> >  	if (msr_val & RMID_VAL_ERROR)
> > @@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> >  	int ret;
> >  
> >  	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> > -	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> > -	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> > +	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> > +	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
> >  	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
> >  
> >  	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index 5feec2c33544..a8cf6251e506 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -1367,7 +1367,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> >  		}
> >  	}
> >  
> > -	return size;
> > +	return size / snc_nodes_per_l3_cache;
> >  }
> >  
> >  /**
> > @@ -2600,7 +2600,7 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
> >  		ctx->enable_cdpl2 = true;
> >  		return 0;
> >  	case Opt_mba_mbps:
> > -		if (!supports_mba_mbps())
> > +		if (!supports_mba_mbps() || snc_nodes_per_l3_cache > 1)
> >  			return -EINVAL;
> >  		ctx->enable_mba_mbps = true;
> >  		return 0;
> 
> 
> Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-09-25 23:29       ` Reinette Chatre
@ 2023-09-28 18:12         ` Tony Luck
  2023-09-28 18:44           ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-28 18:12 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Sep 25, 2023 at 04:29:44PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/29/2023 4:44 PM, Tony Luck wrote:
> > There isn't a simple h/w bit that indicates whether a CPU is
> > running in Sub NUMA Cluster mode. Infer the state by comparing
> 
> Sub NUMA Cluster (SNC)

Fixed.

> 
> > the ratio of NUMA nodes to L3 cache instances.
> > 
> > When SNC mode is detected, reconfigure the RMID counters by updating
> > the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
> 
> Can it be explained how RMID counters are reconfigured when
> the MSR is updated?

Added some in previous patch (which implements the changes needed
by the reconfuration). Small note in ths patch to refer to that.

> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  arch/x86/include/asm/msr-index.h   |  1 +
> >  arch/x86/kernel/cpu/resctrl/core.c | 68 ++++++++++++++++++++++++++++++
> >  2 files changed, 69 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> > index 1d111350197f..393d1b047617 100644
> > --- a/arch/x86/include/asm/msr-index.h
> > +++ b/arch/x86/include/asm/msr-index.h
> > @@ -1100,6 +1100,7 @@
> >  #define MSR_IA32_QM_CTR			0xc8e
> >  #define MSR_IA32_PQR_ASSOC		0xc8f
> >  #define MSR_IA32_L3_CBM_BASE		0xc90
> > +#define MSR_RMID_SNC_CONFIG		0xca0
> >  #define MSR_IA32_L2_CBM_BASE		0xd10
> >  #define MSR_IA32_MBA_THRTL_BASE		0xd50
> >  
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index ed4f55b3e5e4..9f0ac9721fab 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -16,11 +16,14 @@
> >  
> >  #define pr_fmt(fmt)	"resctrl: " fmt
> >  
> > +#include <linux/cpu.h>
> >  #include <linux/slab.h>
> >  #include <linux/err.h>
> >  #include <linux/cacheinfo.h>
> >  #include <linux/cpuhotplug.h>
> > +#include <linux/mod_devicetable.h>
> >  
> > +#include <asm/cpu_device_id.h>
> >  #include <asm/intel-family.h>
> >  #include <asm/resctrl.h>
> >  #include "internal.h"
> > @@ -724,11 +727,34 @@ static void clear_closid_rmid(int cpu)
> >  	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
> >  }
> >  
> > +/*
> > + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> > + * which indicates that RMIDs are configured in legacy mode.
> > + * Clearing bit 0 reconfigures the RMID counters for use
> > + * in Sub NUMA Cluster mode.
> > + */
> 
> I think I missed where "legacy mode" and "Sub NUMA Cluster mode"
> are explained. 

Updated this comment to cover this better.

> 
> > +static void snc_remap_rmids(int cpu)
> > +{
> > +	u64 val;
> > +
> > +	/* Only need to enable once per package */
> 
> Sentence end with period.

Period added.
> 
> > +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> > +		return;
> > +
> 
> This is an area I am not familiar with. The above code seems
> to assume that CPUs are onlined in a particular numerical
> order. For example, if I understand correctly, if CPUs
> are onlined from higher number to lower number then
> the above code may end up running on every CPU online.

This sent me on a voyage of exploration into early Linux
bringup. There's a CONFIG_HOTPLUG_PARALLEL option to bring
CPUs up in parallel. I have it set on my kernel, but I still
see the "announce_cpu()" console messages show up in monotonic
increasing order:

[    3.148423] smpboot: x86: Booting SMP configuration:
[    3.148940] .... node  #0, CPUs:          #1   #2   #3   #4   #5   #6 #7 and so on

But, without solving this mystery, I realized this doesn't matter.
Whatever order the CPUs come online was all completed long before
resctrl is initialized:

	late_initcall(resctrl_late_init);

So the order that resctrl sees CPUs is dependent on the callbacks
from the registration with cpuhp_setup_state(). That happens with:

        /*
         * Try to call the startup callback for each present cpu
         * depending on the hotplug state of the cpu.
         */
        for_each_present_cpu(cpu) {

which is going to call in increasing numerical order as the bitmap
of present CPUs is traversed.

If someone changed this, the only ill effect on the code I'm
adding would be to set the MSR multiple times (which is
inefficient, but won't break anything).


> 
> > +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
> > +	val &= ~BIT_ULL(0);
> > +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
> > +}
> > +
> >  static int resctrl_online_cpu(unsigned int cpu)
> >  {
> >  	struct rdt_resource *r;
> >  
> >  	mutex_lock(&rdtgroup_mutex);
> > +
> > +	if (snc_nodes_per_l3_cache > 1)
> > +		snc_remap_rmids(cpu);
> > +
> >  	for_each_capable_rdt_resource(r)
> >  		domain_add_cpu(cpu, r);
> >  	/* The cpu is set in default rdtgroup after online. */
> > @@ -983,11 +1009,53 @@ static __init bool get_rdt_resources(void)
> >  	return (rdt_mon_capable || rdt_alloc_capable);
> >  }
> >  
> > +/* CPU models that support MSR_RMID_SNC_CONFIG */
> > +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> > +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> > +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> > +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> > +	{}
> > +};
> > +
> > +static __init int get_snc_config(void)
> 
> Could this function please get a comment about what it does?
> The first paragraph from the commit message should be ok.

Comment added. Also renamed to snc_get_config() so that these
SNC related functions use the same naming convention.
> 
> > +{
> > +	unsigned long *node_caches;
> > +	int mem_only_nodes = 0;
> > +	int cpu, node, ret;
> > +
> > +	if (!x86_match_cpu(snc_cpu_ids))
> > +		return 1;
> > +
> > +	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> 
> How about bitmap_zalloc()?

Excellent! Now using it.

> 
> > +	if (!node_caches)
> > +		return 1;
> > +
> > +	cpus_read_lock();
> > +	for_each_node(node) {
> > +		cpu = cpumask_first(cpumask_of_node(node));
> > +		if (cpu < nr_cpu_ids)
> > +			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> > +		else
> > +			mem_only_nodes++;
> > +	}
> > +	cpus_read_unlock();
> > +
> > +	ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
> 
> My static checker complains about the above line risking
> a "divide by zero" that may be the case if the bitmap_weight() returns
> zero. You may need to ensure this is safe here before more static checkers
> start analyzing this code.

Added defensive code that checks result of bitmap_weight() for a zero
return. If it does (somehow ... that means there are no nodes with
CPUs that have an L3 cache!) then the code will assume SNC is not
enabled.

> 
> > +	kfree(node_caches);
> > +
> > +	if (ret > 1)
> > +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> > +
> > +	return ret;
> > +}
> 
> As discussed there are scenarios where the above may not provide the
> correct value. Could you please document the assumptions of this code to
> highlight this to x86 maintainers for their opinions?

Added a note that booting with maxcpus=N boot parameter will break this
check in ways that cannot be worked around. Early boot code doesn't have
to worry about user taking CPUs offline, because it runs before users
can do that.

> > +
> >  static __init void rdt_init_res_defs_intel(void)
> >  {
> >  	struct rdt_hw_resource *hw_res;
> >  	struct rdt_resource *r;
> >  
> > +	snc_nodes_per_l3_cache = get_snc_config();
> > +
> >  	for_each_rdt_resource(r) {
> >  		hw_res = resctrl_to_arch_res(r);
> >  
> 
> 
> Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-09-25 23:30       ` Reinette Chatre
@ 2023-09-28 18:15         ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-09-28 18:15 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Sep 25, 2023 at 04:30:22PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/29/2023 4:44 PM, Tony Luck wrote:
> > With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> > per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> > their name refer to Sub-NUMA nodes instead of L3 cache ids.
> > 
> > Users should be aware that SNC mode also affects the amount of L3 cache
> > available for allocation within each SNC node.
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  Documentation/arch/x86/resctrl.rst | 25 ++++++++++++++++++++++---
> >  1 file changed, 22 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> > index cb05d90111b4..407764f43f25 100644
> > --- a/Documentation/arch/x86/resctrl.rst
> > +++ b/Documentation/arch/x86/resctrl.rst
> > @@ -345,9 +345,15 @@ When control is enabled all CTRL_MON groups will also contain:
> >  When monitoring is enabled all MON groups will also contain:
> >  
> >  "mon_data":
> > -	This contains a set of files organized by L3 domain and by
> > -	RDT event. E.g. on a system with two L3 domains there will
> > -	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
> > +	This contains a set of files organized by L3 domain or by NUMA
> > +	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> > +	or enabled respectively) and by RDT event. E.g. on a system with
> > +	SNC mode disabled with two L3 domains there will be subdirectories
> > +	"mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
> > +	L3 cache id.  With SNC enabled the directory names are the same,
> > +	but the numerical suffix refers to the node id.
> > +        Mappings from node ids to CPUs are available in the
> > +        /sys/devices/system/node/node*/cpulist files. Each of these
> 
> Please be consistent with tabs vs spaces.

I blame "vim" for this one. Typing <TAB> while working on a .rst file
throws in 8 spaces. Maybe there's a vim setting to stop this annoying
behaviour (set syntax=off didn't help). Had to use CTRL(V)<TAB> to make
it work.

> 
> >  	directories have one file per event (e.g. "llc_occupancy",
> >  	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
> >  	files provide a read out of the current value of the event for
> > @@ -452,6 +458,19 @@ and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
> >  of the capacity of the cache. You could partition the cache into four
> >  equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
> >  
> > +Notes on Sub-NUMA Cluster mode
> > +==============================
> > +When SNC mode is enabled the "llc_occupancy", "mbm_total_bytes", and
> > +"mbm_local_bytes" will only give accurate results for well behaved NUMA
> > +applications. I.e. those that perform the majority of memory accesses
> > +to memory on the local NUMA node to the CPU where the task is executing.
> > +
> 
> Following our discussion of the cover letter I believe this will change to
> be specific about what user space can expect and what is actually "accurate".

I added more text about the importance of binding tasks to SNC nodes.
> 
> > +The cache allocation feature still provides the same number of
> > +bits in a mask to control allocation into the L3 cache. But each
> > +of those ways has its capacity reduced because the cache is divided
> > +between the SNC nodes. The values reported in the resctrl
> > +"size" files are adjusted accordingly.
> > +
> >  Memory bandwidth Allocation and monitoring
> >  ==========================================
> >  
> 
> Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure
  2023-09-28 17:33         ` Tony Luck
@ 2023-09-28 18:42           ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-09-28 18:42 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 9/28/2023 10:33 AM, Tony Luck wrote:
> On Mon, Sep 25, 2023 at 04:25:15PM -0700, Reinette Chatre wrote:
>>
>> On 8/29/2023 4:44 PM, Tony Luck wrote:

>>> Similar breakout of struct rdt_hw_mondomain from struct rdt_hw_domain.
>>
>> rdt_hw_mondomain and rdt_hw_ctrldomain?
> 
> Yes. rdt_hw_domain is split two ways and each gets an appropriate name.
> I added an extra "_" to match shape of other names. So these are now
> "rdt_hw_ctrl_domain" and "rdt_hw_mon_domain"

Thank you for adding the extra "_", it looks good.

...

>>> @@ -641,12 +641,12 @@ void cqm_handle_limbo(struct work_struct *work)
>>>  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
>>>  	int cpu = smp_processor_id();
>>>  	struct rdt_resource *r;
>>> -	struct rdt_domain *d;
>>> +	struct rdt_mondomain *d;
>>
>> Reverse fir (Please check all code).
> 
> Fixed. (I wrote a simple awk script to find these (as I'm
> obviously bad at noticing them). Included at end of this
> message).

Thank you very much for sharing your script. 


Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 4/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-09-28 17:42         ` Tony Luck
@ 2023-09-28 18:44           ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-09-28 18:44 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 9/28/2023 10:42 AM, Tony Luck wrote:
> On Mon, Sep 25, 2023 at 04:25:54PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 8/29/2023 4:44 PM, Tony Luck wrote:
>>> Currently supported resctrl features are all domain scoped the same as the
>>> scope of the L2 or L3 caches.
>>
>> fyi ... this patch series seems to use the terms "resctrl feature"
>> and "resctrl resource" interchangeably and it is not always clear
>> if the terms mean something different.
> 
> I think a "resctrl feature" is a h/w control or monitor feature. A
> "resctrl resource" is "struct rdt_resource" (which may have more than
> one "resctrl feature" attached to it. E.g. the RDT_RESOURCE_L3 resource
> has L3 CAT, MBM, CQM attached).

Please keep in mind that in resctrl documentation
(Documentation/arch/x86/resctrl.rst) L3_MON is described as 
a resource. The main message is just to ensure that it is clear
what is meant with particular terms and then be consistent in
their use.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-09-28 18:12         ` Tony Luck
@ 2023-09-28 18:44           ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-09-28 18:44 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 9/28/2023 11:12 AM, Tony Luck wrote:
> On Mon, Sep 25, 2023 at 04:29:44PM -0700, Reinette Chatre wrote:
>> On 8/29/2023 4:44 PM, Tony Luck wrote:

>>> +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
>>> +		return;
>>> +
>>
>> This is an area I am not familiar with. The above code seems
>> to assume that CPUs are onlined in a particular numerical
>> order. For example, if I understand correctly, if CPUs
>> are onlined from higher number to lower number then
>> the above code may end up running on every CPU online.
> 
> This sent me on a voyage of exploration into early Linux
> bringup. There's a CONFIG_HOTPLUG_PARALLEL option to bring
> CPUs up in parallel. I have it set on my kernel, but I still
> see the "announce_cpu()" console messages show up in monotonic
> increasing order:
> 
> [    3.148423] smpboot: x86: Booting SMP configuration:
> [    3.148940] .... node  #0, CPUs:          #1   #2   #3   #4   #5   #6 #7 and so on
> 
> But, without solving this mystery, I realized this doesn't matter.
> Whatever order the CPUs come online was all completed long before
> resctrl is initialized:
> 
> 	late_initcall(resctrl_late_init);
> 
> So the order that resctrl sees CPUs is dependent on the callbacks
> from the registration with cpuhp_setup_state(). That happens with:
> 
>         /*
>          * Try to call the startup callback for each present cpu
>          * depending on the hotplug state of the cpu.
>          */
>         for_each_present_cpu(cpu) {
> 
> which is going to call in increasing numerical order as the bitmap
> of present CPUs is traversed.
> 
> If someone changed this, the only ill effect on the code I'm
> adding would be to set the MSR multiple times (which is
> inefficient, but won't break anything).
> 
> 

Thank you very much for investigating this.


Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v6 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
                       ` (8 preceding siblings ...)
  2023-09-11 20:23     ` [PATCH v5 0/8] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
@ 2023-09-28 19:13     ` Tony Luck
  2023-09-28 19:13       ` [PATCH v6 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
                         ` (10 more replies)
  9 siblings, 11 replies; 331+ messages in thread
From: Tony Luck @ 2023-09-28 19:13 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The Sub-NUMA cluster feature on some Intel processors partitions
the CPUs that share an L3 cache into two or more sets. This plays
havoc with the Resource Director Technology (RDT) monitoring features.
Prior to this patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID
counters in the same way. This allows for monitoring features
to be used (with the caveat that memory accesses between different
SNC NUMA nodes may still not be counted accuratlely.

Note that this patch series improves resctrl reporting considerably
on systems with SNC enabled, but there will still be some anomalies
for processes accessing memory from other sub-NUMA nodes.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Summary of changes since v5 - see each patch commit for more specifics

Rebased to v6.6-rc3

0001	Define "scope" enum with values 2, 3 for caches to simplify some
	code (but sanity check before each such usage).
	Better warning messages when scope lookup fails

0002	New patch so that some code can be shared between looking up
	control and monitor domains

0003	Spell "mondomains" as "mon_domains" and be consistent with all
	the other "mon" identifiers also having similar "_".
	Don't leave control stuff with old names, change those too
	so now have ctrl_scope, ctrl_domains, etc.

0004	Use infrastructure from 0002 to have a common rdt_find_domain()
	function for both types of domain structure.
	0003 was using same "rdt_domain" structure for both control
	and monitor domains. Divide it into rdt_ctrl_domain and
	rdt_mon_domain structures with just the fields they need.
	Ditto for rdt_hw_domain. Also split and rename many support
	functions and macros.
	Lots of "fir tree local declaration order" changes because
	lengths of typenames changed.

0005	Better commit description

0006	Better commit and code comments

0007	More explanations in commit and code comments.
	Use consistent naming for "snc_*()" functions.

Patch to update selftests dropped from this series. Someone else
has taken over that work.

Tony Luck (8):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  34 +-
 include/linux/resctrl.h                   |  78 +++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  66 ++--
 arch/x86/kernel/cpu/resctrl/core.c        | 380 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  52 +--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  58 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  14 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 131 ++++----
 9 files changed, 567 insertions(+), 247 deletions(-)


base-commit: 6465e260f48790807eef06b583b38ca9789b6072
-- 
2.41.0


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v6 1/8] x86/resctrl: Prepare for new domain scope
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
@ 2023-09-28 19:13       ` Tony Luck
  2023-09-29 12:09         ` Peter Newman
  2023-09-28 19:13       ` [PATCH v6 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
                         ` (9 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-28 19:13 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Provide a more detailed warning message if a domain id cannot be found
when adding a CPU. Just check and silent return if the domain id can't
be found when removing a CPU.

No functional change.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v5:
1) Set enum values of RESCTRL_L2_CACHE and RESCTRL_L3_CACHE to "2" and
   "3 respectively so they can be passed to get_cpu_cacheinfo_id()
   (in code paths that check that scope is one of the cache types).

2) Simplified the check on scope in pseudo_lock_region_init() and
   rdtgroup_cbm_to_size().

3) Added detailed warning if the domain id for a CPU cannot be
   determined when adding a CPU. Simpler check on removal.
---
 include/linux/resctrl.h                   |  9 +++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 33 ++++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 ++++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 +++-
 4 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..618735e396cb 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..3b1837e1fb6b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -487,6 +487,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -502,12 +515,17 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 	int err;
 
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
+		return;
+	}
 	d = rdt_find_domain(r, id, &add_pos);
 	if (IS_ERR(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -552,10 +570,13 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
+	if (id < 0)
+		return;
+
 	d = rdt_find_domain(r, id, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..8c5f932bc00b 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	int scope = plr->s->res->scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 725344048f85..1cf2b36f5bf8 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1345,10 +1345,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return -EINVAL;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v6 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
  2023-09-28 19:13       ` [PATCH v6 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-09-28 19:13       ` Tony Luck
  2023-09-29 12:19         ` Peter Newman
  2023-09-28 19:13       ` [PATCH v6 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
                         ` (8 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-28 19:13 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move the "list" and "id" fields into their own
structure within the rdt_domain structure.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v5

This is a new patch. Spawned from Reinette's review comment
about my trying to mark some fields in the rdt_ctrl_domain
and rdt_mon_domain structures as required to be the same.

This is a better solution that doesn't require that developers
read and obey those comments.
---
 include/linux/resctrl.h                   | 14 ++++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 16 ++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 16 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     |  2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 44 +++++++++++------------
 6 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 618735e396cb..a583fa88ea5a 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,9 +53,18 @@ struct resctrl_staged_config {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
+ */
+struct rdt_domain_hdr {
+	struct list_head		list;
+	int				id;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr:		common header for different domain types
  * @cpu_mask:		which CPUs share this resource
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
@@ -71,8 +80,7 @@ struct resctrl_staged_config {
  *			by closid
  */
 struct rdt_domain {
-	struct list_head		list;
-	int				id;
+	struct rdt_domain_hdr		hdr;
 	struct cpumask			cpu_mask;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3b1837e1fb6b..05369add4578 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -352,7 +352,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->cpu_mask))
 			return d;
@@ -401,12 +401,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 		return ERR_PTR(-ENODEV);
 
 	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, list);
+		d = list_entry(l, struct rdt_domain, hdr.list);
 		/* When id is found, return its domain. */
-		if (id == d->id)
+		if (id == d->hdr.id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->id)
+		if (id < d->hdr.id)
 			break;
 	}
 
@@ -544,7 +544,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 
 	d = &hw_dom->d_resctrl;
-	d->id = id;
+	d->hdr.id = id;
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
@@ -559,11 +559,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail(&d->list, add_pos);
+	list_add_tail(&d->hdr.list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
@@ -587,7 +587,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 
 		/*
 		 * rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b44c487727d4..8bce591a1018 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -144,7 +144,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -224,8 +224,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
 			if (r->parse_ctrlval(&data, s, d))
@@ -316,7 +316,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -464,7 +464,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -474,7 +474,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 			ctrl_val = resctrl_arch_get_config(r, dom, closid,
 							   schema->conf_type);
 
-		seq_printf(s, r->format_str, dom->id, max_data_width,
+		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
 			   ctrl_val);
 		sep = true;
 	}
@@ -503,7 +503,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			} else {
 				seq_printf(s, "%s:%d=%x\n",
 					   rdtgrp->plr->s->res->name,
-					   rdtgrp->plr->d->id,
+					   rdtgrp->plr->d->hdr.id,
 					   rdtgrp->plr->cbm);
 			}
 		} else {
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..27cda5988d7f 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8c5f932bc00b..18b6183a1b48 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, list) {
+		list_for_each_entry(d_i, &r->domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 1cf2b36f5bf8..42adf17ea6fa 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -86,7 +86,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -928,12 +928,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
 		exclusive = 0;
-		seq_printf(seq, "%d=", dom->id);
+		seq_printf(seq, "%d=", dom->hdr.id);
 		for (i = 0; i < closids_supported(); i++) {
 			if (!closid_allocated(i))
 				continue;
@@ -1233,7 +1233,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1398,7 +1398,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
 						    rdtgrp->plr->d,
 						    rdtgrp->plr->cbm);
-			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
 		}
 		goto out;
 	}
@@ -1410,7 +1410,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1428,7 +1428,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 				else
 					size = rdtgroup_cbm_to_size(r, d, ctrl);
 			}
-			seq_printf(s, "%d=%u", d->id, size);
+			seq_printf(s, "%d=%u", d->hdr.id, size);
 			sep = true;
 		}
 		seq_putc(s, '\n');
@@ -1499,7 +1499,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1507,7 +1507,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		mon_info.evtid = evtid;
 		mondata_config_read(dom, &mon_info);
 
-		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1622,8 +1622,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
 				return -EINVAL;
@@ -2141,7 +2141,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, list) {
+	list_for_each_entry(d, &r_l->domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->cpu_mask)
@@ -2226,7 +2226,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2528,7 +2528,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2652,7 +2652,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * CBMs in all domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
@@ -2858,7 +2858,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
 	/* create the directory */
 	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
 	if (IS_ERR(kn))
@@ -2874,7 +2874,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	}
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->id;
+	priv.u.domid = d->hdr.id;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -2922,7 +2922,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3081,7 +3081,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
 	 */
 	tmp_cbm = cfg->new_ctrl;
 	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
-		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
 		return -ENOSPC;
 	}
 	cfg->have_new_ctrl = true;
@@ -3104,7 +3104,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, list) {
+	list_for_each_entry(d, &s->res->domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3119,7 +3119,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3726,7 +3726,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (static_branch_unlikely(&rdt_mon_enable_key))
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v6 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
  2023-09-28 19:13       ` [PATCH v6 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2023-09-28 19:13       ` [PATCH v6 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2023-09-28 19:13       ` Tony Luck
  2023-09-29 13:10         ` Peter Newman
  2023-09-28 19:13       ` [PATCH v6 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
                         ` (7 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-28 19:13 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically L3 scope for
cache control and NODE scope for cache occupancy and memory bandwidth
monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v5:

Commit comment: s/Existing resctrl assumes/Resctrl assumes/

Many new names. Put an underscore in "mon_domains" for consistency
with "mon_scope". Do same with all the other "mon" changes.

Also rename "scope" to "ctrl_scope", "domains" to "ctrl_domains"
and all the assocated functions and macros.
---
 include/linux/resctrl.h                   |  18 +-
 arch/x86/kernel/cpu/resctrl/internal.h    |   4 +-
 arch/x86/kernel/cpu/resctrl/core.c        | 198 ++++++++++++++++------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  12 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  54 +++---
 7 files changed, 200 insertions(+), 92 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a583fa88ea5a..0af5c5aa5a6f 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -163,10 +163,12 @@ enum resctrl_scope {
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @scope:		Scope of this resource
+ * @ctrl_scope:		Scope of this resource for control functions
+ * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		All domains for this resource
+ * @ctrl_domains:	Control domains for this resource
+ * @mon_domains:	Monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -181,10 +183,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -230,8 +234,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 85ceaf9a31ac..e9a2a8993d14 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -511,8 +511,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 05369add4578..7ef178fb7c77 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,8 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.mon_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
+			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.scope			= RESCTRL_L2_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.ctrl_scope		= RESCTRL_L2_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -352,7 +355,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->cpu_mask))
 			return d;
@@ -384,29 +387,39 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Find a domain in one of a resource domain lists.
  *
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
+ * Search the list to find the resource id. If the resource id is found
+ * in a domain, return the domain. Otherwise, if requested by caller,
+ * return the first domain whose id is bigger than the input id.
  * The domain list is sorted by id in ascending order.
+ *
+ * If an existing domain in the resource r's domain list matches the cpu's
+ * resource id, add the cpu in the domain.
+ *
+ * Otherwise, caller will allocate a new domain and insert into the right position
+ * in the domain list sorted by id in ascending order.
+ *
+ * The order in the domain list is visible to users when we print entries
+ * in the schemata file and schemata input is validated to have the same order
+ * as this list.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos)
 {
-	struct rdt_domain *d;
+	struct rdt_domain_hdr *d;
 	struct list_head *l;
 
 	if (id < 0)
 		return ERR_PTR(-ENODEV);
 
-	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, hdr.list);
+	list_for_each(l, h) {
+		d = list_entry(l, struct rdt_domain_hdr, list);
 		/* When id is found, return its domain. */
-		if (id == d->hdr.id)
+		if (id == d->id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->hdr.id)
+		if (id < d->id)
 			break;
 	}
 
@@ -500,37 +513,27 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	return -EINVAL;
 }
 
-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 	int err;
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+
+	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+	if (IS_ERR(hdr)) {
+		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
+	d = container_of(hdr, struct rdt_domain, hdr);
 
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
@@ -549,44 +552,101 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	rdt_domain_reconfigure_cdp(r);
 
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+	if (domain_setup_ctrlval(r, d)) {
 		domain_free(hw_dom);
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	list_add_tail(&d->hdr.list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del(&d->hdr.list);
 		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_mondom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+	int err;
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+	if (IS_ERR(hdr)) {
+		pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+	d = container_of(hdr, struct rdt_domain, hdr);
+
+	if (d) {
+		cpumask_set_cpu(cpu, &d->cpu_mask);
+		return;
+	}
+
+	hw_mondom = kzalloc_node(sizeof(*hw_mondom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_mondom)
+		return;
+
+	d = &hw_mondom->d_resctrl;
+	d->hdr.id = id;
+	cpumask_set_cpu(cpu, &d->cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_mondom)) {
+		domain_free(hw_mondom);
 		return;
 	}
 
 	list_add_tail(&d->hdr.list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		domain_free(hw_mondom);
 	}
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+/*
+ * domain_add_cpu - Add a cpu to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 
 	if (id < 0)
 		return;
 
-	d = rdt_find_domain(r, id, NULL);
-	if (IS_ERR_OR_NULL(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
+		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
+	d = container_of(hdr, struct rdt_domain, hdr);
 	hw_dom = resctrl_to_arch_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del(&d->hdr.list);
 
 		/*
@@ -599,6 +659,34 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 		return;
 	}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_mondom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+
+	if (id < 0)
+		return;
+
+	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
+		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+	d = container_of(hdr, struct rdt_domain, hdr);
+	hw_mondom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->cpu_mask);
+	if (cpumask_empty(&d->cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del(&d->hdr.list);
+		domain_free(hw_mondom);
+
+		return;
+	}
 
 	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -613,6 +701,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 8bce591a1018..a6261e177cc1 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -224,7 +224,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
@@ -316,7 +316,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -464,7 +464,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -540,6 +540,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
+	struct rdt_domain_hdr *hdr;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
@@ -560,11 +561,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
 		ret = -ENOENT;
 		goto out;
 	}
+	d = container_of(hdr, struct rdt_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 27cda5988d7f..3265b8499e2a 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 18b6183a1b48..bda32b4e1c1e 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
-	int scope = plr->s->res->scope;
+	int scope = plr->s->res->ctrl_scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, hdr.list) {
+		list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 42adf17ea6fa..8132f81f31bb 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -86,7 +86,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -928,7 +928,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -1233,7 +1233,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1345,13 +1345,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
-	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+	if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
 		return -EINVAL;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->scope) {
+		if (ci->info_list[i].level == r->ctrl_scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
@@ -1410,7 +1410,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1499,7 +1499,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1622,7 +1622,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
@@ -2141,7 +2141,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, hdr.list) {
+	list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->cpu_mask)
@@ -2226,7 +2226,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2528,7 +2528,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2649,10 +2649,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
 
 	/*
 	 * Disable resource control for this resource by setting all
-	 * CBMs in all domains to the maximum mask value. Pick one CPU
+	 * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
@@ -2922,7 +2922,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3104,7 +3104,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, hdr.list) {
+	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3119,7 +3119,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3711,16 +3711,16 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
+}
 
-	if (!r->mon_capable)
-		return;
-
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
 	/*
 	 * If resctrl is mounted, remove all the
 	 * per domain monitor data directories.
@@ -3776,18 +3776,22 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
-	int err;
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
-		return 0;
+	return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v6 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
                         ` (2 preceding siblings ...)
  2023-09-28 19:13       ` [PATCH v6 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-09-28 19:13       ` Tony Luck
  2023-09-29 13:44         ` Peter Newman
  2023-09-28 19:13       ` [PATCH v6 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
                         ` (6 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-28 19:13 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v5:

Make rdt_find_domain() work on either control or monitor domains
using infrastructure setup in previous patch to have a common
header stucture in each.

Don't use a field paramter in the domain_init() macro, just provide
separate "ctrl_domain_init()" and "mon_domain_init()" versions.

Improve error messages if domain_add_cpu_{ctrl,mon}() fail to
locate a domain when adding a CPU.

Re-order local variable declarations to maintain reverse fir tree
pattern in functions when the name changes in this patch broke
the pattern.

Moved the comment describing how domain lists are ordered that used
to be in front of domain_add_cpu() to rdt_find_domain() which is
doing the majority of this work.

Dropped two blank lines that don't belong.
---
 include/linux/resctrl.h                   | 50 +++++++------
 arch/x86/kernel/cpu/resctrl/internal.h    | 60 ++++++++++------
 arch/x86/kernel/cpu/resctrl/core.c        | 87 ++++++++++++-----------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 40 +++++------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 62 ++++++++--------
 7 files changed, 184 insertions(+), 153 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0af5c5aa5a6f..1c925e3db2ea 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -63,7 +63,25 @@ struct rdt_domain_hdr {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr:		common header for different domain types
+ * @cpu_mask:		which CPUs share this resource
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_ctrl_domain {
+	struct rdt_domain_hdr		hdr;
+	struct cpumask			cpu_mask;
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl control resource
  * @hdr:		common header for different domain types
  * @cpu_mask:		which CPUs share this resource
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
@@ -73,13 +91,8 @@ struct rdt_domain_hdr {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
 	struct cpumask			cpu_mask;
 	unsigned long			*rmid_busy_llc;
@@ -89,9 +102,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };
 
 /**
@@ -195,7 +205,7 @@ struct rdt_resource {
 	const char		*format_str;
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
-						 struct rdt_domain *d);
+						 struct rdt_ctrl_domain *d);
 	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
@@ -229,15 +239,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.
  */
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -253,7 +263,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
 
 /**
@@ -266,7 +276,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid);
 
 /**
@@ -278,7 +288,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e9a2a8993d14..ee38249c6f1d 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -106,7 +106,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -191,7 +191,7 @@ struct mongroup {
  */
 struct pseudo_lock_region {
 	struct resctrl_schema	*s;
-	struct rdt_domain	*d;
+	struct rdt_ctrl_domain	*d;
 	u32			cbm;
 	wait_queue_head_t	lock_thread_wq;
 	int			thread_done;
@@ -319,25 +319,41 @@ struct arch_mbm_state {
 };
 
 /**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ *			  a resource for a control function
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+	struct rdt_ctrl_domain		d_resctrl;
+	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ *			  a resource for a monitor function
+ * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
-struct rdt_hw_domain {
-	struct rdt_domain		d_resctrl;
-	u32				*ctrl_val;
+struct rdt_hw_mon_domain {
+	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };
 
-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
 {
-	return container_of(r, struct rdt_hw_domain, d_resctrl);
+	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
+{
+	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
 }
 
 /**
@@ -405,7 +421,7 @@ struct rdt_hw_resource {
 	struct rdt_resource	r_resctrl;
 	u32			num_closid;
 	unsigned int		msr_base;
-	void (*msr_update)	(struct rdt_domain *d, struct msr_param *m,
+	void (*msr_update)	(struct rdt_ctrl_domain *d, struct msr_param *m,
 				 struct rdt_resource *r);
 	unsigned int		mon_scale;
 	unsigned int		mbm_width;
@@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 }
 
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d);
+	      struct rdt_ctrl_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d);
+	     struct rdt_ctrl_domain *d);
 
 extern struct mutex rdtgroup_mutex;
 
@@ -517,21 +533,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				  unsigned long cbm);
 enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
 int rdtgroup_tasks_assigned(struct rdtgroup *r);
 int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
 int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
 int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
@@ -541,17 +557,17 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7ef178fb7c77..726f00c01079 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
 bool rdt_alloc_capable;
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
 #define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -303,11 +303,11 @@ static void rdt_get_cdp_l2_config(void)
 }
 
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -328,12 +328,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 }
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	/*  Write the delay values for mba. */
 	for (i = m->low; i < m->high; i++)
@@ -341,19 +341,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 }
 
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
@@ -375,7 +375,7 @@ void rdt_ctrl_update(void *arg)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
 	struct rdt_resource *r = m->res;
 	int cpu = smp_processor_id();
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	d = get_domain_from_cpu(cpu, r);
 	if (d) {
@@ -443,18 +443,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 		*dc = r->default_ctrl;
 }
 
-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct msr_param m;
 	u32 *dc;
 
@@ -477,7 +482,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
 {
 	size_t tsize;
 
@@ -516,10 +521,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -533,7 +538,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
 
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
@@ -553,7 +558,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	rdt_domain_reconfigure_cdp(r);
 
 	if (domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 		return;
 	}
 
@@ -562,17 +567,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	err = resctrl_online_ctrl_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 	}
 }
 
 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_mon_domain *hw_mondom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_mondom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -586,7 +591,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 		pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
@@ -602,7 +607,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_mondom)) {
-		domain_free(hw_mondom);
+		mon_domain_free(hw_mondom);
 		return;
 	}
 
@@ -611,7 +616,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_mondom);
+		mon_domain_free(hw_mondom);
 	}
 }
 
@@ -629,9 +634,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	if (id < 0)
 		return;
@@ -641,8 +646,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+	hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
@@ -650,12 +655,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		list_del(&d->hdr.list);
 
 		/*
-		 * rdt_domain "d" is going to be freed below, so clear
+		 * rdt_ctrl_domain "d" is going to be freed below, so clear
 		 * its pointer from pseudo_lock_region struct.
 		 */
 		if (d->plr)
 			d->plr->d = NULL;
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 
 		return;
 	}
@@ -664,9 +669,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_mondom;
+	struct rdt_hw_mon_domain *hw_mondom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	if (id < 0)
 		return;
@@ -676,14 +681,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_mondom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
+	hw_mondom = resctrl_to_arch_mon_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del(&d->hdr.list);
-		domain_free(hw_mondom);
+		mon_domain_free(hw_mondom);
 
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index a6261e177cc1..7513eba9feaf 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }
 
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d)
+	     struct rdt_ctrl_domain *d)
 {
 	struct resctrl_staged_config *cfg;
 	u32 closid = data->rdtgrp->closid;
@@ -135,7 +135,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
  * resource type.
  */
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d)
+	      struct rdt_ctrl_domain *d)
 {
 	struct rdtgroup *rdtgrp = data->rdtgrp;
 	struct resctrl_staged_config *cfg;
@@ -205,7 +205,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct rdt_resource *r = s->res;
 	struct rdt_parse_data data;
 	char *dom = NULL, *id;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	unsigned long dom_id;
 
 	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -265,11 +265,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
 	}
 }
 
-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
 			 struct resctrl_staged_config *cfg, u32 idx,
 			 cpumask_var_t cpu_mask)
 {
-	struct rdt_domain *dom = &hw_dom->d_resctrl;
+	struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
 		cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
@@ -281,11 +281,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	return false;
 }
 
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
@@ -305,11 +305,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
 	enum resctrl_conf_type t;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	u32 idx;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -317,7 +317,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 
 	msr_param.res = NULL;
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
 			if (!cfg->have_new_ctrl)
@@ -447,10 +447,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	u32 idx = get_config_index(closid, type);
 
 	return hw_dom->ctrl_val[idx];
@@ -459,7 +459,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
 {
 	struct rdt_resource *r = schema->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	bool sep = false;
 	u32 ctrl_val;
 
@@ -521,7 +521,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	/*
@@ -541,11 +541,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
 	struct rdt_domain_hdr *hdr;
+	struct rdt_mon_domain *d;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -566,7 +566,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		ret = -ENOENT;
 		goto out;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3265b8499e2a..97d2ed829f5d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	}
 }
 
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
 }
@@ -334,7 +334,7 @@ int alloc_rmid(void)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int cpu, err;
 	u64 val = 0;
 
@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
 				       enum resctrl_event_id evtid)
 {
 	switch (evtid) {
@@ -516,13 +516,13 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
+	struct rdt_ctrl_domain *dom_mba;
 	u32 cur_bw, delta_bw, user_bw;
 	struct rdt_resource *r_mba;
-	struct rdt_domain *dom_mba;
 	struct list_head *head;
 	struct rdtgroup *entry;
 
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	}
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
 {
 	struct rmid_read rr;
 
@@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
 	struct rdtgroup *prgrp, *crgrp;
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index bda32b4e1c1e..675e9e47af54 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
  * Return: true if @cbm overlaps with pseudo-locked region on @d, false
  * otherwise.
  */
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int cbm_len;
 	unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
  *         if it is not possible to test due to memory allocation issue,
  *         false otherwise.
  */
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
 {
+	struct rdt_ctrl_domain *d_i;
 	cpumask_var_t cpu_with_psl;
 	struct rdt_resource *r;
-	struct rdt_domain *d_i;
 	bool ret = false;
 
 	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 8132f81f31bb..b0901fb95aa9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -80,8 +80,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
 
 void rdt_staged_configs_clear(void)
 {
+	struct rdt_ctrl_domain *dom;
 	struct rdt_resource *r;
-	struct rdt_domain *dom;
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -920,7 +920,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	unsigned long sw_shareable = 0, hw_shareable = 0;
 	unsigned long exclusive = 0, pseudo_locked = 0;
 	struct rdt_resource *r = s->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	int i, hwb, swb, excl, psl;
 	enum rdtgrp_mode mode;
 	bool sep = false;
@@ -1137,7 +1137,7 @@ static enum resctrl_conf_type resctrl_peer_type(enum resctrl_conf_type my_type)
  *
  * Return: false if CBM does not overlap, true if it does.
  */
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				    unsigned long cbm, int closid,
 				    enum resctrl_conf_type type, bool exclusive)
 {
@@ -1192,7 +1192,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
  *
  * Return: true if CBM overlap detected, false if there is no overlap
  */
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1222,10 +1222,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
 static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 {
 	int closid = rdtgrp->closid;
+	struct rdt_ctrl_domain *d;
 	struct resctrl_schema *s;
 	struct rdt_resource *r;
 	bool has_cache = false;
-	struct rdt_domain *d;
 	u32 ctrl;
 
 	list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1339,7 +1339,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
  * bitmap functions work correctly.
  */
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
-				  struct rdt_domain *d, unsigned long cbm)
+				  struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	struct cpu_cacheinfo *ci;
 	unsigned int size = 0;
@@ -1372,9 +1372,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 {
 	struct resctrl_schema *schema;
 	enum resctrl_conf_type type;
+	struct rdt_ctrl_domain *d;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 	unsigned int size;
 	int ret = 0;
 	u32 closid;
@@ -1486,7 +1486,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1494,7 +1494,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
@@ -1551,7 +1551,7 @@ static void mon_event_config_write(void *info)
 }
 
 static int mbm_config_write_domain(struct rdt_resource *r,
-				   struct rdt_domain *d, u32 evtid, u32 val)
+				   struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 	int ret = 0;
@@ -1601,7 +1601,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 {
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int ret = 0;
 
 next:
@@ -2125,9 +2125,9 @@ static inline bool is_mba_linear(void)
 static int set_cache_qos_cfg(int level, bool enable)
 {
 	void (*update)(void *arg);
+	struct rdt_ctrl_domain *d;
 	struct rdt_resource *r_l;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int cpu;
 
 	if (level == RDT_RESOURCE_L3)
@@ -2174,7 +2174,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 		l3_qos_cfg_update(&hw_res->cdp_enabled);
 }
 
-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
 	int cpu = cpumask_any(&d->cpu_mask);
@@ -2192,7 +2192,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 }
 
 static void mba_sc_domain_destroy(struct rdt_resource *r,
-				  struct rdt_domain *d)
+				  struct rdt_ctrl_domain *d)
 {
 	kfree(d->mbps_val);
 	d->mbps_val = NULL;
@@ -2218,7 +2218,7 @@ static int set_mba_sc(bool mba_sc)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2466,7 +2466,7 @@ static void schemata_list_destroy(void)
 static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2634,10 +2634,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
 static int reset_all_ctrls(struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int i;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2653,7 +2653,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * from each domain to update the MSRs below.
 	 */
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
@@ -2848,7 +2848,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -2897,7 +2897,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mon_domain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -2919,7 +2919,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	int ret;
 
 	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3021,7 +3021,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
  * Set the RDT domain up to start off with all usable allocations. That is,
  * all shareable and unused bits. All-zero CBM is invalid.
  */
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
 				 u32 closid)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3101,7 +3101,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
  */
 static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int ret;
 
 	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3117,7 +3117,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
@@ -3704,14 +3704,14 @@ static int __init rdtgroup_setup_root(void)
 	return ret;
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3719,7 +3719,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 		mba_sc_domain_destroy(r, d);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3746,7 +3746,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	size_t tsize;
 
@@ -3776,7 +3776,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3787,7 +3787,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	int err;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v6 5/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
                         ` (3 preceding siblings ...)
  2023-09-28 19:13       ` [PATCH v6 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-09-28 19:13       ` Tony Luck
  2023-09-29 14:21         ` Peter Newman
  2023-09-28 19:13       ` [PATCH v6 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                         ` (5 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-28 19:13 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

Changes since v5:

Updates to commit message.

 include/linux/resctrl.h            | 1 +
 arch/x86/kernel/cpu/resctrl/core.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 1c925e3db2ea..18ed787f9798 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -165,6 +165,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 726f00c01079..e61bf919ac78 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -511,6 +511,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
 	default:
 		break;
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v6 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
                         ` (4 preceding siblings ...)
  2023-09-28 19:13       ` [PATCH v6 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-09-28 19:13       ` Tony Luck
  2023-09-29 14:21         ` Peter Newman
  2023-09-28 19:13       ` [PATCH v6 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
                         ` (4 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-28 19:13 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this redumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that will show how many
SNC nodes share each L3 cache. When this is "1", SNC mode is either
not implemented, or not enabled.

A later patch will detect SNC mode and set snc_nodes_per_l3_cache to
the appropriate value. For now it remains at the default "1" to
indicate SNC mode is not active.

Code that needs to take action when SNC is enabled is:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be adjusted for the number
   of SNC nodes.
3) The RMID renumbering operates when using the value from the
   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
   counter, code must adjust from the logical RMID used to the physical
   RMID value for the SNC node that it wishes to read and load the
   adjusted value into the IA32_QM_EVTSEL MSR.
4) The L3 cache is divided between the SNC nodes. So the value
   reported in the resctrl "size" file is adjusted.
5) The "-o mba_MBps" mount option must be disabled in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.
   Trying to use this feedback loop might result in contradictory
   changes to the throttling level coming from each of the SNC
   node bandwidth measurements.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v5:

Major overhaul to the commit message. Starts with high level overview
of what SNC is, before going into details on changes needed.

Begin with definiton of the SNC acronym.

Clarify in point "1" that available RMIDs are per L3 cache.

Add extra detail in "5" why mba_MBps is incompatible with SNC mode.

Code changes:

Reformat a comment to use longer lines.

Added a period at end of sentence for a comment.
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 ++--
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index ee38249c6f1d..0a17ace5811e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index e61bf919ac78..1f94b7b11f3e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.  Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 97d2ed829f5d..e6e566921a60 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs.
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b0901fb95aa9..a5404c412f53 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1357,7 +1357,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /**
@@ -2590,7 +2590,7 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		ctx->enable_cdpl2 = true;
 		return 0;
 	case Opt_mba_mbps:
-		if (!supports_mba_mbps())
+		if (!supports_mba_mbps() || snc_nodes_per_l3_cache > 1)
 			return -EINVAL;
 		ctx->enable_mba_mbps = true;
 		return 0;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v6 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
                         ` (5 preceding siblings ...)
  2023-09-28 19:13       ` [PATCH v6 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-09-28 19:13       ` Tony Luck
  2023-09-29 14:32         ` Peter Newman
  2023-09-28 19:13       ` [PATCH v6 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
                         ` (3 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-28 19:13 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

There isn't a simple h/w bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero. An earlier commit includes
all the required changes in Linux to operate in this reconfigured mode.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v5:

Short explanation of RMID reconfiguration added in the commit message
(longer one included in previous patch that has all the code that takes
action when "snc_nodes_per_l3_cache" is greater than "1").

Added to the comment before snc_remap_rmids() describing what
"remapping" is occuring.

Added a comment before snc_get_config() [renamed from get_snc_config()
to be consistent with "snc_" as a prefix] describing how it works, and
that it can fail if the system is booted with "maxcpus=N" parameter.

Now using bitmap_zalloc() to allocate bitmap.

Add code to defend against divide by zero if no caches are found.
---
 arch/x86/include/asm/msr-index.h   |  1 +
 arch/x86/kernel/cpu/resctrl/core.c | 90 ++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d111350197f..393d1b047617 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1100,6 +1100,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 1f94b7b11f3e..0041c80c3b2c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -733,11 +736,42 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counnter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+	u64 val;
+
+	/* Only need to enable once per package. */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
@@ -992,11 +1026,67 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple h/w bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+	int num_l3_caches;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids)
+			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+		else
+			mem_only_nodes++;
+	}
+	cpus_read_unlock();
+
+	num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
+	if (!num_l3_caches)
+		return 1;
+
+	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+	kfree(node_caches);
+
+	if (ret > 1)
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+
+	return ret;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = snc_get_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v6 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
                         ` (6 preceding siblings ...)
  2023-09-28 19:13       ` [PATCH v6 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-09-28 19:13       ` Tony Luck
  2023-09-29 14:54         ` Peter Newman
  2023-09-29 14:33       ` [PATCH v6 0/8] Add support for Sub-NUMA cluster (SNC) systems Peter Newman
                         ` (2 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-09-28 19:13 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v5:

Added addtional details about challenges tracking tasks when SNC
mode is enabled.
---
 Documentation/arch/x86/resctrl.rst | 34 +++++++++++++++++++++++++++---
 1 file changed, 31 insertions(+), 3 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index cb05d90111b4..d6b6a4cfd967 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -345,9 +345,15 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event. E.g. on a system with
+	SNC mode disabled with two L3 domains there will be subdirectories
+	"mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
+	L3 cache id.  With SNC enabled the directory names are the same,
+	but the numerical suffix refers to the node id.
+	Mappings from node ids to CPUs are available in the
+	/sys/devices/system/node/node*/cpulist files. Each of these
 	directories have one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
@@ -452,6 +458,28 @@ and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
 of the capacity of the cache. You could partition the cache into four
 equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled the "llc_occupancy", "mbm_total_bytes", and
+"mbm_local_bytes" will only give meaningful results for well behaved NUMA
+applications. I.e. those that perform the majority of memory accesses
+to memory on the local NUMA node to the CPU where the task is executing.
+Note that Linux may load balance tasks between Sub-NUMA nodes much
+more readily than between regular NUMA nodes since the CPUs on SNC
+share the same L3 cache and the system may report the NUMA distance
+between SNC nodes with a lower value than used for regular NUMA nodes.
+Tasks that migrate between nodes will have their traffic recorded by the
+counters in different SNC nodes so a user will need to read mon_data
+files from each node on which the task executed to get the full
+view of traffic for which the task was the source.
+
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache. But each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 1/8] x86/resctrl: Prepare for new domain scope
  2023-09-28 19:13       ` [PATCH v6 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-09-29 12:09         ` Peter Newman
  2023-09-29 18:38           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Peter Newman @ 2023-09-29 12:09 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
>
> Resctrl resources operate on subsets of CPUs in the system with the
> defining attribute of each subset being an instance of a particular
> level of cache. E.g. all CPUs sharing an L3 cache would be part of the
> same domain.
>
> In preparation for features that are scoped at the NUMA node level
> change the code from explicit references to "cache_level" to a more
> generic scope. At this point the only options for this scope are groups
> of CPUs that share an L2 cache or L3 cache.
>
> Provide a more detailed warning message if a domain id cannot be found
> when adding a CPU. Just check and silent return if the domain id can't
> be found when removing a CPU.
>
> No functional change.

I see a number of diagnostic checks added below. Are you sure there's
no functional change?


> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 8f559eeae08e..8c5f932bc00b 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
>   */
>  static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  {
> +       int scope = plr->s->res->scope;
>         struct cpu_cacheinfo *ci;
>         int ret;
>         int i;
>
> +       if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
> +               return -ENODEV;

Functional change?


> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 725344048f85..1cf2b36f5bf8 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1345,10 +1345,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>         unsigned int size = 0;
>         int num_b, i;
>
> +       if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
> +               return -EINVAL;

This function returns unsigned int. That's a huge region!

-Peter

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-09-28 19:13       ` [PATCH v6 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2023-09-29 12:19         ` Peter Newman
  0 siblings, 0 replies; 331+ messages in thread
From: Peter Newman @ 2023-09-29 12:19 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
>
> The rdt_domain structure is used for both control and monitor features.
> It is about to be split into separate structures for these two usages
> because the scope for control and monitoring features for a resource
> will be different for future resources.
>
> To allow for common code that scans a list of domains looking for a
> specific domain id, move the "list" and "id" fields into their own
> structure within the rdt_domain structure.

On this one I think you can say "No functional change"

Reviewed-by: Peter Newman <peternewman@google.com>

Thanks!
-Peter

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-09-28 19:13       ` [PATCH v6 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-09-29 13:10         ` Peter Newman
  2023-09-29 19:15           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Peter Newman @ 2023-09-29 13:10 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
>
> @@ -352,7 +355,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
>  {
>         struct rdt_domain *d;
>
> -       list_for_each_entry(d, &r->domains, hdr.list) {
> +       list_for_each_entry(d, &r->ctrl_domains, hdr.list) {

If someone were to call get_domain_from_cpu() looking for a
mon_domain, I don't think they'd be happy with the result.

This problem seems adequately addressed by the next patch where a type
mismatch on the return value would result.

In any case, perhaps the name could be updated to set expectations better.


> @@ -549,44 +552,101 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>
>         rdt_domain_reconfigure_cdp(r);
>
> -       if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> +       if (domain_setup_ctrlval(r, d)) {
>                 domain_free(hw_dom);
>                 return;
>         }
>
> -       if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> +       list_add_tail(&d->hdr.list, add_pos);
> +
> +       err = resctrl_online_ctrl_domain(r, d);
> +       if (err) {
> +               list_del(&d->hdr.list);
>                 domain_free(hw_dom);
> +       }
> +}
> +
> +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +       int id = get_domain_id_from_scope(cpu, r->mon_scope);
> +       struct list_head *add_pos = NULL;
> +       struct rdt_hw_domain *hw_mondom;

It's still hw_dom in domain_add_cpu_ctrl(), so why hw_mondom here?


> @@ -3711,16 +3711,16 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
>         kfree(d->mbm_local);
>  }
>
> -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
>  {
>         lockdep_assert_held(&rdtgroup_mutex);
>
>         if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
>                 mba_sc_domain_destroy(r, d);
> +}
>
> -       if (!r->mon_capable)
> -               return;
> -
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> +{
>         /*
>          * If resctrl is mounted, remove all the
>          * per domain monitor data directories.

We did a lockdep_assert_held(&rdtgroup_mutex) for both types before.
Should we continue to do so here?


> --
> 2.41.0
>

In the resctrl2 prototype I complained that resctrl_resource was
awkwardly disjoint in its support for control and monitoring
groups[1]. In this patch, you seem to have already done most of the
hard work in separating the control and monitoring functionality, so
taking the next step and using a different structure to represent
control and monitoring resources would further improve the code by
statically typing monitoring and control resources, which would be
less error-prone than run-time checks on the alloc_capable and
mon_capable fields, which seem easy to forget.

I don't think this is necessary to complete SNC support, but it would
give me confidence that there isn't a misplaced {alloc,mon}_capable
check resulting in the wrong domain list being traversed. I will
probably do this myself later if you don't.

Thanks!
-Peter

[1] https://lore.kernel.org/all/CALPaoCj_oa=nATvOO_uysYvu+PdTQtd0pvssvm9_M1+fP-Z8JA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-09-28 19:13       ` [PATCH v6 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-09-29 13:44         ` Peter Newman
  2023-09-29 20:06           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Peter Newman @ 2023-09-29 13:44 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
>  /**
> - * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
> - *                       a resource
> + * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
> + *                       a resource for a control function

wrapped line not quite aligned anymore

>   * @d_resctrl: Properties exposed to the resctrl file system
>   * @ctrl_val:  array of cache or mem ctrl values (indexed by CLOSID)
> + *
> + * Members of this structure are accessed via helpers that provide abstraction.
> + */
> +struct rdt_hw_ctrl_domain {
> +       struct rdt_ctrl_domain          d_resctrl;
> +       u32                             *ctrl_val;
> +};
> +
> +/**
> + * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
> + *                       a resource for a monitor function

wrapped line not quite aligned anymore


> --
> 2.41.0
>

Reviewed-by: Peter Newman <peternewman@google.com>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 5/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-09-28 19:13       ` [PATCH v6 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-09-29 14:21         ` Peter Newman
  0 siblings, 0 replies; 331+ messages in thread
From: Peter Newman @ 2023-09-29 14:21 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
>
> Currently supported resctrl features are all domain scoped the same as the
> scope of the L2 or L3 caches.
>
> Add RESCTRL_NODE as a new option for features that are scoped at the
> same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
> Cluster (SNC) feature where monitoring features are node scoped.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>
> Changes since v5:
>
> Updates to commit message.
>
>  include/linux/resctrl.h            | 1 +
>  arch/x86/kernel/cpu/resctrl/core.c | 2 ++
>  2 files changed, 3 insertions(+)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 1c925e3db2ea..18ed787f9798 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -165,6 +165,7 @@ struct resctrl_schema;
>  enum resctrl_scope {
>         RESCTRL_L2_CACHE = 2,
>         RESCTRL_L3_CACHE = 3,
> +       RESCTRL_NODE,
>  };
>
>  /**
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 726f00c01079..e61bf919ac78 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -511,6 +511,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>         case RESCTRL_L2_CACHE:
>         case RESCTRL_L3_CACHE:
>                 return get_cpu_cacheinfo_id(cpu, scope);
> +       case RESCTRL_NODE:
> +               return cpu_to_node(cpu);
>         default:
>                 break;
>         }
> --
> 2.41.0
>

Looks fine.

Reviewed-by: Peter Newman <peternewman@google.com>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-09-28 19:13       ` [PATCH v6 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-09-29 14:21         ` Peter Newman
  2023-09-29 20:18           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Peter Newman @ 2023-09-29 14:21 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
>
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
>
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
>
> Resctrl monitoring on Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
>
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
>
> The other mode divides the RMIDs and renumbers the ones on the second
> SNC node to start from zero.
>
> Even with this redumbering SNC mode requires several changes in resctrl
> behavior for correct operation.

redumbering? Harsh.


> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index b0901fb95aa9..a5404c412f53 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1357,7 +1357,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>                 }
>         }
>
> -       return size;
> +       return size / snc_nodes_per_l3_cache;

To confirm, the size represented by a bit goes down rather than the
CBM mask shrinking in each sub-NUMA domain?

I would maybe have expected the CBM mask to already be allocating at
the smallest granularity the hardware supports.

>  }
>
>  /**
> @@ -2590,7 +2590,7 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
>                 ctx->enable_cdpl2 = true;
>                 return 0;
>         case Opt_mba_mbps:
> -               if (!supports_mba_mbps())
> +               if (!supports_mba_mbps() || snc_nodes_per_l3_cache > 1)

Factor into supports_mba_mbps()?

>                         return -EINVAL;
>                 ctx->enable_mba_mbps = true;
>                 return 0;
> --
> 2.41.0
>

Reviewed-by: Peter Newman <peternewman@google.com>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-09-28 19:13       ` [PATCH v6 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-09-29 14:32         ` Peter Newman
  0 siblings, 0 replies; 331+ messages in thread
From: Peter Newman @ 2023-09-29 14:32 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
> +static __init int snc_get_config(void)
> +{
> +       unsigned long *node_caches;
> +       int mem_only_nodes = 0;
> +       int cpu, node, ret;
> +       int num_l3_caches;
> +
> +       if (!x86_match_cpu(snc_cpu_ids))
> +               return 1;
> +
> +       node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
> +       if (!node_caches)
> +               return 1;
> +
> +       cpus_read_lock();
> +       for_each_node(node) {
> +               cpu = cpumask_first(cpumask_of_node(node));
> +               if (cpu < nr_cpu_ids)
> +                       set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> +               else
> +                       mem_only_nodes++;
> +       }
> +       cpus_read_unlock();
> +
> +       num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
> +       if (!num_l3_caches)
> +               return 1;
> +
> +       ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
> +       kfree(node_caches);

It seems a little peculiar to free node_caches a couple of lines after
you're actually done with it.

> +
> +       if (ret > 1)
> +               rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> +
> +       return ret;
> +}
> +
>  static __init void rdt_init_res_defs_intel(void)
>  {
>         struct rdt_hw_resource *hw_res;
>         struct rdt_resource *r;
>
> +       snc_nodes_per_l3_cache = snc_get_config();
> +
>         for_each_rdt_resource(r) {
>                 hw_res = resctrl_to_arch_res(r);
>
> --
> 2.41.0
>

Reviewed-by: Peter Newman <peternewman@google.com>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
                         ` (7 preceding siblings ...)
  2023-09-28 19:13       ` [PATCH v6 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-09-29 14:33       ` Peter Newman
  2023-09-29 21:06         ` Tony Luck
  2023-10-03 16:07       ` [PATCH v7 " Tony Luck
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
  10 siblings, 1 reply; 331+ messages in thread
From: Peter Newman @ 2023-09-29 14:33 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
>
> The Sub-NUMA cluster feature on some Intel processors partitions
> the CPUs that share an L3 cache into two or more sets. This plays
> havoc with the Resource Director Technology (RDT) monitoring features.
> Prior to this patch Intel has advised that SNC and RDT are incompatible.
>
> Some of these CPU support an MSR that can partition the RMID
> counters in the same way. This allows for monitoring features
> to be used (with the caveat that memory accesses between different
> SNC NUMA nodes may still not be counted accuratlely.

Is an "SNC NUMA node" a "sub-NUMA node", or a NUMA node on which SNC
has been enabled?

Thanks!
-Peter

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-09-28 19:13       ` [PATCH v6 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-09-29 14:54         ` Peter Newman
  2023-09-29 20:51           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Peter Newman @ 2023-09-29 14:54 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index cb05d90111b4..d6b6a4cfd967 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -345,9 +345,15 @@ When control is enabled all CTRL_MON groups will also contain:
>  When monitoring is enabled all MON groups will also contain:
>
>  "mon_data":
> -       This contains a set of files organized by L3 domain and by
> -       RDT event. E.g. on a system with two L3 domains there will
> -       be subdirectories "mon_L3_00" and "mon_L3_01".  Each of these
> +       This contains a set of files organized by L3 domain or by NUMA
> +       node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> +       or enabled respectively) and by RDT event. E.g. on a system with
> +       SNC mode disabled with two L3 domains there will be subdirectories
> +       "mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
> +       L3 cache id.  With SNC enabled the directory names are the same,
> +       but the numerical suffix refers to the node id.
> +       Mappings from node ids to CPUs are available in the
> +       /sys/devices/system/node/node*/cpulist files. Each of these

The explanation of mon_data seems overwhelmingly SNC-centric now.
Maybe the SNC section should be responsible for explaining its impact
on the mon_data directory. Mainly by reminding the reader that domain
ids in the mon_data directory are node ids in SNC mode.


>         directories have one file per event (e.g. "llc_occupancy",
>         "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
>         files provide a read out of the current value of the event for
> @@ -452,6 +458,28 @@ and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
>  of the capacity of the cache. You could partition the cache into four
>  equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>
> +Notes on Sub-NUMA Cluster mode
> +==============================
> +When SNC mode is enabled the "llc_occupancy", "mbm_total_bytes", and
> +"mbm_local_bytes" will only give meaningful results for well behaved NUMA
> +applications. I.e. those that perform the majority of memory accesses
> +to memory on the local NUMA node to the CPU where the task is executing.

Not being specific about why the results aren't meaningful, this
sounds vague and alarming.

> +Note that Linux may load balance tasks between Sub-NUMA nodes much
> +more readily than between regular NUMA nodes since the CPUs on SNC
> +share the same L3 cache and the system may report the NUMA distance
> +between SNC nodes with a lower value than used for regular NUMA nodes.
> +Tasks that migrate between nodes will have their traffic recorded by the
> +counters in different SNC nodes so a user will need to read mon_data
> +files from each node on which the task executed to get the full
> +view of traffic for which the task was the source.
> +
> +
> +The cache allocation feature still provides the same number of
> +bits in a mask to control allocation into the L3 cache. But each
> +of those ways has its capacity reduced because the cache is divided
> +between the SNC nodes. The values reported in the resctrl
> +"size" files are adjusted accordingly.
> +
>  Memory bandwidth Allocation and monitoring
>  ==========================================
>
> --
> 2.41.0
>

Reviewed-by: Peter Newman <peternewman@google.com>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 1/8] x86/resctrl: Prepare for new domain scope
  2023-09-29 12:09         ` Peter Newman
@ 2023-09-29 18:38           ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-09-29 18:38 UTC (permalink / raw)
  To: Peter Newman
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Fri, Sep 29, 2023 at 02:09:42PM +0200, Peter Newman wrote:
> Hi Tony,
> 
> On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
> >
> > Resctrl resources operate on subsets of CPUs in the system with the
> > defining attribute of each subset being an instance of a particular
> > level of cache. E.g. all CPUs sharing an L3 cache would be part of the
> > same domain.
> >
> > In preparation for features that are scoped at the NUMA node level
> > change the code from explicit references to "cache_level" to a more
> > generic scope. At this point the only options for this scope are groups
> > of CPUs that share an L2 cache or L3 cache.
> >
> > Provide a more detailed warning message if a domain id cannot be found
> > when adding a CPU. Just check and silent return if the domain id can't
> > be found when removing a CPU.
> >
> > No functional change.
> 
> I see a number of diagnostic checks added below. Are you sure there's
> no functional change?

Those checks should be for "can't happen" cases where some "scope" that
isn't attached to a cache_level somehow made its way down to a routine
that is only expecting cache scoped.

But I'll remove the "No functional change" from the commit. comment.
> 
> 
> > diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> > index 8f559eeae08e..8c5f932bc00b 100644
> > --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> > +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> > @@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
> >   */
> >  static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
> >  {
> > +       int scope = plr->s->res->scope;
> >         struct cpu_cacheinfo *ci;
> >         int ret;
> >         int i;
> >
> > +       if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
> > +               return -ENODEV;
> 
> Functional change?
> 
> 
> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index 725344048f85..1cf2b36f5bf8 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -1345,10 +1345,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> >         unsigned int size = 0;
> >         int num_b, i;
> >
> > +       if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
> > +               return -EINVAL;
> 
> This function returns unsigned int. That's a huge region!

I wondered for a bit what value to return for this "can't happen" case
and initially went with an error code.  But the function can already
fail (in ways that won't happen) and in that case the return is "0".

I'l update to return 0 here too.

> -Peter

Thanks for the review.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-09-29 13:10         ` Peter Newman
@ 2023-09-29 19:15           ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-09-29 19:15 UTC (permalink / raw)
  To: Peter Newman
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Fri, Sep 29, 2023 at 03:10:18PM +0200, Peter Newman wrote:
> Hi Tony,
> 
> On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
> >
> > @@ -352,7 +355,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
> >  {
> >         struct rdt_domain *d;
> >
> > -       list_for_each_entry(d, &r->domains, hdr.list) {
> > +       list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> 
> If someone were to call get_domain_from_cpu() looking for a
> mon_domain, I don't think they'd be happy with the result.

Indeed not. Type checking in "C" doesn't seem adequate to address this
(when using "container_of()" which blindly trusts the user provided the
right tyep/fieldname). I'm using the rdt_domain_hdr.type field to
provide necessary checks.

> 
> This problem seems adequately addressed by the next patch where a type
> mismatch on the return value would result.
> 
> In any case, perhaps the name could be updated to set expectations better.
> 
> 
> > @@ -549,44 +552,101 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >
> >         rdt_domain_reconfigure_cdp(r);
> >
> > -       if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> > +       if (domain_setup_ctrlval(r, d)) {
> >                 domain_free(hw_dom);
> >                 return;
> >         }
> >
> > -       if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> > +       list_add_tail(&d->hdr.list, add_pos);
> > +
> > +       err = resctrl_online_ctrl_domain(r, d);
> > +       if (err) {
> > +               list_del(&d->hdr.list);
> >                 domain_free(hw_dom);
> > +       }
> > +}
> > +
> > +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> > +{
> > +       int id = get_domain_id_from_scope(cpu, r->mon_scope);
> > +       struct list_head *add_pos = NULL;
> > +       struct rdt_hw_domain *hw_mondom;
> 
> It's still hw_dom in domain_add_cpu_ctrl(), so why hw_mondom here?

No good reason. I'll change it to "hw_dom".

> 
> 
> > @@ -3711,16 +3711,16 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
> >         kfree(d->mbm_local);
> >  }
> >
> > -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
> > +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
> >  {
> >         lockdep_assert_held(&rdtgroup_mutex);
> >
> >         if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
> >                 mba_sc_domain_destroy(r, d);
> > +}
> >
> > -       if (!r->mon_capable)
> > -               return;
> > -
> > +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> > +{
> >         /*
> >          * If resctrl is mounted, remove all the
> >          * per domain monitor data directories.
> 
> We did a lockdep_assert_held(&rdtgroup_mutex) for both types before.
> Should we continue to do so here?

Yes. Added it.

> 
> 
> > --
> > 2.41.0
> >
> 
> In the resctrl2 prototype I complained that resctrl_resource was
> awkwardly disjoint in its support for control and monitoring
> groups[1]. In this patch, you seem to have already done most of the
> hard work in separating the control and monitoring functionality, so
> taking the next step and using a different structure to represent
> control and monitoring resources would further improve the code by
> statically typing monitoring and control resources, which would be
> less error-prone than run-time checks on the alloc_capable and
> mon_capable fields, which seem easy to forget.
> 
> I don't think this is necessary to complete SNC support, but it would
> give me confidence that there isn't a misplaced {alloc,mon}_capable
> check resulting in the wrong domain list being traversed. I will
> probably do this myself later if you don't.

Simple change. It's split between previous patch to add the field
and current patch to initialize and check.

> 
> Thanks!
> -Peter
> 
> [1] https://lore.kernel.org/all/CALPaoCj_oa=nATvOO_uysYvu+PdTQtd0pvssvm9_M1+fP-Z8JA@mail.gmail.com/

Thanks

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-09-29 13:44         ` Peter Newman
@ 2023-09-29 20:06           ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-09-29 20:06 UTC (permalink / raw)
  To: Peter Newman
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Fri, Sep 29, 2023 at 03:44:54PM +0200, Peter Newman wrote:
> Hi Tony,
> 
> On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
> >  /**
> > - * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
> > - *                       a resource
> > + * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
> > + *                       a resource for a control function
> 
> wrapped line not quite aligned anymore

Fixed

> 
> >   * @d_resctrl: Properties exposed to the resctrl file system
> >   * @ctrl_val:  array of cache or mem ctrl values (indexed by CLOSID)
> > + *
> > + * Members of this structure are accessed via helpers that provide abstraction.
> > + */
> > +struct rdt_hw_ctrl_domain {
> > +       struct rdt_ctrl_domain          d_resctrl;
> > +       u32                             *ctrl_val;
> > +};
> > +
> > +/**
> > + * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
> > + *                       a resource for a monitor function
> 
> wrapped line not quite aligned anymore

Fixed

> 
> 
> > --
> > 2.41.0
> >
> 
> Reviewed-by: Peter Newman <peternewman@google.com>

Thanks

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-09-29 14:21         ` Peter Newman
@ 2023-09-29 20:18           ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-09-29 20:18 UTC (permalink / raw)
  To: Peter Newman
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Fri, Sep 29, 2023 at 04:21:54PM +0200, Peter Newman wrote:
> Hi Tony,
> 
> On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
> >
> > Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> > and memory controllers on a socket into two or more groups. These are
> > presented to the operating system as NUMA nodes.
> >
> > This may enable some workloads to have slightly lower latency to memory
> > as the memory controller(s) in an SNC node are electrically closer to the
> > CPU cores on that SNC node. This cost may be offset by lower bandwidth
> > since the memory accesses for each core can only be interleaved between
> > the memory controllers on the same SNC node.
> >
> > Resctrl monitoring on Intel system depends upon attaching RMIDs to tasks
> > to track L3 cache occupancy and memory bandwidth. There is an MSR that
> > controls how the RMIDs are shared between SNC nodes.
> >
> > The default mode divides them numerically. E.g. when there are two SNC
> > nodes on a socket the lower number half of the RMIDs are given to the
> > first node, the remainder to the second node. This would be difficult
> > to use with the Linux resctrl interface as specific RMID values assigned
> > to resctrl groups are not visible to users.
> >
> > The other mode divides the RMIDs and renumbers the ones on the second
> > SNC node to start from zero.
> >
> > Even with this redumbering SNC mode requires several changes in resctrl
> > behavior for correct operation.
> 
> redumbering? Harsh.

Spell check didn't catch this. Implies that redumbering is a real word!
Now I need to find an opportunity to use it :-)

Changed to renumbering.
> 
> 
> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index b0901fb95aa9..a5404c412f53 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -1357,7 +1357,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> >                 }
> >         }
> >
> > -       return size;
> > +       return size / snc_nodes_per_l3_cache;
> 
> To confirm, the size represented by a bit goes down rather than the
> CBM mask shrinking in each sub-NUMA domain?

Correct. The CBM mask keeps the same number of bits. Those bits are
all available to control allocation in each SNC node. But the amount
of cache you get per-bit is reduced in the ratio of the number of SNC
ways.

> 
> I would maybe have expected the CBM mask to already be allocating at
> the smallest granularity the hardware supports.
> 
> >  }
> >
> >  /**
> > @@ -2590,7 +2590,7 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
> >                 ctx->enable_cdpl2 = true;
> >                 return 0;
> >         case Opt_mba_mbps:
> > -               if (!supports_mba_mbps())
> > +               if (!supports_mba_mbps() || snc_nodes_per_l3_cache > 1)
> 
> Factor into supports_mba_mbps()?

Agreed. That's a better place. Moved this test into supports_mba_mbps().
> 
> >                         return -EINVAL;
> >                 ctx->enable_mba_mbps = true;
> >                 return 0;
> > --
> > 2.41.0
> >
> 
> Reviewed-by: Peter Newman <peternewman@google.com>

Thanks

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-09-29 14:54         ` Peter Newman
@ 2023-09-29 20:51           ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-09-29 20:51 UTC (permalink / raw)
  To: Peter Newman
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Fri, Sep 29, 2023 at 04:54:21PM +0200, Peter Newman wrote:
> Hi Tony,
> 
> On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
> > diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> > index cb05d90111b4..d6b6a4cfd967 100644
> > --- a/Documentation/arch/x86/resctrl.rst
> > +++ b/Documentation/arch/x86/resctrl.rst
> > @@ -345,9 +345,15 @@ When control is enabled all CTRL_MON groups will also contain:
> >  When monitoring is enabled all MON groups will also contain:
> >
> >  "mon_data":
> > -       This contains a set of files organized by L3 domain and by
> > -       RDT event. E.g. on a system with two L3 domains there will
> > -       be subdirectories "mon_L3_00" and "mon_L3_01".  Each of these
> > +       This contains a set of files organized by L3 domain or by NUMA
> > +       node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> > +       or enabled respectively) and by RDT event. E.g. on a system with
> > +       SNC mode disabled with two L3 domains there will be subdirectories
> > +       "mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
> > +       L3 cache id.  With SNC enabled the directory names are the same,
> > +       but the numerical suffix refers to the node id.
> > +       Mappings from node ids to CPUs are available in the
> > +       /sys/devices/system/node/node*/cpulist files. Each of these
> 
> The explanation of mon_data seems overwhelmingly SNC-centric now.
> Maybe the SNC section should be responsible for explaining its impact
> on the mon_data directory. Mainly by reminding the reader that domain
> ids in the mon_data directory are node ids in SNC mode.

I cut out all the examples and just note that the numerical suffices
are nodes instead of cache instances.

This bit of the git diff now reads:

-       This contains a set of files organized by L3 domain and by
-       RDT event. E.g. on a system with two L3 domains there will
-       be subdirectories "mon_L3_00" and "mon_L3_01".  Each of these
+       This contains a set of files organized by L3 domain or by NUMA
+       node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+       or enabled respectively) and by RDT event.  Each of these


> 
> 
> >         directories have one file per event (e.g. "llc_occupancy",
> >         "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
> >         files provide a read out of the current value of the event for
> > @@ -452,6 +458,28 @@ and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
> >  of the capacity of the cache. You could partition the cache into four
> >  equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
> >
> > +Notes on Sub-NUMA Cluster mode
> > +==============================
> > +When SNC mode is enabled the "llc_occupancy", "mbm_total_bytes", and
> > +"mbm_local_bytes" will only give meaningful results for well behaved NUMA
> > +applications. I.e. those that perform the majority of memory accesses
> > +to memory on the local NUMA node to the CPU where the task is executing.
> 
> Not being specific about why the results aren't meaningful, this
> sounds vague and alarming.

Removed the trigger word "meaningful" and re-worded to just explain
the increased liklihood that tasks will migrate between nodes, so users
must collect data from all nodes. Technically this has always been true
on multi-socket systems. But since there is a much higher barrier to
task migration between sockets, users may find that simple measurements
that used to work now behave differently.

New version:

+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache. But each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.


> 
> > +Note that Linux may load balance tasks between Sub-NUMA nodes much
> > +more readily than between regular NUMA nodes since the CPUs on SNC
> > +share the same L3 cache and the system may report the NUMA distance
> > +between SNC nodes with a lower value than used for regular NUMA nodes.
> > +Tasks that migrate between nodes will have their traffic recorded by the
> > +counters in different SNC nodes so a user will need to read mon_data
> > +files from each node on which the task executed to get the full
> > +view of traffic for which the task was the source.
> > +
> > +
> > +The cache allocation feature still provides the same number of
> > +bits in a mask to control allocation into the L3 cache. But each
> > +of those ways has its capacity reduced because the cache is divided
> > +between the SNC nodes. The values reported in the resctrl
> > +"size" files are adjusted accordingly.
> > +
> >  Memory bandwidth Allocation and monitoring
> >  ==========================================
> >
> > --
> > 2.41.0
> >
> 
> Reviewed-by: Peter Newman <peternewman@google.com>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v6 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-09-29 14:33       ` [PATCH v6 0/8] Add support for Sub-NUMA cluster (SNC) systems Peter Newman
@ 2023-09-29 21:06         ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-09-29 21:06 UTC (permalink / raw)
  To: Peter Newman
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Fri, Sep 29, 2023 at 04:33:17PM +0200, Peter Newman wrote:
> Hi Tony,
> 
> On Thu, Sep 28, 2023 at 9:14 PM Tony Luck <tony.luck@intel.com> wrote:
> >
> > The Sub-NUMA cluster feature on some Intel processors partitions
> > the CPUs that share an L3 cache into two or more sets. This plays
> > havoc with the Resource Director Technology (RDT) monitoring features.
> > Prior to this patch Intel has advised that SNC and RDT are incompatible.
> >
> > Some of these CPU support an MSR that can partition the RMID
> > counters in the same way. This allows for monitoring features
> > to be used (with the caveat that memory accesses between different
> > SNC NUMA nodes may still not be counted accuratlely.
> 
> Is an "SNC NUMA node" a "sub-NUMA node", or a NUMA node on which SNC
> has been enabled?

It would be architecturally possible to enable SNC mode on
a subset of CPU sockets. But there isn't a BIOS setup option
to do that. You either have SNC everywhere, or nowhere.

I prefer "SNC NUMA node" == "sub-NUMA node".

This version "NUMA node on which SNC has been enabled"
makes it sound like there is a control on a NUMA node
that can be switched.  The control is on the CPU socket.
That's often equivalent to a NUMA node, but Intel has
had CPUs in the past where this isn't the case (e.g.
Cascade Lake -AP and Cooper Lake).
> 
> Thanks!
> -Peter


Thanks for the review of the series. I've applied changes
to my local tree. Will post v7 of the series early next
week if no other reviews come in.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v7 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
                         ` (8 preceding siblings ...)
  2023-09-29 14:33       ` [PATCH v6 0/8] Add support for Sub-NUMA cluster (SNC) systems Peter Newman
@ 2023-10-03 16:07       ` Tony Luck
  2023-10-03 16:07         ` [PATCH v7 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
                           ` (8 more replies)
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
  10 siblings, 9 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 16:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The Sub-NUMA cluster feature on some Intel processors partitions
the CPUs that share an L3 cache into two or more sets. This plays
havoc with the Resource Director Technology (RDT) monitoring features.
Prior to this patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID
counters in the same way. This allows for monitoring features
to be used (with the caveat that memory accesses between different
SNC NUMA nodes may still not be counted accuratlely.

Note that this patch series improves resctrl reporting considerably
on systems with SNC enabled, but there will still be some anomalies
for processes accessing memory from other sub-NUMA nodes.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Tony Luck (8):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  23 +-
 include/linux/resctrl.h                   |  85 +++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  66 ++--
 arch/x86/kernel/cpu/resctrl/core.c        | 400 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  58 ++--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  58 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  14 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 132 +++----
 9 files changed, 591 insertions(+), 246 deletions(-)


base-commit: 6465e260f48790807eef06b583b38ca9789b6072
-- 
2.41.0


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v7 1/8] x86/resctrl: Prepare for new domain scope
  2023-10-03 16:07       ` [PATCH v7 " Tony Luck
@ 2023-10-03 16:07         ` Tony Luck
  2023-10-03 16:07         ` [PATCH v7 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
                           ` (7 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 16:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Provide a more detailed warning message if a domain id cannot be found
when adding a CPU. Just check and silent return if the domain id can't
be found when removing a CPU.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Changes since last version:

s/-EINVAL/0/ for return value of rdtgroup_cbm_to_size()

 include/linux/resctrl.h                   |  9 +++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 33 ++++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 ++++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 +++-
 4 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..618735e396cb 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..3b1837e1fb6b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -487,6 +487,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -502,12 +515,17 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 	int err;
 
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
+		return;
+	}
 	d = rdt_find_domain(r, id, &add_pos);
 	if (IS_ERR(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -552,10 +570,13 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
+	if (id < 0)
+		return;
+
 	d = rdt_find_domain(r, id, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..8c5f932bc00b 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	int scope = plr->s->res->scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 725344048f85..04c164f6d39d 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1345,10 +1345,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return size;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v7 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-10-03 16:07       ` [PATCH v7 " Tony Luck
  2023-10-03 16:07         ` [PATCH v7 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-10-03 16:07         ` Tony Luck
  2023-10-03 16:07         ` [PATCH v7 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
                           ` (6 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 16:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move the "list" and "id" fields into their own
structure within the rdt_domain structure.

Add a "type" field to the header to be used later so that callers looking
up a domain can be assured that they found one of the expected type.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Changes since last version:
Add the "type" field to rdt_domain_hdr.

 include/linux/resctrl.h                   | 21 +++++++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 16 ++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 16 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     |  2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 44 +++++++++++------------
 6 files changed, 58 insertions(+), 43 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 618735e396cb..4c8870345ff8 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -52,10 +52,26 @@ struct resctrl_staged_config {
 	bool			have_new_ctrl;
 };
 
+enum resctrl_domain_type {
+	RESCTRL_CTRL_DOMAIN,
+	RESCTRL_MON_DOMAIN,
+};
+
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
+ * @type:		type of this instance
+ */
+struct rdt_domain_hdr {
+	struct list_head		list;
+	int				id;
+	enum resctrl_domain_type	type;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr:		common header for different domain types
  * @cpu_mask:		which CPUs share this resource
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
@@ -71,8 +87,7 @@ struct resctrl_staged_config {
  *			by closid
  */
 struct rdt_domain {
-	struct list_head		list;
-	int				id;
+	struct rdt_domain_hdr		hdr;
 	struct cpumask			cpu_mask;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3b1837e1fb6b..05369add4578 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -352,7 +352,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->cpu_mask))
 			return d;
@@ -401,12 +401,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 		return ERR_PTR(-ENODEV);
 
 	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, list);
+		d = list_entry(l, struct rdt_domain, hdr.list);
 		/* When id is found, return its domain. */
-		if (id == d->id)
+		if (id == d->hdr.id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->id)
+		if (id < d->hdr.id)
 			break;
 	}
 
@@ -544,7 +544,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 
 	d = &hw_dom->d_resctrl;
-	d->id = id;
+	d->hdr.id = id;
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
@@ -559,11 +559,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail(&d->list, add_pos);
+	list_add_tail(&d->hdr.list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
@@ -587,7 +587,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 
 		/*
 		 * rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b44c487727d4..8bce591a1018 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -144,7 +144,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -224,8 +224,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
 			if (r->parse_ctrlval(&data, s, d))
@@ -316,7 +316,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -464,7 +464,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -474,7 +474,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 			ctrl_val = resctrl_arch_get_config(r, dom, closid,
 							   schema->conf_type);
 
-		seq_printf(s, r->format_str, dom->id, max_data_width,
+		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
 			   ctrl_val);
 		sep = true;
 	}
@@ -503,7 +503,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			} else {
 				seq_printf(s, "%s:%d=%x\n",
 					   rdtgrp->plr->s->res->name,
-					   rdtgrp->plr->d->id,
+					   rdtgrp->plr->d->hdr.id,
 					   rdtgrp->plr->cbm);
 			}
 		} else {
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..27cda5988d7f 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8c5f932bc00b..18b6183a1b48 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, list) {
+		list_for_each_entry(d_i, &r->domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 04c164f6d39d..739e56cfb31f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -86,7 +86,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -928,12 +928,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
 		exclusive = 0;
-		seq_printf(seq, "%d=", dom->id);
+		seq_printf(seq, "%d=", dom->hdr.id);
 		for (i = 0; i < closids_supported(); i++) {
 			if (!closid_allocated(i))
 				continue;
@@ -1233,7 +1233,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1398,7 +1398,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
 						    rdtgrp->plr->d,
 						    rdtgrp->plr->cbm);
-			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
 		}
 		goto out;
 	}
@@ -1410,7 +1410,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1428,7 +1428,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 				else
 					size = rdtgroup_cbm_to_size(r, d, ctrl);
 			}
-			seq_printf(s, "%d=%u", d->id, size);
+			seq_printf(s, "%d=%u", d->hdr.id, size);
 			sep = true;
 		}
 		seq_putc(s, '\n');
@@ -1499,7 +1499,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1507,7 +1507,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		mon_info.evtid = evtid;
 		mondata_config_read(dom, &mon_info);
 
-		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1622,8 +1622,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
 				return -EINVAL;
@@ -2141,7 +2141,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, list) {
+	list_for_each_entry(d, &r_l->domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->cpu_mask)
@@ -2226,7 +2226,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2528,7 +2528,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2652,7 +2652,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * CBMs in all domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
@@ -2858,7 +2858,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
 	/* create the directory */
 	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
 	if (IS_ERR(kn))
@@ -2874,7 +2874,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	}
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->id;
+	priv.u.domid = d->hdr.id;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -2922,7 +2922,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3081,7 +3081,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
 	 */
 	tmp_cbm = cfg->new_ctrl;
 	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
-		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
 		return -ENOSPC;
 	}
 	cfg->have_new_ctrl = true;
@@ -3104,7 +3104,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, list) {
+	list_for_each_entry(d, &s->res->domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3119,7 +3119,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3726,7 +3726,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (static_branch_unlikely(&rdt_mon_enable_key))
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v7 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-10-03 16:07       ` [PATCH v7 " Tony Luck
  2023-10-03 16:07         ` [PATCH v7 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2023-10-03 16:07         ` [PATCH v7 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2023-10-03 16:07         ` Tony Luck
  2023-10-03 16:07         ` [PATCH v7 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
                           ` (5 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 16:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically L3 scope for
cache control and NODE scope for cache occupancy and memory bandwidth
monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since last version:

Initialize the "type" in rdt_domain_hdr when creating domains.
Check type has expected value before using container_of() to
get to the surrounding structure.

Rename "hw_mondom" to "hw_dom" in domain_add_cpu_mon() and
in domain_remove_cpu_mon().

Add lockdep_assert_held(&rdtgroup_mutex) to resctrl_offline_mon_domain()

 include/linux/resctrl.h                   |  18 +-
 arch/x86/kernel/cpu/resctrl/internal.h    |   4 +-
 arch/x86/kernel/cpu/resctrl/core.c        | 214 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  18 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  54 +++---
 7 files changed, 224 insertions(+), 90 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 4c8870345ff8..f47b01ae28ca 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -170,10 +170,12 @@ enum resctrl_scope {
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @scope:		Scope of this resource
+ * @ctrl_scope:		Scope of this resource for control functions
+ * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		All domains for this resource
+ * @ctrl_domains:	Control domains for this resource
+ * @mon_domains:	Monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -188,10 +190,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -237,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 85ceaf9a31ac..e9a2a8993d14 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -511,8 +511,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 05369add4578..ae2b66d38afc 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,8 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.mon_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
+			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.scope			= RESCTRL_L2_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.ctrl_scope		= RESCTRL_L2_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -352,7 +355,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->cpu_mask))
 			return d;
@@ -384,29 +387,39 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Find a domain in one of a resource domain lists.
  *
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
+ * Search the list to find the resource id. If the resource id is found
+ * in a domain, return the domain. Otherwise, if requested by caller,
+ * return the first domain whose id is bigger than the input id.
  * The domain list is sorted by id in ascending order.
+ *
+ * If an existing domain in the resource r's domain list matches the cpu's
+ * resource id, add the cpu in the domain.
+ *
+ * Otherwise, caller will allocate a new domain and insert into the right position
+ * in the domain list sorted by id in ascending order.
+ *
+ * The order in the domain list is visible to users when we print entries
+ * in the schemata file and schemata input is validated to have the same order
+ * as this list.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos)
 {
-	struct rdt_domain *d;
+	struct rdt_domain_hdr *d;
 	struct list_head *l;
 
 	if (id < 0)
 		return ERR_PTR(-ENODEV);
 
-	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, hdr.list);
+	list_for_each(l, h) {
+		d = list_entry(l, struct rdt_domain_hdr, list);
 		/* When id is found, return its domain. */
-		if (id == d->hdr.id)
+		if (id == d->id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->hdr.id)
+		if (id < d->id)
 			break;
 	}
 
@@ -500,38 +513,32 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	return -EINVAL;
 }
 
-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 	int err;
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+
+	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+	if (IS_ERR(hdr)) {
+		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
 
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -545,48 +552,115 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = &hw_dom->d_resctrl;
 	d->hdr.id = id;
+	d->hdr.type = RESCTRL_CTRL_DOMAIN;
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+	if (domain_setup_ctrlval(r, d)) {
 		domain_free(hw_dom);
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	list_add_tail(&d->hdr.list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+	int err;
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+	if (IS_ERR(hdr)) {
+		pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+
+	if (d) {
+		cpumask_set_cpu(cpu, &d->cpu_mask);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->hdr.id = id;
+	d->hdr.type = RESCTRL_MON_DOMAIN;
+	cpumask_set_cpu(cpu, &d->cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
 	}
 
 	list_add_tail(&d->hdr.list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+/*
+ * domain_add_cpu - Add a cpu to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 
 	if (id < 0)
 		return;
 
-	d = rdt_find_domain(r, id, NULL);
-	if (IS_ERR_OR_NULL(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
+		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
 	hw_dom = resctrl_to_arch_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del(&d->hdr.list);
 
 		/*
@@ -599,6 +673,38 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 		return;
 	}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+
+	if (id < 0)
+		return;
+
+	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
+		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+	hw_dom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->cpu_mask);
+	if (cpumask_empty(&d->cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+
+		return;
+	}
 
 	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -613,6 +719,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 8bce591a1018..33ff4d00a08c 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -224,7 +224,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
@@ -316,7 +316,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -464,7 +464,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -540,6 +540,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
+	struct rdt_domain_hdr *hdr;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
@@ -560,12 +561,19 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
 		ret = -ENOENT;
 		goto out;
 	}
 
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
 	if (rr.err == -EIO)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 27cda5988d7f..3265b8499e2a 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 18b6183a1b48..bda32b4e1c1e 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
-	int scope = plr->s->res->scope;
+	int scope = plr->s->res->ctrl_scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, hdr.list) {
+		list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 739e56cfb31f..90565bb44d0e 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -86,7 +86,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -928,7 +928,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -1233,7 +1233,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1345,13 +1345,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
-	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+	if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->scope) {
+		if (ci->info_list[i].level == r->ctrl_scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
@@ -1410,7 +1410,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1499,7 +1499,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1622,7 +1622,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
@@ -2141,7 +2141,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, hdr.list) {
+	list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->cpu_mask)
@@ -2226,7 +2226,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2528,7 +2528,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2649,10 +2649,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
 
 	/*
 	 * Disable resource control for this resource by setting all
-	 * CBMs in all domains to the maximum mask value. Pick one CPU
+	 * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
@@ -2922,7 +2922,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3104,7 +3104,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, hdr.list) {
+	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3119,7 +3119,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3711,15 +3711,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
+}
 
-	if (!r->mon_capable)
-		return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3776,18 +3778,22 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
-	int err;
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
-		return 0;
+	return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v7 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-10-03 16:07       ` [PATCH v7 " Tony Luck
                           ` (2 preceding siblings ...)
  2023-10-03 16:07         ` [PATCH v7 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-10-03 16:07         ` Tony Luck
  2023-10-03 16:46           ` Luck, Tony
  2023-10-03 16:07         ` [PATCH v7 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
                           ` (4 subsequent siblings)
  8 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-03 16:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>

---

Changes since last version:

"wrapped line not quite aligned anymore" * 2. Both fixed.

 include/linux/resctrl.h                   | 50 +++++++------
 arch/x86/kernel/cpu/resctrl/internal.h    | 60 ++++++++++------
 arch/x86/kernel/cpu/resctrl/core.c        | 87 ++++++++++++-----------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 40 +++++------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 62 ++++++++--------
 7 files changed, 184 insertions(+), 153 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0af5c5aa5a6f..1c925e3db2ea 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -63,7 +63,25 @@ struct rdt_domain_hdr {
 };

 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr:		common header for different domain types
+ * @cpu_mask:		which CPUs share this resource
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_ctrl_domain {
+	struct rdt_domain_hdr		hdr;
+	struct cpumask			cpu_mask;
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl control resource
  * @hdr:		common header for different domain types
  * @cpu_mask:		which CPUs share this resource
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
@@ -73,13 +91,8 @@ struct rdt_domain_hdr {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
 	struct cpumask			cpu_mask;
 	unsigned long			*rmid_busy_llc;
@@ -89,9 +102,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };

 /**
@@ -195,7 +205,7 @@ struct rdt_resource {
 	const char		*format_str;
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
-						 struct rdt_domain *d);
+						 struct rdt_ctrl_domain *d);
 	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
@@ -229,15 +239,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.
  */
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);

-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);

 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -253,7 +263,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val);

 /**
@@ -266,7 +276,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid);

 /**
@@ -278,7 +288,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);

 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e9a2a8993d14..ee38249c6f1d 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -106,7 +106,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -191,7 +191,7 @@ struct mongroup {
  */
 struct pseudo_lock_region {
 	struct resctrl_schema	*s;
-	struct rdt_domain	*d;
+	struct rdt_ctrl_domain	*d;
 	u32			cbm;
 	wait_queue_head_t	lock_thread_wq;
 	int			thread_done;
@@ -319,25 +319,41 @@ struct arch_mbm_state {
 };

 /**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ *			  a resource for a control function
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+	struct rdt_ctrl_domain		d_resctrl;
+	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ *			  a resource for a monitor function
+ * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
-struct rdt_hw_domain {
-	struct rdt_domain		d_resctrl;
-	u32				*ctrl_val;
+struct rdt_hw_mon_domain {
+	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };

-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
 {
-	return container_of(r, struct rdt_hw_domain, d_resctrl);
+	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
+{
+	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
 }

 /**
@@ -405,7 +421,7 @@ struct rdt_hw_resource {
 	struct rdt_resource	r_resctrl;
 	u32			num_closid;
 	unsigned int		msr_base;
-	void (*msr_update)	(struct rdt_domain *d, struct msr_param *m,
+	void (*msr_update)	(struct rdt_ctrl_domain *d, struct msr_param *m,
 				 struct rdt_resource *r);
 	unsigned int		mon_scale;
 	unsigned int		mbm_width;
@@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 }

 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d);
+	      struct rdt_ctrl_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d);
+	     struct rdt_ctrl_domain *d);

 extern struct mutex rdtgroup_mutex;

@@ -517,21 +533,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				  unsigned long cbm);
 enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
 int rdtgroup_tasks_assigned(struct rdtgroup *r);
 int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
 int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
 int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
@@ -541,17 +557,17 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7ef178fb7c77..726f00c01079 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
 bool rdt_alloc_capable;

 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);

 #define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -303,11 +303,11 @@ static void rdt_get_cdp_l2_config(void)
 }

 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;

 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -328,12 +328,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 }

 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;

 	/*  Write the delay values for mba. */
 	for (i = m->low; i < m->high; i++)
@@ -341,19 +341,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 }

 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;

 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }

-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;

 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
@@ -375,7 +375,7 @@ void rdt_ctrl_update(void *arg)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
 	struct rdt_resource *r = m->res;
 	int cpu = smp_processor_id();
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;

 	d = get_domain_from_cpu(cpu, r);
 	if (d) {
@@ -443,18 +443,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 		*dc = r->default_ctrl;
 }

-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }

-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct msr_param m;
 	u32 *dc;

@@ -477,7 +482,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
 {
 	size_t tsize;

@@ -516,10 +521,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int err;

 	if (id < 0) {
@@ -533,7 +538,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);

 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
@@ -553,7 +558,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	rdt_domain_reconfigure_cdp(r);

 	if (domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 		return;
 	}

@@ -562,17 +567,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	err = resctrl_online_ctrl_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 	}
 }

 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_mon_domain *hw_mondom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_mondom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int err;

 	if (id < 0) {
@@ -586,7 +591,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 		pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);

 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
@@ -602,7 +607,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	cpumask_set_cpu(cpu, &d->cpu_mask);

 	if (arch_domain_mbm_alloc(r->num_rmid, hw_mondom)) {
-		domain_free(hw_mondom);
+		mon_domain_free(hw_mondom);
 		return;
 	}

@@ -611,7 +616,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_mondom);
+		mon_domain_free(hw_mondom);
 	}
 }

@@ -629,9 +634,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;

 	if (id < 0)
 		return;
@@ -641,8 +646,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+	hw_dom = resctrl_to_arch_ctrl_dom(d);

 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
@@ -650,12 +655,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		list_del(&d->hdr.list);

 		/*
-		 * rdt_domain "d" is going to be freed below, so clear
+		 * rdt_ctrl_domain "d" is going to be freed below, so clear
 		 * its pointer from pseudo_lock_region struct.
 		 */
 		if (d->plr)
 			d->plr->d = NULL;
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);

 		return;
 	}
@@ -664,9 +669,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_mondom;
+	struct rdt_hw_mon_domain *hw_mondom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;

 	if (id < 0)
 		return;
@@ -676,14 +681,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_mondom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
+	hw_mondom = resctrl_to_arch_mon_dom(d);

 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del(&d->hdr.list);
-		domain_free(hw_mondom);
+		mon_domain_free(hw_mondom);

 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index a6261e177cc1..7513eba9feaf 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }

 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d)
+	     struct rdt_ctrl_domain *d)
 {
 	struct resctrl_staged_config *cfg;
 	u32 closid = data->rdtgrp->closid;
@@ -135,7 +135,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
  * resource type.
  */
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d)
+	      struct rdt_ctrl_domain *d)
 {
 	struct rdtgroup *rdtgrp = data->rdtgrp;
 	struct resctrl_staged_config *cfg;
@@ -205,7 +205,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct rdt_resource *r = s->res;
 	struct rdt_parse_data data;
 	char *dom = NULL, *id;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	unsigned long dom_id;

 	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -265,11 +265,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
 	}
 }

-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
 			 struct resctrl_staged_config *cfg, u32 idx,
 			 cpumask_var_t cpu_mask)
 {
-	struct rdt_domain *dom = &hw_dom->d_resctrl;
+	struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;

 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
 		cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
@@ -281,11 +281,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	return false;
 }

-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;

@@ -305,11 +305,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
 	enum resctrl_conf_type t;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	u32 idx;

 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -317,7 +317,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)

 	msr_param.res = NULL;
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
 			if (!cfg->have_new_ctrl)
@@ -447,10 +447,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }

-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	u32 idx = get_config_index(closid, type);

 	return hw_dom->ctrl_val[idx];
@@ -459,7 +459,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
 {
 	struct rdt_resource *r = schema->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	bool sep = false;
 	u32 ctrl_val;

@@ -521,7 +521,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 }

 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	/*
@@ -541,11 +541,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
 	struct rdt_domain_hdr *hdr;
+	struct rdt_mon_domain *d;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
 	struct rmid_read rr;
 	int ret = 0;

@@ -566,7 +566,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		ret = -ENOENT;
 		goto out;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);

 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3265b8499e2a..97d2ed829f5d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }

-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }

-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct arch_mbm_state *am;

 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);

 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }

-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	}
 }

-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
 }
@@ -334,7 +334,7 @@ int alloc_rmid(void)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int cpu, err;
 	u64 val = 0;

@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }

-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
 				       enum resctrl_event_id evtid)
 {
 	switch (evtid) {
@@ -516,13 +516,13 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
+	struct rdt_ctrl_domain *dom_mba;
 	u32 cur_bw, delta_bw, user_bw;
 	struct rdt_resource *r_mba;
-	struct rdt_domain *dom_mba;
 	struct list_head *head;
 	struct rdtgroup *entry;

@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	}
 }

-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
 {
 	struct rmid_read rr;

@@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct rdt_resource *r;
-	struct rdt_domain *d;

 	mutex_lock(&rdtgroup_mutex);

 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);

 	__check_limbo(d, false);

@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }

-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
 	struct rdtgroup *prgrp, *crgrp;
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;

 	mutex_lock(&rdtgroup_mutex);

@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;

 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mon_domain, mbm_over.work);

 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }

-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index bda32b4e1c1e..675e9e47af54 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
  * Return: true if @cbm overlaps with pseudo-locked region on @d, false
  * otherwise.
  */
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int cbm_len;
 	unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
  *         if it is not possible to test due to memory allocation issue,
  *         false otherwise.
  */
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
 {
+	struct rdt_ctrl_domain *d_i;
 	cpumask_var_t cpu_with_psl;
 	struct rdt_resource *r;
-	struct rdt_domain *d_i;
 	bool ret = false;

 	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 8132f81f31bb..b0901fb95aa9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -80,8 +80,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)

 void rdt_staged_configs_clear(void)
 {
+	struct rdt_ctrl_domain *dom;
 	struct rdt_resource *r;
-	struct rdt_domain *dom;

 	lockdep_assert_held(&rdtgroup_mutex);

@@ -920,7 +920,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	unsigned long sw_shareable = 0, hw_shareable = 0;
 	unsigned long exclusive = 0, pseudo_locked = 0;
 	struct rdt_resource *r = s->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	int i, hwb, swb, excl, psl;
 	enum rdtgrp_mode mode;
 	bool sep = false;
@@ -1137,7 +1137,7 @@ static enum resctrl_conf_type resctrl_peer_type(enum resctrl_conf_type my_type)
  *
  * Return: false if CBM does not overlap, true if it does.
  */
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				    unsigned long cbm, int closid,
 				    enum resctrl_conf_type type, bool exclusive)
 {
@@ -1192,7 +1192,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
  *
  * Return: true if CBM overlap detected, false if there is no overlap
  */
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1222,10 +1222,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
 static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 {
 	int closid = rdtgrp->closid;
+	struct rdt_ctrl_domain *d;
 	struct resctrl_schema *s;
 	struct rdt_resource *r;
 	bool has_cache = false;
-	struct rdt_domain *d;
 	u32 ctrl;

 	list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1339,7 +1339,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
  * bitmap functions work correctly.
  */
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
-				  struct rdt_domain *d, unsigned long cbm)
+				  struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	struct cpu_cacheinfo *ci;
 	unsigned int size = 0;
@@ -1372,9 +1372,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 {
 	struct resctrl_schema *schema;
 	enum resctrl_conf_type type;
+	struct rdt_ctrl_domain *d;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 	unsigned int size;
 	int ret = 0;
 	u32 closid;
@@ -1486,7 +1486,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }

-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1494,7 +1494,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	bool sep = false;

 	mutex_lock(&rdtgroup_mutex);
@@ -1551,7 +1551,7 @@ static void mon_event_config_write(void *info)
 }

 static int mbm_config_write_domain(struct rdt_resource *r,
-				   struct rdt_domain *d, u32 evtid, u32 val)
+				   struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 	int ret = 0;
@@ -1601,7 +1601,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 {
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int ret = 0;

 next:
@@ -2125,9 +2125,9 @@ static inline bool is_mba_linear(void)
 static int set_cache_qos_cfg(int level, bool enable)
 {
 	void (*update)(void *arg);
+	struct rdt_ctrl_domain *d;
 	struct rdt_resource *r_l;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int cpu;

 	if (level == RDT_RESOURCE_L3)
@@ -2174,7 +2174,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 		l3_qos_cfg_update(&hw_res->cdp_enabled);
 }

-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
 	int cpu = cpumask_any(&d->cpu_mask);
@@ -2192,7 +2192,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 }

 static void mba_sc_domain_destroy(struct rdt_resource *r,
-				  struct rdt_domain *d)
+				  struct rdt_ctrl_domain *d)
 {
 	kfree(d->mbps_val);
 	d->mbps_val = NULL;
@@ -2218,7 +2218,7 @@ static int set_mba_sc(bool mba_sc)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;

 	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2466,7 +2466,7 @@ static void schemata_list_destroy(void)
 static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;

@@ -2634,10 +2634,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
 static int reset_all_ctrls(struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int i;

 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2653,7 +2653,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * from each domain to update the MSRs below.
 	 */
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);

 		for (i = 0; i < hw_res->num_closid; i++)
@@ -2848,7 +2848,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }

 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -2897,7 +2897,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mon_domain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -2919,7 +2919,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	int ret;

 	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3021,7 +3021,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
  * Set the RDT domain up to start off with all usable allocations. That is,
  * all shareable and unused bits. All-zero CBM is invalid.
  */
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
 				 u32 closid)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3101,7 +3101,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
  */
 static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int ret;

 	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3117,7 +3117,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;

 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
@@ -3704,14 +3704,14 @@ static int __init rdtgroup_setup_root(void)
 	return ret;
 }

-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
 }

-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);

@@ -3719,7 +3719,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 		mba_sc_domain_destroy(r, d);
 }

-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3746,7 +3746,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }

-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	size_t tsize;

@@ -3776,7 +3776,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }

-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);

@@ -3787,7 +3787,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }

-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	int err;

--
2.41.0
---
 include/linux/resctrl.h                   | 50 +++++++------
 arch/x86/kernel/cpu/resctrl/internal.h    | 60 ++++++++++------
 arch/x86/kernel/cpu/resctrl/core.c        | 87 ++++++++++++-----------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 40 +++++------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 62 ++++++++--------
 7 files changed, 184 insertions(+), 153 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index f47b01ae28ca..56fa04cedb50 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -70,7 +70,25 @@ struct rdt_domain_hdr {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr:		common header for different domain types
+ * @cpu_mask:		which CPUs share this resource
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_ctrl_domain {
+	struct rdt_domain_hdr		hdr;
+	struct cpumask			cpu_mask;
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl control resource
  * @hdr:		common header for different domain types
  * @cpu_mask:		which CPUs share this resource
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
@@ -80,13 +98,8 @@ struct rdt_domain_hdr {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
 	struct cpumask			cpu_mask;
 	unsigned long			*rmid_busy_llc;
@@ -96,9 +109,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };
 
 /**
@@ -202,7 +212,7 @@ struct rdt_resource {
 	const char		*format_str;
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
-						 struct rdt_domain *d);
+						 struct rdt_ctrl_domain *d);
 	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
@@ -236,15 +246,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.
  */
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -260,7 +270,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
 
 /**
@@ -273,7 +283,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid);
 
 /**
@@ -285,7 +295,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e9a2a8993d14..3aed8e7b8487 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -106,7 +106,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -191,7 +191,7 @@ struct mongroup {
  */
 struct pseudo_lock_region {
 	struct resctrl_schema	*s;
-	struct rdt_domain	*d;
+	struct rdt_ctrl_domain	*d;
 	u32			cbm;
 	wait_queue_head_t	lock_thread_wq;
 	int			thread_done;
@@ -319,25 +319,41 @@ struct arch_mbm_state {
 };
 
 /**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ *			       a resource for a control function
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+	struct rdt_ctrl_domain		d_resctrl;
+	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ *			      a resource for a monitor function
+ * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
-struct rdt_hw_domain {
-	struct rdt_domain		d_resctrl;
-	u32				*ctrl_val;
+struct rdt_hw_mon_domain {
+	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };
 
-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
 {
-	return container_of(r, struct rdt_hw_domain, d_resctrl);
+	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
+{
+	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
 }
 
 /**
@@ -405,7 +421,7 @@ struct rdt_hw_resource {
 	struct rdt_resource	r_resctrl;
 	u32			num_closid;
 	unsigned int		msr_base;
-	void (*msr_update)	(struct rdt_domain *d, struct msr_param *m,
+	void (*msr_update)	(struct rdt_ctrl_domain *d, struct msr_param *m,
 				 struct rdt_resource *r);
 	unsigned int		mon_scale;
 	unsigned int		mbm_width;
@@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 }
 
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d);
+	      struct rdt_ctrl_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d);
+	     struct rdt_ctrl_domain *d);
 
 extern struct mutex rdtgroup_mutex;
 
@@ -517,21 +533,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				  unsigned long cbm);
 enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
 int rdtgroup_tasks_assigned(struct rdtgroup *r);
 int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
 int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
 int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
@@ -541,17 +557,17 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index ae2b66d38afc..d92fdce4e44f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
 bool rdt_alloc_capable;
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
 #define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -303,11 +303,11 @@ static void rdt_get_cdp_l2_config(void)
 }
 
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -328,12 +328,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 }
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	/*  Write the delay values for mba. */
 	for (i = m->low; i < m->high; i++)
@@ -341,19 +341,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 }
 
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
@@ -375,7 +375,7 @@ void rdt_ctrl_update(void *arg)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
 	struct rdt_resource *r = m->res;
 	int cpu = smp_processor_id();
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	d = get_domain_from_cpu(cpu, r);
 	if (d) {
@@ -443,18 +443,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 		*dc = r->default_ctrl;
 }
 
-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct msr_param m;
 	u32 *dc;
 
@@ -477,7 +482,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
 {
 	size_t tsize;
 
@@ -516,10 +521,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -537,7 +542,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
 
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
@@ -558,7 +563,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	rdt_domain_reconfigure_cdp(r);
 
 	if (domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 		return;
 	}
 
@@ -567,17 +572,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	err = resctrl_online_ctrl_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 	}
 }
 
 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_mon_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -595,7 +600,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
@@ -612,7 +617,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 		return;
 	}
 
@@ -621,7 +626,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 	}
 }
 
@@ -639,9 +644,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	if (id < 0)
 		return;
@@ -655,8 +660,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+	hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
@@ -664,12 +669,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		list_del(&d->hdr.list);
 
 		/*
-		 * rdt_domain "d" is going to be freed below, so clear
+		 * rdt_ctrl_domain "d" is going to be freed below, so clear
 		 * its pointer from pseudo_lock_region struct.
 		 */
 		if (d->plr)
 			d->plr->d = NULL;
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 
 		return;
 	}
@@ -678,9 +683,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_mon_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	if (id < 0)
 		return;
@@ -694,14 +699,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
+	hw_dom = resctrl_to_arch_mon_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 33ff4d00a08c..b0b868baaaa3 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }
 
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d)
+	     struct rdt_ctrl_domain *d)
 {
 	struct resctrl_staged_config *cfg;
 	u32 closid = data->rdtgrp->closid;
@@ -135,7 +135,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
  * resource type.
  */
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d)
+	      struct rdt_ctrl_domain *d)
 {
 	struct rdtgroup *rdtgrp = data->rdtgrp;
 	struct resctrl_staged_config *cfg;
@@ -205,7 +205,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct rdt_resource *r = s->res;
 	struct rdt_parse_data data;
 	char *dom = NULL, *id;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	unsigned long dom_id;
 
 	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -265,11 +265,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
 	}
 }
 
-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
 			 struct resctrl_staged_config *cfg, u32 idx,
 			 cpumask_var_t cpu_mask)
 {
-	struct rdt_domain *dom = &hw_dom->d_resctrl;
+	struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
 		cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
@@ -281,11 +281,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	return false;
 }
 
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
@@ -305,11 +305,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
 	enum resctrl_conf_type t;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	u32 idx;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -317,7 +317,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 
 	msr_param.res = NULL;
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
 			if (!cfg->have_new_ctrl)
@@ -447,10 +447,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	u32 idx = get_config_index(closid, type);
 
 	return hw_dom->ctrl_val[idx];
@@ -459,7 +459,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
 {
 	struct rdt_resource *r = schema->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	bool sep = false;
 	u32 ctrl_val;
 
@@ -521,7 +521,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	/*
@@ -541,11 +541,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
 	struct rdt_domain_hdr *hdr;
+	struct rdt_mon_domain *d;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -572,7 +572,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		goto out;
 	}
 
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3265b8499e2a..97d2ed829f5d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	}
 }
 
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
 }
@@ -334,7 +334,7 @@ int alloc_rmid(void)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int cpu, err;
 	u64 val = 0;
 
@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
 				       enum resctrl_event_id evtid)
 {
 	switch (evtid) {
@@ -516,13 +516,13 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
+	struct rdt_ctrl_domain *dom_mba;
 	u32 cur_bw, delta_bw, user_bw;
 	struct rdt_resource *r_mba;
-	struct rdt_domain *dom_mba;
 	struct list_head *head;
 	struct rdtgroup *entry;
 
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	}
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
 {
 	struct rmid_read rr;
 
@@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
 	struct rdtgroup *prgrp, *crgrp;
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index bda32b4e1c1e..675e9e47af54 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
  * Return: true if @cbm overlaps with pseudo-locked region on @d, false
  * otherwise.
  */
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int cbm_len;
 	unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
  *         if it is not possible to test due to memory allocation issue,
  *         false otherwise.
  */
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
 {
+	struct rdt_ctrl_domain *d_i;
 	cpumask_var_t cpu_with_psl;
 	struct rdt_resource *r;
-	struct rdt_domain *d_i;
 	bool ret = false;
 
 	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 90565bb44d0e..afa7a8dca48d 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -80,8 +80,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
 
 void rdt_staged_configs_clear(void)
 {
+	struct rdt_ctrl_domain *dom;
 	struct rdt_resource *r;
-	struct rdt_domain *dom;
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -920,7 +920,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	unsigned long sw_shareable = 0, hw_shareable = 0;
 	unsigned long exclusive = 0, pseudo_locked = 0;
 	struct rdt_resource *r = s->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	int i, hwb, swb, excl, psl;
 	enum rdtgrp_mode mode;
 	bool sep = false;
@@ -1137,7 +1137,7 @@ static enum resctrl_conf_type resctrl_peer_type(enum resctrl_conf_type my_type)
  *
  * Return: false if CBM does not overlap, true if it does.
  */
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				    unsigned long cbm, int closid,
 				    enum resctrl_conf_type type, bool exclusive)
 {
@@ -1192,7 +1192,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
  *
  * Return: true if CBM overlap detected, false if there is no overlap
  */
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1222,10 +1222,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
 static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 {
 	int closid = rdtgrp->closid;
+	struct rdt_ctrl_domain *d;
 	struct resctrl_schema *s;
 	struct rdt_resource *r;
 	bool has_cache = false;
-	struct rdt_domain *d;
 	u32 ctrl;
 
 	list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1339,7 +1339,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
  * bitmap functions work correctly.
  */
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
-				  struct rdt_domain *d, unsigned long cbm)
+				  struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	struct cpu_cacheinfo *ci;
 	unsigned int size = 0;
@@ -1372,9 +1372,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 {
 	struct resctrl_schema *schema;
 	enum resctrl_conf_type type;
+	struct rdt_ctrl_domain *d;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 	unsigned int size;
 	int ret = 0;
 	u32 closid;
@@ -1486,7 +1486,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1494,7 +1494,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
@@ -1551,7 +1551,7 @@ static void mon_event_config_write(void *info)
 }
 
 static int mbm_config_write_domain(struct rdt_resource *r,
-				   struct rdt_domain *d, u32 evtid, u32 val)
+				   struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 	int ret = 0;
@@ -1601,7 +1601,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 {
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int ret = 0;
 
 next:
@@ -2125,9 +2125,9 @@ static inline bool is_mba_linear(void)
 static int set_cache_qos_cfg(int level, bool enable)
 {
 	void (*update)(void *arg);
+	struct rdt_ctrl_domain *d;
 	struct rdt_resource *r_l;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int cpu;
 
 	if (level == RDT_RESOURCE_L3)
@@ -2174,7 +2174,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 		l3_qos_cfg_update(&hw_res->cdp_enabled);
 }
 
-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
 	int cpu = cpumask_any(&d->cpu_mask);
@@ -2192,7 +2192,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 }
 
 static void mba_sc_domain_destroy(struct rdt_resource *r,
-				  struct rdt_domain *d)
+				  struct rdt_ctrl_domain *d)
 {
 	kfree(d->mbps_val);
 	d->mbps_val = NULL;
@@ -2218,7 +2218,7 @@ static int set_mba_sc(bool mba_sc)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2466,7 +2466,7 @@ static void schemata_list_destroy(void)
 static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2634,10 +2634,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
 static int reset_all_ctrls(struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int i;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2653,7 +2653,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * from each domain to update the MSRs below.
 	 */
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
@@ -2848,7 +2848,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -2897,7 +2897,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mon_domain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -2919,7 +2919,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	int ret;
 
 	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3021,7 +3021,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
  * Set the RDT domain up to start off with all usable allocations. That is,
  * all shareable and unused bits. All-zero CBM is invalid.
  */
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
 				 u32 closid)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3101,7 +3101,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
  */
 static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int ret;
 
 	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3117,7 +3117,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
@@ -3704,14 +3704,14 @@ static int __init rdtgroup_setup_root(void)
 	return ret;
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3719,7 +3719,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 		mba_sc_domain_destroy(r, d);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3748,7 +3748,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	size_t tsize;
 
@@ -3778,7 +3778,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3789,7 +3789,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	int err;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v7 5/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-10-03 16:07       ` [PATCH v7 " Tony Luck
                           ` (3 preceding siblings ...)
  2023-10-03 16:07         ` [PATCH v7 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-10-03 16:07         ` Tony Luck
  2023-10-03 16:07         ` [PATCH v7 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                           ` (3 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 16:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
---

No changes since last version

 include/linux/resctrl.h            | 1 +
 arch/x86/kernel/cpu/resctrl/core.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 56fa04cedb50..15b98ac91272 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -172,6 +172,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d92fdce4e44f..6b937da36e4c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -511,6 +511,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
 	default:
 		break;
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v7 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-10-03 16:07       ` [PATCH v7 " Tony Luck
                           ` (4 preceding siblings ...)
  2023-10-03 16:07         ` [PATCH v7 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-10-03 16:07         ` Tony Luck
  2023-10-03 16:07         ` [PATCH v7 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
                           ` (2 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 16:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that will show how many
SNC nodes share each L3 cache. When this is "1", SNC mode is either
not implemented, or not enabled.

A later patch will detect SNC mode and set snc_nodes_per_l3_cache to
the appropriate value. For now it remains at the default "1" to
indicate SNC mode is not active.

Code that needs to take action when SNC is enabled is:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be adjusted for the number
   of SNC nodes.
3) The RMID renumbering operates when using the value from the
   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
   counter, code must adjust from the logical RMID used to the physical
   RMID value for the SNC node that it wishes to read and load the
   adjusted value into the IA32_QM_EVTSEL MSR.
4) The L3 cache is divided between the SNC nodes. So the value
   reported in the resctrl "size" file is adjusted.
5) The "-o mba_MBps" mount option must be disabled in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.
   Trying to use this feedback loop might result in contradictory
   changes to the throttling level coming from each of the SNC
   node bandwidth measurements.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>

---

Changes since last version:

In commit comment s/redumbering/renumbering/

Move check that SNC is not enabled into supports_mba_mbps().

 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  5 +++--
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 3aed8e7b8487..3fddda401b83 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6b937da36e4c..cd189b7ca6ea 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.  Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 97d2ed829f5d..e6e566921a60 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs.
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index afa7a8dca48d..def203c40d70 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1357,7 +1357,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /**
@@ -2207,7 +2207,8 @@ static bool supports_mba_mbps(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 
 	return (is_mbm_local_enabled() &&
-		r->alloc_capable && is_mba_linear());
+		r->alloc_capable && is_mba_linear() &&
+		snc_nodes_per_l3_cache == 1);
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v7 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-10-03 16:07       ` [PATCH v7 " Tony Luck
                           ` (5 preceding siblings ...)
  2023-10-03 16:07         ` [PATCH v7 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-10-03 16:07         ` Tony Luck
  2023-10-03 16:07         ` [PATCH v7 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
  2023-10-03 16:16         ` [PATCH v7 0/8] Add support for Sub-NUMA cluster (SNC) systems Luck, Tony
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 16:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

There isn't a simple h/w bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero. An earlier commit includes
all the required changes in Linux to operate in this reconfigured mode.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>

---

Changes since last version:

Moved kfree(node_caches); earlier, to the earliest point where it
is no longer needed.

Added Granite Rapids to list of CPU models that support SNC mode.

 arch/x86/include/asm/msr-index.h   |  1 +
 arch/x86/kernel/cpu/resctrl/core.c | 92 ++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d111350197f..393d1b047617 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1100,6 +1100,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index cd189b7ca6ea..dff294529f27 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -751,11 +754,42 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counnter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+	u64 val;
+
+	/* Only need to enable once per package. */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
@@ -1010,11 +1044,69 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple h/w bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately detemine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+	int num_l3_caches;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids)
+			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+		else
+			mem_only_nodes++;
+	}
+	cpus_read_unlock();
+
+	num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
+	kfree(node_caches);
+
+	if (!num_l3_caches)
+		return 1;
+
+	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+	if (ret > 1)
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+
+	return ret;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = snc_get_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v7 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-10-03 16:07       ` [PATCH v7 " Tony Luck
                           ` (6 preceding siblings ...)
  2023-10-03 16:07         ` [PATCH v7 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-10-03 16:07         ` Tony Luck
  2023-10-03 16:16         ` [PATCH v7 0/8] Add support for Sub-NUMA cluster (SNC) systems Luck, Tony
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 16:07 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v5:

Added addtional details about challenges tracking tasks when SNC
mode is enabled.

 Documentation/arch/x86/resctrl.rst | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index cb05d90111b4..222c507089a5 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -345,9 +345,9 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event.  Each of these
 	directories have one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
@@ -452,6 +452,23 @@ and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
 of the capacity of the cache. You could partition the cache into four
 equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache. But each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* RE: [PATCH v7 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-10-03 16:07       ` [PATCH v7 " Tony Luck
                           ` (7 preceding siblings ...)
  2023-10-03 16:07         ` [PATCH v7 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-10-03 16:16         ` Luck, Tony
  2023-10-03 16:34           ` Reinette Chatre
  8 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-10-03 16:16 UTC (permalink / raw)
  To: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

> The Sub-NUMA cluster feature on some Intel processors partitions
> the CPUs that share an L3 cache into two or more sets. This plays
> havoc with the Resource Director Technology (RDT) monitoring features.
> Prior to this patch Intel has advised that SNC and RDT are incompatible.
>
> Some of these CPU support an MSR that can partition the RMID
> counters in the same way. This allows for monitoring features
> to be used (with the caveat that memory accesses between different
> SNC NUMA nodes may still not be counted accuratlely.
>
> Note that this patch series improves resctrl reporting considerably
> on systems with SNC enabled, but there will still be some anomalies
> for processes accessing memory from other sub-NUMA nodes.

Bother .. forgot to add the changes since last version summary
to the cover letter. I fixed all the issues called out by Peter Newman
in his review of v6 series. Specific details are included in each patch
(except for patch 0005 which is unchanged).

I added Peter's "Reviewed-by" to patches where he offered it AND
where I didn't make substantive changes (parts 4, 5, 6, 7)

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v7 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-10-03 16:16         ` [PATCH v7 0/8] Add support for Sub-NUMA cluster (SNC) systems Luck, Tony
@ 2023-10-03 16:34           ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-10-03 16:34 UTC (permalink / raw)
  To: Luck, Tony, Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches



On 10/3/2023 9:16 AM, Luck, Tony wrote:
>> The Sub-NUMA cluster feature on some Intel processors partitions
>> the CPUs that share an L3 cache into two or more sets. This plays
>> havoc with the Resource Director Technology (RDT) monitoring features.
>> Prior to this patch Intel has advised that SNC and RDT are incompatible.
>>
>> Some of these CPU support an MSR that can partition the RMID
>> counters in the same way. This allows for monitoring features
>> to be used (with the caveat that memory accesses between different
>> SNC NUMA nodes may still not be counted accuratlely.

The typo that I pointed out in V4 as well as V5 remains.
Not fixing something this fundamental reflects poorly on the rest
of this work.

Reinette


^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v7 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-10-03 16:07         ` [PATCH v7 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-10-03 16:46           ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-10-03 16:46 UTC (permalink / raw)
  To: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

> The same rdt_domain structure is used for both control and monitor
> functions. But this results in wasted memory as some of the fields are
> only used by control functions, while most are only used for monitor
> functions.
>
> Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
> just the fields required for control and monitoring respectively.
>
> Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
> and rdt_hw_mon_domain.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Peter Newman <peternewman@google.com>

Bother^2 ... lkp complained it can't apply this series to any of the
usual upstream targets (linus, tip, linx-next). Digging into that it
seems that "git format-patch" may have generated something that
can't be applied using "git am".

Not sure if I've got something bad in my config, or my git tree.

Will keep digging.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v8 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-09-28 19:13     ` [PATCH v6 " Tony Luck
                         ` (9 preceding siblings ...)
  2023-10-03 16:07       ` [PATCH v7 " Tony Luck
@ 2023-10-03 21:30       ` Tony Luck
  2023-10-03 21:30         ` [PATCH v8 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
                           ` (10 more replies)
  10 siblings, 11 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The Sub-NUMA cluster feature on some Intel processors partitions
the CPUs that share an L3 cache into two or more sets. This plays
havoc with the Resource Director Technology (RDT) monitoring features.
Prior to this patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID
counters in the same way. This allows for monitoring features
to be used (with the caveat that memory accesses between different
SNC NUMA nodes may still not be counted accurately.

Note that this patch series improves resctrl reporting considerably
on systems with SNC enabled, but there will still be some anomalies
for processes accessing memory from other sub-NUMA nodes.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Please ignore v7 posting. There was some glitch in how I created
the patches with "git format-patch" that meant part 0004 would not
apply.

Changes since v6:

* Fixed spelling of "accurately" in cover letter.

* Applied changes from Peter Newman's review
Link: https://lore.kernel.org/r/CALPaoChB5ryT96ZZBQb6+3=xO+A0uR-ToN0TWqUjLJ7bgi==Rg@mail.gmail.com
(and follow-on posts against other patches in the v6 series).

See comments in indivdual patches for specific details.

Added Peter's "Reviewed-by" to parts 4-7.q

Tony Luck (8):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  23 +-
 include/linux/resctrl.h                   |  85 +++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  66 ++--
 arch/x86/kernel/cpu/resctrl/core.c        | 400 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  58 ++--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  58 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  14 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 132 +++----
 9 files changed, 591 insertions(+), 246 deletions(-)


base-commit: 6465e260f48790807eef06b583b38ca9789b6072
-- 
2.41.0


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v8 1/8] x86/resctrl: Prepare for new domain scope
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
@ 2023-10-03 21:30         ` Tony Luck
  2023-10-05 14:27           ` Peter Newman
  2023-10-03 21:30         ` [PATCH v8 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
                           ` (9 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-03 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Provide a more detailed warning message if a domain id cannot be found
when adding a CPU. Just check and silent return if the domain id can't
be found when removing a CPU.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Changes since last version:

s/-EINVAL/0/ for return value of rdtgroup_cbm_to_size()

---
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..618735e396cb 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..3b1837e1fb6b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -487,6 +487,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -502,12 +515,17 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 	int err;
 
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
+		return;
+	}
 	d = rdt_find_domain(r, id, &add_pos);
 	if (IS_ERR(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -552,10 +570,13 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
+	if (id < 0)
+		return;
+
 	d = rdt_find_domain(r, id, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..8c5f932bc00b 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	int scope = plr->s->res->scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 725344048f85..04c164f6d39d 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1345,10 +1345,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return size;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}

^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v8 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
  2023-10-03 21:30         ` [PATCH v8 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-10-03 21:30         ` Tony Luck
  2023-10-03 21:30         ` [PATCH v8 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
                           ` (8 subsequent siblings)
  10 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move the "list" and "id" fields into their own
structure within the rdt_domain structure.

Add a "type" field to the header to be used later so that callers looking
up a domain can be assured that they found one of the expected type.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Changes since last version:
Add the "type" field to rdt_domain_hdr.

---
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 618735e396cb..4c8870345ff8 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -52,10 +52,26 @@ struct resctrl_staged_config {
 	bool			have_new_ctrl;
 };
 
+enum resctrl_domain_type {
+	RESCTRL_CTRL_DOMAIN,
+	RESCTRL_MON_DOMAIN,
+};
+
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
+ * @type:		type of this instance
+ */
+struct rdt_domain_hdr {
+	struct list_head		list;
+	int				id;
+	enum resctrl_domain_type	type;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr:		common header for different domain types
  * @cpu_mask:		which CPUs share this resource
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
@@ -71,8 +87,7 @@ struct resctrl_staged_config {
  *			by closid
  */
 struct rdt_domain {
-	struct list_head		list;
-	int				id;
+	struct rdt_domain_hdr		hdr;
 	struct cpumask			cpu_mask;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3b1837e1fb6b..05369add4578 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -352,7 +352,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->cpu_mask))
 			return d;
@@ -401,12 +401,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 		return ERR_PTR(-ENODEV);
 
 	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, list);
+		d = list_entry(l, struct rdt_domain, hdr.list);
 		/* When id is found, return its domain. */
-		if (id == d->id)
+		if (id == d->hdr.id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->id)
+		if (id < d->hdr.id)
 			break;
 	}
 
@@ -544,7 +544,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 
 	d = &hw_dom->d_resctrl;
-	d->id = id;
+	d->hdr.id = id;
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
@@ -559,11 +559,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail(&d->list, add_pos);
+	list_add_tail(&d->hdr.list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
@@ -587,7 +587,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 
 		/*
 		 * rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b44c487727d4..8bce591a1018 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -144,7 +144,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -224,8 +224,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
 			if (r->parse_ctrlval(&data, s, d))
@@ -316,7 +316,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -464,7 +464,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -474,7 +474,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 			ctrl_val = resctrl_arch_get_config(r, dom, closid,
 							   schema->conf_type);
 
-		seq_printf(s, r->format_str, dom->id, max_data_width,
+		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
 			   ctrl_val);
 		sep = true;
 	}
@@ -503,7 +503,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			} else {
 				seq_printf(s, "%s:%d=%x\n",
 					   rdtgrp->plr->s->res->name,
-					   rdtgrp->plr->d->id,
+					   rdtgrp->plr->d->hdr.id,
 					   rdtgrp->plr->cbm);
 			}
 		} else {
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..27cda5988d7f 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8c5f932bc00b..18b6183a1b48 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, list) {
+		list_for_each_entry(d_i, &r->domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 04c164f6d39d..739e56cfb31f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -86,7 +86,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -928,12 +928,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
 		exclusive = 0;
-		seq_printf(seq, "%d=", dom->id);
+		seq_printf(seq, "%d=", dom->hdr.id);
 		for (i = 0; i < closids_supported(); i++) {
 			if (!closid_allocated(i))
 				continue;
@@ -1233,7 +1233,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1398,7 +1398,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
 						    rdtgrp->plr->d,
 						    rdtgrp->plr->cbm);
-			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
 		}
 		goto out;
 	}
@@ -1410,7 +1410,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1428,7 +1428,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 				else
 					size = rdtgroup_cbm_to_size(r, d, ctrl);
 			}
-			seq_printf(s, "%d=%u", d->id, size);
+			seq_printf(s, "%d=%u", d->hdr.id, size);
 			sep = true;
 		}
 		seq_putc(s, '\n');
@@ -1499,7 +1499,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1507,7 +1507,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		mon_info.evtid = evtid;
 		mondata_config_read(dom, &mon_info);
 
-		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1622,8 +1622,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
 				return -EINVAL;
@@ -2141,7 +2141,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, list) {
+	list_for_each_entry(d, &r_l->domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->cpu_mask)
@@ -2226,7 +2226,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2528,7 +2528,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2652,7 +2652,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * CBMs in all domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
@@ -2858,7 +2858,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
 	/* create the directory */
 	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
 	if (IS_ERR(kn))
@@ -2874,7 +2874,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	}
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->id;
+	priv.u.domid = d->hdr.id;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -2922,7 +2922,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3081,7 +3081,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
 	 */
 	tmp_cbm = cfg->new_ctrl;
 	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
-		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
 		return -ENOSPC;
 	}
 	cfg->have_new_ctrl = true;
@@ -3104,7 +3104,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, list) {
+	list_for_each_entry(d, &s->res->domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3119,7 +3119,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3726,7 +3726,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (static_branch_unlikely(&rdt_mon_enable_key))
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);

^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v8 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
  2023-10-03 21:30         ` [PATCH v8 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2023-10-03 21:30         ` [PATCH v8 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2023-10-03 21:30         ` Tony Luck
  2023-10-05 14:34           ` Peter Newman
  2023-10-03 21:30         ` [PATCH v8 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
                           ` (7 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-03 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically L3 scope for
cache control and NODE scope for cache occupancy and memory bandwidth
monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since last version:

Initialize the "type" in rdt_domain_hdr when creating domains.
Check type has expected value before using container_of() to
get to the surrounding structure.

Rename "hw_mondom" to "hw_dom" in domain_add_cpu_mon() and
in domain_remove_cpu_mon().

Add lockdep_assert_held(&rdtgroup_mutex) to resctrl_offline_mon_domain()

---
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 4c8870345ff8..f47b01ae28ca 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -170,10 +170,12 @@ enum resctrl_scope {
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @scope:		Scope of this resource
+ * @ctrl_scope:		Scope of this resource for control functions
+ * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		All domains for this resource
+ * @ctrl_domains:	Control domains for this resource
+ * @mon_domains:	Monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -188,10 +190,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -237,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 85ceaf9a31ac..e9a2a8993d14 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -511,8 +511,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 05369add4578..ae2b66d38afc 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,8 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.mon_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
+			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.scope			= RESCTRL_L2_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.ctrl_scope		= RESCTRL_L2_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -352,7 +355,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->cpu_mask))
 			return d;
@@ -384,29 +387,39 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Find a domain in one of a resource domain lists.
  *
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
+ * Search the list to find the resource id. If the resource id is found
+ * in a domain, return the domain. Otherwise, if requested by caller,
+ * return the first domain whose id is bigger than the input id.
  * The domain list is sorted by id in ascending order.
+ *
+ * If an existing domain in the resource r's domain list matches the cpu's
+ * resource id, add the cpu in the domain.
+ *
+ * Otherwise, caller will allocate a new domain and insert into the right position
+ * in the domain list sorted by id in ascending order.
+ *
+ * The order in the domain list is visible to users when we print entries
+ * in the schemata file and schemata input is validated to have the same order
+ * as this list.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos)
 {
-	struct rdt_domain *d;
+	struct rdt_domain_hdr *d;
 	struct list_head *l;
 
 	if (id < 0)
 		return ERR_PTR(-ENODEV);
 
-	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, hdr.list);
+	list_for_each(l, h) {
+		d = list_entry(l, struct rdt_domain_hdr, list);
 		/* When id is found, return its domain. */
-		if (id == d->hdr.id)
+		if (id == d->id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->hdr.id)
+		if (id < d->id)
 			break;
 	}
 
@@ -500,38 +513,32 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	return -EINVAL;
 }
 
-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 	int err;
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+
+	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+	if (IS_ERR(hdr)) {
+		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
 
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -545,48 +552,115 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = &hw_dom->d_resctrl;
 	d->hdr.id = id;
+	d->hdr.type = RESCTRL_CTRL_DOMAIN;
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+	if (domain_setup_ctrlval(r, d)) {
 		domain_free(hw_dom);
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	list_add_tail(&d->hdr.list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+	int err;
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+	if (IS_ERR(hdr)) {
+		pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+
+	if (d) {
+		cpumask_set_cpu(cpu, &d->cpu_mask);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->hdr.id = id;
+	d->hdr.type = RESCTRL_MON_DOMAIN;
+	cpumask_set_cpu(cpu, &d->cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
 	}
 
 	list_add_tail(&d->hdr.list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+/*
+ * domain_add_cpu - Add a cpu to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 
 	if (id < 0)
 		return;
 
-	d = rdt_find_domain(r, id, NULL);
-	if (IS_ERR_OR_NULL(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
+		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
 	hw_dom = resctrl_to_arch_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del(&d->hdr.list);
 
 		/*
@@ -599,6 +673,38 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 		return;
 	}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+
+	if (id < 0)
+		return;
+
+	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
+		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+	hw_dom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->cpu_mask);
+	if (cpumask_empty(&d->cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+
+		return;
+	}
 
 	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -613,6 +719,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 8bce591a1018..33ff4d00a08c 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -224,7 +224,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
@@ -316,7 +316,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -464,7 +464,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -540,6 +540,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
+	struct rdt_domain_hdr *hdr;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
@@ -560,12 +561,19 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
 		ret = -ENOENT;
 		goto out;
 	}
 
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
 	if (rr.err == -EIO)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 27cda5988d7f..3265b8499e2a 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 18b6183a1b48..bda32b4e1c1e 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
-	int scope = plr->s->res->scope;
+	int scope = plr->s->res->ctrl_scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, hdr.list) {
+		list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 739e56cfb31f..90565bb44d0e 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -86,7 +86,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -928,7 +928,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -1233,7 +1233,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1345,13 +1345,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
-	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+	if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->scope) {
+		if (ci->info_list[i].level == r->ctrl_scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
@@ -1410,7 +1410,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1499,7 +1499,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1622,7 +1622,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
@@ -2141,7 +2141,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, hdr.list) {
+	list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->cpu_mask)
@@ -2226,7 +2226,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2528,7 +2528,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2649,10 +2649,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
 
 	/*
 	 * Disable resource control for this resource by setting all
-	 * CBMs in all domains to the maximum mask value. Pick one CPU
+	 * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
@@ -2922,7 +2922,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3104,7 +3104,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, hdr.list) {
+	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3119,7 +3119,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3711,15 +3711,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
+}
 
-	if (!r->mon_capable)
-		return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3776,18 +3778,22 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
-	int err;
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
-		return 0;
+	return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)

^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v8 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
                           ` (2 preceding siblings ...)
  2023-10-03 21:30         ` [PATCH v8 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-10-03 21:30         ` Tony Luck
  2023-10-03 21:30         ` [PATCH v8 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
                           ` (6 subsequent siblings)
  10 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>

---
Changes since last version:

Fixed two places where line wrapping inside comments were not
properly aligned.


---
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index f47b01ae28ca..56fa04cedb50 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -70,7 +70,25 @@ struct rdt_domain_hdr {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr:		common header for different domain types
+ * @cpu_mask:		which CPUs share this resource
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_ctrl_domain {
+	struct rdt_domain_hdr		hdr;
+	struct cpumask			cpu_mask;
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl control resource
  * @hdr:		common header for different domain types
  * @cpu_mask:		which CPUs share this resource
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
@@ -80,13 +98,8 @@ struct rdt_domain_hdr {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
 	struct cpumask			cpu_mask;
 	unsigned long			*rmid_busy_llc;
@@ -96,9 +109,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };
 
 /**
@@ -202,7 +212,7 @@ struct rdt_resource {
 	const char		*format_str;
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
-						 struct rdt_domain *d);
+						 struct rdt_ctrl_domain *d);
 	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
@@ -236,15 +246,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.
  */
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -260,7 +270,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
 
 /**
@@ -273,7 +283,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid);
 
 /**
@@ -285,7 +295,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e9a2a8993d14..3aed8e7b8487 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -106,7 +106,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -191,7 +191,7 @@ struct mongroup {
  */
 struct pseudo_lock_region {
 	struct resctrl_schema	*s;
-	struct rdt_domain	*d;
+	struct rdt_ctrl_domain	*d;
 	u32			cbm;
 	wait_queue_head_t	lock_thread_wq;
 	int			thread_done;
@@ -319,25 +319,41 @@ struct arch_mbm_state {
 };
 
 /**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ *			       a resource for a control function
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+	struct rdt_ctrl_domain		d_resctrl;
+	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ *			      a resource for a monitor function
+ * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
-struct rdt_hw_domain {
-	struct rdt_domain		d_resctrl;
-	u32				*ctrl_val;
+struct rdt_hw_mon_domain {
+	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };
 
-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
 {
-	return container_of(r, struct rdt_hw_domain, d_resctrl);
+	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
 }
 
 /**
@@ -405,7 +421,7 @@ struct rdt_hw_resource {
 	struct rdt_resource	r_resctrl;
 	u32			num_closid;
 	unsigned int		msr_base;
-	void (*msr_update)	(struct rdt_domain *d, struct msr_param *m,
+	void (*msr_update)	(struct rdt_ctrl_domain *d, struct msr_param *m,
 				 struct rdt_resource *r);
 	unsigned int		mon_scale;
 	unsigned int		mbm_width;
@@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 }
 
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d);
+	      struct rdt_ctrl_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d);
+	     struct rdt_ctrl_domain *d);
 
 extern struct mutex rdtgroup_mutex;
 
@@ -517,21 +533,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				  unsigned long cbm);
 enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
 int rdtgroup_tasks_assigned(struct rdtgroup *r);
 int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
 int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
 int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
@@ -541,17 +557,17 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index ae2b66d38afc..d92fdce4e44f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
 bool rdt_alloc_capable;
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
 #define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -303,11 +303,11 @@ static void rdt_get_cdp_l2_config(void)
 }
 
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -328,12 +328,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 }
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	/*  Write the delay values for mba. */
 	for (i = m->low; i < m->high; i++)
@@ -341,19 +341,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 }
 
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
@@ -375,7 +375,7 @@ void rdt_ctrl_update(void *arg)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
 	struct rdt_resource *r = m->res;
 	int cpu = smp_processor_id();
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	d = get_domain_from_cpu(cpu, r);
 	if (d) {
@@ -443,18 +443,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 		*dc = r->default_ctrl;
 }
 
-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct msr_param m;
 	u32 *dc;
 
@@ -477,7 +482,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
 {
 	size_t tsize;
 
@@ -516,10 +521,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -537,7 +542,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
 
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
@@ -558,7 +563,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	rdt_domain_reconfigure_cdp(r);
 
 	if (domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 		return;
 	}
 
@@ -567,17 +572,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	err = resctrl_online_ctrl_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 	}
 }
 
 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_mon_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -595,7 +600,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
@@ -612,7 +617,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 		return;
 	}
 
@@ -621,7 +626,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 	}
 }
 
@@ -639,9 +644,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	if (id < 0)
 		return;
@@ -655,8 +660,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+	hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
@@ -664,12 +669,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		list_del(&d->hdr.list);
 
 		/*
-		 * rdt_domain "d" is going to be freed below, so clear
+		 * rdt_ctrl_domain "d" is going to be freed below, so clear
 		 * its pointer from pseudo_lock_region struct.
 		 */
 		if (d->plr)
 			d->plr->d = NULL;
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 
 		return;
 	}
@@ -678,9 +683,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_mon_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	if (id < 0)
 		return;
@@ -694,14 +699,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
+	hw_dom = resctrl_to_arch_mon_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 33ff4d00a08c..b0b868baaaa3 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }
 
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d)
+	     struct rdt_ctrl_domain *d)
 {
 	struct resctrl_staged_config *cfg;
 	u32 closid = data->rdtgrp->closid;
@@ -135,7 +135,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
  * resource type.
  */
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d)
+	      struct rdt_ctrl_domain *d)
 {
 	struct rdtgroup *rdtgrp = data->rdtgrp;
 	struct resctrl_staged_config *cfg;
@@ -205,7 +205,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct rdt_resource *r = s->res;
 	struct rdt_parse_data data;
 	char *dom = NULL, *id;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	unsigned long dom_id;
 
 	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -265,11 +265,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
 	}
 }
 
-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
 			 struct resctrl_staged_config *cfg, u32 idx,
 			 cpumask_var_t cpu_mask)
 {
-	struct rdt_domain *dom = &hw_dom->d_resctrl;
+	struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
 		cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
@@ -281,11 +281,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	return false;
 }
 
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
@@ -305,11 +305,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
 	enum resctrl_conf_type t;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	u32 idx;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -317,7 +317,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 
 	msr_param.res = NULL;
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
 			if (!cfg->have_new_ctrl)
@@ -447,10 +447,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	u32 idx = get_config_index(closid, type);
 
 	return hw_dom->ctrl_val[idx];
@@ -459,7 +459,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
 {
 	struct rdt_resource *r = schema->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	bool sep = false;
 	u32 ctrl_val;
 
@@ -521,7 +521,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	/*
@@ -541,11 +541,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
 	struct rdt_domain_hdr *hdr;
+	struct rdt_mon_domain *d;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -572,7 +572,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		goto out;
 	}
 
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3265b8499e2a..97d2ed829f5d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	}
 }
 
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
 }
@@ -334,7 +334,7 @@ int alloc_rmid(void)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int cpu, err;
 	u64 val = 0;
 
@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
 				       enum resctrl_event_id evtid)
 {
 	switch (evtid) {
@@ -516,13 +516,13 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
+	struct rdt_ctrl_domain *dom_mba;
 	u32 cur_bw, delta_bw, user_bw;
 	struct rdt_resource *r_mba;
-	struct rdt_domain *dom_mba;
 	struct list_head *head;
 	struct rdtgroup *entry;
 
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	}
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
 {
 	struct rmid_read rr;
 
@@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
 	struct rdtgroup *prgrp, *crgrp;
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index bda32b4e1c1e..675e9e47af54 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
  * Return: true if @cbm overlaps with pseudo-locked region on @d, false
  * otherwise.
  */
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int cbm_len;
 	unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
  *         if it is not possible to test due to memory allocation issue,
  *         false otherwise.
  */
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
 {
+	struct rdt_ctrl_domain *d_i;
 	cpumask_var_t cpu_with_psl;
 	struct rdt_resource *r;
-	struct rdt_domain *d_i;
 	bool ret = false;
 
 	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 90565bb44d0e..afa7a8dca48d 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -80,8 +80,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
 
 void rdt_staged_configs_clear(void)
 {
+	struct rdt_ctrl_domain *dom;
 	struct rdt_resource *r;
-	struct rdt_domain *dom;
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -920,7 +920,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	unsigned long sw_shareable = 0, hw_shareable = 0;
 	unsigned long exclusive = 0, pseudo_locked = 0;
 	struct rdt_resource *r = s->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	int i, hwb, swb, excl, psl;
 	enum rdtgrp_mode mode;
 	bool sep = false;
@@ -1137,7 +1137,7 @@ static enum resctrl_conf_type resctrl_peer_type(enum resctrl_conf_type my_type)
  *
  * Return: false if CBM does not overlap, true if it does.
  */
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				    unsigned long cbm, int closid,
 				    enum resctrl_conf_type type, bool exclusive)
 {
@@ -1192,7 +1192,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
  *
  * Return: true if CBM overlap detected, false if there is no overlap
  */
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1222,10 +1222,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
 static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 {
 	int closid = rdtgrp->closid;
+	struct rdt_ctrl_domain *d;
 	struct resctrl_schema *s;
 	struct rdt_resource *r;
 	bool has_cache = false;
-	struct rdt_domain *d;
 	u32 ctrl;
 
 	list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1339,7 +1339,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
  * bitmap functions work correctly.
  */
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
-				  struct rdt_domain *d, unsigned long cbm)
+				  struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	struct cpu_cacheinfo *ci;
 	unsigned int size = 0;
@@ -1372,9 +1372,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 {
 	struct resctrl_schema *schema;
 	enum resctrl_conf_type type;
+	struct rdt_ctrl_domain *d;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 	unsigned int size;
 	int ret = 0;
 	u32 closid;
@@ -1486,7 +1486,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1494,7 +1494,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
@@ -1551,7 +1551,7 @@ static void mon_event_config_write(void *info)
 }
 
 static int mbm_config_write_domain(struct rdt_resource *r,
-				   struct rdt_domain *d, u32 evtid, u32 val)
+				   struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 	int ret = 0;
@@ -1601,7 +1601,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 {
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int ret = 0;
 
 next:
@@ -2125,9 +2125,9 @@ static inline bool is_mba_linear(void)
 static int set_cache_qos_cfg(int level, bool enable)
 {
 	void (*update)(void *arg);
+	struct rdt_ctrl_domain *d;
 	struct rdt_resource *r_l;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int cpu;
 
 	if (level == RDT_RESOURCE_L3)
@@ -2174,7 +2174,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 		l3_qos_cfg_update(&hw_res->cdp_enabled);
 }
 
-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
 	int cpu = cpumask_any(&d->cpu_mask);
@@ -2192,7 +2192,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 }
 
 static void mba_sc_domain_destroy(struct rdt_resource *r,
-				  struct rdt_domain *d)
+				  struct rdt_ctrl_domain *d)
 {
 	kfree(d->mbps_val);
 	d->mbps_val = NULL;
@@ -2218,7 +2218,7 @@ static int set_mba_sc(bool mba_sc)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2466,7 +2466,7 @@ static void schemata_list_destroy(void)
 static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2634,10 +2634,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
 static int reset_all_ctrls(struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int i;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2653,7 +2653,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * from each domain to update the MSRs below.
 	 */
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
@@ -2848,7 +2848,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -2897,7 +2897,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mon_domain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -2919,7 +2919,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	int ret;
 
 	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3021,7 +3021,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
  * Set the RDT domain up to start off with all usable allocations. That is,
  * all shareable and unused bits. All-zero CBM is invalid.
  */
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
 				 u32 closid)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3101,7 +3101,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
  */
 static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int ret;
 
 	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3117,7 +3117,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
@@ -3704,14 +3704,14 @@ static int __init rdtgroup_setup_root(void)
 	return ret;
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3719,7 +3719,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 		mba_sc_domain_destroy(r, d);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3748,7 +3748,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	size_t tsize;
 
@@ -3778,7 +3778,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3789,7 +3789,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	int err;
 

^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v8 5/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
                           ` (3 preceding siblings ...)
  2023-10-03 21:30         ` [PATCH v8 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-10-03 21:30         ` Tony Luck
  2023-10-03 21:30         ` [PATCH v8 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                           ` (5 subsequent siblings)
  10 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
---

No changes since last version

---
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 56fa04cedb50..15b98ac91272 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -172,6 +172,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d92fdce4e44f..6b937da36e4c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -511,6 +511,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
 	default:
 		break;
 	}

^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v8 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
                           ` (4 preceding siblings ...)
  2023-10-03 21:30         ` [PATCH v8 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-10-03 21:30         ` Tony Luck
  2023-10-03 21:30         ` [PATCH v8 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
                           ` (4 subsequent siblings)
  10 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that will show how many
SNC nodes share each L3 cache. When this is "1", SNC mode is either
not implemented, or not enabled.

A later patch will detect SNC mode and set snc_nodes_per_l3_cache to
the appropriate value. For now it remains at the default "1" to
indicate SNC mode is not active.

Code that needs to take action when SNC is enabled is:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be adjusted for the number
   of SNC nodes.
3) The RMID renumbering operates when using the value from the
   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
   counter, code must adjust from the logical RMID used to the physical
   RMID value for the SNC node that it wishes to read and load the
   adjusted value into the IA32_QM_EVTSEL MSR.
4) The L3 cache is divided between the SNC nodes. So the value
   reported in the resctrl "size" file is adjusted.
5) The "-o mba_MBps" mount option must be disabled in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.
   Trying to use this feedback loop might result in contradictory
   changes to the throttling level coming from each of the SNC
   node bandwidth measurements.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>

---

Changes since last version:

In commit comment s/redumbering/renumbering/

Move check that SNC is not enabled into supports_mba_mbps().

---
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 3aed8e7b8487..3fddda401b83 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6b937da36e4c..cd189b7ca6ea 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.  Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 97d2ed829f5d..e6e566921a60 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs.
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index afa7a8dca48d..def203c40d70 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1357,7 +1357,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /**
@@ -2207,7 +2207,8 @@ static bool supports_mba_mbps(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 
 	return (is_mbm_local_enabled() &&
-		r->alloc_capable && is_mba_linear());
+		r->alloc_capable && is_mba_linear() &&
+		snc_nodes_per_l3_cache == 1);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v8 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
                           ` (5 preceding siblings ...)
  2023-10-03 21:30         ` [PATCH v8 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-10-03 21:30         ` Tony Luck
  2023-10-03 21:30         ` [PATCH v8 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
                           ` (3 subsequent siblings)
  10 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-03 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

There isn't a simple h/w bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero. An earlier commit includes
all the required changes in Linux to operate in this reconfigured mode.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>

---

Changes since last version:

Moved kfree(node_caches); earlier, to the earliest point where it
is no longer needed.

Added Granite Rapids to list of CPU models that support SNC mode.

---
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d111350197f..393d1b047617 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1100,6 +1100,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index cd189b7ca6ea..dff294529f27 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -751,11 +754,42 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counnter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+	u64 val;
+
+	/* Only need to enable once per package. */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
@@ -1010,11 +1044,69 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple h/w bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+	int num_l3_caches;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids)
+			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+		else
+			mem_only_nodes++;
+	}
+	cpus_read_unlock();
+
+	num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
+	kfree(node_caches);
+
+	if (!num_l3_caches)
+		return 1;
+
+	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+	if (ret > 1)
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+
+	return ret;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = snc_get_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 

^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v8 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
                           ` (6 preceding siblings ...)
  2023-10-03 21:30         ` [PATCH v8 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-10-03 21:30         ` Tony Luck
  2023-10-05 13:54           ` Peter Newman
  2023-10-03 21:44         ` [PATCH v8 0/8] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
                           ` (2 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-03 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v5:

Added addtional details about challenges tracking tasks when SNC
mode is enabled.

---
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index cb05d90111b4..222c507089a5 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -345,9 +345,9 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event.  Each of these
 	directories have one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
@@ -452,6 +452,23 @@ and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
 of the capacity of the cache. You could partition the cache into four
 equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache. But each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 

^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v8 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
                           ` (7 preceding siblings ...)
  2023-10-03 21:30         ` [PATCH v8 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-10-03 21:44         ` Reinette Chatre
  2023-10-05  8:10         ` Shaopeng Tan (Fujitsu)
  2023-10-20 21:30         ` [PATCH v9 " Tony Luck
  10 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-10-03 21:44 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Tony,

On 10/3/2023 2:30 PM, Tony Luck wrote:
> The Sub-NUMA cluster feature on some Intel processors partitions
> the CPUs that share an L3 cache into two or more sets. This plays
> havoc with the Resource Director Technology (RDT) monitoring features.
> Prior to this patch Intel has advised that SNC and RDT are incompatible.
> 
> Some of these CPU support an MSR that can partition the RMID
> counters in the same way. This allows for monitoring features
> to be used (with the caveat that memory accesses between different
> SNC NUMA nodes may still not be counted accurately.

Almost. For reference:
https://lore.kernel.org/lkml/e350514e-76ed-14ea-3e74-c0852658182f@intel.com/

No need to send a new series just for this, but this series does find
itself at the back of my queue.

> Note that this patch series improves resctrl reporting considerably
> on systems with SNC enabled, but there will still be some anomalies
> for processes accessing memory from other sub-NUMA nodes.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v8 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
                           ` (8 preceding siblings ...)
  2023-10-03 21:44         ` [PATCH v8 0/8] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
@ 2023-10-05  8:10         ` Shaopeng Tan (Fujitsu)
  2023-10-05 15:08           ` Luck, Tony
  2023-10-20 21:30         ` [PATCH v9 " Tony Luck
  10 siblings, 1 reply; 331+ messages in thread
From: Shaopeng Tan (Fujitsu) @ 2023-10-05  8:10 UTC (permalink / raw)
  To: 'Tony Luck',
	Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap, linux-kernel,
	linux-doc, patches

Hi Tony,

I applied this patch series to kernel v6.5 and v6.6-rc4, but the kernel cannot be booted.
Could you tell me what kernel version this patch series is based on?

Best regards,
Shaopeng TAN

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v8 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-10-03 21:30         ` [PATCH v8 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-10-05 13:54           ` Peter Newman
  0 siblings, 0 replies; 331+ messages in thread
From: Peter Newman @ 2023-10-05 13:54 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Tue, Oct 3, 2023 at 11:30 PM Tony Luck <tony.luck@intel.com> wrote:
>
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
>
> Users should be aware that SNC mode also affects the amount of L3 cache
> available for allocation within each SNC node.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
>
> ---
>
> Changes since v5:
>
> Added addtional details about challenges tracking tasks when SNC
> mode is enabled.
>
> ---
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index cb05d90111b4..222c507089a5 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -345,9 +345,9 @@ When control is enabled all CTRL_MON groups will also contain:
>  When monitoring is enabled all MON groups will also contain:
>
>  "mon_data":
> -       This contains a set of files organized by L3 domain and by
> -       RDT event. E.g. on a system with two L3 domains there will
> -       be subdirectories "mon_L3_00" and "mon_L3_01".  Each of these
> +       This contains a set of files organized by L3 domain or by NUMA
> +       node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> +       or enabled respectively) and by RDT event.  Each of these
>         directories have one file per event (e.g. "llc_occupancy",
>         "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
>         files provide a read out of the current value of the event for
> @@ -452,6 +452,23 @@ and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
>  of the capacity of the cache. You could partition the cache into four
>  equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>
> +Notes on Sub-NUMA Cluster mode
> +==============================
> +When SNC mode is enabled Linux may load balance tasks between Sub-NUMA
> +nodes much more readily than between regular NUMA nodes since the CPUs
> +on Sub-NUMA nodes share the same L3 cache and the system may report
> +the NUMA distance between Sub-NUMA nodes with a lower value than used
> +for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
> +specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
> +and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
> +to get the full view of traffic for which the tasks were the source.
> +
> +The cache allocation feature still provides the same number of
> +bits in a mask to control allocation into the L3 cache. But each
> +of those ways has its capacity reduced because the cache is divided
> +between the SNC nodes. The values reported in the resctrl
> +"size" files are adjusted accordingly.
> +
>  Memory bandwidth Allocation and monitoring
>  ==========================================
>

Sounds better, thanks.

Reviewed-by: Peter Newman <peternewman@google.com>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v8 1/8] x86/resctrl: Prepare for new domain scope
  2023-10-03 21:30         ` [PATCH v8 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-10-05 14:27           ` Peter Newman
  0 siblings, 0 replies; 331+ messages in thread
From: Peter Newman @ 2023-10-05 14:27 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Tue, Oct 3, 2023 at 11:30 PM Tony Luck <tony.luck@intel.com> wrote:
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 725344048f85..04c164f6d39d 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1345,10 +1345,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>         unsigned int size = 0;
>         int num_b, i;
>
> +       if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
> +               return size;

Thanks!

Reviewed-by: Peter Newman <peternewman@google.com>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v8 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-10-03 21:30         ` [PATCH v8 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-10-05 14:34           ` Peter Newman
  0 siblings, 0 replies; 331+ messages in thread
From: Peter Newman @ 2023-10-05 14:34 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Tue, Oct 3, 2023 at 11:30 PM Tony Luck <tony.luck@intel.com> wrote:
>
> Resctrl assumes that control and monitor operations on a resource are
> performed at the same scope.
>
> Prepare for systems that use different scope (specifically L3 scope for
> cache control and NODE scope for cache occupancy and memory bandwidth
> monitoring).
>
> Create separate domain lists for control and monitor operations.
>
> Note that errors during initialization of either control or monitor
> functions on a domain would previously result in that domain being
> excluded from both control and monitor operations. Now the domains are
> allocated independently it is no longer required to disable both control
> and monitor operations if either fail.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
>
> ---
>
> Changes since last version:
>
> Initialize the "type" in rdt_domain_hdr when creating domains.
> Check type has expected value before using container_of() to
> get to the surrounding structure.
>
> Rename "hw_mondom" to "hw_dom" in domain_add_cpu_mon() and
> in domain_remove_cpu_mon().
>
> Add lockdep_assert_held(&rdtgroup_mutex) to resctrl_offline_mon_domain()

Thanks!

Reviewed-by: Peter Newman <peternewman@google.com>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v8 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-10-05  8:10         ` Shaopeng Tan (Fujitsu)
@ 2023-10-05 15:08           ` Luck, Tony
  2023-10-06 20:27             ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2023-10-05 15:08 UTC (permalink / raw)
  To: Shaopeng Tan (Fujitsu),
	Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap, linux-kernel,
	linux-doc, patches

> I applied this patch series to kernel v6.5 and v6.6-rc4, but the kernel cannot be booted.
> Could you tell me what kernel version this patch series is based on?

Hi Shaopeng,

Patches are based on v6.6-rc3 (see the "base-commit" line at the end of the cover letter).

Which CPU family/model/stepping & microcode are you testing on?

Are there error or panic messages when it does not boot?

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v8 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-10-05 15:08           ` Luck, Tony
@ 2023-10-06 20:27             ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-10-06 20:27 UTC (permalink / raw)
  To: Luck, Tony, Shaopeng Tan (Fujitsu),
	Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap, linux-kernel,
	linux-doc, patches

Hi Tony,

On 10/5/2023 8:08 AM, Luck, Tony wrote:
>> I applied this patch series to kernel v6.5 and v6.6-rc4, but the kernel cannot be booted.
>> Could you tell me what kernel version this patch series is based on?
> 
> Hi Shaopeng,
> 
> Patches are based on v6.6-rc3 (see the "base-commit" line at the end of the cover letter).
> 
> Which CPU family/model/stepping & microcode are you testing on?
> 
> Are there error or panic messages when it does not boot?

There is no need to ask Shaopeng for more information. If you test
this series you will immediately learn that it is broken. 
Reading the series I only made it to patch #3 where I realized that this
has not been tested before posting. 

Reinette


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v9 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-10-03 21:30       ` [PATCH v8 " Tony Luck
                           ` (9 preceding siblings ...)
  2023-10-05  8:10         ` Shaopeng Tan (Fujitsu)
@ 2023-10-20 21:30         ` Tony Luck
  2023-10-20 21:30           ` [PATCH v9 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
                             ` (9 more replies)
  10 siblings, 10 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-20 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Changes since v6 (see individual patches for specifics):

v7 -	had some git format-patch disaster and one of the patches couldn't
	be applied.

v8 -	Was rushed. Somehow I booted the wrong kernel while testing and
	let escape a brown-paper-bag bug that crashed duing boot.
	Sincere apologies to all who wasted time reading this series,
	or trying to boot it.

v9 -	Tested (Really! I checked timestamps in dmesg, and all sorts of
	other checks to make sure I was really looking at a kernel built
	with these patches).

	Rebased to tip/master October 20th since that has several other
	resctrl changes staged resdy for next merge window. No
	significant collisions, just noise where "git am" would not
	automatically apply. New base is:

	3300447612b2 ("Merge branch into tip/master: 'x86/tdx'")

	Fixed the brown-paper-bag bug from v8.

	Added Peter's "Reviewed-by" where offered (except on patch 3
	which had the aforementioned bug).

Tony Luck (8):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  23 +-
 include/linux/resctrl.h                   |  85 +++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  66 ++--
 arch/x86/kernel/cpu/resctrl/core.c        | 402 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  58 ++--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  58 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  14 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 132 +++----
 9 files changed, 592 insertions(+), 247 deletions(-)


base-commit: 3300447612b2adbc05cbb90e5d1cb288f19c40c6
-- 
2.41.0


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v9 1/8] x86/resctrl: Prepare for new domain scope
  2023-10-20 21:30         ` [PATCH v9 " Tony Luck
@ 2023-10-20 21:30           ` Tony Luck
  2023-10-30 21:18             ` Reinette Chatre
  2023-10-20 21:30           ` [PATCH v9 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
                             ` (8 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-20 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Provide a more detailed warning message if a domain id cannot be found
when adding a CPU. Just check and silent return if the domain id can't
be found when removing a CPU.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v6:

s/-EINVAL/0/ for return value of rdtgroup_cbm_to_size()
Added Peter's review tag

 include/linux/resctrl.h                   |  9 +++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 33 ++++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 ++++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 +++-
 4 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..7d4eb7df611d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19e0681f0435..e1588bcd9bd2 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -491,6 +491,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -506,12 +519,17 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 	int err;
 
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
+		return;
+	}
 	d = rdt_find_domain(r, id, &add_pos);
 	if (IS_ERR(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -556,10 +574,13 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
+	if (id < 0)
+		return;
+
 	d = rdt_find_domain(r, id, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..8c5f932bc00b 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	int scope = plr->s->res->scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..c44be64d65ec 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1413,10 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return size;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v9 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-10-20 21:30         ` [PATCH v9 " Tony Luck
  2023-10-20 21:30           ` [PATCH v9 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-10-20 21:30           ` Tony Luck
  2023-10-30 21:19             ` Reinette Chatre
  2023-10-20 21:30           ` [PATCH v9 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
                             ` (7 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-20 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move the "list" and "id" fields into their own
structure within the rdt_domain structure.

Add a "type" field to the header to be used later so that callers looking
up a domain can be assured that they found one of the expected type.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v6:

Add the "type" field to rdt_domain_hdr.

 include/linux/resctrl.h                   | 21 +++++++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 16 ++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 16 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     |  2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 44 +++++++++++------------
 6 files changed, 58 insertions(+), 43 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 7d4eb7df611d..320febbb0a4e 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -52,10 +52,26 @@ struct resctrl_staged_config {
 	bool			have_new_ctrl;
 };
 
+enum resctrl_domain_type {
+	RESCTRL_CTRL_DOMAIN,
+	RESCTRL_MON_DOMAIN,
+};
+
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
+ * @type:		type of this instance
+ */
+struct rdt_domain_hdr {
+	struct list_head		list;
+	int				id;
+	enum resctrl_domain_type	type;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr:		common header for different domain types
  * @cpu_mask:		which CPUs share this resource
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
@@ -71,8 +87,7 @@ struct resctrl_staged_config {
  *			by closid
  */
 struct rdt_domain {
-	struct list_head		list;
-	int				id;
+	struct rdt_domain_hdr		hdr;
 	struct cpumask			cpu_mask;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index e1588bcd9bd2..7daa5b7e6cb0 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -356,7 +356,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->cpu_mask))
 			return d;
@@ -405,12 +405,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 		return ERR_PTR(-ENODEV);
 
 	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, list);
+		d = list_entry(l, struct rdt_domain, hdr.list);
 		/* When id is found, return its domain. */
-		if (id == d->id)
+		if (id == d->hdr.id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->id)
+		if (id < d->hdr.id)
 			break;
 	}
 
@@ -548,7 +548,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 
 	d = &hw_dom->d_resctrl;
-	d->id = id;
+	d->hdr.id = id;
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
@@ -563,11 +563,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail(&d->list, add_pos);
+	list_add_tail(&d->hdr.list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
@@ -591,7 +591,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 
 		/*
 		 * rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index beccb0e87ba7..6f4152b21985 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -146,7 +146,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -226,8 +226,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
 			if (r->parse_ctrlval(&data, s, d))
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -476,7 +476,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 			ctrl_val = resctrl_arch_get_config(r, dom, closid,
 							   schema->conf_type);
 
-		seq_printf(s, r->format_str, dom->id, max_data_width,
+		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
 			   ctrl_val);
 		sep = true;
 	}
@@ -505,7 +505,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			} else {
 				seq_printf(s, "%s:%d=%x\n",
 					   rdtgrp->plr->s->res->name,
-					   rdtgrp->plr->d->id,
+					   rdtgrp->plr->d->hdr.id,
 					   rdtgrp->plr->cbm);
 			}
 		} else {
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..ef12f78392e9 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8c5f932bc00b..18b6183a1b48 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, list) {
+		list_for_each_entry(d_i, &r->domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index c44be64d65ec..746ee56856a9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -984,12 +984,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
 		exclusive = 0;
-		seq_printf(seq, "%d=", dom->id);
+		seq_printf(seq, "%d=", dom->hdr.id);
 		for (i = 0; i < closids_supported(); i++) {
 			if (!closid_allocated(i))
 				continue;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1465,7 +1465,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
 						    rdtgrp->plr->d,
 						    rdtgrp->plr->cbm);
-			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
 		}
 		goto out;
 	}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1495,7 +1495,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 				else
 					size = rdtgroup_cbm_to_size(r, d, ctrl);
 			}
-			seq_printf(s, "%d=%u", d->id, size);
+			seq_printf(s, "%d=%u", d->hdr.id, size);
 			sep = true;
 		}
 		seq_putc(s, '\n');
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1574,7 +1574,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		mon_info.evtid = evtid;
 		mondata_config_read(dom, &mon_info);
 
-		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1689,8 +1689,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
 				return -EINVAL;
@@ -2232,7 +2232,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, list) {
+	list_for_each_entry(d, &r_l->domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->cpu_mask)
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2780,7 +2780,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * CBMs in all domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
@@ -2986,7 +2986,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
 	/* create the directory */
 	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
 	if (IS_ERR(kn))
@@ -3002,7 +3002,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	}
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->id;
+	priv.u.domid = d->hdr.id;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3209,7 +3209,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
 	 */
 	tmp_cbm = cfg->new_ctrl;
 	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
-		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
 		return -ENOSPC;
 	}
 	cfg->have_new_ctrl = true;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, list) {
+	list_for_each_entry(d, &s->res->domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3864,7 +3864,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (static_branch_unlikely(&rdt_mon_enable_key))
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v9 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-10-20 21:30         ` [PATCH v9 " Tony Luck
  2023-10-20 21:30           ` [PATCH v9 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2023-10-20 21:30           ` [PATCH v9 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2023-10-20 21:30           ` Tony Luck
  2023-10-30 21:21             ` Reinette Chatre
  2023-10-20 21:30           ` [PATCH v9 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
                             ` (6 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-20 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically L3 scope for
cache control and NODE scope for cache occupancy and memory bandwidth
monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v6:

Initialize the "type" in rdt_domain_hdr when creating domains.
Check type has expected value before using container_of() to
get to the surrounding structure.

Rename "hw_mondom" to "hw_dom" in domain_add_cpu_mon() and
in domain_remove_cpu_mon().

Add lockdep_assert_held(&rdtgroup_mutex) to resctrl_offline_mon_domain()

Changes since v8:
Fix the brown-paper-bag bugs using NULL result from rdt_find_domain()


 include/linux/resctrl.h                   |  18 +-
 arch/x86/kernel/cpu/resctrl/internal.h    |   4 +-
 arch/x86/kernel/cpu/resctrl/core.c        | 216 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  18 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  54 +++---
 7 files changed, 225 insertions(+), 91 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 320febbb0a4e..98d917aff075 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -170,10 +170,12 @@ enum resctrl_scope {
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @scope:		Scope of this resource
+ * @ctrl_scope:		Scope of this resource for control functions
+ * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		All domains for this resource
+ * @ctrl_domains:	Control domains for this resource
+ * @mon_domains:	Monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -188,10 +190,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -237,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..12f1ea3ba8a1 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -520,8 +520,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7daa5b7e6cb0..b3b8f936bea5 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,8 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.mon_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
+			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.scope			= RESCTRL_L2_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.ctrl_scope		= RESCTRL_L2_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -356,7 +359,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->cpu_mask))
 			return d;
@@ -388,29 +391,39 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Find a domain in one of a resource domain lists.
  *
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
+ * Search the list to find the resource id. If the resource id is found
+ * in a domain, return the domain. Otherwise, if requested by caller,
+ * return the first domain whose id is bigger than the input id.
  * The domain list is sorted by id in ascending order.
+ *
+ * If an existing domain in the resource r's domain list matches the cpu's
+ * resource id, add the cpu in the domain.
+ *
+ * Otherwise, caller will allocate a new domain and insert into the right position
+ * in the domain list sorted by id in ascending order.
+ *
+ * The order in the domain list is visible to users when we print entries
+ * in the schemata file and schemata input is validated to have the same order
+ * as this list.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos)
 {
-	struct rdt_domain *d;
+	struct rdt_domain_hdr *d;
 	struct list_head *l;
 
 	if (id < 0)
 		return ERR_PTR(-ENODEV);
 
-	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, hdr.list);
+	list_for_each(l, h) {
+		d = list_entry(l, struct rdt_domain_hdr, list);
 		/* When id is found, return its domain. */
-		if (id == d->hdr.id)
+		if (id == d->id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->hdr.id)
+		if (id < d->id)
 			break;
 	}
 
@@ -504,39 +517,33 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	return -EINVAL;
 }
 
-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 	int err;
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+
+	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+	if (IS_ERR(hdr)) {
+		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
 
-	if (d) {
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+			return;
+
+		d = container_of(hdr, struct rdt_domain, hdr);
+
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
@@ -549,48 +556,115 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = &hw_dom->d_resctrl;
 	d->hdr.id = id;
+	d->hdr.type = RESCTRL_CTRL_DOMAIN;
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+	if (domain_setup_ctrlval(r, d)) {
+		domain_free(hw_dom);
+		return;
+	}
+
+	list_add_tail(&d->hdr.list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del(&d->hdr.list);
 		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+	int err;
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+	if (IS_ERR(hdr)) {
+		pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+			return;
+
+		d = container_of(hdr, struct rdt_domain, hdr);
+
+		cpumask_set_cpu(cpu, &d->cpu_mask);
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->hdr.id = id;
+	d->hdr.type = RESCTRL_MON_DOMAIN;
+	cpumask_set_cpu(cpu, &d->cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
 	}
 
 	list_add_tail(&d->hdr.list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+/*
+ * domain_add_cpu - Add a cpu to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 
 	if (id < 0)
 		return;
 
-	d = rdt_find_domain(r, id, NULL);
-	if (IS_ERR_OR_NULL(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
+		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
 	hw_dom = resctrl_to_arch_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del(&d->hdr.list);
 
 		/*
@@ -603,6 +677,38 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 		return;
 	}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+
+	if (id < 0)
+		return;
+
+	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
+		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+	hw_dom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->cpu_mask);
+	if (cpumask_empty(&d->cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+
+		return;
+	}
 
 	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -617,6 +723,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 6f4152b21985..34b7eb26b06d 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -226,7 +226,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -542,6 +542,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
+	struct rdt_domain_hdr *hdr;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
@@ -562,12 +563,19 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
 		ret = -ENOENT;
 		goto out;
 	}
 
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
 	if (rr.err == -EIO)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ef12f78392e9..6d5e2cbdefb5 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 18b6183a1b48..bda32b4e1c1e 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
-	int scope = plr->s->res->scope;
+	int scope = plr->s->res->ctrl_scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, hdr.list) {
+		list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 746ee56856a9..9df8f02ecb63 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -984,7 +984,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1413,13 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
-	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+	if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->scope) {
+		if (ci->info_list[i].level == r->ctrl_scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1689,7 +1689,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
@@ -2232,7 +2232,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, hdr.list) {
+	list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->cpu_mask)
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2777,10 +2777,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
 
 	/*
 	 * Disable resource control for this resource by setting all
-	 * CBMs in all domains to the maximum mask value. Pick one CPU
+	 * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, hdr.list) {
+	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3849,15 +3849,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
+}
 
-	if (!r->mon_capable)
-		return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3914,18 +3916,22 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
-	int err;
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
-		return 0;
+	return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v9 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-10-20 21:30         ` [PATCH v9 " Tony Luck
                             ` (2 preceding siblings ...)
  2023-10-20 21:30           ` [PATCH v9 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-10-20 21:30           ` Tony Luck
  2023-10-30 21:21             ` Reinette Chatre
  2023-10-20 21:30           ` [PATCH v9 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
                             ` (5 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-20 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v6:

Changes since last version:

Fixed two places where line wrapping inside comments were not
properly aligned.

Added Peter's review tag

 include/linux/resctrl.h                   | 50 +++++++------
 arch/x86/kernel/cpu/resctrl/internal.h    | 60 ++++++++++------
 arch/x86/kernel/cpu/resctrl/core.c        | 87 ++++++++++++-----------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 40 +++++------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 62 ++++++++--------
 7 files changed, 184 insertions(+), 153 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 98d917aff075..4778ef71c893 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -70,7 +70,25 @@ struct rdt_domain_hdr {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr:		common header for different domain types
+ * @cpu_mask:		which CPUs share this resource
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_ctrl_domain {
+	struct rdt_domain_hdr		hdr;
+	struct cpumask			cpu_mask;
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl control resource
  * @hdr:		common header for different domain types
  * @cpu_mask:		which CPUs share this resource
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
@@ -80,13 +98,8 @@ struct rdt_domain_hdr {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
 	struct cpumask			cpu_mask;
 	unsigned long			*rmid_busy_llc;
@@ -96,9 +109,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };
 
 /**
@@ -202,7 +212,7 @@ struct rdt_resource {
 	const char		*format_str;
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
-						 struct rdt_domain *d);
+						 struct rdt_ctrl_domain *d);
 	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
@@ -236,15 +246,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.
  */
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -260,7 +270,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
 
 /**
@@ -273,7 +283,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid);
 
 /**
@@ -285,7 +295,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 12f1ea3ba8a1..41a23556f57d 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -107,7 +107,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -192,7 +192,7 @@ struct mongroup {
  */
 struct pseudo_lock_region {
 	struct resctrl_schema	*s;
-	struct rdt_domain	*d;
+	struct rdt_ctrl_domain	*d;
 	u32			cbm;
 	wait_queue_head_t	lock_thread_wq;
 	int			thread_done;
@@ -319,25 +319,41 @@ struct arch_mbm_state {
 };
 
 /**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ *			       a resource for a control function
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+	struct rdt_ctrl_domain		d_resctrl;
+	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ *			      a resource for a monitor function
+ * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
-struct rdt_hw_domain {
-	struct rdt_domain		d_resctrl;
-	u32				*ctrl_val;
+struct rdt_hw_mon_domain {
+	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };
 
-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
 {
-	return container_of(r, struct rdt_hw_domain, d_resctrl);
+	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
 }
 
 /**
@@ -405,7 +421,7 @@ struct rdt_hw_resource {
 	struct rdt_resource	r_resctrl;
 	u32			num_closid;
 	unsigned int		msr_base;
-	void (*msr_update)	(struct rdt_domain *d, struct msr_param *m,
+	void (*msr_update)	(struct rdt_ctrl_domain *d, struct msr_param *m,
 				 struct rdt_resource *r);
 	unsigned int		mon_scale;
 	unsigned int		mbm_width;
@@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 }
 
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d);
+	      struct rdt_ctrl_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d);
+	     struct rdt_ctrl_domain *d);
 
 extern struct mutex rdtgroup_mutex;
 
@@ -526,21 +542,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				  unsigned long cbm);
 enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
 int rdtgroup_tasks_assigned(struct rdtgroup *r);
 int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
 int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
 int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
@@ -550,17 +566,17 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b3b8f936bea5..bcc4bd2e1930 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
 bool rdt_alloc_capable;
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
 #define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -307,11 +307,11 @@ static void rdt_get_cdp_l2_config(void)
 }
 
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -332,12 +332,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 }
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	/*  Write the delay values for mba. */
 	for (i = m->low; i < m->high; i++)
@@ -345,19 +345,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 }
 
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
@@ -379,7 +379,7 @@ void rdt_ctrl_update(void *arg)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
 	struct rdt_resource *r = m->res;
 	int cpu = smp_processor_id();
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	d = get_domain_from_cpu(cpu, r);
 	if (d) {
@@ -447,18 +447,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 		*dc = r->default_ctrl;
 }
 
-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct msr_param m;
 	u32 *dc;
 
@@ -481,7 +486,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
 {
 	size_t tsize;
 
@@ -520,10 +525,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -542,7 +547,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 			return;
 
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_ctrl_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -562,7 +567,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	rdt_domain_reconfigure_cdp(r);
 
 	if (domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 		return;
 	}
 
@@ -571,17 +576,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	err = resctrl_online_ctrl_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 	}
 }
 
 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_mon_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -600,7 +605,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 			return;
 
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		return;
@@ -616,7 +621,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 		return;
 	}
 
@@ -625,7 +630,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 	}
 }
 
@@ -643,9 +648,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	if (id < 0)
 		return;
@@ -659,8 +664,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+	hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
@@ -668,12 +673,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		list_del(&d->hdr.list);
 
 		/*
-		 * rdt_domain "d" is going to be freed below, so clear
+		 * rdt_ctrl_domain "d" is going to be freed below, so clear
 		 * its pointer from pseudo_lock_region struct.
 		 */
 		if (d->plr)
 			d->plr->d = NULL;
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 
 		return;
 	}
@@ -682,9 +687,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_mon_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	if (id < 0)
 		return;
@@ -698,14 +703,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
+	hw_dom = resctrl_to_arch_mon_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 34b7eb26b06d..1e68e1b91d31 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }
 
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d)
+	     struct rdt_ctrl_domain *d)
 {
 	struct resctrl_staged_config *cfg;
 	u32 closid = data->rdtgrp->closid;
@@ -137,7 +137,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
  * resource type.
  */
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d)
+	      struct rdt_ctrl_domain *d)
 {
 	struct rdtgroup *rdtgrp = data->rdtgrp;
 	struct resctrl_staged_config *cfg;
@@ -206,8 +206,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct resctrl_staged_config *cfg;
 	struct rdt_resource *r = s->res;
 	struct rdt_parse_data data;
+	struct rdt_ctrl_domain *d;
 	char *dom = NULL, *id;
-	struct rdt_domain *d;
 	unsigned long dom_id;
 
 	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -267,11 +267,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
 	}
 }
 
-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
 			 struct resctrl_staged_config *cfg, u32 idx,
 			 cpumask_var_t cpu_mask)
 {
-	struct rdt_domain *dom = &hw_dom->d_resctrl;
+	struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
 		cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
@@ -283,11 +283,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	return false;
 }
 
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
@@ -307,11 +307,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	enum resctrl_conf_type t;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	u32 idx;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -319,7 +319,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 
 	msr_param.res = NULL;
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
 			if (!cfg->have_new_ctrl)
@@ -449,10 +449,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	u32 idx = get_config_index(closid, type);
 
 	return hw_dom->ctrl_val[idx];
@@ -461,7 +461,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
 {
 	struct rdt_resource *r = schema->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	bool sep = false;
 	u32 ctrl_val;
 
@@ -523,7 +523,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	/*
@@ -543,11 +543,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
 	struct rdt_domain_hdr *hdr;
+	struct rdt_mon_domain *d;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -574,7 +574,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		goto out;
 	}
 
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 6d5e2cbdefb5..7f06848fb828 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	}
 }
 
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
 }
@@ -334,7 +334,7 @@ int alloc_rmid(void)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int cpu, err;
 	u64 val = 0;
 
@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
 				       enum resctrl_event_id evtid)
 {
 	switch (evtid) {
@@ -516,13 +516,13 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
+	struct rdt_ctrl_domain *dom_mba;
 	u32 cur_bw, delta_bw, user_bw;
 	struct rdt_resource *r_mba;
-	struct rdt_domain *dom_mba;
 	struct list_head *head;
 	struct rdtgroup *entry;
 
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	}
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
 {
 	struct rmid_read rr;
 
@@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
 	struct rdtgroup *prgrp, *crgrp;
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index bda32b4e1c1e..675e9e47af54 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
  * Return: true if @cbm overlaps with pseudo-locked region on @d, false
  * otherwise.
  */
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int cbm_len;
 	unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
  *         if it is not possible to test due to memory allocation issue,
  *         false otherwise.
  */
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
 {
+	struct rdt_ctrl_domain *d_i;
 	cpumask_var_t cpu_with_psl;
 	struct rdt_resource *r;
-	struct rdt_domain *d_i;
 	bool ret = false;
 
 	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9df8f02ecb63..46c6d6807bad 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -85,8 +85,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
 
 void rdt_staged_configs_clear(void)
 {
+	struct rdt_ctrl_domain *dom;
 	struct rdt_resource *r;
-	struct rdt_domain *dom;
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -976,7 +976,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	unsigned long sw_shareable = 0, hw_shareable = 0;
 	unsigned long exclusive = 0, pseudo_locked = 0;
 	struct rdt_resource *r = s->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	int i, hwb, swb, excl, psl;
 	enum rdtgrp_mode mode;
 	bool sep = false;
@@ -1205,7 +1205,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
  *
  * Return: false if CBM does not overlap, true if it does.
  */
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				    unsigned long cbm, int closid,
 				    enum resctrl_conf_type type, bool exclusive)
 {
@@ -1260,7 +1260,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
  *
  * Return: true if CBM overlap detected, false if there is no overlap
  */
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1291,10 +1291,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
 static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 {
 	int closid = rdtgrp->closid;
+	struct rdt_ctrl_domain *d;
 	struct resctrl_schema *s;
 	struct rdt_resource *r;
 	bool has_cache = false;
-	struct rdt_domain *d;
 	u32 ctrl;
 
 	list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1407,7 +1407,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
  * bitmap functions work correctly.
  */
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
-				  struct rdt_domain *d, unsigned long cbm)
+				  struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	struct cpu_cacheinfo *ci;
 	unsigned int size = 0;
@@ -1439,9 +1439,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 {
 	struct resctrl_schema *schema;
 	enum resctrl_conf_type type;
+	struct rdt_ctrl_domain *d;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 	unsigned int size;
 	int ret = 0;
 	u32 closid;
@@ -1553,7 +1553,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1561,7 +1561,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
@@ -1618,7 +1618,7 @@ static void mon_event_config_write(void *info)
 }
 
 static int mbm_config_write_domain(struct rdt_resource *r,
-				   struct rdt_domain *d, u32 evtid, u32 val)
+				   struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 	int ret = 0;
@@ -1668,7 +1668,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 {
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int ret = 0;
 
 next:
@@ -2216,9 +2216,9 @@ static inline bool is_mba_linear(void)
 static int set_cache_qos_cfg(int level, bool enable)
 {
 	void (*update)(void *arg);
+	struct rdt_ctrl_domain *d;
 	struct rdt_resource *r_l;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int cpu;
 
 	if (level == RDT_RESOURCE_L3)
@@ -2265,7 +2265,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 		l3_qos_cfg_update(&hw_res->cdp_enabled);
 }
 
-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
 	int cpu = cpumask_any(&d->cpu_mask);
@@ -2283,7 +2283,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 }
 
 static void mba_sc_domain_destroy(struct rdt_resource *r,
-				  struct rdt_domain *d)
+				  struct rdt_ctrl_domain *d)
 {
 	kfree(d->mbps_val);
 	d->mbps_val = NULL;
@@ -2309,7 +2309,7 @@ static int set_mba_sc(bool mba_sc)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2578,7 +2578,7 @@ static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	unsigned long flags = RFTYPE_CTRL_BASE;
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2762,10 +2762,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
 static int reset_all_ctrls(struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int i;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2781,7 +2781,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * from each domain to update the MSRs below.
 	 */
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
@@ -2976,7 +2976,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -3025,7 +3025,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mon_domain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -3047,7 +3047,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	int ret;
 
 	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3149,7 +3149,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
  * Set the RDT domain up to start off with all usable allocations. That is,
  * all shareable and unused bits. All-zero CBM is invalid.
  */
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
 				 u32 closid)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3229,7 +3229,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
  */
 static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int ret;
 
 	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3245,7 +3245,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
@@ -3842,14 +3842,14 @@ static void __init rdtgroup_setup_default(void)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3857,7 +3857,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 		mba_sc_domain_destroy(r, d);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3886,7 +3886,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	size_t tsize;
 
@@ -3916,7 +3916,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3927,7 +3927,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	int err;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v9 5/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-10-20 21:30         ` [PATCH v9 " Tony Luck
                             ` (3 preceding siblings ...)
  2023-10-20 21:30           ` [PATCH v9 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-10-20 21:30           ` Tony Luck
  2023-10-20 21:30           ` [PATCH v9 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                             ` (4 subsequent siblings)
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-20 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
No changes since v6 except to add Peter's review tag

 include/linux/resctrl.h            | 1 +
 arch/x86/kernel/cpu/resctrl/core.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 4778ef71c893..683706355810 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -172,6 +172,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index bcc4bd2e1930..2c3975c9c20c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -515,6 +515,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
 	default:
 		break;
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v9 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-10-20 21:30         ` [PATCH v9 " Tony Luck
                             ` (4 preceding siblings ...)
  2023-10-20 21:30           ` [PATCH v9 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-10-20 21:30           ` Tony Luck
  2023-10-30 21:22             ` Reinette Chatre
  2023-10-20 21:30           ` [PATCH v9 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
                             ` (3 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-20 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that will show how many
SNC nodes share each L3 cache. When this is "1", SNC mode is either
not implemented, or not enabled.

A later patch will detect SNC mode and set snc_nodes_per_l3_cache to
the appropriate value. For now it remains at the default "1" to
indicate SNC mode is not active.

Code that needs to take action when SNC is enabled is:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be adjusted for the number
   of SNC nodes.
3) The RMID renumbering operates when using the value from the
   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
   counter, code must adjust from the logical RMID used to the physical
   RMID value for the SNC node that it wishes to read and load the
   adjusted value into the IA32_QM_EVTSEL MSR.
4) The L3 cache is divided between the SNC nodes. So the value
   reported in the resctrl "size" file is adjusted.
5) The "-o mba_MBps" mount option must be disabled in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.
   Trying to use this feedback loop might result in contradictory
   changes to the throttling level coming from each of the SNC
   node bandwidth measurements.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v6:

In commit comment s/redumbering/renumbering/

Move check that SNC is not enabled into supports_mba_mbps().

Add Peter's review tag.

 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  5 +++--
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 41a23556f57d..563e6203321e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 2c3975c9c20c..0e418dd14070 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.  Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 7f06848fb828..9122c9a725e2 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs.
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 46c6d6807bad..d2aae0ca3c40 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /*
@@ -2298,7 +2298,8 @@ static bool supports_mba_mbps(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 
 	return (is_mbm_local_enabled() &&
-		r->alloc_capable && is_mba_linear());
+		r->alloc_capable && is_mba_linear() &&
+		snc_nodes_per_l3_cache == 1);
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v9 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-10-20 21:30         ` [PATCH v9 " Tony Luck
                             ` (5 preceding siblings ...)
  2023-10-20 21:30           ` [PATCH v9 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-10-20 21:30           ` Tony Luck
  2023-10-30 21:23             ` Reinette Chatre
  2023-10-20 21:31           ` [PATCH v9 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
                             ` (2 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-20 21:30 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

There isn't a simple h/w bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero. An earlier commit includes
all the required changes in Linux to operate in this reconfigured mode.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v6:

Moved kfree(node_caches); earlier, to the earliest point where it
is no longer needed.

Added Granite Rapids to list of CPU models that support SNC mode.

Added Peter's review tag

 arch/x86/include/asm/msr-index.h   |  1 +
 arch/x86/kernel/cpu/resctrl/core.c | 92 ++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e3fa9cecd599..4285a5ee81fe 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1109,6 +1109,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 0e418dd14070..ac187eb0440f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -755,11 +758,42 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counnter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+	u64 val;
+
+	/* Only need to enable once per package. */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
@@ -1014,11 +1048,69 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple h/w bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+	int num_l3_caches;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids)
+			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+		else
+			mem_only_nodes++;
+	}
+	cpus_read_unlock();
+
+	num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
+	kfree(node_caches);
+
+	if (!num_l3_caches)
+		return 1;
+
+	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+	if (ret > 1)
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+
+	return ret;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = snc_get_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v9 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-10-20 21:30         ` [PATCH v9 " Tony Luck
                             ` (6 preceding siblings ...)
  2023-10-20 21:30           ` [PATCH v9 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-10-20 21:31           ` Tony Luck
  2023-10-24  5:41           ` [PATCH v9 0/8] Add support for Sub-NUMA cluster (SNC) systems Shaopeng Tan (Fujitsu)
  2023-10-31 21:17           ` [PATCH v10 " Tony Luck
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-20 21:31 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v6:

Added Peter's review tag

 Documentation/arch/x86/resctrl.rst | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..d1db200db5f9 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -366,9 +366,9 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event.  Each of these
 	directories have one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
@@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
 each bit represents 5% of the capacity of the cache. You could partition
 the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache. But each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* RE: [PATCH v9 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-10-20 21:30         ` [PATCH v9 " Tony Luck
                             ` (7 preceding siblings ...)
  2023-10-20 21:31           ` [PATCH v9 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-10-24  5:41           ` Shaopeng Tan (Fujitsu)
  2023-10-24 15:29             ` Luck, Tony
  2023-10-31 21:17           ` [PATCH v10 " Tony Luck
  9 siblings, 1 reply; 331+ messages in thread
From: Shaopeng Tan (Fujitsu) @ 2023-10-24  5:41 UTC (permalink / raw)
  To: 'Tony Luck',
	Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap, linux-kernel,
	linux-doc, patches

Hi tony,

> base-commit: 3300447612b2adbc05cbb90e5d1cb288f19c40c6

$ git checkout 3300447612b2adbc05cbb90e5d1cb288f19c40c6 -b patch_test
fatal: reference is not a tree: 3300447612b2adbc05cbb90e5d1cb288f19c40c6

Then I tried apply this patch series to kernel v6.5 and v6.6-rc1~7, 
but it failed with error message "Patch failed at 000x".

Could you tell me what kernel version this patch series is based on?

Best regards,
Shaopeng TAN

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v9 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-10-24  5:41           ` [PATCH v9 0/8] Add support for Sub-NUMA cluster (SNC) systems Shaopeng Tan (Fujitsu)
@ 2023-10-24 15:29             ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-10-24 15:29 UTC (permalink / raw)
  To: Shaopeng Tan (Fujitsu),
	Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap, linux-kernel,
	linux-doc, patches

> Could you tell me what kernel version this patch series is based on?

Shaopeng TAN,

Sorry. It's buried in all the other text in the change log in the cover letter.
I should have put this more prominently at the start of the cover letter.

	Rebased to tip/master October 20th since that has several other
	resctrl changes staged resdy for next merge window. No
	significant collisions, just noise where "git am" would not
	automatically apply. New base is:

	3300447612b2 ("Merge branch into tip/master: 'x86/tdx'")

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v9 1/8] x86/resctrl: Prepare for new domain scope
  2023-10-20 21:30           ` [PATCH v9 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-10-30 21:18             ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-10-30 21:18 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 10/20/2023 2:30 PM, Tony Luck wrote:

...

> @@ -506,12 +519,17 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>   */
>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>  	struct list_head *add_pos = NULL;
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  	int err;
>  
> +	if (id < 0) {
> +		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->scope, r->name);
> +		return;
> +	}
>  	d = rdt_find_domain(r, id, &add_pos);
>  	if (IS_ERR(d)) {
>  		pr_warn("Couldn't find cache id for CPU %d\n", cpu);

From what I can tell the original implementation relied on implementation of
rdt_find_domain() to do error checking of the id value, printing the above pr_warn()
if id was found to be invalid. In your change the error checking on id is moved
earlier yet this original behavior is maintained. How could rdt_find_domain()
possibly fail for this reason at this point?

> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 8f559eeae08e..8c5f932bc00b 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
>   */
>  static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  {
> +	int scope = plr->s->res->scope;

enum resctrl_scope ? 

>  	struct cpu_cacheinfo *ci;
>  	int ret;
>  	int i;




Reinette


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v9 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-10-20 21:30           ` [PATCH v9 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2023-10-30 21:19             ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-10-30 21:19 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 10/20/2023 2:30 PM, Tony Luck wrote:
> The rdt_domain structure is used for both control and monitor features.
> It is about to be split into separate structures for these two usages
> because the scope for control and monitoring features for a resource
> will be different for future resources.
> 
> To allow for common code that scans a list of domains looking for a
> specific domain id, move the "list" and "id" fields into their own
> structure within the rdt_domain structure.

The motivation for this split is not clear to me. Here the motivation
is to support the code that needs to traverse both lists. Later (patch #4)
the motivation is that what remains should be "just the fields required for
control and monitoring respectively". The comment "common header for different
domain types" also makes me think this new structure should contain all
common members?
The reason why the motivation needs to be clear in this regard is because
there is a common field, cpu_mask, that did not make it into the header.
Should it?

> Add a "type" field to the header to be used later so that callers looking
> up a domain can be assured that they found one of the expected type.

Please move this addition to when it is used.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v9 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-10-20 21:30           ` [PATCH v9 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-10-30 21:21             ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-10-30 21:21 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 10/20/2023 2:30 PM, Tony Luck wrote:
> Resctrl assumes that control and monitor operations on a resource are
> performed at the same scope.
> 
> Prepare for systems that use different scope (specifically L3 scope for
> cache control and NODE scope for cache occupancy and memory bandwidth
> monitoring).

The first paragraph is a generalization of all resources but then the
second paragraph only mentions L3. In preparation for readers seeing
that only L3 resource's monitoring scope initialized it may help
to be specific here that resctrl only supports monitoring on L3.

> Create separate domain lists for control and monitor operations.

Please do note that an upcoming change changes the domain list locking.
I expect the transition to go smoothly with the locking and list type
translating to both lists so just sharing for your information in case you
are now aware:
https://lore.kernel.org/lkml/20231025180345.28061-25-james.morse@arm.com/

> 
> Note that errors during initialization of either control or monitor
> functions on a domain would previously result in that domain being
> excluded from both control and monitor operations. Now the domains are
> allocated independently it is no longer required to disable both control
> and monitor operations if either fail.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
...
> @@ -356,7 +359,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
>  {
>  	struct rdt_domain *d;
>  
> -	list_for_each_entry(d, &r->domains, hdr.list) {
> +	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
>  		/* Find the domain that contains this CPU */
>  		if (cpumask_test_cpu(cpu, &d->cpu_mask))
>  			return d;

This appears to silently turn a generic "get_domain_from_cpu()" into
code that assumes control domain. This works for now for the existing
users but can trip up future changes. I think this would be better
renamed to get_ctrl_domain_from_cpu() or something better.

> @@ -388,29 +391,39 @@ void rdt_ctrl_update(void *arg)
>  }
>  
>  /*
> - * rdt_find_domain - Find a domain in a resource that matches input resource id
> + * rdt_find_domain - Find a domain in one of a resource domain lists.

The above does not sound right.
"one of a resource domain lists" -> "a resource domain list" or "one of the resource
domain lists" or ?

>   *
> - * Search resource r's domain list to find the resource id. If the resource
> - * id is found in a domain, return the domain. Otherwise, if requested by
> - * caller, return the first domain whose id is bigger than the input id.
> + * Search the list to find the resource id. If the resource id is found
> + * in a domain, return the domain. Otherwise, if requested by caller,
> + * return the first domain whose id is bigger than the input id.

The above does not sound right. First there is "Search the list to find
the resource id." The resource id is not involved in this code, do you
mean "domain id"? Also later "If the resource id is found in a domain,"
what does that mean here?

>   * The domain list is sorted by id in ascending order.
> + *
> + * If an existing domain in the resource r's domain list matches the cpu's
> + * resource id, add the cpu in the domain.

domain id?

> + *
> + * Otherwise, caller will allocate a new domain and insert into the right position
> + * in the domain list sorted by id in ascending order.
> + *
> + * The order in the domain list is visible to users when we print entries
> + * in the schemata file and schemata input is validated to have the same order
> + * as this list.

Please document what the caller does at the caller. Also, please always use "CPU",
not "cpu". Finally, please do not impersonate code (no "we"). I understand you
are copying original comments, no need to propagate these issues.

>   */
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> -				   struct list_head **pos)
> +struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
> +				       struct list_head **pos)
>  {
> -	struct rdt_domain *d;
> +	struct rdt_domain_hdr *d;
>  	struct list_head *l;
>  
>  	if (id < 0)
>  		return ERR_PTR(-ENODEV);
>  
> -	list_for_each(l, &r->domains) {
> -		d = list_entry(l, struct rdt_domain, hdr.list);
> +	list_for_each(l, h) {
> +		d = list_entry(l, struct rdt_domain_hdr, list);
>  		/* When id is found, return its domain. */
> -		if (id == d->hdr.id)
> +		if (id == d->id)
>  			return d;
>  		/* Stop searching when finding id's position in sorted list. */
> -		if (id < d->hdr.id)
> +		if (id < d->id)
>  			break;
>  	}
>  
> @@ -504,39 +517,33 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>  	return -EINVAL;
>  }
>  
> -/*
> - * domain_add_cpu - Add a cpu to a resource's domain list.
> - *
> - * If an existing domain in the resource r's domain list matches the cpu's
> - * resource id, add the cpu in the domain.
> - *
> - * Otherwise, a new domain is allocated and inserted into the right position
> - * in the domain list sorted by id in ascending order.
> - *
> - * The order in the domain list is visible to users when we print entries
> - * in the schemata file and schemata input is validated to have the same order
> - * as this list.
> - */
> -static void domain_add_cpu(int cpu, struct rdt_resource *r)
> +static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_domain_id_from_scope(cpu, r->scope);
> +	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
>  	struct list_head *add_pos = NULL;
>  	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain_hdr *hdr;
>  	struct rdt_domain *d;
>  	int err;
>  
>  	if (id < 0) {
> -		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> -			     cpu, r->scope, r->name);
> +		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->ctrl_scope, r->name);
>  		return;
>  	}
> -	d = rdt_find_domain(r, id, &add_pos);
> -	if (IS_ERR(d)) {
> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> +
> +	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
> +	if (IS_ERR(hdr)) {
> +		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);

How can the failure in the error message be encountered at this point?

>  		return;
>  	}
>  
> -	if (d) {
> +	if (hdr) {
> +		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
> +			return;
> +
> +		d = container_of(hdr, struct rdt_domain, hdr);
> +
>  		cpumask_set_cpu(cpu, &d->cpu_mask);
>  		if (r->cache.arch_has_per_cpu_cfg)
>  			rdt_domain_reconfigure_cdp(r);
> @@ -549,48 +556,115 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  
>  	d = &hw_dom->d_resctrl;
>  	d->hdr.id = id;
> +	d->hdr.type = RESCTRL_CTRL_DOMAIN;
>  	cpumask_set_cpu(cpu, &d->cpu_mask);
>  
>  	rdt_domain_reconfigure_cdp(r);
>  
> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> +	if (domain_setup_ctrlval(r, d)) {
> +		domain_free(hw_dom);
> +		return;
> +	}
> +
> +	list_add_tail(&d->hdr.list, add_pos);
> +
> +	err = resctrl_online_ctrl_domain(r, d);
> +	if (err) {
> +		list_del(&d->hdr.list);
>  		domain_free(hw_dom);
> +	}
> +}
> +
> +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> +	struct list_head *add_pos = NULL;
> +	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain_hdr *hdr;
> +	struct rdt_domain *d;
> +	int err;
> +
> +	if (id < 0) {
> +		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->mon_scope, r->name);
> +		return;
> +	}
> +
> +	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
> +	if (IS_ERR(hdr)) {
> +		pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
> +		return;
> +	}
> +
> +	if (hdr) {
> +		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
> +			return;
> +
> +		d = container_of(hdr, struct rdt_domain, hdr);
> +
> +		cpumask_set_cpu(cpu, &d->cpu_mask);
>  		return;
>  	}
>  
> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> +	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!hw_dom)
> +		return;
> +
> +	d = &hw_dom->d_resctrl;
> +	d->hdr.id = id;
> +	d->hdr.type = RESCTRL_MON_DOMAIN;
> +	cpumask_set_cpu(cpu, &d->cpu_mask);
> +
> +	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
>  		domain_free(hw_dom);
>  		return;
>  	}
>  
>  	list_add_tail(&d->hdr.list, add_pos);
>  
> -	err = resctrl_online_domain(r, d);
> +	err = resctrl_online_mon_domain(r, d);
>  	if (err) {
>  		list_del(&d->hdr.list);
>  		domain_free(hw_dom);
>  	}
>  }
>  
> -static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> +/*
> + * domain_add_cpu - Add a cpu to either/both resource's domain lists.

cpu -> CPU (please check all changelog and comments)

> + */
> +static void domain_add_cpu(int cpu, struct rdt_resource *r)
> +{
> +	if (r->alloc_capable)
> +		domain_add_cpu_ctrl(cpu, r);
> +	if (r->mon_capable)
> +		domain_add_cpu_mon(cpu, r);
> +}
> +

> @@ -3914,18 +3916,22 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
>  	return 0;
>  }
>  
> -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
>  {
> -	int err;
> -
>  	lockdep_assert_held(&rdtgroup_mutex);
>  
>  	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
>  		/* RDT_RESOURCE_MBA is never mon_capable */

This comment was used to justify the early exit based on later
"if (!r->mon_capable)" test. With the test removed this comment
becomes unnecessary.


Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v9 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-10-20 21:30           ` [PATCH v9 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-10-30 21:21             ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-10-30 21:21 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 10/20/2023 2:30 PM, Tony Luck wrote:
> The same rdt_domain structure is used for both control and monitor
> functions. But this results in wasted memory as some of the fields are
> only used by control functions, while most are only used for monitor
> functions.
> 
> Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
> just the fields required for control and monitoring respectively.

Sounds like a motivation for the cpumask to form part of the
common header?

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v9 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-10-20 21:30           ` [PATCH v9 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-10-30 21:22             ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-10-30 21:22 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 10/20/2023 2:30 PM, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
> 
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
> 
> Resctrl monitoring on Intel system depends upon attaching RMIDs to tasks

"on Intel system depends" -> "on an Intel system depends" or "on Intel
systems depend" or ... ?

> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
> 
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
> 
> The other mode divides the RMIDs and renumbers the ones on the second
> SNC node to start from zero.
> 
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
> 
> Add a global integer "snc_nodes_per_l3_cache" that will show how many
> SNC nodes share each L3 cache. When this is "1", SNC mode is either
> not implemented, or not enabled.
> 
> A later patch will detect SNC mode and set snc_nodes_per_l3_cache to

Please remove usages of "later patch" from this series. For reference:
https://lore.kernel.org/lkml/20231009171918.GPZSQ2Frs%2Fqp129wsP@fat_crate.local/
Please check whole series. For same reason I expect "earlier patch" to
need removal also.

> the appropriate value. For now it remains at the default "1" to
> indicate SNC mode is not active.
> 
> Code that needs to take action when SNC is enabled is:
> 1) The number of logical RMIDs per L3 cache available for use is the
>    number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be adjusted for the number
>    of SNC nodes.

Can this be expanded to indicate how the value needs to be adjusted?

> 3) The RMID renumbering operates when using the value from the
>    IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
>    counter, code must adjust from the logical RMID used to the physical
>    RMID value for the SNC node that it wishes to read and load the
>    adjusted value into the IA32_QM_EVTSEL MSR.
> 4) The L3 cache is divided between the SNC nodes. So the value
>    reported in the resctrl "size" file is adjusted.

Can this be expanded to indicate how the value needs to be adjusted?

> 5) The "-o mba_MBps" mount option must be disabled in SNC mode
>    because the monitoring is being done per SNC node, while the
>    bandwidth allocation is still done at the L3 cache scope.
>    Trying to use this feedback loop might result in contradictory
>    changes to the throttling level coming from each of the SNC
>    node bandwidth measurements.
> 
> Reviewed-by: Peter Newman <peternewman@google.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> Changes since v6:
> 
> In commit comment s/redumbering/renumbering/
> 
> Move check that SNC is not enabled into supports_mba_mbps().
> 
> Add Peter's review tag.
> 
>  arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
>  arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
>  arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  5 +++--
>  4 files changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 41a23556f57d..563e6203321e 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>  
>  extern struct dentry *debugfs_resctrl;
>  
> +extern int snc_nodes_per_l3_cache;
> +
>  enum resctrl_res_level {
>  	RDT_RESOURCE_L3,
>  	RDT_RESOURCE_L2,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 2c3975c9c20c..0e418dd14070 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -48,6 +48,12 @@ int max_name_width, max_data_width;
>   */
>  bool rdt_alloc_capable;
>  
> +/*
> + * Number of SNC nodes that share each L3 cache.  Default is 1 for
> + * systems that do not support SNC, or have SNC disabled.
> + */
> +int snc_nodes_per_l3_cache = 1;

Should this be an unsigned int?

> +
>  static void
>  mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
>  		struct rdt_resource *r);

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v9 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-10-20 21:30           ` [PATCH v9 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-10-30 21:23             ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-10-30 21:23 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 10/20/2023 2:30 PM, Tony Luck wrote:
> There isn't a simple h/w bit that indicates whether a CPU is

"h/w" -> hardware

> running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
> the ratio of NUMA nodes to L3 cache instances.
> 
> When SNC mode is detected, reconfigure the RMID counters by updating
> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
> 
> Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
> on the second SNC node to start from zero. An earlier commit includes

Please drop the "earlier commit" reference.

> all the required changes in Linux to operate in this reconfigured mode.
> 
> Reviewed-by: Peter Newman <peternewman@google.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> Changes since v6:
> 
> Moved kfree(node_caches); earlier, to the earliest point where it
> is no longer needed.
> 
> Added Granite Rapids to list of CPU models that support SNC mode.
> 
> Added Peter's review tag
> 
>  arch/x86/include/asm/msr-index.h   |  1 +
>  arch/x86/kernel/cpu/resctrl/core.c | 92 ++++++++++++++++++++++++++++++
>  2 files changed, 93 insertions(+)
> 
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index e3fa9cecd599..4285a5ee81fe 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1109,6 +1109,7 @@
>  #define MSR_IA32_QM_CTR			0xc8e
>  #define MSR_IA32_PQR_ASSOC		0xc8f
>  #define MSR_IA32_L3_CBM_BASE		0xc90
> +#define MSR_RMID_SNC_CONFIG		0xca0
>  #define MSR_IA32_L2_CBM_BASE		0xd10
>  #define MSR_IA32_MBA_THRTL_BASE		0xd50
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 0e418dd14070..ac187eb0440f 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>  
>  #define pr_fmt(fmt)	"resctrl: " fmt
>  
> +#include <linux/cpu.h>
>  #include <linux/slab.h>
>  #include <linux/err.h>
>  #include <linux/cacheinfo.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>  
> +#include <asm/cpu_device_id.h>
>  #include <asm/intel-family.h>
>  #include <asm/resctrl.h>
>  #include "internal.h"
> @@ -755,11 +758,42 @@ static void clear_closid_rmid(int cpu)
>  	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
>  }
>  
> +/*
> + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> + * which indicates that RMIDs are configured in legacy mode.
> + * This mode is incompatible with Linux resctrl semantics
> + * as RMIDs are partitioned between SNC nodes, which requires
> + * a user to know which RMID is allocated to a task.
> + * Clearing bit 0 reconfigures the RMID counters for use
> + * in Sub NUMA Cluster mode. This mode is better for Linux.
> + * The RMID space is divided between all SNC nodes with the
> + * RMIDs renumbered to start from zero in each node when
> + * couning operations from tasks. Code to read the counters
> + * must adjust RMID counnter numbers based on SNC node. See

counnter -> counter

> + * __rmid_read() for code that does this.
> + */
> +static void snc_remap_rmids(int cpu)
> +{
> +	u64 val;
> +
> +	/* Only need to enable once per package. */
> +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> +		return;
> +
> +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
> +	val &= ~BIT_ULL(0);
> +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
> +}
> +
>  static int resctrl_online_cpu(unsigned int cpu)
>  {
>  	struct rdt_resource *r;
>  
>  	mutex_lock(&rdtgroup_mutex);
> +
> +	if (snc_nodes_per_l3_cache > 1)
> +		snc_remap_rmids(cpu);
> +
>  	for_each_capable_rdt_resource(r)
>  		domain_add_cpu(cpu, r);
>  	/* The cpu is set in default rdtgroup after online. */
> @@ -1014,11 +1048,69 @@ static __init bool get_rdt_resources(void)
>  	return (rdt_mon_capable || rdt_alloc_capable);
>  }
>  
> +/* CPU models that support MSR_RMID_SNC_CONFIG */
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
> +	{}
> +};
> +
> +/*
> + * There isn't a simple h/w bit that indicates whether a CPU is running

h/w -> hardware

> + * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
> + * ratio of NUMA nodes to L3 cache instances.
> + * It is not possible to accurately determine SNC state if the system is
> + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes

Can the user be warned when SNC state cannot be determined accurately? 

> + * to L3 caches. It will be OK if system is booted with hyperthreading
> + * disabled (since this doesn't affect the ratio).
> + */
> +static __init int snc_get_config(void)
> +{
> +	unsigned long *node_caches;
> +	int mem_only_nodes = 0;
> +	int cpu, node, ret;
> +	int num_l3_caches;
> +
> +	if (!x86_match_cpu(snc_cpu_ids))
> +		return 1;
> +
> +	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
> +	if (!node_caches)
> +		return 1;
> +
> +	cpus_read_lock();
> +	for_each_node(node) {
> +		cpu = cpumask_first(cpumask_of_node(node));
> +		if (cpu < nr_cpu_ids)
> +			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> +		else
> +			mem_only_nodes++;
> +	}
> +	cpus_read_unlock();
> +
> +	num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
> +	kfree(node_caches);
> +
> +	if (!num_l3_caches)
> +		return 1;
> +
> +	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
> +
> +	if (ret > 1)
> +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> +
> +	return ret;
> +}
> +
>  static __init void rdt_init_res_defs_intel(void)
>  {
>  	struct rdt_hw_resource *hw_res;
>  	struct rdt_resource *r;
>  
> +	snc_nodes_per_l3_cache = snc_get_config();
> +
>  	for_each_rdt_resource(r) {
>  		hw_res = resctrl_to_arch_res(r);
>  

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v10 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-10-20 21:30         ` [PATCH v9 " Tony Luck
                             ` (8 preceding siblings ...)
  2023-10-24  5:41           ` [PATCH v9 0/8] Add support for Sub-NUMA cluster (SNC) systems Shaopeng Tan (Fujitsu)
@ 2023-10-31 21:17           ` Tony Luck
  2023-10-31 21:17             ` [PATCH v10 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
                               ` (8 more replies)
  9 siblings, 9 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-31 21:17 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Dropped Peter's "Reviewed-by" from all but parts 5 & 8 since there
have been many changes since he provided those.

Other changes since v9 (all from Reinette's comments)

global s/cpu/CPU/ in commit messages and code comments

#1
New test for invalid domain id before calling rdt_find_domain() means that
error handling in that function and at all call-sites can be simplified.
In pseudo_lock_region_init() use the new enum resctrl_scope for local variable.

#2
Include *all* common fields in the rdt_domain_hdr. Defer adding "type" until it is
used later in part #3.

#3
Fix commit to be specific the only the RDT_RESOURCE_L3 resource is going
to have different monitor and control scope.
Rename get_domain_from_cpu() -> get_ctrl_domain_from_cpu()
Rewrite comment for rdt_find_domains().
Add "type" field to rdt_domain_hdr structure.
Delete the /* RDT_RESOURCE_MBA is never mon_capable */ comment.

#4
Comment against patch 4, but now fixed in patch #2. cpu_mask
is included in common header.

#5
No comments. No changes.

#6
Fixed missing word s/monitoring on Intel/monitoring on an Intel/
Deleted "A later patch" paragraph.
Expanded description how how values are "adjusted" for mon_scale
and cache size.
Changed type of "snc_nodes_per_l3_cache" to "unsigned int".

#7
Expand h/w to hardware (commit and code comments)
Remove "earlier commit" reference
s/counnter/counter/
Check for offline CPUs and warn user SNC detection may be broken.

#8
No comments. No changes.

Tony Luck (8):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  23 +-
 include/linux/resctrl.h                   |  87 +++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  66 ++--
 arch/x86/kernel/cpu/resctrl/core.c        | 411 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  58 +--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  68 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  26 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 149 ++++----
 9 files changed, 607 insertions(+), 282 deletions(-)


base-commit: 5a6a09e97199d6600d31383055f9d43fbbcbe86f
-- 
2.41.0


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v10 1/8] x86/resctrl: Prepare for new domain scope
  2023-10-31 21:17           ` [PATCH v10 " Tony Luck
@ 2023-10-31 21:17             ` Tony Luck
  2023-11-07  0:31               ` Reinette Chatre
  2023-10-31 21:17             ` [PATCH v10 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
                               ` (7 subsequent siblings)
  8 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-31 21:17 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Clean up the error handling when looking up domains. Report invalid id's
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the callsites.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v9
New test for invalid domain id before calling rdt_find_domain() means that
error handling in that function and at all call-sites can be simplified.
In pseudo_lock_region_init() use the new enum resctrl_scope for local variable.

 include/linux/resctrl.h                   |  9 +++--
 arch/x86/kernel/cpu/resctrl/core.c        | 40 +++++++++++++++--------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 ++-
 5 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..7d4eb7df611d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19e0681f0435..47f92390edbb 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -401,9 +401,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct rdt_domain *d;
 	struct list_head *l;
 
-	if (id < 0)
-		return ERR_PTR(-ENODEV);
-
 	list_for_each(l, &r->domains) {
 		d = list_entry(l, struct rdt_domain, list);
 		/* When id is found, return its domain. */
@@ -491,6 +488,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -506,17 +516,18 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 	int err;
 
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
 		return;
 	}
+	d = rdt_find_domain(r, id, &add_pos);
 
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
@@ -556,12 +567,15 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
+	if (id < 0)
+		return;
+
 	d = rdt_find_domain(r, id, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	if (!d) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index beccb0e87ba7..3f8891d57fac 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -563,7 +563,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 
 	r = &rdt_resources_all[resid].r_resctrl;
 	d = rdt_find_domain(r, domid, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	if (!d) {
 		ret = -ENOENT;
 		goto out;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..2a682da9f43a 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	enum resctrl_scope scope = plr->s->res->scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..c44be64d65ec 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1413,10 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return size;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v10 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-10-31 21:17           ` [PATCH v10 " Tony Luck
  2023-10-31 21:17             ` [PATCH v10 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-10-31 21:17             ` Tony Luck
  2023-10-31 21:17             ` [PATCH v10 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
                               ` (6 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-31 21:17 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v9
Include *all* common fields in the rdt_domain_hdr. Defer adding "type" until it is
used later in part #3.

 include/linux/resctrl.h                   | 16 ++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 26 +++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 10 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 60 +++++++++++------------
 6 files changed, 78 insertions(+), 70 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 7d4eb7df611d..c4067150a6b7 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,10 +53,20 @@ struct resctrl_staged_config {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
  * @cpu_mask:		which CPUs share this resource
+ */
+struct rdt_domain_hdr {
+	struct list_head		list;
+	int				id;
+	struct cpumask			cpu_mask;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
  * @mbm_local:		saved state for MBM local bandwidth
@@ -71,9 +81,7 @@ struct resctrl_staged_config {
  *			by closid
  */
 struct rdt_domain {
-	struct list_head		list;
-	int				id;
-	struct cpumask			cpu_mask;
+	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
 	struct mbm_state		*mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 47f92390edbb..c26ecb2e415f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -356,9 +356,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/* Find the domain that contains this CPU */
-		if (cpumask_test_cpu(cpu, &d->cpu_mask))
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
 	}
 
@@ -402,12 +402,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct list_head *l;
 
 	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, list);
+		d = list_entry(l, struct rdt_domain, hdr.list);
 		/* When id is found, return its domain. */
-		if (id == d->id)
+		if (id == d->hdr.id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->id)
+		if (id < d->hdr.id)
 			break;
 	}
 
@@ -530,7 +530,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 	d = rdt_find_domain(r, id, &add_pos);
 
 	if (d) {
-		cpumask_set_cpu(cpu, &d->cpu_mask);
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
 		return;
@@ -541,8 +541,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 
 	d = &hw_dom->d_resctrl;
-	d->id = id;
-	cpumask_set_cpu(cpu, &d->cpu_mask);
+	d->hdr.id = id;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
@@ -556,11 +556,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail(&d->list, add_pos);
+	list_add_tail(&d->hdr.list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
@@ -581,10 +581,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 	hw_dom = resctrl_to_arch_dom(d);
 
-	cpumask_clear_cpu(cpu, &d->cpu_mask);
-	if (cpumask_empty(&d->cpu_mask)) {
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 
 		/*
 		 * rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 3f8891d57fac..23f8258d36a8 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -146,7 +146,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -226,8 +226,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
 			if (r->parse_ctrlval(&data, s, d))
@@ -274,7 +274,7 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	struct rdt_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
-		cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
+		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
 		hw_dom->ctrl_val[idx] = cfg->new_ctrl;
 
 		return true;
@@ -291,7 +291,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	hw_dom->ctrl_val[idx] = cfg_val;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -476,7 +476,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 			ctrl_val = resctrl_arch_get_config(r, dom, closid,
 							   schema->conf_type);
 
-		seq_printf(s, r->format_str, dom->id, max_data_width,
+		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
 			   ctrl_val);
 		sep = true;
 	}
@@ -505,7 +505,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			} else {
 				seq_printf(s, "%s:%d=%x\n",
 					   rdtgrp->plr->s->res->name,
-					   rdtgrp->plr->d->id,
+					   rdtgrp->plr->d->hdr.id,
 					   rdtgrp->plr->cbm);
 			}
 		} else {
@@ -536,7 +536,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 	rr->val = 0;
 	rr->first = first;
 
-	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
 }
 
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..dd0ea1bc0092 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -238,7 +238,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 	u64 msr_val, chunks;
 	int ret;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	ret = __rmid_read(rmid, eventid, &msr_val);
@@ -340,8 +340,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
-		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
 						     &val);
@@ -661,7 +661,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
 
-	cpu = cpumask_any(&dom->cpu_mask);
+	cpu = cpumask_any(&dom->hdr.cpu_mask);
 	dom->cqm_work_cpu = cpu;
 
 	schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
@@ -708,7 +708,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
 
 	if (!static_branch_likely(&rdt_mon_enable_key))
 		return;
-	cpu = cpumask_any(&dom->cpu_mask);
+	cpu = cpumask_any(&dom->hdr.cpu_mask);
 	dom->mbm_work_cpu = cpu;
 	schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
 }
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 2a682da9f43a..fcbd99e2eb66 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
 	int cpu;
 	int ret;
 
-	for_each_cpu(cpu, &plr->d->cpu_mask) {
+	for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
 		pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
 		if (!pm_req) {
 			rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
@@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 		return -ENODEV;
 
 	/* Pick the first cpu we find that is associated with the cache. */
-	plr->cpu = cpumask_first(&plr->d->cpu_mask);
+	plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 
 	if (!cpu_online(plr->cpu)) {
 		rdt_last_cmd_printf("CPU %u associated with cache not online\n",
@@ -856,10 +856,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, list) {
+		list_for_each_entry(d_i, &r->domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
-					   &d_i->cpu_mask);
+					   &d_i->hdr.cpu_mask);
 		}
 	}
 
@@ -867,7 +867,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * Next test if new pseudo-locked region would intersect with
 	 * existing region.
 	 */
-	if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
+	if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
 		ret = true;
 
 	free_cpumask_var(cpu_with_psl);
@@ -1199,7 +1199,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
 	}
 
 	plr->thread_done = 0;
-	cpu = cpumask_first(&plr->d->cpu_mask);
+	cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 	if (!cpu_online(cpu)) {
 		ret = -ENODEV;
 		goto out;
@@ -1529,7 +1529,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * may be scheduled elsewhere and invalidate entries in the
 	 * pseudo-locked region.
 	 */
-	if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
+	if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
 		mutex_unlock(&rdtgroup_mutex);
 		return -EINVAL;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index c44be64d65ec..04d32602ac33 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -295,7 +295,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
 				rdt_last_cmd_puts("Cache domain offline\n");
 				ret = -ENODEV;
 			} else {
-				mask = &rdtgrp->plr->d->cpu_mask;
+				mask = &rdtgrp->plr->d->hdr.cpu_mask;
 				seq_printf(s, is_cpu_list(of) ?
 					   "%*pbl\n" : "%*pb\n",
 					   cpumask_pr_args(mask));
@@ -984,12 +984,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
 		exclusive = 0;
-		seq_printf(seq, "%d=", dom->id);
+		seq_printf(seq, "%d=", dom->hdr.id);
 		for (i = 0; i < closids_supported(); i++) {
 			if (!closid_allocated(i))
 				continue;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1417,7 +1417,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
-	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
+	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
 		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
@@ -1465,7 +1465,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
 						    rdtgrp->plr->d,
 						    rdtgrp->plr->cbm);
-			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
 		}
 		goto out;
 	}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1495,7 +1495,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 				else
 					size = rdtgroup_cbm_to_size(r, d, ctrl);
 			}
-			seq_printf(s, "%d=%u", d->id, size);
+			seq_printf(s, "%d=%u", d->hdr.id, size);
 			sep = true;
 		}
 		seq_putc(s, '\n');
@@ -1555,7 +1555,7 @@ static void mon_event_config_read(void *info)
 
 static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
 {
-	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
 
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1574,7 +1574,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		mon_info.evtid = evtid;
 		mondata_config_read(dom, &mon_info);
 
-		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1646,7 +1646,7 @@ static int mbm_config_write_domain(struct rdt_resource *r,
 	 * are scoped at the domain level. Writing any of these MSRs
 	 * on one CPU is observed by all the CPUs in the domain.
 	 */
-	smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
 			      &mon_info, 1);
 
 	/*
@@ -1689,8 +1689,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
 				return -EINVAL;
@@ -2232,14 +2232,14 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, list) {
+	list_for_each_entry(d, &r_l->domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
-			for_each_cpu(cpu, &d->cpu_mask)
+			for_each_cpu(cpu, &d->hdr.cpu_mask)
 				cpumask_set_cpu(cpu, cpu_mask);
 		else
 			/* Pick one CPU from each domain instance to update MSR */
-			cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+			cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 	}
 
 	/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2268,7 +2268,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	int cpu = cpumask_any(&d->cpu_mask);
+	int cpu = cpumask_any(&d->hdr.cpu_mask);
 	int i;
 
 	d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2780,9 +2780,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * CBMs in all domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
-		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
 			hw_dom->ctrl_val[i] = r->default_ctrl;
@@ -2986,7 +2986,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
 	/* create the directory */
 	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
 	if (IS_ERR(kn))
@@ -3002,7 +3002,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	}
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->id;
+	priv.u.domid = d->hdr.id;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3209,7 +3209,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
 	 */
 	tmp_cbm = cfg->new_ctrl;
 	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
-		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
 		return -ENOSPC;
 	}
 	cfg->have_new_ctrl = true;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, list) {
+	list_for_each_entry(d, &s->res->domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3864,7 +3864,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (static_branch_unlikely(&rdt_mon_enable_key))
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v10 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-10-31 21:17           ` [PATCH v10 " Tony Luck
  2023-10-31 21:17             ` [PATCH v10 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2023-10-31 21:17             ` [PATCH v10 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2023-10-31 21:17             ` Tony Luck
  2023-11-01 19:53               ` Luck, Tony
  2023-11-07  0:32               ` Reinette Chatre
  2023-10-31 21:17             ` [PATCH v10 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
                               ` (5 subsequent siblings)
  8 siblings, 2 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-31 21:17 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v9
Fix commit to be specific the only the RDT_RESOURCE_L3 resource is going
to have different monitor and control scope.
Rename get_domain_from_cpu() -> get_ctrl_domain_from_cpu()
Rewrite comment for rdt_find_domains().
Add "type" field to rdt_domain_hdr structure.
Delete the /* RDT_RESOURCE_MBA is never mon_capable */ comment.

 include/linux/resctrl.h                   |  25 ++-
 arch/x86/kernel/cpu/resctrl/internal.h    |   6 +-
 arch/x86/kernel/cpu/resctrl/core.c        | 206 ++++++++++++++++------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  12 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     |   4 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  55 +++---
 7 files changed, 218 insertions(+), 94 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c4067150a6b7..35e700edc6e6 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -52,15 +52,22 @@ struct resctrl_staged_config {
 	bool			have_new_ctrl;
 };
 
+enum resctrl_domain_type {
+	RESCTRL_CTRL_DOMAIN,
+	RESCTRL_MON_DOMAIN,
+};
+
 /**
  * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
+ * @type:		type of this instance
  * @cpu_mask:		which CPUs share this resource
  */
 struct rdt_domain_hdr {
 	struct list_head		list;
 	int				id;
+	enum resctrl_domain_type	type;
 	struct cpumask			cpu_mask;
 };
 
@@ -163,10 +170,12 @@ enum resctrl_scope {
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @scope:		Scope of this resource
+ * @ctrl_scope:		Scope of this resource for control functions
+ * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		All domains for this resource
+ * @ctrl_domains:	Control domains for this resource
+ * @mon_domains:	Monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -181,10 +190,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -230,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..24bf9d7989a9 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -520,8 +520,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -540,7 +540,7 @@ int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c26ecb2e415f..8dc2cb49358e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,8 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.mon_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
+			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.scope			= RESCTRL_L2_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.ctrl_scope		= RESCTRL_L2_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -352,11 +355,11 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
@@ -378,7 +381,7 @@ void rdt_ctrl_update(void *arg)
 	int cpu = smp_processor_id();
 	struct rdt_domain *d;
 
-	d = get_domain_from_cpu(cpu, r);
+	d = get_ctrl_domain_from_cpu(cpu, r);
 	if (d) {
 		hw_res->msr_update(d, m, r);
 		return;
@@ -388,26 +391,26 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Search for a domain id in a resource domain list.
  *
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
+ * Search the list to find the resource id. If the domain id is found
+ * in a domain, return the domain. Otherwise, if requested by caller,
+ * return the first domain whose id is bigger than the input id.
  * The domain list is sorted by id in ascending order.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos)
 {
-	struct rdt_domain *d;
+	struct rdt_domain_hdr *d;
 	struct list_head *l;
 
-	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, hdr.list);
+	list_for_each(l, h) {
+		d = list_entry(l, struct rdt_domain_hdr, list);
 		/* When id is found, return its domain. */
-		if (id == d->hdr.id)
+		if (id == d->id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->hdr.id)
+		if (id < d->id)
 			break;
 	}
 
@@ -501,35 +504,29 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	return -EINVAL;
 }
 
-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 	int err;
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
-	d = rdt_find_domain(r, id, &add_pos);
 
-	if (d) {
+	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+			return;
+
+		d = container_of(hdr, struct rdt_domain, hdr);
+
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
@@ -542,48 +539,115 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = &hw_dom->d_resctrl;
 	d->hdr.id = id;
+	d->hdr.type = RESCTRL_CTRL_DOMAIN;
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+	if (domain_setup_ctrlval(r, d)) {
 		domain_free(hw_dom);
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	list_add_tail(&d->hdr.list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+	int err;
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+	if (IS_ERR(hdr)) {
+		pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+			return;
+
+		d = container_of(hdr, struct rdt_domain, hdr);
+
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->hdr.id = id;
+	d->hdr.type = RESCTRL_MON_DOMAIN;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
 	}
 
 	list_add_tail(&d->hdr.list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+/*
+ * domain_add_cpu - Add a cpu to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 
 	if (id < 0)
 		return;
 
-	d = rdt_find_domain(r, id, NULL);
-	if (!d) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
+		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
 	hw_dom = resctrl_to_arch_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del(&d->hdr.list);
 
 		/*
@@ -596,6 +660,38 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 		return;
 	}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+
+	if (id < 0)
+		return;
+
+	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+	if (IS_ERR_OR_NULL(hdr)) {
+		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+	hw_dom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+
+		return;
+	}
 
 	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -610,6 +706,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 23f8258d36a8..0b4136c42762 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -226,7 +226,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -542,6 +542,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
+	struct rdt_domain_hdr *hdr;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
@@ -562,11 +563,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
-	if (!d) {
+	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+	if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
 		ret = -ENOENT;
 		goto out;
 	}
+	d = container_of(hdr, struct rdt_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index dd0ea1bc0092..ec5ad926c5dc 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
@@ -535,7 +535,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	rmid = rgrp->mon.rmid;
 	pmbm_data = &dom_mbm->mbm_local[rmid];
 
-	dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+	dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
 	if (!dom_mba) {
 		pr_warn_once("Failure to get domain for MBA update\n");
 		return;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index fcbd99e2eb66..ed6d59af1cef 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
-	enum resctrl_scope scope = plr->s->res->scope;
+	enum resctrl_scope scope = plr->s->res->ctrl_scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, hdr.list) {
+		list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->hdr.cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 04d32602ac33..760013ed1bff 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -984,7 +984,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1413,13 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
-	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+	if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->scope) {
+		if (ci->info_list[i].level == r->ctrl_scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1689,7 +1689,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
@@ -2232,7 +2232,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, hdr.list) {
+	list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->hdr.cpu_mask)
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2777,10 +2777,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
 
 	/*
 	 * Disable resource control for this resource by setting all
-	 * CBMs in all domains to the maximum mask value. Pick one CPU
+	 * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, hdr.list) {
+	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3849,15 +3849,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
+}
 
-	if (!r->mon_capable)
-		return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3914,18 +3916,21 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
-	int err;
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
-		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
-		return 0;
+	return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v10 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-10-31 21:17           ` [PATCH v10 " Tony Luck
                               ` (2 preceding siblings ...)
  2023-10-31 21:17             ` [PATCH v10 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-10-31 21:17             ` Tony Luck
  2023-11-07  0:32               ` Reinette Chatre
  2023-10-31 21:17             ` [PATCH v10 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
                               ` (4 subsequent siblings)
  8 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-31 21:17 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v9
Comment against patch 4, but now fixed in patch #2. cpu_mask
is included in common header.

 include/linux/resctrl.h                   | 50 +++++++------
 arch/x86/kernel/cpu/resctrl/internal.h    | 60 ++++++++++------
 arch/x86/kernel/cpu/resctrl/core.c        | 87 ++++++++++++-----------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 40 +++++------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 62 ++++++++--------
 7 files changed, 184 insertions(+), 153 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 35e700edc6e6..36503e8870cd 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -72,7 +72,25 @@ struct rdt_domain_hdr {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr:		common header for different domain types
+ * @cpu_mask:		which CPUs share this resource
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_ctrl_domain {
+	struct rdt_domain_hdr		hdr;
+	struct cpumask			cpu_mask;
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl control resource
  * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
@@ -81,13 +99,8 @@ struct rdt_domain_hdr {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
@@ -96,9 +109,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };
 
 /**
@@ -202,7 +212,7 @@ struct rdt_resource {
 	const char		*format_str;
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
-						 struct rdt_domain *d);
+						 struct rdt_ctrl_domain *d);
 	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
@@ -236,15 +246,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.
  */
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -260,7 +270,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
 
 /**
@@ -273,7 +283,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid);
 
 /**
@@ -285,7 +295,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 24bf9d7989a9..ce3a70657842 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -107,7 +107,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -192,7 +192,7 @@ struct mongroup {
  */
 struct pseudo_lock_region {
 	struct resctrl_schema	*s;
-	struct rdt_domain	*d;
+	struct rdt_ctrl_domain	*d;
 	u32			cbm;
 	wait_queue_head_t	lock_thread_wq;
 	int			thread_done;
@@ -319,25 +319,41 @@ struct arch_mbm_state {
 };
 
 /**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ *			       a resource for a control function
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+	struct rdt_ctrl_domain		d_resctrl;
+	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ *			      a resource for a monitor function
+ * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
-struct rdt_hw_domain {
-	struct rdt_domain		d_resctrl;
-	u32				*ctrl_val;
+struct rdt_hw_mon_domain {
+	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };
 
-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
 {
-	return container_of(r, struct rdt_hw_domain, d_resctrl);
+	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
 }
 
 /**
@@ -405,7 +421,7 @@ struct rdt_hw_resource {
 	struct rdt_resource	r_resctrl;
 	u32			num_closid;
 	unsigned int		msr_base;
-	void (*msr_update)	(struct rdt_domain *d, struct msr_param *m,
+	void (*msr_update)	(struct rdt_ctrl_domain *d, struct msr_param *m,
 				 struct rdt_resource *r);
 	unsigned int		mon_scale;
 	unsigned int		mbm_width;
@@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 }
 
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d);
+	      struct rdt_ctrl_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d);
+	     struct rdt_ctrl_domain *d);
 
 extern struct mutex rdtgroup_mutex;
 
@@ -526,21 +542,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				  unsigned long cbm);
 enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
 int rdtgroup_tasks_assigned(struct rdtgroup *r);
 int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
 int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
 int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
@@ -550,17 +566,17 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 8dc2cb49358e..6bae0a658b94 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
 bool rdt_alloc_capable;
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
 #define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -307,11 +307,11 @@ static void rdt_get_cdp_l2_config(void)
 }
 
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -332,12 +332,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 }
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	/*  Write the delay values for mba. */
 	for (i = m->low; i < m->high; i++)
@@ -345,19 +345,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 }
 
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
@@ -379,7 +379,7 @@ void rdt_ctrl_update(void *arg)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
 	struct rdt_resource *r = m->res;
 	int cpu = smp_processor_id();
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	d = get_ctrl_domain_from_cpu(cpu, r);
 	if (d) {
@@ -434,18 +434,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 		*dc = r->default_ctrl;
 }
 
-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct msr_param m;
 	u32 *dc;
 
@@ -468,7 +473,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
 {
 	size_t tsize;
 
@@ -507,10 +512,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -525,7 +530,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 			return;
 
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_ctrl_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -545,7 +550,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	rdt_domain_reconfigure_cdp(r);
 
 	if (domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 		return;
 	}
 
@@ -554,17 +559,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	err = resctrl_online_ctrl_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 	}
 }
 
 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_mon_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -583,7 +588,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 			return;
 
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		return;
@@ -599,7 +604,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 		return;
 	}
 
@@ -608,7 +613,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 	}
 }
 
@@ -626,9 +631,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	if (id < 0)
 		return;
@@ -642,8 +647,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+	hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
@@ -651,12 +656,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		list_del(&d->hdr.list);
 
 		/*
-		 * rdt_domain "d" is going to be freed below, so clear
+		 * rdt_ctrl_domain "d" is going to be freed below, so clear
 		 * its pointer from pseudo_lock_region struct.
 		 */
 		if (d->plr)
 			d->plr->d = NULL;
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 
 		return;
 	}
@@ -665,9 +670,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_mon_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	if (id < 0)
 		return;
@@ -681,14 +686,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
+	hw_dom = resctrl_to_arch_mon_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 0b4136c42762..08fc97ce4135 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }
 
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d)
+	     struct rdt_ctrl_domain *d)
 {
 	struct resctrl_staged_config *cfg;
 	u32 closid = data->rdtgrp->closid;
@@ -137,7 +137,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
  * resource type.
  */
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d)
+	      struct rdt_ctrl_domain *d)
 {
 	struct rdtgroup *rdtgrp = data->rdtgrp;
 	struct resctrl_staged_config *cfg;
@@ -206,8 +206,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct resctrl_staged_config *cfg;
 	struct rdt_resource *r = s->res;
 	struct rdt_parse_data data;
+	struct rdt_ctrl_domain *d;
 	char *dom = NULL, *id;
-	struct rdt_domain *d;
 	unsigned long dom_id;
 
 	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -267,11 +267,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
 	}
 }
 
-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
 			 struct resctrl_staged_config *cfg, u32 idx,
 			 cpumask_var_t cpu_mask)
 {
-	struct rdt_domain *dom = &hw_dom->d_resctrl;
+	struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
 		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
@@ -283,11 +283,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	return false;
 }
 
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
@@ -307,11 +307,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	enum resctrl_conf_type t;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	u32 idx;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -319,7 +319,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 
 	msr_param.res = NULL;
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
 			if (!cfg->have_new_ctrl)
@@ -449,10 +449,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	u32 idx = get_config_index(closid, type);
 
 	return hw_dom->ctrl_val[idx];
@@ -461,7 +461,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
 {
 	struct rdt_resource *r = schema->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	bool sep = false;
 	u32 ctrl_val;
 
@@ -523,7 +523,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	/*
@@ -543,11 +543,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
 	struct rdt_domain_hdr *hdr;
+	struct rdt_mon_domain *d;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -568,7 +568,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		ret = -ENOENT;
 		goto out;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ec5ad926c5dc..4e145f5620b0 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	}
 }
 
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
 }
@@ -334,7 +334,7 @@ int alloc_rmid(void)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int cpu, err;
 	u64 val = 0;
 
@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
 				       enum resctrl_event_id evtid)
 {
 	switch (evtid) {
@@ -516,13 +516,13 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
+	struct rdt_ctrl_domain *dom_mba;
 	u32 cur_bw, delta_bw, user_bw;
 	struct rdt_resource *r_mba;
-	struct rdt_domain *dom_mba;
 	struct list_head *head;
 	struct rdtgroup *entry;
 
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	}
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
 {
 	struct rmid_read rr;
 
@@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
 	struct rdtgroup *prgrp, *crgrp;
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index ed6d59af1cef..08d35f828bc3 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
  * Return: true if @cbm overlaps with pseudo-locked region on @d, false
  * otherwise.
  */
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int cbm_len;
 	unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
  *         if it is not possible to test due to memory allocation issue,
  *         false otherwise.
  */
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
 {
+	struct rdt_ctrl_domain *d_i;
 	cpumask_var_t cpu_with_psl;
 	struct rdt_resource *r;
-	struct rdt_domain *d_i;
 	bool ret = false;
 
 	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 760013ed1bff..21bbd832f3f2 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -85,8 +85,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
 
 void rdt_staged_configs_clear(void)
 {
+	struct rdt_ctrl_domain *dom;
 	struct rdt_resource *r;
-	struct rdt_domain *dom;
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -976,7 +976,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	unsigned long sw_shareable = 0, hw_shareable = 0;
 	unsigned long exclusive = 0, pseudo_locked = 0;
 	struct rdt_resource *r = s->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	int i, hwb, swb, excl, psl;
 	enum rdtgrp_mode mode;
 	bool sep = false;
@@ -1205,7 +1205,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
  *
  * Return: false if CBM does not overlap, true if it does.
  */
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				    unsigned long cbm, int closid,
 				    enum resctrl_conf_type type, bool exclusive)
 {
@@ -1260,7 +1260,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
  *
  * Return: true if CBM overlap detected, false if there is no overlap
  */
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1291,10 +1291,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
 static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 {
 	int closid = rdtgrp->closid;
+	struct rdt_ctrl_domain *d;
 	struct resctrl_schema *s;
 	struct rdt_resource *r;
 	bool has_cache = false;
-	struct rdt_domain *d;
 	u32 ctrl;
 
 	list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1407,7 +1407,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
  * bitmap functions work correctly.
  */
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
-				  struct rdt_domain *d, unsigned long cbm)
+				  struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	struct cpu_cacheinfo *ci;
 	unsigned int size = 0;
@@ -1439,9 +1439,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 {
 	struct resctrl_schema *schema;
 	enum resctrl_conf_type type;
+	struct rdt_ctrl_domain *d;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 	unsigned int size;
 	int ret = 0;
 	u32 closid;
@@ -1553,7 +1553,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1561,7 +1561,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
@@ -1618,7 +1618,7 @@ static void mon_event_config_write(void *info)
 }
 
 static int mbm_config_write_domain(struct rdt_resource *r,
-				   struct rdt_domain *d, u32 evtid, u32 val)
+				   struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 	int ret = 0;
@@ -1668,7 +1668,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 {
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int ret = 0;
 
 next:
@@ -2216,9 +2216,9 @@ static inline bool is_mba_linear(void)
 static int set_cache_qos_cfg(int level, bool enable)
 {
 	void (*update)(void *arg);
+	struct rdt_ctrl_domain *d;
 	struct rdt_resource *r_l;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int cpu;
 
 	if (level == RDT_RESOURCE_L3)
@@ -2265,7 +2265,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 		l3_qos_cfg_update(&hw_res->cdp_enabled);
 }
 
-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
 	int cpu = cpumask_any(&d->hdr.cpu_mask);
@@ -2283,7 +2283,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 }
 
 static void mba_sc_domain_destroy(struct rdt_resource *r,
-				  struct rdt_domain *d)
+				  struct rdt_ctrl_domain *d)
 {
 	kfree(d->mbps_val);
 	d->mbps_val = NULL;
@@ -2309,7 +2309,7 @@ static int set_mba_sc(bool mba_sc)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2578,7 +2578,7 @@ static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	unsigned long flags = RFTYPE_CTRL_BASE;
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2762,10 +2762,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
 static int reset_all_ctrls(struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int i;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2781,7 +2781,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * from each domain to update the MSRs below.
 	 */
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
@@ -2976,7 +2976,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -3025,7 +3025,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mon_domain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -3047,7 +3047,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	int ret;
 
 	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3149,7 +3149,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
  * Set the RDT domain up to start off with all usable allocations. That is,
  * all shareable and unused bits. All-zero CBM is invalid.
  */
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
 				 u32 closid)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3229,7 +3229,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
  */
 static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int ret;
 
 	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3245,7 +3245,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
@@ -3842,14 +3842,14 @@ static void __init rdtgroup_setup_default(void)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3857,7 +3857,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 		mba_sc_domain_destroy(r, d);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3886,7 +3886,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	size_t tsize;
 
@@ -3916,7 +3916,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3926,7 +3926,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	int err;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v10 5/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-10-31 21:17           ` [PATCH v10 " Tony Luck
                               ` (3 preceding siblings ...)
  2023-10-31 21:17             ` [PATCH v10 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-10-31 21:17             ` Tony Luck
  2023-10-31 21:17             ` [PATCH v10 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                               ` (3 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-31 21:17 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
No changes since v9

 include/linux/resctrl.h            | 1 +
 arch/x86/kernel/cpu/resctrl/core.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 36503e8870cd..f42a5e59027b 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -172,6 +172,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6bae0a658b94..d2c1aa8411a3 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -502,6 +502,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
 	default:
 		break;
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v10 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-10-31 21:17           ` [PATCH v10 " Tony Luck
                               ` (4 preceding siblings ...)
  2023-10-31 21:17             ` [PATCH v10 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-10-31 21:17             ` Tony Luck
  2023-10-31 21:17             ` [PATCH v10 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
                               ` (2 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-31 21:17 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that will show how many
SNC nodes share each L3 cache. When this is "1", SNC mode is either
not implemented, or not enabled, but all places that need to check
it are updated to take appropriate action when SNC mode is enabled.

Code that needs to take action when SNC is enabled is:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
   nodes.
3) The RMID renumbering operates when using the value from the
   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
   counter, code must adjust from the logical RMID used to the physical
   RMID value for the SNC node that it wishes to read and load the
   adjusted value into the IA32_QM_EVTSEL MSR.
4) The L3 cache is divided between the SNC nodes. So the value
   reported in the resctrl "size" file is divided by the number of SNC
   nodes because the effective amount of cache that can be allocated
   is reduced by that factor.
5) The "-o mba_MBps" mount option must be disabled in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.
   Trying to use this feedback loop might result in contradictory
   changes to the throttling level coming from each of the SNC
   node bandwidth measurements.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v9
Fixed missing word s/monitoring on Intel/monitoring on an Intel/
Deleted "A later patch" paragraph.
Expanded description how how values are "adjusted" for mon_scale
and cache size.
Changed type of "snc_nodes_per_l3_cache" to "unsigned int".

 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  5 +++--
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index ce3a70657842..e7a75a439c16 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern unsigned int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d2c1aa8411a3..97d2a5a7dd41 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.  Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 4e145f5620b0..30b7c3b9b517 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs.
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 21bbd832f3f2..79d57dade568 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /*
@@ -2298,7 +2298,8 @@ static bool supports_mba_mbps(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 
 	return (is_mbm_local_enabled() &&
-		r->alloc_capable && is_mba_linear());
+		r->alloc_capable && is_mba_linear() &&
+		snc_nodes_per_l3_cache == 1);
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v10 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-10-31 21:17           ` [PATCH v10 " Tony Luck
                               ` (5 preceding siblings ...)
  2023-10-31 21:17             ` [PATCH v10 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-10-31 21:17             ` Tony Luck
  2023-11-07  0:33               ` Reinette Chatre
  2023-10-31 21:17             ` [PATCH v10 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
  2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  8 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-10-31 21:17 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v9
Expand h/w to hardware (commit and code comments)
Remove "earlier commit" reference
s/counnter/counter/
Check for offline CPUs and warn user SNC detection may be broken.

 arch/x86/include/asm/msr-index.h   |   1 +
 arch/x86/kernel/cpu/resctrl/core.c | 100 ++++++++++++++++++++++++++++-
 2 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e3fa9cecd599..4285a5ee81fe 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1109,6 +1109,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 97d2a5a7dd41..034f9797e1fb 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -184,10 +187,10 @@ bool is_mba_sc(struct rdt_resource *r)
 
 /*
  * rdt_get_mb_table() - get a mapping of bandwidth(b/w) percentage values
- * exposed to user interface and the h/w understandable delay values.
+ * exposed to user interface and the hardware understandable delay values.
  *
  * The non-linear delay values have the granularity of power of two
- * and also the h/w does not guarantee a curve for configured delay
+ * and also the hardware does not guarantee a curve for configured delay
  * values vs. actual b/w enforced.
  * Hence we need a mapping that is pre calibrated so the user can
  * express the memory b/w as a percentage value.
@@ -738,11 +741,42 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+	u64 val;
+
+	/* Only need to enable once per package. */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
@@ -997,11 +1031,73 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+	int num_l3_caches;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+
+	if (num_online_cpus() != num_present_cpus())
+		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids)
+			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+		else
+			mem_only_nodes++;
+	}
+	cpus_read_unlock();
+
+	num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
+	kfree(node_caches);
+
+	if (!num_l3_caches)
+		return 1;
+
+	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+	if (ret > 1)
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+
+	return ret;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = snc_get_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v10 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-10-31 21:17           ` [PATCH v10 " Tony Luck
                               ` (6 preceding siblings ...)
  2023-10-31 21:17             ` [PATCH v10 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-10-31 21:17             ` Tony Luck
  2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-10-31 21:17 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
No changes since v9

 Documentation/arch/x86/resctrl.rst | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..d1db200db5f9 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -366,9 +366,9 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event.  Each of these
 	directories have one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
@@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
 each bit represents 5% of the capacity of the cache. You could partition
 the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache. But each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* RE: [PATCH v10 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-10-31 21:17             ` [PATCH v10 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-11-01 19:53               ` Luck, Tony
  2023-11-07  0:32               ` Reinette Chatre
  1 sibling, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-11-01 19:53 UTC (permalink / raw)
  To: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

+/*
+ * domain_add_cpu - Add a cpu to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)

Bother. Missed one comment that needs s/cpu/CPU/

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v10 1/8] x86/resctrl: Prepare for new domain scope
  2023-10-31 21:17             ` [PATCH v10 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-11-07  0:31               ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-11-07  0:31 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 10/31/2023 2:17 PM, Tony Luck wrote:
> Resctrl resources operate on subsets of CPUs in the system with the
> defining attribute of each subset being an instance of a particular
> level of cache. E.g. all CPUs sharing an L3 cache would be part of the
> same domain.
> 
> In preparation for features that are scoped at the NUMA node level
> change the code from explicit references to "cache_level" to a more
> generic scope. At this point the only options for this scope are groups
> of CPUs that share an L2 cache or L3 cache.
> 
> Clean up the error handling when looking up domains. Report invalid id's
> before calling rdt_find_domain() in preparation for better messages when
> scope can be other than cache scope. This means that rdt_find_domain()
> will never return an error. So remove checks for error from the callsites.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> Changes since v9
> New test for invalid domain id before calling rdt_find_domain() means that
> error handling in that function and at all call-sites can be simplified.

These changes do not appear to be consistent in this series. Simplifying the
call-sites is indeed done in this patch but this work seems to be undone in
patch 3 where it reverts back to the previous error handling in
domain_add_cpu_mon(), domain_remove_cpu_ctrl(), and domain_remove_cpu_mon().

> In pseudo_lock_region_init() use the new enum resctrl_scope for local variable.
> 
>  include/linux/resctrl.h                   |  9 +++--
>  arch/x86/kernel/cpu/resctrl/core.c        | 40 +++++++++++++++--------
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  2 +-
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +++-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 ++-
>  5 files changed, 44 insertions(+), 18 deletions(-)
> 

...

> @@ -506,17 +516,18 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>   */
>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>  	struct list_head *add_pos = NULL;
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  	int err;
>  
> -	d = rdt_find_domain(r, id, &add_pos);
> -	if (IS_ERR(d)) {
> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> +	if (id < 0) {
> +		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->scope, r->name);
>  		return;
>  	}

Please add empty line here ...

> +	d = rdt_find_domain(r, id, &add_pos);
>  

... and remove this empty line.

>  	if (d) {
>  		cpumask_set_cpu(cpu, &d->cpu_mask);
> @@ -556,12 +567,15 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  
>  static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  
> +	if (id < 0)
> +		return;
> +
>  	d = rdt_find_domain(r, id, NULL);
> -	if (IS_ERR_OR_NULL(d)) {
> +	if (!d) {
>  		pr_warn("Couldn't find cache id for CPU %d\n", cpu);

This error message is no longer accurate.

>  		return;
>  	}


Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v10 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-10-31 21:17             ` [PATCH v10 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
  2023-11-01 19:53               ` Luck, Tony
@ 2023-11-07  0:32               ` Reinette Chatre
  1 sibling, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-11-07  0:32 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 10/31/2023 2:17 PM, Tony Luck wrote:
> Resctrl assumes that control and monitor operations on a resource are
> performed at the same scope.
> 
> Prepare for systems that use different scope (specifically Intel needs
> to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
> and NODE scope for cache occupancy and memory bandwidth monitoring).
> 
> Create separate domain lists for control and monitor operations.
> 
> Note that errors during initialization of either control or monitor
> functions on a domain would previously result in that domain being
> excluded from both control and monitor operations. Now the domains are
> allocated independently it is no longer required to disable both control
> and monitor operations if either fail.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> Changes since v9
> Fix commit to be specific the only the RDT_RESOURCE_L3 resource is going
> to have different monitor and control scope.
> Rename get_domain_from_cpu() -> get_ctrl_domain_from_cpu()
> Rewrite comment for rdt_find_domains().
> Add "type" field to rdt_domain_hdr structure.
> Delete the /* RDT_RESOURCE_MBA is never mon_capable */ comment.
> 
>  include/linux/resctrl.h                   |  25 ++-
>  arch/x86/kernel/cpu/resctrl/internal.h    |   6 +-
>  arch/x86/kernel/cpu/resctrl/core.c        | 206 ++++++++++++++++------
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  12 +-
>  arch/x86/kernel/cpu/resctrl/monitor.c     |   4 +-
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  55 +++---
>  7 files changed, 218 insertions(+), 94 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index c4067150a6b7..35e700edc6e6 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -52,15 +52,22 @@ struct resctrl_staged_config {
>  	bool			have_new_ctrl;
>  };
>  
> +enum resctrl_domain_type {
> +	RESCTRL_CTRL_DOMAIN,
> +	RESCTRL_MON_DOMAIN,
> +};
> +
>  /**
>   * struct rdt_domain_hdr - common header for different domain types
>   * @list:		all instances of this resource
>   * @id:			unique id for this instance
> + * @type:		type of this instance
>   * @cpu_mask:		which CPUs share this resource
>   */
>  struct rdt_domain_hdr {
>  	struct list_head		list;
>  	int				id;
> +	enum resctrl_domain_type	type;
>  	struct cpumask			cpu_mask;
>  };
>  
> @@ -163,10 +170,12 @@ enum resctrl_scope {
>   * @alloc_capable:	Is allocation available on this machine
>   * @mon_capable:	Is monitor feature available on this machine
>   * @num_rmid:		Number of RMIDs available
> - * @scope:		Scope of this resource
> + * @ctrl_scope:		Scope of this resource for control functions
> + * @mon_scope:		Scope of this resource for monitor functions
>   * @cache:		Cache allocation related data
>   * @membw:		If the component has bandwidth controls, their properties.
> - * @domains:		All domains for this resource
> + * @ctrl_domains:	Control domains for this resource
> + * @mon_domains:	Monitor domains for this resource
>   * @name:		Name to use in "schemata" file.
>   * @data_width:		Character width of data when displaying
>   * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
> @@ -181,10 +190,12 @@ struct rdt_resource {
>  	bool			alloc_capable;
>  	bool			mon_capable;
>  	int			num_rmid;
> -	enum resctrl_scope	scope;
> +	enum resctrl_scope	ctrl_scope;
> +	enum resctrl_scope	mon_scope;
>  	struct resctrl_cache	cache;
>  	struct resctrl_membw	membw;
> -	struct list_head	domains;
> +	struct list_head	ctrl_domains;
> +	struct list_head	mon_domains;
>  	char			*name;
>  	int			data_width;
>  	u32			default_ctrl;
> @@ -230,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
>  
>  u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
>  			    u32 closid, enum resctrl_conf_type type);
> -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
> -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
>  
>  /**
>   * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a4f1aa15f0a2..24bf9d7989a9 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -520,8 +520,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
>  int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
>  int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
>  			     umode_t mask);
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> -				   struct list_head **pos);
> +struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
> +				       struct list_head **pos);
>  ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off);
>  int rdtgroup_schemata_show(struct kernfs_open_file *of,
> @@ -540,7 +540,7 @@ int rdt_pseudo_lock_init(void);
>  void rdt_pseudo_lock_release(void);
>  int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
>  void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
> -struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
> +struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
>  int closids_supported(void);
>  void closid_free(int closid);
>  int alloc_rmid(void);
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index c26ecb2e415f..8dc2cb49358e 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -57,7 +57,8 @@ static void
>  mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
>  	      struct rdt_resource *r);
>  
> -#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
> +#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
> +#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
>  
>  struct rdt_hw_resource rdt_resources_all[] = {
>  	[RDT_RESOURCE_L3] =
> @@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L3,
>  			.name			= "L3",
> -			.scope			= RESCTRL_L3_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_L3),
> +			.ctrl_scope		= RESCTRL_L3_CACHE,
> +			.mon_scope		= RESCTRL_L3_CACHE,
> +			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
> +			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
>  			.fflags			= RFTYPE_RES_CACHE,
> @@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L2,
>  			.name			= "L2",
> -			.scope			= RESCTRL_L2_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_L2),
> +			.ctrl_scope		= RESCTRL_L2_CACHE,
> +			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
>  			.fflags			= RFTYPE_RES_CACHE,
> @@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_MBA,
>  			.name			= "MB",
> -			.scope			= RESCTRL_L3_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_MBA),
> +			.ctrl_scope		= RESCTRL_L3_CACHE,
> +			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
>  			.fflags			= RFTYPE_RES_MB,
> @@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_SMBA,
>  			.name			= "SMBA",
> -			.scope			= RESCTRL_L3_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_SMBA),
> +			.ctrl_scope		= RESCTRL_L3_CACHE,
> +			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
>  			.fflags			= RFTYPE_RES_MB,
> @@ -352,11 +355,11 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
>  		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
>  }
>  
> -struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
> +struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
>  {
>  	struct rdt_domain *d;
>  
> -	list_for_each_entry(d, &r->domains, hdr.list) {
> +	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
>  		/* Find the domain that contains this CPU */
>  		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
>  			return d;
> @@ -378,7 +381,7 @@ void rdt_ctrl_update(void *arg)
>  	int cpu = smp_processor_id();
>  	struct rdt_domain *d;
>  
> -	d = get_domain_from_cpu(cpu, r);
> +	d = get_ctrl_domain_from_cpu(cpu, r);
>  	if (d) {
>  		hw_res->msr_update(d, m, r);
>  		return;
> @@ -388,26 +391,26 @@ void rdt_ctrl_update(void *arg)
>  }
>  
>  /*
> - * rdt_find_domain - Find a domain in a resource that matches input resource id
> + * rdt_find_domain - Search for a domain id in a resource domain list.
>   *> - * Search resource r's domain list to find the resource id. If the resource
> - * id is found in a domain, return the domain. Otherwise, if requested by
> - * caller, return the first domain whose id is bigger than the input id.
> + * Search the list to find the resource id. If the domain id is found

This still talks about searching for a "resource id". And the "otherwise"
is not accurate ... it returns NULL if domain id cannot be found. The
"otherwise" refers to a value returned in a parameter, not the function
return value.

How about:
	Search the domain list to find the domain id. If the domain id
	is found, return the domain. NULL otherwise.
	If the domain id is not found (and NULL returned) then the first
	domain with id bigger than the input id can be returned to the
	caller via @pos.

Please feel free to improve.

> + * in a domain, return the domain. Otherwise, if requested by caller,
> + * return the first domain whose id is bigger than the input id.
>   * The domain list is sorted by id in ascending order.
>   */
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> -				   struct list_head **pos)
> +struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
> +				       struct list_head **pos)
>  {
> -	struct rdt_domain *d;
> +	struct rdt_domain_hdr *d;
>  	struct list_head *l;
>  
> -	list_for_each(l, &r->domains) {
> -		d = list_entry(l, struct rdt_domain, hdr.list);
> +	list_for_each(l, h) {
> +		d = list_entry(l, struct rdt_domain_hdr, list);
>  		/* When id is found, return its domain. */
> -		if (id == d->hdr.id)
> +		if (id == d->id)
>  			return d;
>  		/* Stop searching when finding id's position in sorted list. */
> -		if (id < d->hdr.id)
> +		if (id < d->id)
>  			break;
>  	}
>  
> @@ -501,35 +504,29 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>  	return -EINVAL;
>  }
>  
> -/*
> - * domain_add_cpu - Add a cpu to a resource's domain list.
> - *
> - * If an existing domain in the resource r's domain list matches the cpu's
> - * resource id, add the cpu in the domain.
> - *
> - * Otherwise, a new domain is allocated and inserted into the right position
> - * in the domain list sorted by id in ascending order.
> - *
> - * The order in the domain list is visible to users when we print entries
> - * in the schemata file and schemata input is validated to have the same order
> - * as this list.
> - */
> -static void domain_add_cpu(int cpu, struct rdt_resource *r)
> +static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_domain_id_from_scope(cpu, r->scope);
> +	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
>  	struct list_head *add_pos = NULL;
>  	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain_hdr *hdr;
>  	struct rdt_domain *d;
>  	int err;
>  
>  	if (id < 0) {
> -		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> -			     cpu, r->scope, r->name);
> +		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->ctrl_scope, r->name);
>  		return;
>  	}
> -	d = rdt_find_domain(r, id, &add_pos);
>  
> -	if (d) {
> +	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
> +

Please remove empty line.

> +	if (hdr) {
> +		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
> +			return;
> +
> +		d = container_of(hdr, struct rdt_domain, hdr);
> +
>  		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>  		if (r->cache.arch_has_per_cpu_cfg)
>  			rdt_domain_reconfigure_cdp(r);
> @@ -542,48 +539,115 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  
>  	d = &hw_dom->d_resctrl;
>  	d->hdr.id = id;
> +	d->hdr.type = RESCTRL_CTRL_DOMAIN;
>  	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>  
>  	rdt_domain_reconfigure_cdp(r);
>  
> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> +	if (domain_setup_ctrlval(r, d)) {
>  		domain_free(hw_dom);
>  		return;
>  	}
>  
> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> +	list_add_tail(&d->hdr.list, add_pos);
> +
> +	err = resctrl_online_ctrl_domain(r, d);
> +	if (err) {
> +		list_del(&d->hdr.list);
> +		domain_free(hw_dom);
> +	}
> +}
> +
> +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> +	struct list_head *add_pos = NULL;
> +	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain_hdr *hdr;
> +	struct rdt_domain *d;
> +	int err;
> +
> +	if (id < 0) {
> +		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->mon_scope, r->name);
> +		return;
> +	}
> +
> +	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
> +	if (IS_ERR(hdr)) {

Can this happen with rdt_find_domain() no longer returning an error?

It does not seem as though changes were made consistently to this series.

> +		pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
> +		return;
> +	}
> +
> +	if (hdr) {
> +		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
> +			return;
> +
> +		d = container_of(hdr, struct rdt_domain, hdr);
> +
> +		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> +		return;
> +	}
> +
> +	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!hw_dom)
> +		return;
> +
> +	d = &hw_dom->d_resctrl;
> +	d->hdr.id = id;
> +	d->hdr.type = RESCTRL_MON_DOMAIN;
> +	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> +
> +	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
>  		domain_free(hw_dom);
>  		return;
>  	}
>  
>  	list_add_tail(&d->hdr.list, add_pos);
>  
> -	err = resctrl_online_domain(r, d);
> +	err = resctrl_online_mon_domain(r, d);
>  	if (err) {
>  		list_del(&d->hdr.list);
>  		domain_free(hw_dom);
>  	}
>  }
>  
> -static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> +/*
> + * domain_add_cpu - Add a cpu to either/both resource's domain lists.
> + */
> +static void domain_add_cpu(int cpu, struct rdt_resource *r)
> +{
> +	if (r->alloc_capable)
> +		domain_add_cpu_ctrl(cpu, r);
> +	if (r->mon_capable)
> +		domain_add_cpu_mon(cpu, r);
> +}
> +
> +static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_domain_id_from_scope(cpu, r->scope);
> +	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
>  	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain_hdr *hdr;
>  	struct rdt_domain *d;
>  
>  	if (id < 0)
>  		return;
>  
> -	d = rdt_find_domain(r, id, NULL);
> -	if (!d) {
> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> +	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
> +	if (IS_ERR_OR_NULL(hdr)) {
> +		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);

Why did the !d error checking transition to IS_ERR_OR_NULL() here?

Also, the error message does not sound reasonable for what can be encountered
at this point.

>  		return;
>  	}
> +
> +	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
> +		return;
> +
> +	d = container_of(hdr, struct rdt_domain, hdr);
>  	hw_dom = resctrl_to_arch_dom(d);
>  
>  	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
>  	if (cpumask_empty(&d->hdr.cpu_mask)) {
> -		resctrl_offline_domain(r, d);
> +		resctrl_offline_ctrl_domain(r, d);
>  		list_del(&d->hdr.list);
>  
>  		/*
> @@ -596,6 +660,38 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  
>  		return;
>  	}
> +}
> +
> +static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> +	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain_hdr *hdr;
> +	struct rdt_domain *d;
> +
> +	if (id < 0)
> +		return;
> +
> +	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
> +	if (IS_ERR_OR_NULL(hdr)) {
> +		pr_warn("Couldn't find scope id=%d for CPU %d\n", id, cpu);

same here

> +		return;
> +	}
> +
> +	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
> +		return;
> +
> +	d = container_of(hdr, struct rdt_domain, hdr);
> +	hw_dom = resctrl_to_arch_dom(d);
> +
> +	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> +	if (cpumask_empty(&d->hdr.cpu_mask)) {
> +		resctrl_offline_mon_domain(r, d);
> +		list_del(&d->hdr.list);
> +		domain_free(hw_dom);
> +
> +		return;
> +	}
>  
>  	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
>  		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {


Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v10 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-10-31 21:17             ` [PATCH v10 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-11-07  0:32               ` Reinette Chatre
  2023-11-08 19:19                 ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-11-07  0:32 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 10/31/2023 2:17 PM, Tony Luck wrote:
> The same rdt_domain structure is used for both control and monitor
> functions. But this results in wasted memory as some of the fields are
> only used by control functions, while most are only used for monitor
> functions.
> 
> Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
> just the fields required for control and monitoring respectively.
> 
> Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
> and rdt_hw_mon_domain.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> Changes since v9
> Comment against patch 4, but now fixed in patch #2. cpu_mask
> is included in common header.
> 
>  include/linux/resctrl.h                   | 50 +++++++------
>  arch/x86/kernel/cpu/resctrl/internal.h    | 60 ++++++++++------
>  arch/x86/kernel/cpu/resctrl/core.c        | 87 ++++++++++++-----------
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
>  arch/x86/kernel/cpu/resctrl/monitor.c     | 40 +++++------
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 62 ++++++++--------
>  7 files changed, 184 insertions(+), 153 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 35e700edc6e6..36503e8870cd 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -72,7 +72,25 @@ struct rdt_domain_hdr {
>  };
>  
>  /**
> - * struct rdt_domain - group of CPUs sharing a resctrl resource
> + * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
> + * @hdr:		common header for different domain types
> + * @cpu_mask:		which CPUs share this resource
> + * @plr:		pseudo-locked region (if any) associated with domain
> + * @staged_config:	parsed configuration to be applied
> + * @mbps_val:		When mba_sc is enabled, this holds the array of user
> + *			specified control values for mba_sc in MBps, indexed
> + *			by closid
> + */
> +struct rdt_ctrl_domain {
> +	struct rdt_domain_hdr		hdr;
> +	struct cpumask			cpu_mask;

This patch did not change what it said it changed.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v10 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-10-31 21:17             ` [PATCH v10 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-11-07  0:33               ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-11-07  0:33 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 10/31/2023 2:17 PM, Tony Luck wrote:
> There isn't a simple hardware bit that indicates whether a CPU is
> running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
> the ratio of NUMA nodes to L3 cache instances.
> 
> When SNC mode is detected, reconfigure the RMID counters by updating
> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
> 
> Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
> on the second SNC node to start from zero.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> Changes since v9
> Expand h/w to hardware (commit and code comments)
> Remove "earlier commit" reference
> s/counnter/counter/
> Check for offline CPUs and warn user SNC detection may be broken.
> 
>  arch/x86/include/asm/msr-index.h   |   1 +
>  arch/x86/kernel/cpu/resctrl/core.c | 100 ++++++++++++++++++++++++++++-
>  2 files changed, 99 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index e3fa9cecd599..4285a5ee81fe 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1109,6 +1109,7 @@
>  #define MSR_IA32_QM_CTR			0xc8e
>  #define MSR_IA32_PQR_ASSOC		0xc8f
>  #define MSR_IA32_L3_CBM_BASE		0xc90
> +#define MSR_RMID_SNC_CONFIG		0xca0
>  #define MSR_IA32_L2_CBM_BASE		0xd10
>  #define MSR_IA32_MBA_THRTL_BASE		0xd50
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 97d2a5a7dd41..034f9797e1fb 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>  
>  #define pr_fmt(fmt)	"resctrl: " fmt
>  
> +#include <linux/cpu.h>
>  #include <linux/slab.h>
>  #include <linux/err.h>
>  #include <linux/cacheinfo.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>  
> +#include <asm/cpu_device_id.h>
>  #include <asm/intel-family.h>
>  #include <asm/resctrl.h>
>  #include "internal.h"
> @@ -184,10 +187,10 @@ bool is_mba_sc(struct rdt_resource *r)
>  
>  /*
>   * rdt_get_mb_table() - get a mapping of bandwidth(b/w) percentage values
> - * exposed to user interface and the h/w understandable delay values.
> + * exposed to user interface and the hardware understandable delay values.
>   *
>   * The non-linear delay values have the granularity of power of two
> - * and also the h/w does not guarantee a curve for configured delay
> + * and also the hardware does not guarantee a curve for configured delay
>   * values vs. actual b/w enforced.
>   * Hence we need a mapping that is pre calibrated so the user can
>   * express the memory b/w as a percentage value.

This seems out of place here. If you want to make such a global change
it can be done as a separate patch. For this work it can just consistently
use "hardware" in own areas changed.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v10 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-11-07  0:32               ` Reinette Chatre
@ 2023-11-08 19:19                 ` Tony Luck
  2023-11-08 19:43                   ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-11-08 19:19 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Nov 06, 2023 at 04:32:56PM -0800, Reinette Chatre wrote:
> Hi Tony,
> 
> On 10/31/2023 2:17 PM, Tony Luck wrote:
> > The same rdt_domain structure is used for both control and monitor
> > functions. But this results in wasted memory as some of the fields are
> > only used by control functions, while most are only used for monitor
> > functions.
> > 
> > Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
> > just the fields required for control and monitoring respectively.
> > 
> > Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
> > and rdt_hw_mon_domain.
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> > Changes since v9
> > Comment against patch 4, but now fixed in patch #2. cpu_mask
> > is included in common header.
> > 
> >  include/linux/resctrl.h                   | 50 +++++++------
> >  arch/x86/kernel/cpu/resctrl/internal.h    | 60 ++++++++++------
> >  arch/x86/kernel/cpu/resctrl/core.c        | 87 ++++++++++++-----------
> >  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
> >  arch/x86/kernel/cpu/resctrl/monitor.c     | 40 +++++------
> >  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 62 ++++++++--------
> >  7 files changed, 184 insertions(+), 153 deletions(-)
> > 
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 35e700edc6e6..36503e8870cd 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -72,7 +72,25 @@ struct rdt_domain_hdr {
> >  };
> >  
> >  /**
> > - * struct rdt_domain - group of CPUs sharing a resctrl resource
> > + * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
> > + * @hdr:		common header for different domain types
> > + * @cpu_mask:		which CPUs share this resource
> > + * @plr:		pseudo-locked region (if any) associated with domain
> > + * @staged_config:	parsed configuration to be applied
> > + * @mbps_val:		When mba_sc is enabled, this holds the array of user
> > + *			specified control values for mba_sc in MBps, indexed
> > + *			by closid
> > + */
> > +struct rdt_ctrl_domain {
> > +	struct rdt_domain_hdr		hdr;
> > +	struct cpumask			cpu_mask;
> 
> This patch did not change what it said it changed.

Reinette,

I'm not sure of the problem. The commit said the patch is splitting the
rdt_domain structure into rdt_ctrl_domain and rdt_mon_domain.

The piece of the patch where you gave this comment changed like this:

------- Before -------
/**
 * struct rdt_domain - group of CPUs sharing a resctrl resource
 * @hdr:		common header for different domain types
 * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
 * @mbm_total:		saved state for MBM total bandwidth
 * @mbm_local:		saved state for MBM local bandwidth
 * @mbm_over:		worker to periodically read MBM h/w counters
 * @cqm_limbo:		worker to periodically read CQM h/w counters
 * @mbm_work_cpu:	worker CPU for MBM h/w counters
 * @cqm_work_cpu:	worker CPU for CQM h/w counters
 * @plr:		pseudo-locked region (if any) associated with domain
 * @staged_config:	parsed configuration to be applied
 * @mbps_val:		When mba_sc is enabled, this holds the array of user
 *			specified control values for mba_sc in MBps, indexed
 *			by closid
 */
struct rdt_domain {
	struct rdt_domain_hdr		hdr;
	unsigned long			*rmid_busy_llc;
	struct mbm_state		*mbm_total;
	struct mbm_state		*mbm_local;
	struct delayed_work		mbm_over;
	struct delayed_work		cqm_limbo;
	int				mbm_work_cpu;
	int				cqm_work_cpu;
	struct pseudo_lock_region	*plr;
	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
	u32				*mbps_val;
};
------- After -------
/**
 * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
 * @hdr:		common header for different domain types
 * @cpu_mask:		which CPUs share this resource
 * @plr:		pseudo-locked region (if any) associated with domain
 * @staged_config:	parsed configuration to be applied
 * @mbps_val:		When mba_sc is enabled, this holds the array of user
 *			specified control values for mba_sc in MBps, indexed
 *			by closid
 */
struct rdt_ctrl_domain {
	struct rdt_domain_hdr		hdr;
	struct cpumask			cpu_mask;
	struct pseudo_lock_region	*plr;
	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
	u32				*mbps_val;
};

/**
 * struct rdt_mon_domain - group of CPUs sharing a resctrl control resource
 * @hdr:		common header for different domain types
 * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
 * @mbm_total:		saved state for MBM total bandwidth
 * @mbm_local:		saved state for MBM local bandwidth
 * @mbm_over:		worker to periodically read MBM h/w counters
 * @cqm_limbo:		worker to periodically read CQM h/w counters
 * @mbm_work_cpu:	worker CPU for MBM h/w counters
 * @cqm_work_cpu:	worker CPU for CQM h/w counters
 */
struct rdt_mon_domain {
	struct rdt_domain_hdr		hdr;
	unsigned long			*rmid_busy_llc;
	struct mbm_state		*mbm_total;
	struct mbm_state		*mbm_local;
	struct delayed_work		mbm_over;
	struct delayed_work		cqm_limbo;
	int				mbm_work_cpu;
	int				cqm_work_cpu;
};
-----

Which to my eyes looks like exactly what the commit comment says I
was going to do the in this patch. The old rdt_domain structure has
been replaced with two structures, with names as described in the commit
comment. The fields of the orginal structure have been distributed
betweeen the two new structures based on whether they are used for
control or monitor functions.


What am I missing? (Apart from a silly cut & paste error in the comments
that I just noticed and will fix now).

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v10 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-11-08 19:19                 ` Tony Luck
@ 2023-11-08 19:43                   ` Reinette Chatre
  2023-11-08 20:10                     ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-11-08 19:43 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 11/8/2023 11:19 AM, Tony Luck wrote:
> On Mon, Nov 06, 2023 at 04:32:56PM -0800, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 10/31/2023 2:17 PM, Tony Luck wrote:
>>> The same rdt_domain structure is used for both control and monitor
>>> functions. But this results in wasted memory as some of the fields are
>>> only used by control functions, while most are only used for monitor
>>> functions.
>>>
>>> Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
>>> just the fields required for control and monitoring respectively.
>>>
>>> Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
>>> and rdt_hw_mon_domain.
>>>
>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>> ---
>>> Changes since v9
>>> Comment against patch 4, but now fixed in patch #2. cpu_mask
>>> is included in common header.

This is the expected change I referred to. Specifically:
	"cpu_mask is included in common header"

> What am I missing? (Apart from a silly cut & paste error in the comments
> that I just noticed and will fix now).


	struct rdt_ctrl_domain {
		struct rdt_domain_hdr		hdr;
		struct cpumask			cpu_mask;
		...
	}

Considering the description of the changes to expect in this version I
did not expect to see a cpu_mask member in struct rdt_ctrl_domain since
it has now been moved to struct rdt_domain_hdr. What am I missing?

Reinette




^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v10 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-11-08 19:43                   ` Reinette Chatre
@ 2023-11-08 20:10                     ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-11-08 20:10 UTC (permalink / raw)
  To: Chatre, Reinette
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

>       struct rdt_ctrl_domain {
>               struct rdt_domain_hdr           hdr;
>               struct cpumask                  cpu_mask;
>               ...
>       }
>
> Considering the description of the changes to expect in this version I
> did not expect to see a cpu_mask member in struct rdt_ctrl_domain since
> it has now been moved to struct rdt_domain_hdr. What am I missing?

Reinette,

Yes. cpu_mask was moved into "struct rdt_domain_hdr" in patch 2 of the
series. I'm not sure how it managed to reappear here. Definitely a zombie.
I've removed it now. 

-Tony




^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-10-31 21:17           ` [PATCH v10 " Tony Luck
                               ` (7 preceding siblings ...)
  2023-10-31 21:17             ` [PATCH v10 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-11-09 23:09             ` Tony Luck
  2023-11-09 23:09               ` [PATCH v11 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
                                 ` (9 more replies)
  8 siblings, 10 replies; 331+ messages in thread
From: Tony Luck @ 2023-11-09 23:09 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Changes since v10 to patches 1, 3, 4, 7. See patches for details.

Tony Luck (8):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  23 +-
 include/linux/resctrl.h                   |  85 +++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  66 ++--
 arch/x86/kernel/cpu/resctrl/core.c        | 403 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  58 ++--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  68 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  26 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 149 ++++----
 9 files changed, 598 insertions(+), 281 deletions(-)


base-commit: 5a6a09e97199d6600d31383055f9d43fbbcbe86f
-- 
2.41.0


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v11 1/8] x86/resctrl: Prepare for new domain scope
  2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
@ 2023-11-09 23:09               ` Tony Luck
  2023-11-13 19:36                 ` Moger, Babu
  2023-11-20 22:16                 ` Reinette Chatre
  2023-11-09 23:09               ` [PATCH v11 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
                                 ` (8 subsequent siblings)
  9 siblings, 2 replies; 331+ messages in thread
From: Tony Luck @ 2023-11-09 23:09 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Clean up the error handling when looking up domains. Report invalid id's
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the callsites.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v10:
Reinette: Move a blank line in domain_add_cpu()
Reniette: Fix warning message in domain_remove_cpu()

 include/linux/resctrl.h                   |  9 +++--
 arch/x86/kernel/cpu/resctrl/core.c        | 42 +++++++++++++++--------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 ++-
 5 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..7d4eb7df611d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19e0681f0435..9f8d87abd751 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -401,9 +401,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct rdt_domain *d;
 	struct list_head *l;
 
-	if (id < 0)
-		return ERR_PTR(-ENODEV);
-
 	list_for_each(l, &r->domains) {
 		d = list_entry(l, struct rdt_domain, list);
 		/* When id is found, return its domain. */
@@ -491,6 +488,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -506,18 +516,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 	int err;
 
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
 		return;
 	}
 
+	d = rdt_find_domain(r, id, &add_pos);
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -556,13 +567,16 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
+	if (id < 0)
+		return;
+
 	d = rdt_find_domain(r, id, NULL);
-	if (IS_ERR_OR_NULL(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (!d) {
+		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
 		return;
 	}
 	hw_dom = resctrl_to_arch_dom(d);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index beccb0e87ba7..3f8891d57fac 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -563,7 +563,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 
 	r = &rdt_resources_all[resid].r_resctrl;
 	d = rdt_find_domain(r, domid, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	if (!d) {
 		ret = -ENOENT;
 		goto out;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..2a682da9f43a 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	enum resctrl_scope scope = plr->s->res->scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..c44be64d65ec 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1413,10 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return size;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v11 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2023-11-09 23:09               ` [PATCH v11 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-11-09 23:09               ` Tony Luck
  2023-11-13 18:08                 ` Moger, Babu
  2023-11-20 22:18                 ` Reinette Chatre
  2023-11-09 23:09               ` [PATCH v11 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
                                 ` (7 subsequent siblings)
  9 siblings, 2 replies; 331+ messages in thread
From: Tony Luck @ 2023-11-09 23:09 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   | 16 ++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 26 +++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 10 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 60 +++++++++++------------
 6 files changed, 78 insertions(+), 70 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 7d4eb7df611d..c4067150a6b7 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,10 +53,20 @@ struct resctrl_staged_config {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
  * @cpu_mask:		which CPUs share this resource
+ */
+struct rdt_domain_hdr {
+	struct list_head		list;
+	int				id;
+	struct cpumask			cpu_mask;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
  * @mbm_local:		saved state for MBM local bandwidth
@@ -71,9 +81,7 @@ struct resctrl_staged_config {
  *			by closid
  */
 struct rdt_domain {
-	struct list_head		list;
-	int				id;
-	struct cpumask			cpu_mask;
+	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
 	struct mbm_state		*mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 9f8d87abd751..c0659e84c9e3 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -356,9 +356,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/* Find the domain that contains this CPU */
-		if (cpumask_test_cpu(cpu, &d->cpu_mask))
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
 	}
 
@@ -402,12 +402,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct list_head *l;
 
 	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, list);
+		d = list_entry(l, struct rdt_domain, hdr.list);
 		/* When id is found, return its domain. */
-		if (id == d->id)
+		if (id == d->hdr.id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->id)
+		if (id < d->hdr.id)
 			break;
 	}
 
@@ -530,7 +530,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = rdt_find_domain(r, id, &add_pos);
 	if (d) {
-		cpumask_set_cpu(cpu, &d->cpu_mask);
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
 		return;
@@ -541,8 +541,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 
 	d = &hw_dom->d_resctrl;
-	d->id = id;
-	cpumask_set_cpu(cpu, &d->cpu_mask);
+	d->hdr.id = id;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
@@ -556,11 +556,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail(&d->list, add_pos);
+	list_add_tail(&d->hdr.list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
@@ -581,10 +581,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 	hw_dom = resctrl_to_arch_dom(d);
 
-	cpumask_clear_cpu(cpu, &d->cpu_mask);
-	if (cpumask_empty(&d->cpu_mask)) {
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 
 		/*
 		 * rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 3f8891d57fac..23f8258d36a8 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -146,7 +146,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -226,8 +226,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
 			if (r->parse_ctrlval(&data, s, d))
@@ -274,7 +274,7 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	struct rdt_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
-		cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
+		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
 		hw_dom->ctrl_val[idx] = cfg->new_ctrl;
 
 		return true;
@@ -291,7 +291,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	hw_dom->ctrl_val[idx] = cfg_val;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -476,7 +476,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 			ctrl_val = resctrl_arch_get_config(r, dom, closid,
 							   schema->conf_type);
 
-		seq_printf(s, r->format_str, dom->id, max_data_width,
+		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
 			   ctrl_val);
 		sep = true;
 	}
@@ -505,7 +505,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			} else {
 				seq_printf(s, "%s:%d=%x\n",
 					   rdtgrp->plr->s->res->name,
-					   rdtgrp->plr->d->id,
+					   rdtgrp->plr->d->hdr.id,
 					   rdtgrp->plr->cbm);
 			}
 		} else {
@@ -536,7 +536,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 	rr->val = 0;
 	rr->first = first;
 
-	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
 }
 
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..dd0ea1bc0092 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -238,7 +238,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 	u64 msr_val, chunks;
 	int ret;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	ret = __rmid_read(rmid, eventid, &msr_val);
@@ -340,8 +340,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
-		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
 						     &val);
@@ -661,7 +661,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
 
-	cpu = cpumask_any(&dom->cpu_mask);
+	cpu = cpumask_any(&dom->hdr.cpu_mask);
 	dom->cqm_work_cpu = cpu;
 
 	schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
@@ -708,7 +708,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
 
 	if (!static_branch_likely(&rdt_mon_enable_key))
 		return;
-	cpu = cpumask_any(&dom->cpu_mask);
+	cpu = cpumask_any(&dom->hdr.cpu_mask);
 	dom->mbm_work_cpu = cpu;
 	schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
 }
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 2a682da9f43a..fcbd99e2eb66 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
 	int cpu;
 	int ret;
 
-	for_each_cpu(cpu, &plr->d->cpu_mask) {
+	for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
 		pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
 		if (!pm_req) {
 			rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
@@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 		return -ENODEV;
 
 	/* Pick the first cpu we find that is associated with the cache. */
-	plr->cpu = cpumask_first(&plr->d->cpu_mask);
+	plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 
 	if (!cpu_online(plr->cpu)) {
 		rdt_last_cmd_printf("CPU %u associated with cache not online\n",
@@ -856,10 +856,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, list) {
+		list_for_each_entry(d_i, &r->domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
-					   &d_i->cpu_mask);
+					   &d_i->hdr.cpu_mask);
 		}
 	}
 
@@ -867,7 +867,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * Next test if new pseudo-locked region would intersect with
 	 * existing region.
 	 */
-	if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
+	if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
 		ret = true;
 
 	free_cpumask_var(cpu_with_psl);
@@ -1199,7 +1199,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
 	}
 
 	plr->thread_done = 0;
-	cpu = cpumask_first(&plr->d->cpu_mask);
+	cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 	if (!cpu_online(cpu)) {
 		ret = -ENODEV;
 		goto out;
@@ -1529,7 +1529,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * may be scheduled elsewhere and invalidate entries in the
 	 * pseudo-locked region.
 	 */
-	if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
+	if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
 		mutex_unlock(&rdtgroup_mutex);
 		return -EINVAL;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index c44be64d65ec..04d32602ac33 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -295,7 +295,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
 				rdt_last_cmd_puts("Cache domain offline\n");
 				ret = -ENODEV;
 			} else {
-				mask = &rdtgrp->plr->d->cpu_mask;
+				mask = &rdtgrp->plr->d->hdr.cpu_mask;
 				seq_printf(s, is_cpu_list(of) ?
 					   "%*pbl\n" : "%*pb\n",
 					   cpumask_pr_args(mask));
@@ -984,12 +984,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
 		exclusive = 0;
-		seq_printf(seq, "%d=", dom->id);
+		seq_printf(seq, "%d=", dom->hdr.id);
 		for (i = 0; i < closids_supported(); i++) {
 			if (!closid_allocated(i))
 				continue;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1417,7 +1417,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
-	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
+	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
 		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
@@ -1465,7 +1465,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
 						    rdtgrp->plr->d,
 						    rdtgrp->plr->cbm);
-			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
 		}
 		goto out;
 	}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1495,7 +1495,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 				else
 					size = rdtgroup_cbm_to_size(r, d, ctrl);
 			}
-			seq_printf(s, "%d=%u", d->id, size);
+			seq_printf(s, "%d=%u", d->hdr.id, size);
 			sep = true;
 		}
 		seq_putc(s, '\n');
@@ -1555,7 +1555,7 @@ static void mon_event_config_read(void *info)
 
 static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
 {
-	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
 
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1574,7 +1574,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		mon_info.evtid = evtid;
 		mondata_config_read(dom, &mon_info);
 
-		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1646,7 +1646,7 @@ static int mbm_config_write_domain(struct rdt_resource *r,
 	 * are scoped at the domain level. Writing any of these MSRs
 	 * on one CPU is observed by all the CPUs in the domain.
 	 */
-	smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
 			      &mon_info, 1);
 
 	/*
@@ -1689,8 +1689,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
 				return -EINVAL;
@@ -2232,14 +2232,14 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, list) {
+	list_for_each_entry(d, &r_l->domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
-			for_each_cpu(cpu, &d->cpu_mask)
+			for_each_cpu(cpu, &d->hdr.cpu_mask)
 				cpumask_set_cpu(cpu, cpu_mask);
 		else
 			/* Pick one CPU from each domain instance to update MSR */
-			cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+			cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 	}
 
 	/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2268,7 +2268,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	int cpu = cpumask_any(&d->cpu_mask);
+	int cpu = cpumask_any(&d->hdr.cpu_mask);
 	int i;
 
 	d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2780,9 +2780,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * CBMs in all domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
-		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
 			hw_dom->ctrl_val[i] = r->default_ctrl;
@@ -2986,7 +2986,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
 	/* create the directory */
 	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
 	if (IS_ERR(kn))
@@ -3002,7 +3002,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	}
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->id;
+	priv.u.domid = d->hdr.id;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3209,7 +3209,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
 	 */
 	tmp_cbm = cfg->new_ctrl;
 	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
-		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
 		return -ENOSPC;
 	}
 	cfg->have_new_ctrl = true;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, list) {
+	list_for_each_entry(d, &s->res->domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3864,7 +3864,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (static_branch_unlikely(&rdt_mon_enable_key))
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v11 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2023-11-09 23:09               ` [PATCH v11 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2023-11-09 23:09               ` [PATCH v11 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2023-11-09 23:09               ` Tony Luck
  2023-11-20 22:19                 ` Reinette Chatre
  2023-11-09 23:09               ` [PATCH v11 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
                                 ` (6 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-11-09 23:09 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v10:
Reinette: Took your rewritten comment for rdt_find_domain()
Reinette: Fixed error handling in domain_add_cpu_mon(),
	  domain_remove_cpu_{ctrl,mon}()
Tony: s/domain_add_cpu - Add a cpu/domain_add_cpu - Add a CPU/

 include/linux/resctrl.h                   |  25 ++-
 arch/x86/kernel/cpu/resctrl/internal.h    |   6 +-
 arch/x86/kernel/cpu/resctrl/core.c        | 202 ++++++++++++++++------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  12 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     |   4 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  55 +++---
 7 files changed, 213 insertions(+), 95 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c4067150a6b7..35e700edc6e6 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -52,15 +52,22 @@ struct resctrl_staged_config {
 	bool			have_new_ctrl;
 };
 
+enum resctrl_domain_type {
+	RESCTRL_CTRL_DOMAIN,
+	RESCTRL_MON_DOMAIN,
+};
+
 /**
  * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
+ * @type:		type of this instance
  * @cpu_mask:		which CPUs share this resource
  */
 struct rdt_domain_hdr {
 	struct list_head		list;
 	int				id;
+	enum resctrl_domain_type	type;
 	struct cpumask			cpu_mask;
 };
 
@@ -163,10 +170,12 @@ enum resctrl_scope {
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @scope:		Scope of this resource
+ * @ctrl_scope:		Scope of this resource for control functions
+ * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		All domains for this resource
+ * @ctrl_domains:	Control domains for this resource
+ * @mon_domains:	Monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -181,10 +190,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -230,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..24bf9d7989a9 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -520,8 +520,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -540,7 +540,7 @@ int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c0659e84c9e3..d5aad434b2fb 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,8 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.mon_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
+			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.scope			= RESCTRL_L2_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.ctrl_scope		= RESCTRL_L2_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -352,11 +355,11 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
@@ -378,7 +381,7 @@ void rdt_ctrl_update(void *arg)
 	int cpu = smp_processor_id();
 	struct rdt_domain *d;
 
-	d = get_domain_from_cpu(cpu, r);
+	d = get_ctrl_domain_from_cpu(cpu, r);
 	if (d) {
 		hw_res->msr_update(d, m, r);
 		return;
@@ -388,26 +391,26 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Search for a domain id in a resource domain list.
  *
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
- * The domain list is sorted by id in ascending order.
+ * Search the domain list to find the domain id. If the domain id is
+ * found, return the domain. NULL otherwise.  If the domain id is not
+ * found (and NULL returned) then the first domain with id bigger than
+ * the input id can be returned to the caller via @pos.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos)
 {
-	struct rdt_domain *d;
+	struct rdt_domain_hdr *d;
 	struct list_head *l;
 
-	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, hdr.list);
+	list_for_each(l, h) {
+		d = list_entry(l, struct rdt_domain_hdr, list);
 		/* When id is found, return its domain. */
-		if (id == d->hdr.id)
+		if (id == d->id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->hdr.id)
+		if (id < d->id)
 			break;
 	}
 
@@ -501,35 +504,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	return -EINVAL;
 }
 
-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 	int err;
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
 
-	d = rdt_find_domain(r, id, &add_pos);
-	if (d) {
+	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+			return;
+
+		d = container_of(hdr, struct rdt_domain, hdr);
+
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
@@ -542,48 +538,110 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = &hw_dom->d_resctrl;
 	d->hdr.id = id;
+	d->hdr.type = RESCTRL_CTRL_DOMAIN;
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+	if (domain_setup_ctrlval(r, d)) {
 		domain_free(hw_dom);
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	list_add_tail(&d->hdr.list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+	int err;
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+			return;
+
+		d = container_of(hdr, struct rdt_domain, hdr);
+
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->hdr.id = id;
+	d->hdr.type = RESCTRL_MON_DOMAIN;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
 	}
 
 	list_add_tail(&d->hdr.list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+/*
+ * domain_add_cpu - Add a CPU to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 
 	if (id < 0)
 		return;
 
-	d = rdt_find_domain(r, id, NULL);
-	if (!d) {
-		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+	if (!hdr) {
+		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
 		return;
 	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
 	hw_dom = resctrl_to_arch_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del(&d->hdr.list);
 
 		/*
@@ -596,6 +654,38 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 		return;
 	}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+
+	if (id < 0)
+		return;
+
+	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+	if (!hdr) {
+		pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
+		return;
+	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+	hw_dom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+
+		return;
+	}
 
 	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -610,6 +700,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 23f8258d36a8..0b4136c42762 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -226,7 +226,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -542,6 +542,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
+	struct rdt_domain_hdr *hdr;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
@@ -562,11 +563,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
-	if (!d) {
+	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+	if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
 		ret = -ENOENT;
 		goto out;
 	}
+	d = container_of(hdr, struct rdt_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index dd0ea1bc0092..ec5ad926c5dc 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
@@ -535,7 +535,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	rmid = rgrp->mon.rmid;
 	pmbm_data = &dom_mbm->mbm_local[rmid];
 
-	dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+	dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
 	if (!dom_mba) {
 		pr_warn_once("Failure to get domain for MBA update\n");
 		return;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index fcbd99e2eb66..ed6d59af1cef 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
-	enum resctrl_scope scope = plr->s->res->scope;
+	enum resctrl_scope scope = plr->s->res->ctrl_scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, hdr.list) {
+		list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->hdr.cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 04d32602ac33..760013ed1bff 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -984,7 +984,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1413,13 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
-	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+	if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->scope) {
+		if (ci->info_list[i].level == r->ctrl_scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1689,7 +1689,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
@@ -2232,7 +2232,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, hdr.list) {
+	list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->hdr.cpu_mask)
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2777,10 +2777,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
 
 	/*
 	 * Disable resource control for this resource by setting all
-	 * CBMs in all domains to the maximum mask value. Pick one CPU
+	 * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, hdr.list) {
+	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3849,15 +3849,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
+}
 
-	if (!r->mon_capable)
-		return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3914,18 +3916,21 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
-	int err;
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
-		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
-		return 0;
+	return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v11 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                 ` (2 preceding siblings ...)
  2023-11-09 23:09               ` [PATCH v11 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-11-09 23:09               ` Tony Luck
  2023-11-20 22:19                 ` Reinette Chatre
  2023-11-09 23:09               ` [PATCH v11 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
                                 ` (5 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-11-09 23:09 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v10:
Reinette: Remove struct cpumask cpu_mask; that had snuck bac into the
	  defintion of "struct rdt_domain"

 include/linux/resctrl.h                   | 48 +++++++------
 arch/x86/kernel/cpu/resctrl/internal.h    | 60 ++++++++++------
 arch/x86/kernel/cpu/resctrl/core.c        | 87 ++++++++++++-----------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 40 +++++------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 62 ++++++++--------
 7 files changed, 182 insertions(+), 153 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 35e700edc6e6..058a940c3239 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -72,7 +72,23 @@ struct rdt_domain_hdr {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr:		common header for different domain types
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_ctrl_domain {
+	struct rdt_domain_hdr		hdr;
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
  * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
@@ -81,13 +97,8 @@ struct rdt_domain_hdr {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
@@ -96,9 +107,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };
 
 /**
@@ -202,7 +210,7 @@ struct rdt_resource {
 	const char		*format_str;
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
-						 struct rdt_domain *d);
+						 struct rdt_ctrl_domain *d);
 	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
@@ -236,15 +244,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.
  */
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -260,7 +268,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
 
 /**
@@ -273,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid);
 
 /**
@@ -285,7 +293,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 24bf9d7989a9..ce3a70657842 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -107,7 +107,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -192,7 +192,7 @@ struct mongroup {
  */
 struct pseudo_lock_region {
 	struct resctrl_schema	*s;
-	struct rdt_domain	*d;
+	struct rdt_ctrl_domain	*d;
 	u32			cbm;
 	wait_queue_head_t	lock_thread_wq;
 	int			thread_done;
@@ -319,25 +319,41 @@ struct arch_mbm_state {
 };
 
 /**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ *			       a resource for a control function
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+	struct rdt_ctrl_domain		d_resctrl;
+	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ *			      a resource for a monitor function
+ * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
-struct rdt_hw_domain {
-	struct rdt_domain		d_resctrl;
-	u32				*ctrl_val;
+struct rdt_hw_mon_domain {
+	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };
 
-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
 {
-	return container_of(r, struct rdt_hw_domain, d_resctrl);
+	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
 }
 
 /**
@@ -405,7 +421,7 @@ struct rdt_hw_resource {
 	struct rdt_resource	r_resctrl;
 	u32			num_closid;
 	unsigned int		msr_base;
-	void (*msr_update)	(struct rdt_domain *d, struct msr_param *m,
+	void (*msr_update)	(struct rdt_ctrl_domain *d, struct msr_param *m,
 				 struct rdt_resource *r);
 	unsigned int		mon_scale;
 	unsigned int		mbm_width;
@@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 }
 
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d);
+	      struct rdt_ctrl_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d);
+	     struct rdt_ctrl_domain *d);
 
 extern struct mutex rdtgroup_mutex;
 
@@ -526,21 +542,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				  unsigned long cbm);
 enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
 int rdtgroup_tasks_assigned(struct rdtgroup *r);
 int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
 int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
 int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
@@ -550,17 +566,17 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d5aad434b2fb..cee8b87566fa 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
 bool rdt_alloc_capable;
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
 #define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -307,11 +307,11 @@ static void rdt_get_cdp_l2_config(void)
 }
 
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -332,12 +332,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 }
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	/*  Write the delay values for mba. */
 	for (i = m->low; i < m->high; i++)
@@ -345,19 +345,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 }
 
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
@@ -379,7 +379,7 @@ void rdt_ctrl_update(void *arg)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
 	struct rdt_resource *r = m->res;
 	int cpu = smp_processor_id();
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	d = get_ctrl_domain_from_cpu(cpu, r);
 	if (d) {
@@ -434,18 +434,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 		*dc = r->default_ctrl;
 }
 
-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct msr_param m;
 	u32 *dc;
 
@@ -468,7 +473,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
 {
 	size_t tsize;
 
@@ -507,10 +512,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -524,7 +529,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 			return;
 
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_ctrl_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -544,7 +549,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	rdt_domain_reconfigure_cdp(r);
 
 	if (domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 		return;
 	}
 
@@ -553,17 +558,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	err = resctrl_online_ctrl_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 	}
 }
 
 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_mon_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -577,7 +582,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 			return;
 
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		return;
@@ -593,7 +598,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 		return;
 	}
 
@@ -602,7 +607,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 	}
 }
 
@@ -620,9 +625,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	if (id < 0)
 		return;
@@ -636,8 +641,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+	hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
@@ -645,12 +650,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		list_del(&d->hdr.list);
 
 		/*
-		 * rdt_domain "d" is going to be freed below, so clear
+		 * rdt_ctrl_domain "d" is going to be freed below, so clear
 		 * its pointer from pseudo_lock_region struct.
 		 */
 		if (d->plr)
 			d->plr->d = NULL;
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 
 		return;
 	}
@@ -659,9 +664,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_mon_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	if (id < 0)
 		return;
@@ -675,14 +680,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
+	hw_dom = resctrl_to_arch_mon_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 0b4136c42762..08fc97ce4135 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }
 
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d)
+	     struct rdt_ctrl_domain *d)
 {
 	struct resctrl_staged_config *cfg;
 	u32 closid = data->rdtgrp->closid;
@@ -137,7 +137,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
  * resource type.
  */
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d)
+	      struct rdt_ctrl_domain *d)
 {
 	struct rdtgroup *rdtgrp = data->rdtgrp;
 	struct resctrl_staged_config *cfg;
@@ -206,8 +206,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct resctrl_staged_config *cfg;
 	struct rdt_resource *r = s->res;
 	struct rdt_parse_data data;
+	struct rdt_ctrl_domain *d;
 	char *dom = NULL, *id;
-	struct rdt_domain *d;
 	unsigned long dom_id;
 
 	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -267,11 +267,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
 	}
 }
 
-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
 			 struct resctrl_staged_config *cfg, u32 idx,
 			 cpumask_var_t cpu_mask)
 {
-	struct rdt_domain *dom = &hw_dom->d_resctrl;
+	struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
 		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
@@ -283,11 +283,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	return false;
 }
 
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
@@ -307,11 +307,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	enum resctrl_conf_type t;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	u32 idx;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -319,7 +319,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 
 	msr_param.res = NULL;
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
 			if (!cfg->have_new_ctrl)
@@ -449,10 +449,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	u32 idx = get_config_index(closid, type);
 
 	return hw_dom->ctrl_val[idx];
@@ -461,7 +461,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
 {
 	struct rdt_resource *r = schema->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	bool sep = false;
 	u32 ctrl_val;
 
@@ -523,7 +523,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	/*
@@ -543,11 +543,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
 	struct rdt_domain_hdr *hdr;
+	struct rdt_mon_domain *d;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -568,7 +568,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		ret = -ENOENT;
 		goto out;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ec5ad926c5dc..4e145f5620b0 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	}
 }
 
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
 }
@@ -334,7 +334,7 @@ int alloc_rmid(void)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int cpu, err;
 	u64 val = 0;
 
@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
 				       enum resctrl_event_id evtid)
 {
 	switch (evtid) {
@@ -516,13 +516,13 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
+	struct rdt_ctrl_domain *dom_mba;
 	u32 cur_bw, delta_bw, user_bw;
 	struct rdt_resource *r_mba;
-	struct rdt_domain *dom_mba;
 	struct list_head *head;
 	struct rdtgroup *entry;
 
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	}
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
 {
 	struct rmid_read rr;
 
@@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
 	struct rdtgroup *prgrp, *crgrp;
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index ed6d59af1cef..08d35f828bc3 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
  * Return: true if @cbm overlaps with pseudo-locked region on @d, false
  * otherwise.
  */
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int cbm_len;
 	unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
  *         if it is not possible to test due to memory allocation issue,
  *         false otherwise.
  */
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
 {
+	struct rdt_ctrl_domain *d_i;
 	cpumask_var_t cpu_with_psl;
 	struct rdt_resource *r;
-	struct rdt_domain *d_i;
 	bool ret = false;
 
 	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 760013ed1bff..21bbd832f3f2 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -85,8 +85,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
 
 void rdt_staged_configs_clear(void)
 {
+	struct rdt_ctrl_domain *dom;
 	struct rdt_resource *r;
-	struct rdt_domain *dom;
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -976,7 +976,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	unsigned long sw_shareable = 0, hw_shareable = 0;
 	unsigned long exclusive = 0, pseudo_locked = 0;
 	struct rdt_resource *r = s->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	int i, hwb, swb, excl, psl;
 	enum rdtgrp_mode mode;
 	bool sep = false;
@@ -1205,7 +1205,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
  *
  * Return: false if CBM does not overlap, true if it does.
  */
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				    unsigned long cbm, int closid,
 				    enum resctrl_conf_type type, bool exclusive)
 {
@@ -1260,7 +1260,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
  *
  * Return: true if CBM overlap detected, false if there is no overlap
  */
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1291,10 +1291,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
 static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 {
 	int closid = rdtgrp->closid;
+	struct rdt_ctrl_domain *d;
 	struct resctrl_schema *s;
 	struct rdt_resource *r;
 	bool has_cache = false;
-	struct rdt_domain *d;
 	u32 ctrl;
 
 	list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1407,7 +1407,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
  * bitmap functions work correctly.
  */
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
-				  struct rdt_domain *d, unsigned long cbm)
+				  struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	struct cpu_cacheinfo *ci;
 	unsigned int size = 0;
@@ -1439,9 +1439,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 {
 	struct resctrl_schema *schema;
 	enum resctrl_conf_type type;
+	struct rdt_ctrl_domain *d;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 	unsigned int size;
 	int ret = 0;
 	u32 closid;
@@ -1553,7 +1553,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1561,7 +1561,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
@@ -1618,7 +1618,7 @@ static void mon_event_config_write(void *info)
 }
 
 static int mbm_config_write_domain(struct rdt_resource *r,
-				   struct rdt_domain *d, u32 evtid, u32 val)
+				   struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 	int ret = 0;
@@ -1668,7 +1668,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 {
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int ret = 0;
 
 next:
@@ -2216,9 +2216,9 @@ static inline bool is_mba_linear(void)
 static int set_cache_qos_cfg(int level, bool enable)
 {
 	void (*update)(void *arg);
+	struct rdt_ctrl_domain *d;
 	struct rdt_resource *r_l;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int cpu;
 
 	if (level == RDT_RESOURCE_L3)
@@ -2265,7 +2265,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 		l3_qos_cfg_update(&hw_res->cdp_enabled);
 }
 
-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
 	int cpu = cpumask_any(&d->hdr.cpu_mask);
@@ -2283,7 +2283,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 }
 
 static void mba_sc_domain_destroy(struct rdt_resource *r,
-				  struct rdt_domain *d)
+				  struct rdt_ctrl_domain *d)
 {
 	kfree(d->mbps_val);
 	d->mbps_val = NULL;
@@ -2309,7 +2309,7 @@ static int set_mba_sc(bool mba_sc)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2578,7 +2578,7 @@ static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	unsigned long flags = RFTYPE_CTRL_BASE;
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2762,10 +2762,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
 static int reset_all_ctrls(struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int i;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2781,7 +2781,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * from each domain to update the MSRs below.
 	 */
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
@@ -2976,7 +2976,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -3025,7 +3025,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mon_domain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -3047,7 +3047,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	int ret;
 
 	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3149,7 +3149,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
  * Set the RDT domain up to start off with all usable allocations. That is,
  * all shareable and unused bits. All-zero CBM is invalid.
  */
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
 				 u32 closid)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3229,7 +3229,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
  */
 static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int ret;
 
 	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3245,7 +3245,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
@@ -3842,14 +3842,14 @@ static void __init rdtgroup_setup_default(void)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3857,7 +3857,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 		mba_sc_domain_destroy(r, d);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3886,7 +3886,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	size_t tsize;
 
@@ -3916,7 +3916,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3926,7 +3926,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	int err;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v11 5/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                 ` (3 preceding siblings ...)
  2023-11-09 23:09               ` [PATCH v11 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-11-09 23:09               ` Tony Luck
  2023-11-20 22:22                 ` Reinette Chatre
  2023-11-09 23:09               ` [PATCH v11 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                                 ` (4 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-11-09 23:09 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h            | 1 +
 arch/x86/kernel/cpu/resctrl/core.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 058a940c3239..b8a3a11b970d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -170,6 +170,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index cee8b87566fa..0f2b5ee429b0 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -502,6 +502,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
 	default:
 		break;
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v11 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                 ` (4 preceding siblings ...)
  2023-11-09 23:09               ` [PATCH v11 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-11-09 23:09               ` Tony Luck
  2023-11-20 22:23                 ` Reinette Chatre
  2023-11-09 23:09               ` [PATCH v11 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
                                 ` (3 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-11-09 23:09 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that will show how many
SNC nodes share each L3 cache. When this is "1", SNC mode is either
not implemented, or not enabled, but all places that need to check
it are updated to take appropriate action when SNC mode is enabled.

Code that needs to take action when SNC is enabled is:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
   nodes.
3) The RMID renumbering operates when using the value from the
   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
   counter, code must adjust from the logical RMID used to the physical
   RMID value for the SNC node that it wishes to read and load the
   adjusted value into the IA32_QM_EVTSEL MSR.
4) The L3 cache is divided between the SNC nodes. So the value
   reported in the resctrl "size" file is divided by the number of SNC
   nodes because the effective amount of cache that can be allocated
   is reduced by that factor.
5) The "-o mba_MBps" mount option must be disabled in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.
   Trying to use this feedback loop might result in contradictory
   changes to the throttling level coming from each of the SNC
   node bandwidth measurements.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  5 +++--
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index ce3a70657842..e7a75a439c16 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern unsigned int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 0f2b5ee429b0..f10b68b45342 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.  Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 4e145f5620b0..30b7c3b9b517 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs.
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 21bbd832f3f2..79d57dade568 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /*
@@ -2298,7 +2298,8 @@ static bool supports_mba_mbps(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 
 	return (is_mbm_local_enabled() &&
-		r->alloc_capable && is_mba_linear());
+		r->alloc_capable && is_mba_linear() &&
+		snc_nodes_per_l3_cache == 1);
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v11 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                 ` (5 preceding siblings ...)
  2023-11-09 23:09               ` [PATCH v11 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-11-09 23:09               ` Tony Luck
  2023-11-20 22:24                 ` Reinette Chatre
  2023-11-09 23:09               ` [PATCH v11 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
                                 ` (2 subsequent siblings)
  9 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-11-09 23:09 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v10:
Reinette: Revert two places where I'd globally swapped "h/w" for
	  "hardware" in comments for functions that were not touched by this
	  patch.

 arch/x86/include/asm/msr-index.h   |  1 +
 arch/x86/kernel/cpu/resctrl/core.c | 96 ++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e3fa9cecd599..4285a5ee81fe 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1109,6 +1109,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f10b68b45342..fa7bc90ccc99 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -732,11 +735,42 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+	u64 val;
+
+	/* Only need to enable once per package. */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
@@ -991,11 +1025,73 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+	int num_l3_caches;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+
+	if (num_online_cpus() != num_present_cpus())
+		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids)
+			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+		else
+			mem_only_nodes++;
+	}
+	cpus_read_unlock();
+
+	num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
+	kfree(node_caches);
+
+	if (!num_l3_caches)
+		return 1;
+
+	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+	if (ret > 1)
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+
+	return ret;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = snc_get_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v11 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                 ` (6 preceding siblings ...)
  2023-11-09 23:09               ` [PATCH v11 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-11-09 23:09               ` Tony Luck
  2023-11-09 23:14                 ` Randy Dunlap
  2023-11-20 22:25                 ` Reinette Chatre
  2023-11-13  4:46               ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Shaopeng Tan (Fujitsu)
  2023-11-30  0:34               ` [PATCH v12 " Tony Luck
  9 siblings, 2 replies; 331+ messages in thread
From: Tony Luck @ 2023-11-09 23:09 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/arch/x86/resctrl.rst | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..d1db200db5f9 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -366,9 +366,9 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event.  Each of these
 	directories have one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
@@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
 each bit represents 5% of the capacity of the cache. You could partition
 the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache. But each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-11-09 23:09               ` [PATCH v11 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-11-09 23:14                 ` Randy Dunlap
  2023-11-20 22:25                 ` Reinette Chatre
  1 sibling, 0 replies; 331+ messages in thread
From: Randy Dunlap @ 2023-11-09 23:14 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches

Hi Tony,

If you make any updates to this:
(but not critical)

On 11/9/23 15:09, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
> 
> Users should be aware that SNC mode also affects the amount of L3 cache
> available for allocation within each SNC node.
> 
> Reviewed-by: Peter Newman <peternewman@google.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  Documentation/arch/x86/resctrl.rst | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index a6279df64a9d..d1db200db5f9 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -366,9 +366,9 @@ When control is enabled all CTRL_MON groups will also contain:
>  When monitoring is enabled all MON groups will also contain:
>  
>  "mon_data":
> -	This contains a set of files organized by L3 domain and by
> -	RDT event. E.g. on a system with two L3 domains there will
> -	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
> +	This contains a set of files organized by L3 domain or by NUMA
> +	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> +	or enabled respectively) and by RDT event.  Each of these
>  	directories have one file per event (e.g. "llc_occupancy",

	            has

>  	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
>  	files provide a read out of the current value of the event for
> @@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
>  each bit represents 5% of the capacity of the cache. You could partition
>  the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>  
> +Notes on Sub-NUMA Cluster mode
> +==============================
> +When SNC mode is enabled Linux may load balance tasks between Sub-NUMA
> +nodes much more readily than between regular NUMA nodes since the CPUs
> +on Sub-NUMA nodes share the same L3 cache and the system may report
> +the NUMA distance between Sub-NUMA nodes with a lower value than used
> +for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
> +specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
> +and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
> +to get the full view of traffic for which the tasks were the source.
> +
> +The cache allocation feature still provides the same number of
> +bits in a mask to control allocation into the L3 cache. But each

                                                    cache, but each

> +of those ways has its capacity reduced because the cache is divided
> +between the SNC nodes. The values reported in the resctrl
> +"size" files are adjusted accordingly.
> +
>  Memory bandwidth Allocation and monitoring
>  ==========================================
>  

-- 
~Randy

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                 ` (7 preceding siblings ...)
  2023-11-09 23:09               ` [PATCH v11 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-11-13  4:46               ` Shaopeng Tan (Fujitsu)
  2023-11-13 17:33                 ` Luck, Tony
  2023-11-30  0:34               ` [PATCH v12 " Tony Luck
  9 siblings, 1 reply; 331+ messages in thread
From: Shaopeng Tan (Fujitsu) @ 2023-11-13  4:46 UTC (permalink / raw)
  To: 'Tony Luck',
	Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap, linux-kernel,
	linux-doc, patches

Hello Tony,

I ran resctrl selftest on Intel(R) Xeon(R) Gold 6254 CPU.
Everything looks good.

Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>

Best regards,
Shaopeng TAN

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-11-13  4:46               ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Shaopeng Tan (Fujitsu)
@ 2023-11-13 17:33                 ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-11-13 17:33 UTC (permalink / raw)
  To: Shaopeng Tan (Fujitsu),
	Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap, linux-kernel,
	linux-doc, patches

> I ran resctrl selftest on Intel(R) Xeon(R) Gold 6254 CPU.
> Everything looks good.
>
> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>


Shaopeng TAN,

Thanks for your patience, and for testing this latest version.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-11-09 23:09               ` [PATCH v11 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2023-11-13 18:08                 ` Moger, Babu
  2023-11-13 18:52                   ` Tony Luck
  2023-11-20 22:18                 ` Reinette Chatre
  1 sibling, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2023-11-13 18:08 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

Sorry for the late review. The patches look good for the most part. But we
can simplify a little more. Please see my comments below.


On 11/9/23 17:09, Tony Luck wrote:
> The rdt_domain structure is used for both control and monitor features.
> It is about to be split into separate structures for these two usages
> because the scope for control and monitoring features for a resource
> will be different for future resources.
> 
> To allow for common code that scans a list of domains looking for a
> specific domain id, move all the common fields ("list", "id", "cpu_mask")
> into their own structure within the rdt_domain structure.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h                   | 16 ++++--
>  arch/x86/kernel/cpu/resctrl/core.c        | 26 +++++-----
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
>  arch/x86/kernel/cpu/resctrl/monitor.c     | 10 ++--
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 60 +++++++++++------------
>  6 files changed, 78 insertions(+), 70 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 7d4eb7df611d..c4067150a6b7 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -53,10 +53,20 @@ struct resctrl_staged_config {
>  };
>  
>  /**
> - * struct rdt_domain - group of CPUs sharing a resctrl resource
> + * struct rdt_domain_hdr - common header for different domain types
>   * @list:		all instances of this resource
>   * @id:			unique id for this instance
>   * @cpu_mask:		which CPUs share this resource
> + */
> +struct rdt_domain_hdr {
> +	struct list_head		list;
> +	int				id;
> +	struct cpumask			cpu_mask;
> +};

I like the idea of separating the domains, one for control and another for
monitor. I have some comments on how it can be done to simplify the code.
Adding the hdr adds a little complexity to the code.

How about converting the current rdt_domain to explicitly to rdt_mon_domain?

Something like this.

struct rdt_mon_domain {
        struct list_head                list;
        int                             id;
        struct cpumask                  cpu_mask;
        unsigned long                   *rmid_busy_llc;
        struct mbm_state                *mbm_total;
        struct mbm_state                *mbm_local;
        struct delayed_work             mbm_over;
        struct delayed_work             cqm_limbo;
        int                             mbm_work_cpu;
        int                             cqm_work_cpu;
};


Then introduce rdt_ctrl_domain to which separates into two doamins.

struct rdt_ctrl_domain {
        struct list_head                list;
        int                             id;
        struct cpumask                  cpu_mask;
        struct pseudo_lock_region       *plr;
        struct resctrl_staged_config    staged_config[CDP_NUM_TYPES];
        u32                             *mbps_val;
};

I feel this will be easy to understand and makes the code simpler. Changes
will be minimal.

Thanks
Babu


> +
> +/**
> + * struct rdt_domain - group of CPUs sharing a resctrl resource
> + * @hdr:		common header for different domain types
>   * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
>   * @mbm_total:		saved state for MBM total bandwidth
>   * @mbm_local:		saved state for MBM local bandwidth
> @@ -71,9 +81,7 @@ struct resctrl_staged_config {
>   *			by closid
>   */
>  struct rdt_domain {
> -	struct list_head		list;
> -	int				id;
> -	struct cpumask			cpu_mask;
> +	struct rdt_domain_hdr		hdr;
>  	unsigned long			*rmid_busy_llc;
>  	struct mbm_state		*mbm_total;
>  	struct mbm_state		*mbm_local;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 9f8d87abd751..c0659e84c9e3 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -356,9 +356,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
>  {
>  	struct rdt_domain *d;
>  
> -	list_for_each_entry(d, &r->domains, list) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
>  		/* Find the domain that contains this CPU */
> -		if (cpumask_test_cpu(cpu, &d->cpu_mask))
> +		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
>  			return d;
>  	}
>  
> @@ -402,12 +402,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
>  	struct list_head *l;
>  
>  	list_for_each(l, &r->domains) {
> -		d = list_entry(l, struct rdt_domain, list);
> +		d = list_entry(l, struct rdt_domain, hdr.list);
>  		/* When id is found, return its domain. */
> -		if (id == d->id)
> +		if (id == d->hdr.id)
>  			return d;
>  		/* Stop searching when finding id's position in sorted list. */
> -		if (id < d->id)
> +		if (id < d->hdr.id)
>  			break;
>  	}
>  
> @@ -530,7 +530,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  
>  	d = rdt_find_domain(r, id, &add_pos);
>  	if (d) {
> -		cpumask_set_cpu(cpu, &d->cpu_mask);
> +		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>  		if (r->cache.arch_has_per_cpu_cfg)
>  			rdt_domain_reconfigure_cdp(r);
>  		return;
> @@ -541,8 +541,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  		return;
>  
>  	d = &hw_dom->d_resctrl;
> -	d->id = id;
> -	cpumask_set_cpu(cpu, &d->cpu_mask);
> +	d->hdr.id = id;
> +	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>  
>  	rdt_domain_reconfigure_cdp(r);
>  
> @@ -556,11 +556,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  		return;
>  	}
>  
> -	list_add_tail(&d->list, add_pos);
> +	list_add_tail(&d->hdr.list, add_pos);
>  
>  	err = resctrl_online_domain(r, d);
>  	if (err) {
> -		list_del(&d->list);
> +		list_del(&d->hdr.list);
>  		domain_free(hw_dom);
>  	}
>  }
> @@ -581,10 +581,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  	}
>  	hw_dom = resctrl_to_arch_dom(d);
>  
> -	cpumask_clear_cpu(cpu, &d->cpu_mask);
> -	if (cpumask_empty(&d->cpu_mask)) {
> +	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> +	if (cpumask_empty(&d->hdr.cpu_mask)) {
>  		resctrl_offline_domain(r, d);
> -		list_del(&d->list);
> +		list_del(&d->hdr.list);
>  
>  		/*
>  		 * rdt_domain "d" is going to be freed below, so clear
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 3f8891d57fac..23f8258d36a8 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
>  
>  	cfg = &d->staged_config[s->conf_type];
>  	if (cfg->have_new_ctrl) {
> -		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
> +		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
>  		return -EINVAL;
>  	}
>  
> @@ -146,7 +146,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
>  
>  	cfg = &d->staged_config[s->conf_type];
>  	if (cfg->have_new_ctrl) {
> -		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
> +		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
>  		return -EINVAL;
>  	}
>  
> @@ -226,8 +226,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
>  		return -EINVAL;
>  	}
>  	dom = strim(dom);
> -	list_for_each_entry(d, &r->domains, list) {
> -		if (d->id == dom_id) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
> +		if (d->hdr.id == dom_id) {
>  			data.buf = dom;
>  			data.rdtgrp = rdtgrp;
>  			if (r->parse_ctrlval(&data, s, d))
> @@ -274,7 +274,7 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
>  	struct rdt_domain *dom = &hw_dom->d_resctrl;
>  
>  	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
> -		cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
> +		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
>  		hw_dom->ctrl_val[idx] = cfg->new_ctrl;
>  
>  		return true;
> @@ -291,7 +291,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
>  	u32 idx = get_config_index(closid, t);
>  	struct msr_param msr_param;
>  
> -	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
> +	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
>  		return -EINVAL;
>  
>  	hw_dom->ctrl_val[idx] = cfg_val;
> @@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
>  		return -ENOMEM;
>  
>  	msr_param.res = NULL;
> -	list_for_each_entry(d, &r->domains, list) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
>  		hw_dom = resctrl_to_arch_dom(d);
>  		for (t = 0; t < CDP_NUM_TYPES; t++) {
>  			cfg = &hw_dom->d_resctrl.staged_config[t];
> @@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
>  	u32 ctrl_val;
>  
>  	seq_printf(s, "%*s:", max_name_width, schema->name);
> -	list_for_each_entry(dom, &r->domains, list) {
> +	list_for_each_entry(dom, &r->domains, hdr.list) {
>  		if (sep)
>  			seq_puts(s, ";");
>  
> @@ -476,7 +476,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
>  			ctrl_val = resctrl_arch_get_config(r, dom, closid,
>  							   schema->conf_type);
>  
> -		seq_printf(s, r->format_str, dom->id, max_data_width,
> +		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
>  			   ctrl_val);
>  		sep = true;
>  	}
> @@ -505,7 +505,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
>  			} else {
>  				seq_printf(s, "%s:%d=%x\n",
>  					   rdtgrp->plr->s->res->name,
> -					   rdtgrp->plr->d->id,
> +					   rdtgrp->plr->d->hdr.id,
>  					   rdtgrp->plr->cbm);
>  			}
>  		} else {
> @@ -536,7 +536,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>  	rr->val = 0;
>  	rr->first = first;
>  
> -	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> +	smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
>  }
>  
>  int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..dd0ea1bc0092 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -238,7 +238,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>  	u64 msr_val, chunks;
>  	int ret;
>  
> -	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
> +	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
>  		return -EINVAL;
>  
>  	ret = __rmid_read(rmid, eventid, &msr_val);
> @@ -340,8 +340,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>  
>  	entry->busy = 0;
>  	cpu = get_cpu();
> -	list_for_each_entry(d, &r->domains, list) {
> -		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
> +		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
>  			err = resctrl_arch_rmid_read(r, d, entry->rmid,
>  						     QOS_L3_OCCUP_EVENT_ID,
>  						     &val);
> @@ -661,7 +661,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
>  	unsigned long delay = msecs_to_jiffies(delay_ms);
>  	int cpu;
>  
> -	cpu = cpumask_any(&dom->cpu_mask);
> +	cpu = cpumask_any(&dom->hdr.cpu_mask);
>  	dom->cqm_work_cpu = cpu;
>  
>  	schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
> @@ -708,7 +708,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
>  
>  	if (!static_branch_likely(&rdt_mon_enable_key))
>  		return;
> -	cpu = cpumask_any(&dom->cpu_mask);
> +	cpu = cpumask_any(&dom->hdr.cpu_mask);
>  	dom->mbm_work_cpu = cpu;
>  	schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
>  }
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 2a682da9f43a..fcbd99e2eb66 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
>  	int cpu;
>  	int ret;
>  
> -	for_each_cpu(cpu, &plr->d->cpu_mask) {
> +	for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
>  		pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
>  		if (!pm_req) {
>  			rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
> @@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  		return -ENODEV;
>  
>  	/* Pick the first cpu we find that is associated with the cache. */
> -	plr->cpu = cpumask_first(&plr->d->cpu_mask);
> +	plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);
>  
>  	if (!cpu_online(plr->cpu)) {
>  		rdt_last_cmd_printf("CPU %u associated with cache not online\n",
> @@ -856,10 +856,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
>  	 * associated with them.
>  	 */
>  	for_each_alloc_capable_rdt_resource(r) {
> -		list_for_each_entry(d_i, &r->domains, list) {
> +		list_for_each_entry(d_i, &r->domains, hdr.list) {
>  			if (d_i->plr)
>  				cpumask_or(cpu_with_psl, cpu_with_psl,
> -					   &d_i->cpu_mask);
> +					   &d_i->hdr.cpu_mask);
>  		}
>  	}
>  
> @@ -867,7 +867,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
>  	 * Next test if new pseudo-locked region would intersect with
>  	 * existing region.
>  	 */
> -	if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
> +	if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
>  		ret = true;
>  
>  	free_cpumask_var(cpu_with_psl);
> @@ -1199,7 +1199,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
>  	}
>  
>  	plr->thread_done = 0;
> -	cpu = cpumask_first(&plr->d->cpu_mask);
> +	cpu = cpumask_first(&plr->d->hdr.cpu_mask);
>  	if (!cpu_online(cpu)) {
>  		ret = -ENODEV;
>  		goto out;
> @@ -1529,7 +1529,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
>  	 * may be scheduled elsewhere and invalidate entries in the
>  	 * pseudo-locked region.
>  	 */
> -	if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
> +	if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
>  		mutex_unlock(&rdtgroup_mutex);
>  		return -EINVAL;
>  	}
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index c44be64d65ec..04d32602ac33 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
>  	lockdep_assert_held(&rdtgroup_mutex);
>  
>  	for_each_alloc_capable_rdt_resource(r) {
> -		list_for_each_entry(dom, &r->domains, list)
> +		list_for_each_entry(dom, &r->domains, hdr.list)
>  			memset(dom->staged_config, 0, sizeof(dom->staged_config));
>  	}
>  }
> @@ -295,7 +295,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
>  				rdt_last_cmd_puts("Cache domain offline\n");
>  				ret = -ENODEV;
>  			} else {
> -				mask = &rdtgrp->plr->d->cpu_mask;
> +				mask = &rdtgrp->plr->d->hdr.cpu_mask;
>  				seq_printf(s, is_cpu_list(of) ?
>  					   "%*pbl\n" : "%*pb\n",
>  					   cpumask_pr_args(mask));
> @@ -984,12 +984,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
>  
>  	mutex_lock(&rdtgroup_mutex);
>  	hw_shareable = r->cache.shareable_bits;
> -	list_for_each_entry(dom, &r->domains, list) {
> +	list_for_each_entry(dom, &r->domains, hdr.list) {
>  		if (sep)
>  			seq_putc(seq, ';');
>  		sw_shareable = 0;
>  		exclusive = 0;
> -		seq_printf(seq, "%d=", dom->id);
> +		seq_printf(seq, "%d=", dom->hdr.id);
>  		for (i = 0; i < closids_supported(); i++) {
>  			if (!closid_allocated(i))
>  				continue;
> @@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
>  		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
>  			continue;
>  		has_cache = true;
> -		list_for_each_entry(d, &r->domains, list) {
> +		list_for_each_entry(d, &r->domains, hdr.list) {
>  			ctrl = resctrl_arch_get_config(r, d, closid,
>  						       s->conf_type);
>  			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
> @@ -1417,7 +1417,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>  		return size;
>  
>  	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
> -	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
> +	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
>  	for (i = 0; i < ci->num_leaves; i++) {
>  		if (ci->info_list[i].level == r->scope) {
>  			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
> @@ -1465,7 +1465,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
>  			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
>  						    rdtgrp->plr->d,
>  						    rdtgrp->plr->cbm);
> -			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
> +			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
>  		}
>  		goto out;
>  	}
> @@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
>  		type = schema->conf_type;
>  		sep = false;
>  		seq_printf(s, "%*s:", max_name_width, schema->name);
> -		list_for_each_entry(d, &r->domains, list) {
> +		list_for_each_entry(d, &r->domains, hdr.list) {
>  			if (sep)
>  				seq_putc(s, ';');
>  			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
> @@ -1495,7 +1495,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
>  				else
>  					size = rdtgroup_cbm_to_size(r, d, ctrl);
>  			}
> -			seq_printf(s, "%d=%u", d->id, size);
> +			seq_printf(s, "%d=%u", d->hdr.id, size);
>  			sep = true;
>  		}
>  		seq_putc(s, '\n');
> @@ -1555,7 +1555,7 @@ static void mon_event_config_read(void *info)
>  
>  static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
>  {
> -	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
> +	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
>  }
>  
>  static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
> @@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
>  
>  	mutex_lock(&rdtgroup_mutex);
>  
> -	list_for_each_entry(dom, &r->domains, list) {
> +	list_for_each_entry(dom, &r->domains, hdr.list) {
>  		if (sep)
>  			seq_puts(s, ";");
>  
> @@ -1574,7 +1574,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
>  		mon_info.evtid = evtid;
>  		mondata_config_read(dom, &mon_info);
>  
> -		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
> +		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
>  		sep = true;
>  	}
>  	seq_puts(s, "\n");
> @@ -1646,7 +1646,7 @@ static int mbm_config_write_domain(struct rdt_resource *r,
>  	 * are scoped at the domain level. Writing any of these MSRs
>  	 * on one CPU is observed by all the CPUs in the domain.
>  	 */
> -	smp_call_function_any(&d->cpu_mask, mon_event_config_write,
> +	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
>  			      &mon_info, 1);
>  
>  	/*
> @@ -1689,8 +1689,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
>  		return -EINVAL;
>  	}
>  
> -	list_for_each_entry(d, &r->domains, list) {
> -		if (d->id == dom_id) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
> +		if (d->hdr.id == dom_id) {
>  			ret = mbm_config_write_domain(r, d, evtid, val);
>  			if (ret)
>  				return -EINVAL;
> @@ -2232,14 +2232,14 @@ static int set_cache_qos_cfg(int level, bool enable)
>  		return -ENOMEM;
>  
>  	r_l = &rdt_resources_all[level].r_resctrl;
> -	list_for_each_entry(d, &r_l->domains, list) {
> +	list_for_each_entry(d, &r_l->domains, hdr.list) {
>  		if (r_l->cache.arch_has_per_cpu_cfg)
>  			/* Pick all the CPUs in the domain instance */
> -			for_each_cpu(cpu, &d->cpu_mask)
> +			for_each_cpu(cpu, &d->hdr.cpu_mask)
>  				cpumask_set_cpu(cpu, cpu_mask);
>  		else
>  			/* Pick one CPU from each domain instance to update MSR */
> -			cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
> +			cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
>  	}
>  
>  	/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
> @@ -2268,7 +2268,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
>  static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
>  {
>  	u32 num_closid = resctrl_arch_get_num_closid(r);
> -	int cpu = cpumask_any(&d->cpu_mask);
> +	int cpu = cpumask_any(&d->hdr.cpu_mask);
>  	int i;
>  
>  	d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
> @@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
>  
>  	r->membw.mba_sc = mba_sc;
>  
> -	list_for_each_entry(d, &r->domains, list) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
>  		for (i = 0; i < num_closid; i++)
>  			d->mbps_val[i] = MBA_MAX_MBPS;
>  	}
> @@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
>  
>  	if (is_mbm_enabled()) {
>  		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> -		list_for_each_entry(dom, &r->domains, list)
> +		list_for_each_entry(dom, &r->domains, hdr.list)
>  			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
>  	}
>  
> @@ -2780,9 +2780,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
>  	 * CBMs in all domains to the maximum mask value. Pick one CPU
>  	 * from each domain to update the MSRs below.
>  	 */
> -	list_for_each_entry(d, &r->domains, list) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
>  		hw_dom = resctrl_to_arch_dom(d);
> -		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
> +		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
>  
>  		for (i = 0; i < hw_res->num_closid; i++)
>  			hw_dom->ctrl_val[i] = r->default_ctrl;
> @@ -2986,7 +2986,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>  	char name[32];
>  	int ret;
>  
> -	sprintf(name, "mon_%s_%02d", r->name, d->id);
> +	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
>  	/* create the directory */
>  	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
>  	if (IS_ERR(kn))
> @@ -3002,7 +3002,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>  	}
>  
>  	priv.u.rid = r->rid;
> -	priv.u.domid = d->id;
> +	priv.u.domid = d->hdr.id;
>  	list_for_each_entry(mevt, &r->evt_list, list) {
>  		priv.u.evtid = mevt->evtid;
>  		ret = mon_addfile(kn, mevt->name, priv.priv);
> @@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
>  	struct rdt_domain *dom;
>  	int ret;
>  
> -	list_for_each_entry(dom, &r->domains, list) {
> +	list_for_each_entry(dom, &r->domains, hdr.list) {
>  		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
>  		if (ret)
>  			return ret;
> @@ -3209,7 +3209,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
>  	 */
>  	tmp_cbm = cfg->new_ctrl;
>  	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
> -		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
> +		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
>  		return -ENOSPC;
>  	}
>  	cfg->have_new_ctrl = true;
> @@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
>  	struct rdt_domain *d;
>  	int ret;
>  
> -	list_for_each_entry(d, &s->res->domains, list) {
> +	list_for_each_entry(d, &s->res->domains, hdr.list) {
>  		ret = __init_one_rdt_domain(d, s, closid);
>  		if (ret < 0)
>  			return ret;
> @@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
>  	struct resctrl_staged_config *cfg;
>  	struct rdt_domain *d;
>  
> -	list_for_each_entry(d, &r->domains, list) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
>  		if (is_mba_sc(r)) {
>  			d->mbps_val[closid] = MBA_MAX_MBPS;
>  			continue;
> @@ -3864,7 +3864,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
>  	 * per domain monitor data directories.
>  	 */
>  	if (static_branch_unlikely(&rdt_mon_enable_key))
> -		rmdir_mondata_subdir_allrdtgrp(r, d->id);
> +		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
>  
>  	if (is_mbm_enabled())
>  		cancel_delayed_work(&d->mbm_over);

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-11-13 18:08                 ` Moger, Babu
@ 2023-11-13 18:52                   ` Tony Luck
  2023-11-13 19:32                     ` Moger, Babu
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-11-13 18:52 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, linux-kernel, linux-doc, patches

On Mon, Nov 13, 2023 at 12:08:19PM -0600, Moger, Babu wrote:
> Hi Tony,
> 
> Sorry for the late review. The patches look good for the most part. But we
> can simplify a little more. Please see my comments below.
> 
> 
> On 11/9/23 17:09, Tony Luck wrote:
> > The rdt_domain structure is used for both control and monitor features.
> > It is about to be split into separate structures for these two usages
> > because the scope for control and monitoring features for a resource
> > will be different for future resources.
> > 
> > To allow for common code that scans a list of domains looking for a
> > specific domain id, move all the common fields ("list", "id", "cpu_mask")
> > into their own structure within the rdt_domain structure.
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  include/linux/resctrl.h                   | 16 ++++--
> >  arch/x86/kernel/cpu/resctrl/core.c        | 26 +++++-----
> >  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
> >  arch/x86/kernel/cpu/resctrl/monitor.c     | 10 ++--
> >  arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 60 +++++++++++------------
> >  6 files changed, 78 insertions(+), 70 deletions(-)
> > 
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 7d4eb7df611d..c4067150a6b7 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -53,10 +53,20 @@ struct resctrl_staged_config {
> >  };
> >  
> >  /**
> > - * struct rdt_domain - group of CPUs sharing a resctrl resource
> > + * struct rdt_domain_hdr - common header for different domain types
> >   * @list:		all instances of this resource
> >   * @id:			unique id for this instance
> >   * @cpu_mask:		which CPUs share this resource
> > + */
> > +struct rdt_domain_hdr {
> > +	struct list_head		list;
> > +	int				id;
> > +	struct cpumask			cpu_mask;
> > +};
> 
> I like the idea of separating the domains, one for control and another for
> monitor. I have some comments on how it can be done to simplify the code.
> Adding the hdr adds a little complexity to the code.
> 
> How about converting the current rdt_domain to explicitly to rdt_mon_domain?
> 
> Something like this.
> 
> struct rdt_mon_domain {
>         struct list_head                list;
>         int                             id;
>         struct cpumask                  cpu_mask;
>         unsigned long                   *rmid_busy_llc;
>         struct mbm_state                *mbm_total;
>         struct mbm_state                *mbm_local;
>         struct delayed_work             mbm_over;
>         struct delayed_work             cqm_limbo;
>         int                             mbm_work_cpu;
>         int                             cqm_work_cpu;
> };
> 
> 
> Then introduce rdt_ctrl_domain to which separates into two doamins.
> 
> struct rdt_ctrl_domain {
>         struct list_head                list;
>         int                             id;
>         struct cpumask                  cpu_mask;
>         struct pseudo_lock_region       *plr;
>         struct resctrl_staged_config    staged_config[CDP_NUM_TYPES];
>         u32                             *mbps_val;
> };
> 
> I feel this will be easy to understand and makes the code simpler. Changes
> will be minimal.

Babu,

I went in this direction because of rdt_find_domain() which is used
to search either of the "ctrl" or "mon" lists. So it needs to be sure
that the "list" and "id" fields are at identical offsets in both the
rdt_ctrl_domain and rdt_mon_domain structures. Best way to guarantee[1]
that is to create the "hdr" entry (which later acquired the cpu_mask
field as a common element after a comment from Reinette).

One way to avoid this would be to essentially duplicate
rdt_find_domain() into rdt_find_ctrl_domain() and rdt_find_mon_domain()
... but I fear the function is just big enough to get complaints
that the two copies should somehow be merged.

-Tony

[1] In v5 I tried using a comment in each to say they must be the same,
but Reinette didn't like comments within declarations:
https://lore.kernel.org/all/5f1256d3-737e-a447-abbe-f541767b2c8f@intel.com/

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-11-13 18:52                   ` Tony Luck
@ 2023-11-13 19:32                     ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2023-11-13 19:32 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, linux-kernel, linux-doc, patches



On 11/13/23 12:52, Tony Luck wrote:
> On Mon, Nov 13, 2023 at 12:08:19PM -0600, Moger, Babu wrote:
>> Hi Tony,
>>
>> Sorry for the late review. The patches look good for the most part. But we
>> can simplify a little more. Please see my comments below.
>>
>>
>> On 11/9/23 17:09, Tony Luck wrote:
>>> The rdt_domain structure is used for both control and monitor features.
>>> It is about to be split into separate structures for these two usages
>>> because the scope for control and monitoring features for a resource
>>> will be different for future resources.
>>>
>>> To allow for common code that scans a list of domains looking for a
>>> specific domain id, move all the common fields ("list", "id", "cpu_mask")
>>> into their own structure within the rdt_domain structure.
>>>
>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>> ---
>>>  include/linux/resctrl.h                   | 16 ++++--
>>>  arch/x86/kernel/cpu/resctrl/core.c        | 26 +++++-----
>>>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
>>>  arch/x86/kernel/cpu/resctrl/monitor.c     | 10 ++--
>>>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
>>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 60 +++++++++++------------
>>>  6 files changed, 78 insertions(+), 70 deletions(-)
>>>
>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index 7d4eb7df611d..c4067150a6b7 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -53,10 +53,20 @@ struct resctrl_staged_config {
>>>  };
>>>  
>>>  /**
>>> - * struct rdt_domain - group of CPUs sharing a resctrl resource
>>> + * struct rdt_domain_hdr - common header for different domain types
>>>   * @list:		all instances of this resource
>>>   * @id:			unique id for this instance
>>>   * @cpu_mask:		which CPUs share this resource
>>> + */
>>> +struct rdt_domain_hdr {
>>> +	struct list_head		list;
>>> +	int				id;
>>> +	struct cpumask			cpu_mask;
>>> +};
>>
>> I like the idea of separating the domains, one for control and another for
>> monitor. I have some comments on how it can be done to simplify the code.
>> Adding the hdr adds a little complexity to the code.
>>
>> How about converting the current rdt_domain to explicitly to rdt_mon_domain?
>>
>> Something like this.
>>
>> struct rdt_mon_domain {
>>         struct list_head                list;
>>         int                             id;
>>         struct cpumask                  cpu_mask;
>>         unsigned long                   *rmid_busy_llc;
>>         struct mbm_state                *mbm_total;
>>         struct mbm_state                *mbm_local;
>>         struct delayed_work             mbm_over;
>>         struct delayed_work             cqm_limbo;
>>         int                             mbm_work_cpu;
>>         int                             cqm_work_cpu;
>> };
>>
>>
>> Then introduce rdt_ctrl_domain to which separates into two doamins.
>>
>> struct rdt_ctrl_domain {
>>         struct list_head                list;
>>         int                             id;
>>         struct cpumask                  cpu_mask;
>>         struct pseudo_lock_region       *plr;
>>         struct resctrl_staged_config    staged_config[CDP_NUM_TYPES];
>>         u32                             *mbps_val;
>> };
>>
>> I feel this will be easy to understand and makes the code simpler. Changes
>> will be minimal.
> 
> Babu,
> 
> I went in this direction because of rdt_find_domain() which is used
> to search either of the "ctrl" or "mon" lists. So it needs to be sure
> that the "list" and "id" fields are at identical offsets in both the
> rdt_ctrl_domain and rdt_mon_domain structures. Best way to guarantee[1]
> that is to create the "hdr" entry (which later acquired the cpu_mask
> field as a common element after a comment from Reinette).
> 
> One way to avoid this would be to essentially duplicate
> rdt_find_domain() into rdt_find_ctrl_domain() and rdt_find_mon_domain()

It could been solved by passing a flag(0 or mon and 1 for ctrl) as we know
where this is being called.

> ... but I fear the function is just big enough to get complaints
> that the two copies should somehow be merged.

Ok. That is fine. Lets go with current implementation. Thanks for showing
the discussion.

> 
> -Tony
> 
> [1] In v5 I tried using a comment in each to say they must be the same,
> but Reinette didn't like comments within declarations:
> https://lore.kernel.org/all/5f1256d3-737e-a447-abbe-f541767b2c8f@intel.com/

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 1/8] x86/resctrl: Prepare for new domain scope
  2023-11-09 23:09               ` [PATCH v11 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-11-13 19:36                 ` Moger, Babu
  2023-11-13 20:49                   ` Luck, Tony
  2023-11-20 22:16                 ` Reinette Chatre
  1 sibling, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2023-11-13 19:36 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 11/9/23 17:09, Tony Luck wrote:
> Resctrl resources operate on subsets of CPUs in the system with the
> defining attribute of each subset being an instance of a particular
> level of cache. E.g. all CPUs sharing an L3 cache would be part of the
> same domain.
> 
> In preparation for features that are scoped at the NUMA node level
> change the code from explicit references to "cache_level" to a more
> generic scope. At this point the only options for this scope are groups
> of CPUs that share an L2 cache or L3 cache.
> 
> Clean up the error handling when looking up domains. Report invalid id's
> before calling rdt_find_domain() in preparation for better messages when
> scope can be other than cache scope. This means that rdt_find_domain()
> will never return an error. So remove checks for error from the callsites.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> Changes since v10:
> Reinette: Move a blank line in domain_add_cpu()
> Reniette: Fix warning message in domain_remove_cpu()
> 
>  include/linux/resctrl.h                   |  9 +++--
>  arch/x86/kernel/cpu/resctrl/core.c        | 42 +++++++++++++++--------
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  2 +-
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +++-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 ++-
>  5 files changed, 45 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 66942d7fba7f..7d4eb7df611d 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -144,13 +144,18 @@ struct resctrl_membw {
>  struct rdt_parse_data;
>  struct resctrl_schema;
>  
> +enum resctrl_scope {
> +	RESCTRL_L2_CACHE = 2,
> +	RESCTRL_L3_CACHE = 3,
> +};
> +
>  /**
>   * struct rdt_resource - attributes of a resctrl resource
>   * @rid:		The index of the resource
>   * @alloc_capable:	Is allocation available on this machine
>   * @mon_capable:	Is monitor feature available on this machine
>   * @num_rmid:		Number of RMIDs available
> - * @cache_level:	Which cache level defines scope of this resource
> + * @scope:		Scope of this resource
>   * @cache:		Cache allocation related data
>   * @membw:		If the component has bandwidth controls, their properties.
>   * @domains:		All domains for this resource
> @@ -168,7 +173,7 @@ struct rdt_resource {
>  	bool			alloc_capable;
>  	bool			mon_capable;
>  	int			num_rmid;
> -	int			cache_level;
> +	enum resctrl_scope	scope;
>  	struct resctrl_cache	cache;
>  	struct resctrl_membw	membw;
>  	struct list_head	domains;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 19e0681f0435..9f8d87abd751 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L3,
>  			.name			= "L3",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_L3),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
> @@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L2,
>  			.name			= "L2",
> -			.cache_level		= 2,
> +			.scope			= RESCTRL_L2_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_L2),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
> @@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_MBA,
>  			.name			= "MB",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_MBA),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
> @@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_SMBA,
>  			.name			= "SMBA",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_SMBA),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
> @@ -401,9 +401,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
>  	struct rdt_domain *d;
>  	struct list_head *l;
>  
> -	if (id < 0)
> -		return ERR_PTR(-ENODEV);
> -
>  	list_for_each(l, &r->domains) {
>  		d = list_entry(l, struct rdt_domain, list);
>  		/* When id is found, return its domain. */
> @@ -491,6 +488,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>  	return 0;
>  }
>  
> +static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> +{
> +	switch (scope) {
> +	case RESCTRL_L2_CACHE:
> +	case RESCTRL_L3_CACHE:
> +		return get_cpu_cacheinfo_id(cpu, scope);
> +	default:
> +		break;
> +	}
> +
> +	return -EINVAL;
> +}

This function is meanigfull when you introduce node scope RESCTRL_NODE.

Can you please move this to patch 5?

Thanks
Babu

> +
>  /*
>   * domain_add_cpu - Add a cpu to a resource's domain list.
>   *
> @@ -506,18 +516,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>   */
>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>  	struct list_head *add_pos = NULL;
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  	int err;
>  
> -	d = rdt_find_domain(r, id, &add_pos);
> -	if (IS_ERR(d)) {
> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> +	if (id < 0) {
> +		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->scope, r->name);
>  		return;
>  	}
>  
> +	d = rdt_find_domain(r, id, &add_pos);
>  	if (d) {
>  		cpumask_set_cpu(cpu, &d->cpu_mask);
>  		if (r->cache.arch_has_per_cpu_cfg)
> @@ -556,13 +567,16 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  
>  static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  
> +	if (id < 0)
> +		return;
> +
>  	d = rdt_find_domain(r, id, NULL);
> -	if (IS_ERR_OR_NULL(d)) {
> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> +	if (!d) {
> +		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
>  		return;
>  	}
>  	hw_dom = resctrl_to_arch_dom(d);
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index beccb0e87ba7..3f8891d57fac 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -563,7 +563,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>  
>  	r = &rdt_resources_all[resid].r_resctrl;
>  	d = rdt_find_domain(r, domid, NULL);
> -	if (IS_ERR_OR_NULL(d)) {
> +	if (!d) {
>  		ret = -ENOENT;
>  		goto out;
>  	}
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 8f559eeae08e..2a682da9f43a 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
>   */
>  static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  {
> +	enum resctrl_scope scope = plr->s->res->scope;
>  	struct cpu_cacheinfo *ci;
>  	int ret;
>  	int i;
>  
> +	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
> +		return -ENODEV;
> +
>  	/* Pick the first cpu we find that is associated with the cache. */
>  	plr->cpu = cpumask_first(&plr->d->cpu_mask);
>  
> @@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
>  
>  	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == plr->s->res->cache_level) {
> +		if (ci->info_list[i].level == scope) {
>  			plr->line_size = ci->info_list[i].coherency_line_size;
>  			return 0;
>  		}
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 69a1de92384a..c44be64d65ec 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1413,10 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>  	unsigned int size = 0;
>  	int num_b, i;
>  
> +	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
> +		return size;
> +
>  	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
>  	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
>  	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == r->cache_level) {
> +		if (ci->info_list[i].level == r->scope) {
>  			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
>  			break;
>  		}

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v11 1/8] x86/resctrl: Prepare for new domain scope
  2023-11-13 19:36                 ` Moger, Babu
@ 2023-11-13 20:49                   ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-11-13 20:49 UTC (permalink / raw)
  To: babu.moger, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

> > +static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> > +{
> > +   switch (scope) {
> > +   case RESCTRL_L2_CACHE:
> > +   case RESCTRL_L3_CACHE:
> > +           return get_cpu_cacheinfo_id(cpu, scope);
> > +   default:
> > +           break;
> > +   }
> > +
> > +   return -EINVAL;
> > +}
>
> This function is meanigfull when you introduce node scope RESCTRL_NODE.
>
> Can you please move this to patch 5?

Code has been this way since v5. I'll note your opinion and if others agree I'll
move it. But I think this is a reasonable part of the first step moving away from
purely cache scope for domains.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 1/8] x86/resctrl: Prepare for new domain scope
  2023-11-09 23:09               ` [PATCH v11 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2023-11-13 19:36                 ` Moger, Babu
@ 2023-11-20 22:16                 ` Reinette Chatre
  1 sibling, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-11-20 22:16 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 11/9/2023 3:09 PM, Tony Luck wrote:
> @@ -556,13 +567,16 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  
>  static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  
> +	if (id < 0)
> +		return;
> +

Apologies for this late comment to this code that has been like this for
a few versions. Something must have gone really wrong if id < 0 so I do
not think a silent failure is ideal. Could the same warning as above be
included here? That is:

	pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
		     cpu, r->scope, r->name);


If you agree, please do ensure that this change propagates to
domain_remove_cpu_ctrl() and domain_remove_cpu_mon() in patches that
follow, in which case the text is expected to adjust to "Can't find control
domain id" and "Can't find monitor domain id" respectively. (btw ... is
the switching between "Can't" and "Couldn't" intentional?)

With this change please feel free to include:
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-11-09 23:09               ` [PATCH v11 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
  2023-11-13 18:08                 ` Moger, Babu
@ 2023-11-20 22:18                 ` Reinette Chatre
  1 sibling, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-11-20 22:18 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 11/9/2023 3:09 PM, Tony Luck wrote:
> The rdt_domain structure is used for both control and monitor features.
> It is about to be split into separate structures for these two usages
> because the scope for control and monitoring features for a resource
> will be different for future resources.
> 
> To allow for common code that scans a list of domains looking for a
> specific domain id, move all the common fields ("list", "id", "cpu_mask")
> into their own structure within the rdt_domain structure.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-11-09 23:09               ` [PATCH v11 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-11-20 22:19                 ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-11-20 22:19 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 11/9/2023 3:09 PM, Tony Luck wrote:

...

> +static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
>  	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain_hdr *hdr;
>  	struct rdt_domain *d;
>  
>  	if (id < 0)
>  		return;
>  
> -	d = rdt_find_domain(r, id, NULL);
> -	if (!d) {
> -		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
> +	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
> +	if (!hdr) {
> +		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);

This error message seems strange to me. Shouldn't this just be "Couldn't find
control domain with id=..."? Also, can the resource name be added in error message
to help user dig into cause of error?


>  		return;
>  	}
> +
> +	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
> +		return;
> +
> +	d = container_of(hdr, struct rdt_domain, hdr);
>  	hw_dom = resctrl_to_arch_dom(d);
>  
>  	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
>  	if (cpumask_empty(&d->hdr.cpu_mask)) {
> -		resctrl_offline_domain(r, d);
> +		resctrl_offline_ctrl_domain(r, d);
>  		list_del(&d->hdr.list);
>  
>  		/*
> @@ -596,6 +654,38 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  
>  		return;
>  	}
> +}
> +
> +static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> +	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain_hdr *hdr;
> +	struct rdt_domain *d;
> +
> +	if (id < 0)
> +		return;
> +
> +	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
> +	if (!hdr) {
> +		pr_warn("Couldn't find monitor scope id=%d for CPU %d\n", id, cpu);
> +		return;
> +	}

Similar to above, can this be "Couldn't find monitor domain with id..."?
Can resource name be added here also?

The rest of the patch looks good to me.

Reinette


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-11-09 23:09               ` [PATCH v11 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-11-20 22:19                 ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-11-20 22:19 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 11/9/2023 3:09 PM, Tony Luck wrote:
> The same rdt_domain structure is used for both control and monitor
> functions. But this results in wasted memory as some of the fields are
> only used by control functions, while most are only used for monitor
> functions.
> 
> Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
> just the fields required for control and monitoring respectively.
> 
> Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
> and rdt_hw_mon_domain.
> 
> Reviewed-by: Peter Newman <peternewman@google.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 5/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-11-09 23:09               ` [PATCH v11 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-11-20 22:22                 ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-11-20 22:22 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 11/9/2023 3:09 PM, Tony Luck wrote:
> Currently supported resctrl features are all domain scoped the same as the
> scope of the L2 or L3 caches.
> 
> Add RESCTRL_NODE as a new option for features that are scoped at the
> same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
> Cluster (SNC) feature where monitoring features are node scoped.
> 
> Reviewed-by: Peter Newman <peternewman@google.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

Please follow guidance in section "Ordering of commit tags" from
Documentation/process/maintainer-tip.rst

With that addressed (thus adding Reviewed-by tags after your 
author SOB):
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-11-09 23:09               ` [PATCH v11 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-11-20 22:23                 ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-11-20 22:23 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 11/9/2023 3:09 PM, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
> 
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
> 
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
> 
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
> 
> The other mode divides the RMIDs and renumbers the ones on the second
> SNC node to start from zero.
> 
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
> 
> Add a global integer "snc_nodes_per_l3_cache" that will show how many
> SNC nodes share each L3 cache. When this is "1", SNC mode is either
> not implemented, or not enabled, but all places that need to check
> it are updated to take appropriate action when SNC mode is enabled.
> 
> Code that needs to take action when SNC is enabled is:
> 1) The number of logical RMIDs per L3 cache available for use is the
>    number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
>    nodes.
> 3) The RMID renumbering operates when using the value from the
>    IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
>    counter, code must adjust from the logical RMID used to the physical
>    RMID value for the SNC node that it wishes to read and load the
>    adjusted value into the IA32_QM_EVTSEL MSR.
> 4) The L3 cache is divided between the SNC nodes. So the value
>    reported in the resctrl "size" file is divided by the number of SNC
>    nodes because the effective amount of cache that can be allocated
>    is reduced by that factor.
> 5) The "-o mba_MBps" mount option must be disabled in SNC mode
>    because the monitoring is being done per SNC node, while the
>    bandwidth allocation is still done at the L3 cache scope.
>    Trying to use this feedback loop might result in contradictory
>    changes to the throttling level coming from each of the SNC
>    node bandwidth measurements.
> 

The latter part of this changelog stopped being in imperative mood.
To reduce confusion I slightly reworked it below to address the
parts I noticed. I often get it wrong myself so please check again.

	Add a global integer "snc_nodes_per_l3_cache" that shows how many
	SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
	SNC mode is either not implemented, or not enabled.

	Update all places to take appropriate action when SNC mode is enabled:
	1) The number of logical RMIDs per L3 cache available for use is the
	   number of physical RMIDs divided by the number of SNC nodes.
	2) Likewise the "mon_scale" value must be divided by the number of SNC
	   nodes.
	3) The RMID renumbering operates when using the value from the
	   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
	   counter, adjust from the logical RMID to the physical
	   RMID value for the SNC node that it wishes to read and load the
	   adjusted value into the IA32_QM_EVTSEL MSR.
	4) Divide the L3 cache between the SNC nodes. Divide the value
	   reported in the resctrl "size" file by the number of SNC
	   nodes because the effective amount of cache that can be allocated
	   is reduced by that factor.
	5) Disable the "-o mba_MBps" mount option in SNC mode
	   because the monitoring is being done per SNC node, while the
	   bandwidth allocation is still done at the L3 cache scope.
	   Trying to use this feedback loop might result in contradictory
	   changes to the throttling level coming from each of the SNC
	   node bandwidth measurements.

> Reviewed-by: Peter Newman <peternewman@google.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

(same comment as previous patch about commit tag ordering)

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-11-09 23:09               ` [PATCH v11 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-11-20 22:24                 ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-11-20 22:24 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 11/9/2023 3:09 PM, Tony Luck wrote:
> There isn't a simple hardware bit that indicates whether a CPU is
> running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
> the ratio of NUMA nodes to L3 cache instances.
> 
> When SNC mode is detected, reconfigure the RMID counters by updating
> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
> 
> Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
> on the second SNC node to start from zero.
> 
> Reviewed-by: Peter Newman <peternewman@google.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

(note ordering of commit tags)

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v11 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-11-09 23:09               ` [PATCH v11 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
  2023-11-09 23:14                 ` Randy Dunlap
@ 2023-11-20 22:25                 ` Reinette Chatre
  1 sibling, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-11-20 22:25 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

On 11/9/2023 3:09 PM, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
> 
> Users should be aware that SNC mode also affects the amount of L3 cache
> available for allocation within each SNC node.
> 
> Reviewed-by: Peter Newman <peternewman@google.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

(note about order of commit tags)

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v12 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                 ` (8 preceding siblings ...)
  2023-11-13  4:46               ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Shaopeng Tan (Fujitsu)
@ 2023-11-30  0:34               ` Tony Luck
  2023-11-30  0:34                 ` [PATCH v12 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
                                   ` (8 more replies)
  9 siblings, 9 replies; 331+ messages in thread
From: Tony Luck @ 2023-11-30  0:34 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Changes since v11:

Global: (comment from Reinette)
	Reorder tags with Signed-off-by: first, then Reviewed/Tested

Patch1: (comment from Reinette)
	Add error message to domain_remove_cpu() [matching the one in
	domain_add_cpu()] for the case where get_cpu_cacheinfo_id()
	failed to find a cache ID for the current CPU.

Patch3: (comment from Reinette)
	When splitting the domain_add_cpu() and domain_remove_cpu()
	functions add "control" and "monitor" to the warning messages.
	Fix the:
		pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
	message:
		s/Couldn't/Can't/
		s/control scope/control domain with/
		Add resource name.
	Ditto for similar monitor message.

Patch6: (comment from Reinette)
	Used Reinette's rewrite into imperative mood for latter part
	of commit message.

Patch8: (comment from Randy)
	s/have/has/  s/cache. But/cache, but/

Added Reinette's "Reviewed-by:" to all patches except patch 3.

Added Shaopeng Tan's Reviewed and Tested to all patches.

Rebased to v6.7-rc3

Tony Luck (8):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  25 +-
 include/linux/resctrl.h                   |  85 +++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  66 ++--
 arch/x86/kernel/cpu/resctrl/core.c        | 411 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  58 +--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  68 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  26 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 149 ++++----
 9 files changed, 607 insertions(+), 282 deletions(-)


base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
-- 
2.41.0


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v12 1/8] x86/resctrl: Prepare for new domain scope
  2023-11-30  0:34               ` [PATCH v12 " Tony Luck
@ 2023-11-30  0:34                 ` Tony Luck
  2023-11-30  0:34                 ` [PATCH v12 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
                                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-11-30  0:34 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Clean up the error handling when looking up domains. Report invalid id's
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the callsites.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 include/linux/resctrl.h                   |  9 ++++-
 arch/x86/kernel/cpu/resctrl/core.c        | 45 ++++++++++++++++-------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 ++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 ++-
 5 files changed, 48 insertions(+), 19 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..7d4eb7df611d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19e0681f0435..fd113bc29d4e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -401,9 +401,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct rdt_domain *d;
 	struct list_head *l;
 
-	if (id < 0)
-		return ERR_PTR(-ENODEV);
-
 	list_for_each(l, &r->domains) {
 		d = list_entry(l, struct rdt_domain, list);
 		/* When id is found, return its domain. */
@@ -491,6 +488,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -506,18 +516,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 	int err;
 
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
 		return;
 	}
 
+	d = rdt_find_domain(r, id, &add_pos);
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -556,13 +567,19 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
+		return;
+	}
+
 	d = rdt_find_domain(r, id, NULL);
-	if (IS_ERR_OR_NULL(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (!d) {
+		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
 		return;
 	}
 	hw_dom = resctrl_to_arch_dom(d);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index beccb0e87ba7..3f8891d57fac 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -563,7 +563,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 
 	r = &rdt_resources_all[resid].r_resctrl;
 	d = rdt_find_domain(r, domid, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	if (!d) {
 		ret = -ENOENT;
 		goto out;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..2a682da9f43a 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	enum resctrl_scope scope = plr->s->res->scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..c44be64d65ec 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1413,10 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return size;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v12 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-11-30  0:34               ` [PATCH v12 " Tony Luck
  2023-11-30  0:34                 ` [PATCH v12 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-11-30  0:34                 ` Tony Luck
  2023-11-30  0:34                 ` [PATCH v12 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
                                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-11-30  0:34 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 include/linux/resctrl.h                   | 16 ++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 26 +++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 10 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 60 +++++++++++------------
 6 files changed, 78 insertions(+), 70 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 7d4eb7df611d..c4067150a6b7 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,10 +53,20 @@ struct resctrl_staged_config {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
  * @cpu_mask:		which CPUs share this resource
+ */
+struct rdt_domain_hdr {
+	struct list_head		list;
+	int				id;
+	struct cpumask			cpu_mask;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
  * @mbm_local:		saved state for MBM local bandwidth
@@ -71,9 +81,7 @@ struct resctrl_staged_config {
  *			by closid
  */
 struct rdt_domain {
-	struct list_head		list;
-	int				id;
-	struct cpumask			cpu_mask;
+	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
 	struct mbm_state		*mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index fd113bc29d4e..62a989fd950d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -356,9 +356,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/* Find the domain that contains this CPU */
-		if (cpumask_test_cpu(cpu, &d->cpu_mask))
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
 	}
 
@@ -402,12 +402,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct list_head *l;
 
 	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, list);
+		d = list_entry(l, struct rdt_domain, hdr.list);
 		/* When id is found, return its domain. */
-		if (id == d->id)
+		if (id == d->hdr.id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->id)
+		if (id < d->hdr.id)
 			break;
 	}
 
@@ -530,7 +530,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = rdt_find_domain(r, id, &add_pos);
 	if (d) {
-		cpumask_set_cpu(cpu, &d->cpu_mask);
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
 		return;
@@ -541,8 +541,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 
 	d = &hw_dom->d_resctrl;
-	d->id = id;
-	cpumask_set_cpu(cpu, &d->cpu_mask);
+	d->hdr.id = id;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
@@ -556,11 +556,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail(&d->list, add_pos);
+	list_add_tail(&d->hdr.list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
@@ -584,10 +584,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 	hw_dom = resctrl_to_arch_dom(d);
 
-	cpumask_clear_cpu(cpu, &d->cpu_mask);
-	if (cpumask_empty(&d->cpu_mask)) {
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 
 		/*
 		 * rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 3f8891d57fac..23f8258d36a8 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -146,7 +146,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -226,8 +226,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
 			if (r->parse_ctrlval(&data, s, d))
@@ -274,7 +274,7 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	struct rdt_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
-		cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
+		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
 		hw_dom->ctrl_val[idx] = cfg->new_ctrl;
 
 		return true;
@@ -291,7 +291,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	hw_dom->ctrl_val[idx] = cfg_val;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -476,7 +476,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 			ctrl_val = resctrl_arch_get_config(r, dom, closid,
 							   schema->conf_type);
 
-		seq_printf(s, r->format_str, dom->id, max_data_width,
+		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
 			   ctrl_val);
 		sep = true;
 	}
@@ -505,7 +505,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			} else {
 				seq_printf(s, "%s:%d=%x\n",
 					   rdtgrp->plr->s->res->name,
-					   rdtgrp->plr->d->id,
+					   rdtgrp->plr->d->hdr.id,
 					   rdtgrp->plr->cbm);
 			}
 		} else {
@@ -536,7 +536,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 	rr->val = 0;
 	rr->first = first;
 
-	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
 }
 
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..dd0ea1bc0092 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -238,7 +238,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 	u64 msr_val, chunks;
 	int ret;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	ret = __rmid_read(rmid, eventid, &msr_val);
@@ -340,8 +340,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
-		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
 						     &val);
@@ -661,7 +661,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
 
-	cpu = cpumask_any(&dom->cpu_mask);
+	cpu = cpumask_any(&dom->hdr.cpu_mask);
 	dom->cqm_work_cpu = cpu;
 
 	schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
@@ -708,7 +708,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
 
 	if (!static_branch_likely(&rdt_mon_enable_key))
 		return;
-	cpu = cpumask_any(&dom->cpu_mask);
+	cpu = cpumask_any(&dom->hdr.cpu_mask);
 	dom->mbm_work_cpu = cpu;
 	schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
 }
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 2a682da9f43a..fcbd99e2eb66 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
 	int cpu;
 	int ret;
 
-	for_each_cpu(cpu, &plr->d->cpu_mask) {
+	for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
 		pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
 		if (!pm_req) {
 			rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
@@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 		return -ENODEV;
 
 	/* Pick the first cpu we find that is associated with the cache. */
-	plr->cpu = cpumask_first(&plr->d->cpu_mask);
+	plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 
 	if (!cpu_online(plr->cpu)) {
 		rdt_last_cmd_printf("CPU %u associated with cache not online\n",
@@ -856,10 +856,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, list) {
+		list_for_each_entry(d_i, &r->domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
-					   &d_i->cpu_mask);
+					   &d_i->hdr.cpu_mask);
 		}
 	}
 
@@ -867,7 +867,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * Next test if new pseudo-locked region would intersect with
 	 * existing region.
 	 */
-	if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
+	if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
 		ret = true;
 
 	free_cpumask_var(cpu_with_psl);
@@ -1199,7 +1199,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
 	}
 
 	plr->thread_done = 0;
-	cpu = cpumask_first(&plr->d->cpu_mask);
+	cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 	if (!cpu_online(cpu)) {
 		ret = -ENODEV;
 		goto out;
@@ -1529,7 +1529,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * may be scheduled elsewhere and invalidate entries in the
 	 * pseudo-locked region.
 	 */
-	if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
+	if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
 		mutex_unlock(&rdtgroup_mutex);
 		return -EINVAL;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index c44be64d65ec..04d32602ac33 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -295,7 +295,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
 				rdt_last_cmd_puts("Cache domain offline\n");
 				ret = -ENODEV;
 			} else {
-				mask = &rdtgrp->plr->d->cpu_mask;
+				mask = &rdtgrp->plr->d->hdr.cpu_mask;
 				seq_printf(s, is_cpu_list(of) ?
 					   "%*pbl\n" : "%*pb\n",
 					   cpumask_pr_args(mask));
@@ -984,12 +984,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
 		exclusive = 0;
-		seq_printf(seq, "%d=", dom->id);
+		seq_printf(seq, "%d=", dom->hdr.id);
 		for (i = 0; i < closids_supported(); i++) {
 			if (!closid_allocated(i))
 				continue;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1417,7 +1417,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
-	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
+	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
 		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
@@ -1465,7 +1465,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
 						    rdtgrp->plr->d,
 						    rdtgrp->plr->cbm);
-			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
 		}
 		goto out;
 	}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1495,7 +1495,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 				else
 					size = rdtgroup_cbm_to_size(r, d, ctrl);
 			}
-			seq_printf(s, "%d=%u", d->id, size);
+			seq_printf(s, "%d=%u", d->hdr.id, size);
 			sep = true;
 		}
 		seq_putc(s, '\n');
@@ -1555,7 +1555,7 @@ static void mon_event_config_read(void *info)
 
 static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
 {
-	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
 
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1574,7 +1574,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		mon_info.evtid = evtid;
 		mondata_config_read(dom, &mon_info);
 
-		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1646,7 +1646,7 @@ static int mbm_config_write_domain(struct rdt_resource *r,
 	 * are scoped at the domain level. Writing any of these MSRs
 	 * on one CPU is observed by all the CPUs in the domain.
 	 */
-	smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
 			      &mon_info, 1);
 
 	/*
@@ -1689,8 +1689,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
 				return -EINVAL;
@@ -2232,14 +2232,14 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, list) {
+	list_for_each_entry(d, &r_l->domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
-			for_each_cpu(cpu, &d->cpu_mask)
+			for_each_cpu(cpu, &d->hdr.cpu_mask)
 				cpumask_set_cpu(cpu, cpu_mask);
 		else
 			/* Pick one CPU from each domain instance to update MSR */
-			cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+			cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 	}
 
 	/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2268,7 +2268,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	int cpu = cpumask_any(&d->cpu_mask);
+	int cpu = cpumask_any(&d->hdr.cpu_mask);
 	int i;
 
 	d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2780,9 +2780,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * CBMs in all domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
-		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
 			hw_dom->ctrl_val[i] = r->default_ctrl;
@@ -2986,7 +2986,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
 	/* create the directory */
 	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
 	if (IS_ERR(kn))
@@ -3002,7 +3002,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	}
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->id;
+	priv.u.domid = d->hdr.id;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3209,7 +3209,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
 	 */
 	tmp_cbm = cfg->new_ctrl;
 	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
-		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
 		return -ENOSPC;
 	}
 	cfg->have_new_ctrl = true;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, list) {
+	list_for_each_entry(d, &s->res->domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3864,7 +3864,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (static_branch_unlikely(&rdt_mon_enable_key))
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v12 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-11-30  0:34               ` [PATCH v12 " Tony Luck
  2023-11-30  0:34                 ` [PATCH v12 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2023-11-30  0:34                 ` [PATCH v12 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2023-11-30  0:34                 ` Tony Luck
  2023-11-30 21:51                   ` Reinette Chatre
  2023-11-30  0:34                 ` [PATCH v12 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
                                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-11-30  0:34 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 include/linux/resctrl.h                   |  25 ++-
 arch/x86/kernel/cpu/resctrl/internal.h    |   6 +-
 arch/x86/kernel/cpu/resctrl/core.c        | 211 ++++++++++++++++------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  12 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     |   4 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  55 +++---
 7 files changed, 220 insertions(+), 97 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c4067150a6b7..35e700edc6e6 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -52,15 +52,22 @@ struct resctrl_staged_config {
 	bool			have_new_ctrl;
 };
 
+enum resctrl_domain_type {
+	RESCTRL_CTRL_DOMAIN,
+	RESCTRL_MON_DOMAIN,
+};
+
 /**
  * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
+ * @type:		type of this instance
  * @cpu_mask:		which CPUs share this resource
  */
 struct rdt_domain_hdr {
 	struct list_head		list;
 	int				id;
+	enum resctrl_domain_type	type;
 	struct cpumask			cpu_mask;
 };
 
@@ -163,10 +170,12 @@ enum resctrl_scope {
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @scope:		Scope of this resource
+ * @ctrl_scope:		Scope of this resource for control functions
+ * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		All domains for this resource
+ * @ctrl_domains:	Control domains for this resource
+ * @mon_domains:	Monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -181,10 +190,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -230,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..24bf9d7989a9 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -520,8 +520,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -540,7 +540,7 @@ int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 62a989fd950d..1fd85533b4ca 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,8 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.mon_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
+			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.scope			= RESCTRL_L2_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.ctrl_scope		= RESCTRL_L2_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -352,11 +355,11 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
@@ -378,7 +381,7 @@ void rdt_ctrl_update(void *arg)
 	int cpu = smp_processor_id();
 	struct rdt_domain *d;
 
-	d = get_domain_from_cpu(cpu, r);
+	d = get_ctrl_domain_from_cpu(cpu, r);
 	if (d) {
 		hw_res->msr_update(d, m, r);
 		return;
@@ -388,26 +391,26 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Search for a domain id in a resource domain list.
  *
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
- * The domain list is sorted by id in ascending order.
+ * Search the domain list to find the domain id. If the domain id is
+ * found, return the domain. NULL otherwise.  If the domain id is not
+ * found (and NULL returned) then the first domain with id bigger than
+ * the input id can be returned to the caller via @pos.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos)
 {
-	struct rdt_domain *d;
+	struct rdt_domain_hdr *d;
 	struct list_head *l;
 
-	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, hdr.list);
+	list_for_each(l, h) {
+		d = list_entry(l, struct rdt_domain_hdr, list);
 		/* When id is found, return its domain. */
-		if (id == d->hdr.id)
+		if (id == d->id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->hdr.id)
+		if (id < d->id)
 			break;
 	}
 
@@ -501,35 +504,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	return -EINVAL;
 }
 
-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 	int err;
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
 
-	d = rdt_find_domain(r, id, &add_pos);
-	if (d) {
+	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+			return;
+
+		d = container_of(hdr, struct rdt_domain, hdr);
+
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
@@ -542,51 +538,114 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = &hw_dom->d_resctrl;
 	d->hdr.id = id;
+	d->hdr.type = RESCTRL_CTRL_DOMAIN;
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+	if (domain_setup_ctrlval(r, d)) {
 		domain_free(hw_dom);
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	list_add_tail(&d->hdr.list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+	int err;
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+			return;
+
+		d = container_of(hdr, struct rdt_domain, hdr);
+
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->hdr.id = id;
+	d->hdr.type = RESCTRL_MON_DOMAIN;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
 	}
 
 	list_add_tail(&d->hdr.list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+/*
+ * domain_add_cpu - Add a CPU to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
 
-	d = rdt_find_domain(r, id, NULL);
-	if (!d) {
-		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+	if (!hdr) {
+		pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n",
+			id, cpu, r->name);
 		return;
 	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
 	hw_dom = resctrl_to_arch_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del(&d->hdr.list);
 
 		/*
@@ -599,6 +658,42 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 		return;
 	}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+	if (!hdr) {
+		pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n",
+			id, cpu, r->name);
+		return;
+	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+	hw_dom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+
+		return;
+	}
 
 	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -613,6 +708,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 23f8258d36a8..0b4136c42762 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -226,7 +226,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -542,6 +542,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
+	struct rdt_domain_hdr *hdr;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
@@ -562,11 +563,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
-	if (!d) {
+	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+	if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
 		ret = -ENOENT;
 		goto out;
 	}
+	d = container_of(hdr, struct rdt_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index dd0ea1bc0092..ec5ad926c5dc 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
@@ -535,7 +535,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	rmid = rgrp->mon.rmid;
 	pmbm_data = &dom_mbm->mbm_local[rmid];
 
-	dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+	dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
 	if (!dom_mba) {
 		pr_warn_once("Failure to get domain for MBA update\n");
 		return;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index fcbd99e2eb66..ed6d59af1cef 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
-	enum resctrl_scope scope = plr->s->res->scope;
+	enum resctrl_scope scope = plr->s->res->ctrl_scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, hdr.list) {
+		list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->hdr.cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 04d32602ac33..760013ed1bff 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -984,7 +984,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1413,13 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
-	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+	if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->scope) {
+		if (ci->info_list[i].level == r->ctrl_scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1689,7 +1689,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
@@ -2232,7 +2232,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, hdr.list) {
+	list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->hdr.cpu_mask)
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2777,10 +2777,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
 
 	/*
 	 * Disable resource control for this resource by setting all
-	 * CBMs in all domains to the maximum mask value. Pick one CPU
+	 * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, hdr.list) {
+	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3849,15 +3849,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
+}
 
-	if (!r->mon_capable)
-		return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3914,18 +3916,21 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
-	int err;
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
-		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
-		return 0;
+	return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v12 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-11-30  0:34               ` [PATCH v12 " Tony Luck
                                   ` (2 preceding siblings ...)
  2023-11-30  0:34                 ` [PATCH v12 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-11-30  0:34                 ` Tony Luck
  2023-11-30  0:34                 ` [PATCH v12 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
                                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-11-30  0:34 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 include/linux/resctrl.h                   | 48 +++++++------
 arch/x86/kernel/cpu/resctrl/internal.h    | 60 ++++++++++------
 arch/x86/kernel/cpu/resctrl/core.c        | 87 ++++++++++++-----------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 40 +++++------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 62 ++++++++--------
 7 files changed, 182 insertions(+), 153 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 35e700edc6e6..058a940c3239 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -72,7 +72,23 @@ struct rdt_domain_hdr {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr:		common header for different domain types
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_ctrl_domain {
+	struct rdt_domain_hdr		hdr;
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
  * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
@@ -81,13 +97,8 @@ struct rdt_domain_hdr {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
@@ -96,9 +107,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };
 
 /**
@@ -202,7 +210,7 @@ struct rdt_resource {
 	const char		*format_str;
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
-						 struct rdt_domain *d);
+						 struct rdt_ctrl_domain *d);
 	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
@@ -236,15 +244,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.
  */
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -260,7 +268,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
 
 /**
@@ -273,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid);
 
 /**
@@ -285,7 +293,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 24bf9d7989a9..ce3a70657842 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -107,7 +107,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -192,7 +192,7 @@ struct mongroup {
  */
 struct pseudo_lock_region {
 	struct resctrl_schema	*s;
-	struct rdt_domain	*d;
+	struct rdt_ctrl_domain	*d;
 	u32			cbm;
 	wait_queue_head_t	lock_thread_wq;
 	int			thread_done;
@@ -319,25 +319,41 @@ struct arch_mbm_state {
 };
 
 /**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ *			       a resource for a control function
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+	struct rdt_ctrl_domain		d_resctrl;
+	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ *			      a resource for a monitor function
+ * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
-struct rdt_hw_domain {
-	struct rdt_domain		d_resctrl;
-	u32				*ctrl_val;
+struct rdt_hw_mon_domain {
+	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };
 
-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
 {
-	return container_of(r, struct rdt_hw_domain, d_resctrl);
+	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
 }
 
 /**
@@ -405,7 +421,7 @@ struct rdt_hw_resource {
 	struct rdt_resource	r_resctrl;
 	u32			num_closid;
 	unsigned int		msr_base;
-	void (*msr_update)	(struct rdt_domain *d, struct msr_param *m,
+	void (*msr_update)	(struct rdt_ctrl_domain *d, struct msr_param *m,
 				 struct rdt_resource *r);
 	unsigned int		mon_scale;
 	unsigned int		mbm_width;
@@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 }
 
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d);
+	      struct rdt_ctrl_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d);
+	     struct rdt_ctrl_domain *d);
 
 extern struct mutex rdtgroup_mutex;
 
@@ -526,21 +542,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				  unsigned long cbm);
 enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
 int rdtgroup_tasks_assigned(struct rdtgroup *r);
 int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
 int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
 int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
@@ -550,17 +566,17 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 1fd85533b4ca..797cb3bf417a 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
 bool rdt_alloc_capable;
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
 #define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -307,11 +307,11 @@ static void rdt_get_cdp_l2_config(void)
 }
 
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -332,12 +332,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 }
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	/*  Write the delay values for mba. */
 	for (i = m->low; i < m->high; i++)
@@ -345,19 +345,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 }
 
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
@@ -379,7 +379,7 @@ void rdt_ctrl_update(void *arg)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
 	struct rdt_resource *r = m->res;
 	int cpu = smp_processor_id();
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	d = get_ctrl_domain_from_cpu(cpu, r);
 	if (d) {
@@ -434,18 +434,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 		*dc = r->default_ctrl;
 }
 
-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct msr_param m;
 	u32 *dc;
 
@@ -468,7 +473,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
 {
 	size_t tsize;
 
@@ -507,10 +512,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -524,7 +529,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 			return;
 
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_ctrl_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -544,7 +549,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	rdt_domain_reconfigure_cdp(r);
 
 	if (domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 		return;
 	}
 
@@ -553,17 +558,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	err = resctrl_online_ctrl_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 	}
 }
 
 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_mon_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -577,7 +582,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 			return;
 
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		return;
@@ -593,7 +598,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 		return;
 	}
 
@@ -602,7 +607,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 	}
 }
 
@@ -620,9 +625,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	if (id < 0) {
 		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
@@ -640,8 +645,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+	hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
@@ -649,12 +654,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		list_del(&d->hdr.list);
 
 		/*
-		 * rdt_domain "d" is going to be freed below, so clear
+		 * rdt_ctrl_domain "d" is going to be freed below, so clear
 		 * its pointer from pseudo_lock_region struct.
 		 */
 		if (d->plr)
 			d->plr->d = NULL;
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 
 		return;
 	}
@@ -663,9 +668,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_mon_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	if (id < 0) {
 		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
@@ -683,14 +688,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
+	hw_dom = resctrl_to_arch_mon_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 0b4136c42762..08fc97ce4135 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }
 
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d)
+	     struct rdt_ctrl_domain *d)
 {
 	struct resctrl_staged_config *cfg;
 	u32 closid = data->rdtgrp->closid;
@@ -137,7 +137,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
  * resource type.
  */
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d)
+	      struct rdt_ctrl_domain *d)
 {
 	struct rdtgroup *rdtgrp = data->rdtgrp;
 	struct resctrl_staged_config *cfg;
@@ -206,8 +206,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct resctrl_staged_config *cfg;
 	struct rdt_resource *r = s->res;
 	struct rdt_parse_data data;
+	struct rdt_ctrl_domain *d;
 	char *dom = NULL, *id;
-	struct rdt_domain *d;
 	unsigned long dom_id;
 
 	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -267,11 +267,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
 	}
 }
 
-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
 			 struct resctrl_staged_config *cfg, u32 idx,
 			 cpumask_var_t cpu_mask)
 {
-	struct rdt_domain *dom = &hw_dom->d_resctrl;
+	struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
 		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
@@ -283,11 +283,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	return false;
 }
 
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
@@ -307,11 +307,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	enum resctrl_conf_type t;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	u32 idx;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -319,7 +319,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 
 	msr_param.res = NULL;
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
 			if (!cfg->have_new_ctrl)
@@ -449,10 +449,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	u32 idx = get_config_index(closid, type);
 
 	return hw_dom->ctrl_val[idx];
@@ -461,7 +461,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
 {
 	struct rdt_resource *r = schema->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	bool sep = false;
 	u32 ctrl_val;
 
@@ -523,7 +523,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	/*
@@ -543,11 +543,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
 	struct rdt_domain_hdr *hdr;
+	struct rdt_mon_domain *d;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -568,7 +568,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		ret = -ENOENT;
 		goto out;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ec5ad926c5dc..4e145f5620b0 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	}
 }
 
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
 }
@@ -334,7 +334,7 @@ int alloc_rmid(void)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int cpu, err;
 	u64 val = 0;
 
@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
 				       enum resctrl_event_id evtid)
 {
 	switch (evtid) {
@@ -516,13 +516,13 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
+	struct rdt_ctrl_domain *dom_mba;
 	u32 cur_bw, delta_bw, user_bw;
 	struct rdt_resource *r_mba;
-	struct rdt_domain *dom_mba;
 	struct list_head *head;
 	struct rdtgroup *entry;
 
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	}
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
 {
 	struct rmid_read rr;
 
@@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
 	struct rdtgroup *prgrp, *crgrp;
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index ed6d59af1cef..08d35f828bc3 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
  * Return: true if @cbm overlaps with pseudo-locked region on @d, false
  * otherwise.
  */
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int cbm_len;
 	unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
  *         if it is not possible to test due to memory allocation issue,
  *         false otherwise.
  */
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
 {
+	struct rdt_ctrl_domain *d_i;
 	cpumask_var_t cpu_with_psl;
 	struct rdt_resource *r;
-	struct rdt_domain *d_i;
 	bool ret = false;
 
 	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 760013ed1bff..21bbd832f3f2 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -85,8 +85,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
 
 void rdt_staged_configs_clear(void)
 {
+	struct rdt_ctrl_domain *dom;
 	struct rdt_resource *r;
-	struct rdt_domain *dom;
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -976,7 +976,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	unsigned long sw_shareable = 0, hw_shareable = 0;
 	unsigned long exclusive = 0, pseudo_locked = 0;
 	struct rdt_resource *r = s->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	int i, hwb, swb, excl, psl;
 	enum rdtgrp_mode mode;
 	bool sep = false;
@@ -1205,7 +1205,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
  *
  * Return: false if CBM does not overlap, true if it does.
  */
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				    unsigned long cbm, int closid,
 				    enum resctrl_conf_type type, bool exclusive)
 {
@@ -1260,7 +1260,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
  *
  * Return: true if CBM overlap detected, false if there is no overlap
  */
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1291,10 +1291,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
 static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 {
 	int closid = rdtgrp->closid;
+	struct rdt_ctrl_domain *d;
 	struct resctrl_schema *s;
 	struct rdt_resource *r;
 	bool has_cache = false;
-	struct rdt_domain *d;
 	u32 ctrl;
 
 	list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1407,7 +1407,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
  * bitmap functions work correctly.
  */
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
-				  struct rdt_domain *d, unsigned long cbm)
+				  struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	struct cpu_cacheinfo *ci;
 	unsigned int size = 0;
@@ -1439,9 +1439,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 {
 	struct resctrl_schema *schema;
 	enum resctrl_conf_type type;
+	struct rdt_ctrl_domain *d;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 	unsigned int size;
 	int ret = 0;
 	u32 closid;
@@ -1553,7 +1553,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1561,7 +1561,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
@@ -1618,7 +1618,7 @@ static void mon_event_config_write(void *info)
 }
 
 static int mbm_config_write_domain(struct rdt_resource *r,
-				   struct rdt_domain *d, u32 evtid, u32 val)
+				   struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 	int ret = 0;
@@ -1668,7 +1668,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 {
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int ret = 0;
 
 next:
@@ -2216,9 +2216,9 @@ static inline bool is_mba_linear(void)
 static int set_cache_qos_cfg(int level, bool enable)
 {
 	void (*update)(void *arg);
+	struct rdt_ctrl_domain *d;
 	struct rdt_resource *r_l;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int cpu;
 
 	if (level == RDT_RESOURCE_L3)
@@ -2265,7 +2265,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 		l3_qos_cfg_update(&hw_res->cdp_enabled);
 }
 
-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
 	int cpu = cpumask_any(&d->hdr.cpu_mask);
@@ -2283,7 +2283,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 }
 
 static void mba_sc_domain_destroy(struct rdt_resource *r,
-				  struct rdt_domain *d)
+				  struct rdt_ctrl_domain *d)
 {
 	kfree(d->mbps_val);
 	d->mbps_val = NULL;
@@ -2309,7 +2309,7 @@ static int set_mba_sc(bool mba_sc)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2578,7 +2578,7 @@ static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	unsigned long flags = RFTYPE_CTRL_BASE;
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2762,10 +2762,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
 static int reset_all_ctrls(struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int i;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2781,7 +2781,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * from each domain to update the MSRs below.
 	 */
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
@@ -2976,7 +2976,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -3025,7 +3025,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mon_domain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -3047,7 +3047,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	int ret;
 
 	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3149,7 +3149,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
  * Set the RDT domain up to start off with all usable allocations. That is,
  * all shareable and unused bits. All-zero CBM is invalid.
  */
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
 				 u32 closid)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3229,7 +3229,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
  */
 static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int ret;
 
 	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3245,7 +3245,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
@@ -3842,14 +3842,14 @@ static void __init rdtgroup_setup_default(void)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3857,7 +3857,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 		mba_sc_domain_destroy(r, d);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3886,7 +3886,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	size_t tsize;
 
@@ -3916,7 +3916,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3926,7 +3926,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	int err;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v12 5/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-11-30  0:34               ` [PATCH v12 " Tony Luck
                                   ` (3 preceding siblings ...)
  2023-11-30  0:34                 ` [PATCH v12 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-11-30  0:34                 ` Tony Luck
  2023-11-30  0:34                 ` [PATCH v12 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-11-30  0:34 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 include/linux/resctrl.h            | 1 +
 arch/x86/kernel/cpu/resctrl/core.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 058a940c3239..b8a3a11b970d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -170,6 +170,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 797cb3bf417a..c9315ce8f7bd 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -502,6 +502,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
 	default:
 		break;
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v12 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-11-30  0:34               ` [PATCH v12 " Tony Luck
                                   ` (4 preceding siblings ...)
  2023-11-30  0:34                 ` [PATCH v12 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-11-30  0:34                 ` Tony Luck
  2023-11-30  0:34                 ` [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
                                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-11-30  0:34 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that shows how many
SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
SNC mode is either not implemented, or not enabled.

Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
   nodes.
3) The RMID renumbering operates when using the value from the
   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
   counter, adjust from the logical RMID to the physical
   RMID value for the SNC node that it wishes to read and load the
   adjusted value into the IA32_QM_EVTSEL MSR.
4) Divide the L3 cache between the SNC nodes. Divide the value
   reported in the resctrl "size" file by the number of SNC
   nodes because the effective amount of cache that can be allocated
   is reduced by that factor.
5) Disable the "-o mba_MBps" mount option in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.
   Trying to use this feedback loop might result in contradictory
   changes to the throttling level coming from each of the SNC
   node bandwidth measurements.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  5 +++--
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index ce3a70657842..e7a75a439c16 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern unsigned int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c9315ce8f7bd..cf5aba8a74bf 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.  Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 4e145f5620b0..30b7c3b9b517 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs.
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 21bbd832f3f2..79d57dade568 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /*
@@ -2298,7 +2298,8 @@ static bool supports_mba_mbps(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 
 	return (is_mbm_local_enabled() &&
-		r->alloc_capable && is_mba_linear());
+		r->alloc_capable && is_mba_linear() &&
+		snc_nodes_per_l3_cache == 1);
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-11-30  0:34               ` [PATCH v12 " Tony Luck
                                   ` (5 preceding siblings ...)
  2023-11-30  0:34                 ` [PATCH v12 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-11-30  0:34                 ` Tony Luck
  2023-11-30 18:02                   ` Fam Zheng
  2023-11-30  0:34                 ` [PATCH v12 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  8 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-11-30  0:34 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 arch/x86/include/asm/msr-index.h   |  1 +
 arch/x86/kernel/cpu/resctrl/core.c | 96 ++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d51e1850ed0..94d29d81e6db 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1111,6 +1111,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index cf5aba8a74bf..3293ab4c58b0 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -740,11 +743,42 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+	u64 val;
+
+	/* Only need to enable once per package. */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
@@ -999,11 +1033,73 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+	int num_l3_caches;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+
+	if (num_online_cpus() != num_present_cpus())
+		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids)
+			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+		else
+			mem_only_nodes++;
+	}
+	cpus_read_unlock();
+
+	num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
+	kfree(node_caches);
+
+	if (!num_l3_caches)
+		return 1;
+
+	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+	if (ret > 1)
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+
+	return ret;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = snc_get_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v12 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-11-30  0:34               ` [PATCH v12 " Tony Luck
                                   ` (6 preceding siblings ...)
  2023-11-30  0:34                 ` [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-11-30  0:34                 ` Tony Luck
  2023-12-04 16:55                   ` Moger, Babu
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  8 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-11-30  0:34 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..49ff789db1d8 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
-	directories have one file per event (e.g. "llc_occupancy",
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event.  Each of these
+	directories has one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
 	all tasks in the group. In CTRL_MON groups these files provide
@@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
 each bit represents 5% of the capacity of the cache. You could partition
 the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache, but each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-11-30  0:34                 ` [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-11-30 18:02                   ` Fam Zheng
  2023-11-30 20:57                     ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Fam Zheng @ 2023-11-30 18:02 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, linux-kernel, linux-doc, patches,
	Shaopeng Tan

Hi Tony,

On Wed, Nov 29, 2023 at 04:34:17PM -0800, Tony Luck wrote:
> There isn't a simple hardware bit that indicates whether a CPU is
> running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
> the ratio of NUMA nodes to L3 cache instances.
> 
> When SNC mode is detected, reconfigure the RMID counters by updating
> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
> 
> Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
> on the second SNC node to start from zero.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Peter Newman <peternewman@google.com>
> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
> ---
>  arch/x86/include/asm/msr-index.h   |  1 +
>  arch/x86/kernel/cpu/resctrl/core.c | 96 ++++++++++++++++++++++++++++++
>  2 files changed, 97 insertions(+)
> 
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 1d51e1850ed0..94d29d81e6db 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1111,6 +1111,7 @@
>  #define MSR_IA32_QM_CTR			0xc8e
>  #define MSR_IA32_PQR_ASSOC		0xc8f
>  #define MSR_IA32_L3_CBM_BASE		0xc90
> +#define MSR_RMID_SNC_CONFIG		0xca0
>  #define MSR_IA32_L2_CBM_BASE		0xd10
>  #define MSR_IA32_MBA_THRTL_BASE		0xd50
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index cf5aba8a74bf..3293ab4c58b0 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>  
>  #define pr_fmt(fmt)	"resctrl: " fmt
>  
> +#include <linux/cpu.h>
>  #include <linux/slab.h>
>  #include <linux/err.h>
>  #include <linux/cacheinfo.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>  
> +#include <asm/cpu_device_id.h>
>  #include <asm/intel-family.h>
>  #include <asm/resctrl.h>
>  #include "internal.h"
> @@ -740,11 +743,42 @@ static void clear_closid_rmid(int cpu)
>  	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
>  }
>  
> +/*
> + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> + * which indicates that RMIDs are configured in legacy mode.
> + * This mode is incompatible with Linux resctrl semantics
> + * as RMIDs are partitioned between SNC nodes, which requires
> + * a user to know which RMID is allocated to a task.
> + * Clearing bit 0 reconfigures the RMID counters for use
> + * in Sub NUMA Cluster mode. This mode is better for Linux.
> + * The RMID space is divided between all SNC nodes with the
> + * RMIDs renumbered to start from zero in each node when
> + * couning operations from tasks. Code to read the counters
> + * must adjust RMID counter numbers based on SNC node. See
> + * __rmid_read() for code that does this.
> + */
> +static void snc_remap_rmids(int cpu)
> +{
> +	u64 val;
> +
> +	/* Only need to enable once per package. */
> +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> +		return;
> +
> +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
> +	val &= ~BIT_ULL(0);
> +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
> +}
> +
>  static int resctrl_online_cpu(unsigned int cpu)
>  {
>  	struct rdt_resource *r;
>  
>  	mutex_lock(&rdtgroup_mutex);
> +
> +	if (snc_nodes_per_l3_cache > 1)
> +		snc_remap_rmids(cpu);
> +
>  	for_each_capable_rdt_resource(r)
>  		domain_add_cpu(cpu, r);
>  	/* The cpu is set in default rdtgroup after online. */
> @@ -999,11 +1033,73 @@ static __init bool get_rdt_resources(void)
>  	return (rdt_mon_capable || rdt_alloc_capable);
>  }
>  
> +/* CPU models that support MSR_RMID_SNC_CONFIG */
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
> +	{}
> +};
> +
> +/*
> + * There isn't a simple hardware bit that indicates whether a CPU is running
> + * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
> + * ratio of NUMA nodes to L3 cache instances.
> + * It is not possible to accurately determine SNC state if the system is
> + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
> + * to L3 caches. It will be OK if system is booted with hyperthreading
> + * disabled (since this doesn't affect the ratio).
> + */
> +static __init int snc_get_config(void)
> +{
> +	unsigned long *node_caches;
> +	int mem_only_nodes = 0;
> +	int cpu, node, ret;
> +	int num_l3_caches;
> +
> +	if (!x86_match_cpu(snc_cpu_ids))
> +		return 1;
> +
> +	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
> +	if (!node_caches)
> +		return 1;
> +
> +	cpus_read_lock();
> +
> +	if (num_online_cpus() != num_present_cpus())
> +		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> +
> +	for_each_node(node) {
> +		cpu = cpumask_first(cpumask_of_node(node));
> +		if (cpu < nr_cpu_ids)
> +			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);

Are we sure get_cpu_cacheinfo_id() is an valid index here? Looking at
the function it could be -1 or larger than nr_node_ids.

Fam

> +		else
> +			mem_only_nodes++;
> +	}
> +	cpus_read_unlock();
> +
> +	num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
> +	kfree(node_caches);
> +
> +	if (!num_l3_caches)
> +		return 1;
> +
> +	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
> +
> +	if (ret > 1)
> +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> +
> +	return ret;
> +}
> +
>  static __init void rdt_init_res_defs_intel(void)
>  {
>  	struct rdt_hw_resource *hw_res;
>  	struct rdt_resource *r;
>  
> +	snc_nodes_per_l3_cache = snc_get_config();
> +
>  	for_each_rdt_resource(r) {
>  		hw_res = resctrl_to_arch_res(r);
>  
> -- 
> 2.41.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-11-30 18:02                   ` Fam Zheng
@ 2023-11-30 20:57                     ` Tony Luck
  2023-11-30 21:47                       ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-11-30 20:57 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, linux-kernel, linux-doc, patches,
	Shaopeng Tan

On Thu, Nov 30, 2023 at 06:02:42PM +0000, Fam Zheng wrote:
> > +static __init int snc_get_config(void)
> > +{
> > +	unsigned long *node_caches;
> > +	int mem_only_nodes = 0;
> > +	int cpu, node, ret;
> > +	int num_l3_caches;
> > +
> > +	if (!x86_match_cpu(snc_cpu_ids))
> > +		return 1;
> > +
> > +	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
> > +	if (!node_caches)
> > +		return 1;
> > +
> > +	cpus_read_lock();
> > +
> > +	if (num_online_cpus() != num_present_cpus())
> > +		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> > +
> > +	for_each_node(node) {
> > +		cpu = cpumask_first(cpumask_of_node(node));
> > +		if (cpu < nr_cpu_ids)
> > +			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> 
> Are we sure get_cpu_cacheinfo_id() is an valid index here? Looking at
> the function it could be -1 or larger than nr_node_ids.

Fam,

Return -1 is possible (in the case where first CPU on a node doesn't
have an L3 cache). Larger than nr_node_ids seems a bit more speculative.
It would mean a system with multiple L3 cache instances per node. I
suppose that's theoretically possible. In the limit case every CPU may
have its own personal L3 cache, but still have multiple CPUs grouped
together on a node.

Patch below (to be folded into part7 of next version). Increases the
size of the bitmap. Checks for get_cpu_cacheinfo_id() returning -1.
Patch just ignores the node in this case. I'm never quite sure how much
code to add for "Can't happen" scenarios.

-Tony


diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3293ab4c58b0..85f8a1b3feaf 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -1056,12 +1056,13 @@ static __init int snc_get_config(void)
 	unsigned long *node_caches;
 	int mem_only_nodes = 0;
 	int cpu, node, ret;
+	int cache_id;
 	int num_l3_caches;
 
 	if (!x86_match_cpu(snc_cpu_ids))
 		return 1;
 
-	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
+	node_caches = bitmap_zalloc(num_online_cpus(), GFP_KERNEL);
 	if (!node_caches)
 		return 1;
 
@@ -1072,10 +1073,13 @@ static __init int snc_get_config(void)
 
 	for_each_node(node) {
 		cpu = cpumask_first(cpumask_of_node(node));
-		if (cpu < nr_cpu_ids)
-			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
-		else
+		if (cpu < nr_cpu_ids) {
+			cache_id = get_cpu_cacheinfo_id(cpu, 3);
+			if (cache_id != -1)
+				set_bit(cache_id, node_caches);
+		} else {
 			mem_only_nodes++;
+		}
 	}
 	cpus_read_unlock();
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-11-30 20:57                     ` Tony Luck
@ 2023-11-30 21:47                       ` Reinette Chatre
  2023-11-30 22:43                         ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-11-30 21:47 UTC (permalink / raw)
  To: Tony Luck, Fam Zheng
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Shaopeng Tan

Hi Tony,

On 11/30/2023 12:57 PM, Tony Luck wrote:
> On Thu, Nov 30, 2023 at 06:02:42PM +0000, Fam Zheng wrote:
>>> +static __init int snc_get_config(void)
>>> +{
>>> +	unsigned long *node_caches;
>>> +	int mem_only_nodes = 0;
>>> +	int cpu, node, ret;
>>> +	int num_l3_caches;
>>> +
>>> +	if (!x86_match_cpu(snc_cpu_ids))
>>> +		return 1;
>>> +
>>> +	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
>>> +	if (!node_caches)
>>> +		return 1;
>>> +
>>> +	cpus_read_lock();
>>> +
>>> +	if (num_online_cpus() != num_present_cpus())
>>> +		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
>>> +
>>> +	for_each_node(node) {
>>> +		cpu = cpumask_first(cpumask_of_node(node));
>>> +		if (cpu < nr_cpu_ids)
>>> +			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
>>
>> Are we sure get_cpu_cacheinfo_id() is an valid index here? Looking at
>> the function it could be -1 or larger than nr_node_ids.
> 
> Fam,
> 
> Return -1 is possible (in the case where first CPU on a node doesn't
> have an L3 cache). Larger than nr_node_ids seems a bit more speculative.
> It would mean a system with multiple L3 cache instances per node. I
> suppose that's theoretically possible. In the limit case every CPU may
> have its own personal L3 cache, but still have multiple CPUs grouped
> together on a node.
> 
> Patch below (to be folded into part7 of next version). Increases the
> size of the bitmap. Checks for get_cpu_cacheinfo_id() returning -1.
> Patch just ignores the node in this case. I'm never quite sure how much
> code to add for "Can't happen" scenarios.
> 

Thank you.

> -Tony
> 
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 3293ab4c58b0..85f8a1b3feaf 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -1056,12 +1056,13 @@ static __init int snc_get_config(void)
>  	unsigned long *node_caches;
>  	int mem_only_nodes = 0;
>  	int cpu, node, ret;
> +	int cache_id;
>  	int num_l3_caches;

Please do maintain reverse fir order.

>  
>  	if (!x86_match_cpu(snc_cpu_ids))
>  		return 1;

I understand and welcome this change as motivated by robustness. Apart
from that, with this being a model specific feature for this particular
group of systems, it it not clear to me in which scenarios this could
run on a system where a present CPU does not have access to L3 cache.

>  
> -	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
> +	node_caches = bitmap_zalloc(num_online_cpus(), GFP_KERNEL);

Please do take care to take new bitmap size into account in all
places. From what I can tell there is a later bitmap_weight() call that
still uses nr_node_ids as size.

>  	if (!node_caches)
>  		return 1;
>  
> @@ -1072,10 +1073,13 @@ static __init int snc_get_config(void)
>  
>  	for_each_node(node) {
>  		cpu = cpumask_first(cpumask_of_node(node));
> -		if (cpu < nr_cpu_ids)
> -			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> -		else
> +		if (cpu < nr_cpu_ids) {
> +			cache_id = get_cpu_cacheinfo_id(cpu, 3);
> +			if (cache_id != -1)
> +				set_bit(cache_id, node_caches);
> +		} else {
>  			mem_only_nodes++;
> +		}
>  	}
>  	cpus_read_unlock();
>  

Could this code be made even more robust by checking the computed
snc_nodes_per_l3_cache against the limited actually possible values?
Forcing it to 1 if something went wrong?

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v12 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-11-30  0:34                 ` [PATCH v12 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-11-30 21:51                   ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-11-30 21:51 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Shaopeng Tan

Hi Tony,

On 11/29/2023 4:34 PM, Tony Luck wrote:
> Resctrl assumes that control and monitor operations on a resource are
> performed at the same scope.
> 
> Prepare for systems that use different scope (specifically Intel needs
> to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
> and NODE scope for cache occupancy and memory bandwidth monitoring).
> 
> Create separate domain lists for control and monitor operations.
> 
> Note that errors during initialization of either control or monitor
> functions on a domain would previously result in that domain being
> excluded from both control and monitor operations. Now the domains are
> allocated independently it is no longer required to disable both control
> and monitor operations if either fail.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
> ---

According to Documentation/process/maintainer-tip.rst the
"Tested-by" tag should go before the "Reviewed-by" tag. Please
check the whole series.

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette



^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-11-30 21:47                       ` Reinette Chatre
@ 2023-11-30 22:43                         ` Tony Luck
  2023-11-30 23:40                           ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-11-30 22:43 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fam Zheng, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan,
	x86, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, linux-kernel, linux-doc, patches, Shaopeng Tan

On Thu, Nov 30, 2023 at 01:47:10PM -0800, Reinette Chatre wrote:
> Hi Tony,
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 3293ab4c58b0..85f8a1b3feaf 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -1056,12 +1056,13 @@ static __init int snc_get_config(void)
> >  	unsigned long *node_caches;
> >  	int mem_only_nodes = 0;
> >  	int cpu, node, ret;
> > +	int cache_id;
> >  	int num_l3_caches;
> 
> Please do maintain reverse fir order.

Fixed.

> 
> >  
> >  	if (!x86_match_cpu(snc_cpu_ids))
> >  		return 1;
> 
> I understand and welcome this change as motivated by robustness. Apart
> from that, with this being a model specific feature for this particular
> group of systems, it it not clear to me in which scenarios this could
> run on a system where a present CPU does not have access to L3 cache.

Agreed that on these systems there should always be an L3 cache. Should
I drop the check for "-1"?

> >  
> > -	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
> > +	node_caches = bitmap_zalloc(num_online_cpus(), GFP_KERNEL);
> 
> Please do take care to take new bitmap size into account in all
> places. From what I can tell there is a later bitmap_weight() call that
> still uses nr_node_ids as size.

Oops. I was also using num_online_cpus() before cpus_read_lock(), so
things could theoretically change before the bitmap_weight() call.
I switched to using num_present_cpus() in both places.

> >  	if (!node_caches)
> >  		return 1;
> >  
> > @@ -1072,10 +1073,13 @@ static __init int snc_get_config(void)
> >  
> >  	for_each_node(node) {
> >  		cpu = cpumask_first(cpumask_of_node(node));
> > -		if (cpu < nr_cpu_ids)
> > -			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> > -		else
> > +		if (cpu < nr_cpu_ids) {
> > +			cache_id = get_cpu_cacheinfo_id(cpu, 3);
> > +			if (cache_id != -1)
> > +				set_bit(cache_id, node_caches);
> > +		} else {
> >  			mem_only_nodes++;
> > +		}
> >  	}
> >  	cpus_read_unlock();
> >  
> 
> Could this code be made even more robust by checking the computed
> snc_nodes_per_l3_cache against the limited actually possible values?
> Forcing it to 1 if something went wrong?

Added a couple of extra sanity checks. See updated incremental patch
below.

-Tony


diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3293ab4c58b0..3684c6bf8224 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -1057,11 +1057,12 @@ static __init int snc_get_config(void)
 	int mem_only_nodes = 0;
 	int cpu, node, ret;
 	int num_l3_caches;
+	int cache_id;
 
 	if (!x86_match_cpu(snc_cpu_ids))
 		return 1;
 
-	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
+	node_caches = bitmap_zalloc(num_present_cpus(), GFP_KERNEL);
 	if (!node_caches)
 		return 1;
 
@@ -1072,23 +1073,39 @@ static __init int snc_get_config(void)
 
 	for_each_node(node) {
 		cpu = cpumask_first(cpumask_of_node(node));
-		if (cpu < nr_cpu_ids)
-			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
-		else
+		if (cpu < nr_cpu_ids) {
+			cache_id = get_cpu_cacheinfo_id(cpu, 3);
+			if (cache_id != -1)
+				set_bit(cache_id, node_caches);
+		} else {
 			mem_only_nodes++;
+		}
 	}
 	cpus_read_unlock();
 
-	num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
+	num_l3_caches = bitmap_weight(node_caches, num_present_cpus());
 	kfree(node_caches);
 
 	if (!num_l3_caches)
 		return 1;
 
+	/* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
+	if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
+		return 1;
+
 	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
 
-	if (ret > 1)
+	/* sanity check #2: Only valid results are 1, 2, 4 */
+	switch (ret) {
+	case 1:
+		break;
+	case 2:
+	case 4:
 		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+		break;
+	default:
+		return 1;
+	}
 
 	return ret;
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-11-30 22:43                         ` Tony Luck
@ 2023-11-30 23:40                           ` Reinette Chatre
  2023-12-01  0:37                             ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2023-11-30 23:40 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fam Zheng, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan,
	x86, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, linux-kernel, linux-doc, patches, Shaopeng Tan

Hi Tony,

On 11/30/2023 2:43 PM, Tony Luck wrote:
> On Thu, Nov 30, 2023 at 01:47:10PM -0800, Reinette Chatre wrote:

...

>>>  	if (!x86_match_cpu(snc_cpu_ids))
>>>  		return 1;
>>
>> I understand and welcome this change as motivated by robustness. Apart
>> from that, with this being a model specific feature for this particular
>> group of systems, it it not clear to me in which scenarios this could
>> run on a system where a present CPU does not have access to L3 cache.
> 
> Agreed that on these systems there should always be an L3 cache. Should
> I drop the check for "-1"?

Please do keep it. I welcome the additional robustness. The static checker I
tried did not complain about this but I expect that it is something that
could trigger checks.

> 
>>>  
>>> -	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
>>> +	node_caches = bitmap_zalloc(num_online_cpus(), GFP_KERNEL);
>>
>> Please do take care to take new bitmap size into account in all
>> places. From what I can tell there is a later bitmap_weight() call that
>> still uses nr_node_ids as size.
> 
> Oops. I was also using num_online_cpus() before cpus_read_lock(), so
> things could theoretically change before the bitmap_weight() call.
> I switched to using num_present_cpus() in both places.

Thanks for catching this. I am not sure if num_present_cpus() is the right
choice. I found its comment to say "If HOTPLUG is enabled, then cpu_present_mask
varies dynamically ...". num_possible_cpus() seems more appropriate when looking
for something that does not change while not holding the hotplug lock. Reading its
description more closely also makes me wonder if the later
	num_online_cpus() != num_present_cpus()
should also maybe be 
	num_online_cpus() != num_possible_cpus() ?
It seems to more closely match the intention.

>>>  	if (!node_caches)
>>>  		return 1;
>>>  
>>> @@ -1072,10 +1073,13 @@ static __init int snc_get_config(void)
>>>  
>>>  	for_each_node(node) {
>>>  		cpu = cpumask_first(cpumask_of_node(node));
>>> -		if (cpu < nr_cpu_ids)
>>> -			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
>>> -		else
>>> +		if (cpu < nr_cpu_ids) {
>>> +			cache_id = get_cpu_cacheinfo_id(cpu, 3);
>>> +			if (cache_id != -1)
>>> +				set_bit(cache_id, node_caches);
>>> +		} else {
>>>  			mem_only_nodes++;
>>> +		}
>>>  	}
>>>  	cpus_read_unlock();
>>>  
>>
>> Could this code be made even more robust by checking the computed
>> snc_nodes_per_l3_cache against the limited actually possible values?
>> Forcing it to 1 if something went wrong?
> 
> Added a couple of extra sanity checks. See updated incremental patch
> below.

Thank you very much. The additional checks look good to me.

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-11-30 23:40                           ` Reinette Chatre
@ 2023-12-01  0:37                             ` Tony Luck
  2023-12-01  2:08                               ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-12-01  0:37 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fam Zheng, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan,
	x86, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, linux-kernel, linux-doc, patches, Shaopeng Tan

On Thu, Nov 30, 2023 at 03:40:52PM -0800, Reinette Chatre wrote:
> Hi Tony,
> 
> On 11/30/2023 2:43 PM, Tony Luck wrote:
> > On Thu, Nov 30, 2023 at 01:47:10PM -0800, Reinette Chatre wrote:
> 
> ...
> 
> >>>  	if (!x86_match_cpu(snc_cpu_ids))
> >>>  		return 1;
> >>
> >> I understand and welcome this change as motivated by robustness. Apart
> >> from that, with this being a model specific feature for this particular
> >> group of systems, it it not clear to me in which scenarios this could
> >> run on a system where a present CPU does not have access to L3 cache.
> > 
> > Agreed that on these systems there should always be an L3 cache. Should
> > I drop the check for "-1"?
> 
> Please do keep it. I welcome the additional robustness. The static checker I
> tried did not complain about this but I expect that it is something that
> could trigger checks.
> 
> > 
> >>>  
> >>> -	node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
> >>> +	node_caches = bitmap_zalloc(num_online_cpus(), GFP_KERNEL);
> >>
> >> Please do take care to take new bitmap size into account in all
> >> places. From what I can tell there is a later bitmap_weight() call that
> >> still uses nr_node_ids as size.
> > 
> > Oops. I was also using num_online_cpus() before cpus_read_lock(), so
> > things could theoretically change before the bitmap_weight() call.
> > I switched to using num_present_cpus() in both places.
> 
> Thanks for catching this. I am not sure if num_present_cpus() is the right
> choice. I found its comment to say "If HOTPLUG is enabled, then cpu_present_mask
> varies dynamically ...". num_possible_cpus() seems more appropriate when looking

I can size the bitmask based on num_possible_cpus().

> for something that does not change while not holding the hotplug lock. Reading its
> description more closely also makes me wonder if the later
> 	num_online_cpus() != num_present_cpus()
> should also maybe be 
> 	num_online_cpus() != num_possible_cpus() ?
> It seems to more closely match the intention.

This seems problematic. On a system that does support physical CPU
hotplug num_possible_cpus() may be some very large number. Reserving
space for CPUs that can be added later. None of those CPUs can be online
(obviously!). So this test would fail on such a system.

> >>>  	if (!node_caches)
> >>>  		return 1;
> >>>  
> >>> @@ -1072,10 +1073,13 @@ static __init int snc_get_config(void)
> >>>  
> >>>  	for_each_node(node) {
> >>>  		cpu = cpumask_first(cpumask_of_node(node));
> >>> -		if (cpu < nr_cpu_ids)
> >>> -			set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> >>> -		else
> >>> +		if (cpu < nr_cpu_ids) {
> >>> +			cache_id = get_cpu_cacheinfo_id(cpu, 3);
> >>> +			if (cache_id != -1)
> >>> +				set_bit(cache_id, node_caches);
> >>> +		} else {
> >>>  			mem_only_nodes++;
> >>> +		}
> >>>  	}
> >>>  	cpus_read_unlock();
> >>>  
> >>
> >> Could this code be made even more robust by checking the computed
> >> snc_nodes_per_l3_cache against the limited actually possible values?
> >> Forcing it to 1 if something went wrong?
> > 
> > Added a couple of extra sanity checks. See updated incremental patch
> > below.
> 
> Thank you very much. The additional checks look good to me.
> 
> Reinette

Thanks for looking at this. I'm applying changes to my local tree. I'll
give folks a little more time to find additonal issues in v12 and post
v13 next week.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-12-01  0:37                             ` Tony Luck
@ 2023-12-01  2:08                               ` Reinette Chatre
  0 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2023-12-01  2:08 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fam Zheng, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan,
	x86, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, linux-kernel, linux-doc, patches, Shaopeng Tan

Hi Tony,

On 11/30/2023 4:37 PM, Tony Luck wrote:
> On Thu, Nov 30, 2023 at 03:40:52PM -0800, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 11/30/2023 2:43 PM, Tony Luck wrote:
>>> On Thu, Nov 30, 2023 at 01:47:10PM -0800, Reinette Chatre wrote:

>> for something that does not change while not holding the hotplug lock. Reading its
>> description more closely also makes me wonder if the later
>> 	num_online_cpus() != num_present_cpus()
>> should also maybe be 
>> 	num_online_cpus() != num_possible_cpus() ?
>> It seems to more closely match the intention.
> 
> This seems problematic. On a system that does support physical CPU
> hotplug num_possible_cpus() may be some very large number. Reserving
> space for CPUs that can be added later. None of those CPUs can be online
> (obviously!). So this test would fail on such a system.

Thank you for clarifying. It was not obvious to me how these bitmaps are
different with physical hotplug. 

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v12 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-11-30  0:34                 ` [PATCH v12 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-12-04 16:55                   ` Moger, Babu
  2023-12-04 17:51                     ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2023-12-04 16:55 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches, Shaopeng Tan



On 11/29/23 18:34, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
> 
> Users should be aware that SNC mode also affects the amount of L3 cache
> available for allocation within each SNC node.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Peter Newman <peternewman@google.com>
> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
> ---
>  Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index a6279df64a9d..49ff789db1d8 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
>  When monitoring is enabled all MON groups will also contain:
>  
>  "mon_data":
> -	This contains a set of files organized by L3 domain and by
> -	RDT event. E.g. on a system with two L3 domains there will
> -	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
> -	directories have one file per event (e.g. "llc_occupancy",
> +	This contains a set of files organized by L3 domain or by NUMA
> +	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> +	or enabled respectively) and by RDT event.  Each of these
> +	directories has one file per event (e.g. "llc_occupancy",
>  	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
>  	files provide a read out of the current value of the event for
>  	all tasks in the group. In CTRL_MON groups these files provide
> @@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
>  each bit represents 5% of the capacity of the cache. You could partition
>  the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>  
> +Notes on Sub-NUMA Cluster mode
> +==============================
> +When SNC mode is enabled Linux may load balance tasks between Sub-NUMA

enabled, Linux

Thanks
Babu

> +nodes much more readily than between regular NUMA nodes since the CPUs
> +on Sub-NUMA nodes share the same L3 cache and the system may report
> +the NUMA distance between Sub-NUMA nodes with a lower value than used
> +for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
> +specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
> +and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
> +to get the full view of traffic for which the tasks were the source.
> +
> +The cache allocation feature still provides the same number of
> +bits in a mask to control allocation into the L3 cache, but each
> +of those ways has its capacity reduced because the cache is divided
> +between the SNC nodes. The values reported in the resctrl
> +"size" files are adjusted accordingly.
> +
>  Memory bandwidth Allocation and monitoring
>  ==========================================
>  

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v12 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-12-04 16:55                   ` Moger, Babu
@ 2023-12-04 17:51                     ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-12-04 17:51 UTC (permalink / raw)
  To: babu.moger, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches, Shaopeng Tan

>> +Notes on Sub-NUMA Cluster mode
>> +==============================
>> +When SNC mode is enabled Linux may load balance tasks between Sub-NUMA
>
> enabled, Linux

Babu,

Thanks. I've added that "," ready for next version.

-Tony


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-11-30  0:34               ` [PATCH v12 " Tony Luck
                                   ` (7 preceding siblings ...)
  2023-11-30  0:34                 ` [PATCH v12 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-12-04 18:53                 ` Tony Luck
  2023-12-04 18:53                   ` [PATCH v13 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
                                     ` (10 more replies)
  8 siblings, 11 replies; 331+ messages in thread
From: Tony Luck @ 2023-12-04 18:53 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Changes since v12:

All:
	Reinette - put commit tags in right order for TIP (Tested-by before
	Reviewed-by)

Patch 7:
	Fam Zheng - Check for -1 return from get_cpu_cacheinfo_id() and
	increase size of bitmap tracking # of L3 instances.
	Reinette - Add extra sanity checks. Note that this patch has
	some additional tweaks beyond the e-mail discussion.
		1) "3" is a valid return in addition to 1, 2, 4
		2) Added a warning if the sanity checks fail that
		prints number of CPU nodes and number of L3 cache
		instances that were found.

Patch 8:
	Babu - Fix grammar with an additional comma.


Tony Luck (8):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  25 +-
 include/linux/resctrl.h                   |  85 +++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  66 ++--
 arch/x86/kernel/cpu/resctrl/core.c        | 433 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  58 +--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  68 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  26 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 149 ++++----
 9 files changed, 629 insertions(+), 282 deletions(-)


base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
-- 
2.41.0


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v13 1/8] x86/resctrl: Prepare for new domain scope
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
@ 2023-12-04 18:53                   ` Tony Luck
  2023-12-04 23:03                     ` Moger, Babu
  2023-12-04 18:53                   ` [PATCH v13 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
                                     ` (9 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-12-04 18:53 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Clean up the error handling when looking up domains. Report invalid id's
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the callsites.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 include/linux/resctrl.h                   |  9 ++++-
 arch/x86/kernel/cpu/resctrl/core.c        | 45 ++++++++++++++++-------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 ++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 ++-
 5 files changed, 48 insertions(+), 19 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..7d4eb7df611d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19e0681f0435..fd113bc29d4e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -401,9 +401,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct rdt_domain *d;
 	struct list_head *l;
 
-	if (id < 0)
-		return ERR_PTR(-ENODEV);
-
 	list_for_each(l, &r->domains) {
 		d = list_entry(l, struct rdt_domain, list);
 		/* When id is found, return its domain. */
@@ -491,6 +488,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -506,18 +516,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 	int err;
 
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
 		return;
 	}
 
+	d = rdt_find_domain(r, id, &add_pos);
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -556,13 +567,19 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
+		return;
+	}
+
 	d = rdt_find_domain(r, id, NULL);
-	if (IS_ERR_OR_NULL(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (!d) {
+		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
 		return;
 	}
 	hw_dom = resctrl_to_arch_dom(d);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index beccb0e87ba7..3f8891d57fac 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -563,7 +563,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 
 	r = &rdt_resources_all[resid].r_resctrl;
 	d = rdt_find_domain(r, domid, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	if (!d) {
 		ret = -ENOENT;
 		goto out;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..2a682da9f43a 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	enum resctrl_scope scope = plr->s->res->scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..c44be64d65ec 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1413,10 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return size;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v13 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2023-12-04 18:53                   ` [PATCH v13 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-12-04 18:53                   ` Tony Luck
  2023-12-04 23:03                     ` Moger, Babu
  2023-12-04 18:53                   ` [PATCH v13 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
                                     ` (8 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-12-04 18:53 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 include/linux/resctrl.h                   | 16 ++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 26 +++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 10 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 60 +++++++++++------------
 6 files changed, 78 insertions(+), 70 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 7d4eb7df611d..c4067150a6b7 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,10 +53,20 @@ struct resctrl_staged_config {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
  * @cpu_mask:		which CPUs share this resource
+ */
+struct rdt_domain_hdr {
+	struct list_head		list;
+	int				id;
+	struct cpumask			cpu_mask;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
  * @mbm_local:		saved state for MBM local bandwidth
@@ -71,9 +81,7 @@ struct resctrl_staged_config {
  *			by closid
  */
 struct rdt_domain {
-	struct list_head		list;
-	int				id;
-	struct cpumask			cpu_mask;
+	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
 	struct mbm_state		*mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index fd113bc29d4e..62a989fd950d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -356,9 +356,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/* Find the domain that contains this CPU */
-		if (cpumask_test_cpu(cpu, &d->cpu_mask))
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
 	}
 
@@ -402,12 +402,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct list_head *l;
 
 	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, list);
+		d = list_entry(l, struct rdt_domain, hdr.list);
 		/* When id is found, return its domain. */
-		if (id == d->id)
+		if (id == d->hdr.id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->id)
+		if (id < d->hdr.id)
 			break;
 	}
 
@@ -530,7 +530,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = rdt_find_domain(r, id, &add_pos);
 	if (d) {
-		cpumask_set_cpu(cpu, &d->cpu_mask);
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
 		return;
@@ -541,8 +541,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 
 	d = &hw_dom->d_resctrl;
-	d->id = id;
-	cpumask_set_cpu(cpu, &d->cpu_mask);
+	d->hdr.id = id;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
@@ -556,11 +556,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail(&d->list, add_pos);
+	list_add_tail(&d->hdr.list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
@@ -584,10 +584,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 	hw_dom = resctrl_to_arch_dom(d);
 
-	cpumask_clear_cpu(cpu, &d->cpu_mask);
-	if (cpumask_empty(&d->cpu_mask)) {
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 
 		/*
 		 * rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 3f8891d57fac..23f8258d36a8 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -146,7 +146,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -226,8 +226,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
 			if (r->parse_ctrlval(&data, s, d))
@@ -274,7 +274,7 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	struct rdt_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
-		cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
+		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
 		hw_dom->ctrl_val[idx] = cfg->new_ctrl;
 
 		return true;
@@ -291,7 +291,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	hw_dom->ctrl_val[idx] = cfg_val;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -476,7 +476,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 			ctrl_val = resctrl_arch_get_config(r, dom, closid,
 							   schema->conf_type);
 
-		seq_printf(s, r->format_str, dom->id, max_data_width,
+		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
 			   ctrl_val);
 		sep = true;
 	}
@@ -505,7 +505,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			} else {
 				seq_printf(s, "%s:%d=%x\n",
 					   rdtgrp->plr->s->res->name,
-					   rdtgrp->plr->d->id,
+					   rdtgrp->plr->d->hdr.id,
 					   rdtgrp->plr->cbm);
 			}
 		} else {
@@ -536,7 +536,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 	rr->val = 0;
 	rr->first = first;
 
-	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
 }
 
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..dd0ea1bc0092 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -238,7 +238,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 	u64 msr_val, chunks;
 	int ret;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	ret = __rmid_read(rmid, eventid, &msr_val);
@@ -340,8 +340,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
-		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
 						     &val);
@@ -661,7 +661,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
 
-	cpu = cpumask_any(&dom->cpu_mask);
+	cpu = cpumask_any(&dom->hdr.cpu_mask);
 	dom->cqm_work_cpu = cpu;
 
 	schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
@@ -708,7 +708,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
 
 	if (!static_branch_likely(&rdt_mon_enable_key))
 		return;
-	cpu = cpumask_any(&dom->cpu_mask);
+	cpu = cpumask_any(&dom->hdr.cpu_mask);
 	dom->mbm_work_cpu = cpu;
 	schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
 }
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 2a682da9f43a..fcbd99e2eb66 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
 	int cpu;
 	int ret;
 
-	for_each_cpu(cpu, &plr->d->cpu_mask) {
+	for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
 		pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
 		if (!pm_req) {
 			rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
@@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 		return -ENODEV;
 
 	/* Pick the first cpu we find that is associated with the cache. */
-	plr->cpu = cpumask_first(&plr->d->cpu_mask);
+	plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 
 	if (!cpu_online(plr->cpu)) {
 		rdt_last_cmd_printf("CPU %u associated with cache not online\n",
@@ -856,10 +856,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, list) {
+		list_for_each_entry(d_i, &r->domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
-					   &d_i->cpu_mask);
+					   &d_i->hdr.cpu_mask);
 		}
 	}
 
@@ -867,7 +867,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * Next test if new pseudo-locked region would intersect with
 	 * existing region.
 	 */
-	if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
+	if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
 		ret = true;
 
 	free_cpumask_var(cpu_with_psl);
@@ -1199,7 +1199,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
 	}
 
 	plr->thread_done = 0;
-	cpu = cpumask_first(&plr->d->cpu_mask);
+	cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 	if (!cpu_online(cpu)) {
 		ret = -ENODEV;
 		goto out;
@@ -1529,7 +1529,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * may be scheduled elsewhere and invalidate entries in the
 	 * pseudo-locked region.
 	 */
-	if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
+	if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
 		mutex_unlock(&rdtgroup_mutex);
 		return -EINVAL;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index c44be64d65ec..04d32602ac33 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -295,7 +295,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
 				rdt_last_cmd_puts("Cache domain offline\n");
 				ret = -ENODEV;
 			} else {
-				mask = &rdtgrp->plr->d->cpu_mask;
+				mask = &rdtgrp->plr->d->hdr.cpu_mask;
 				seq_printf(s, is_cpu_list(of) ?
 					   "%*pbl\n" : "%*pb\n",
 					   cpumask_pr_args(mask));
@@ -984,12 +984,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
 		exclusive = 0;
-		seq_printf(seq, "%d=", dom->id);
+		seq_printf(seq, "%d=", dom->hdr.id);
 		for (i = 0; i < closids_supported(); i++) {
 			if (!closid_allocated(i))
 				continue;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1417,7 +1417,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
-	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
+	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
 		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
@@ -1465,7 +1465,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
 						    rdtgrp->plr->d,
 						    rdtgrp->plr->cbm);
-			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
 		}
 		goto out;
 	}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1495,7 +1495,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 				else
 					size = rdtgroup_cbm_to_size(r, d, ctrl);
 			}
-			seq_printf(s, "%d=%u", d->id, size);
+			seq_printf(s, "%d=%u", d->hdr.id, size);
 			sep = true;
 		}
 		seq_putc(s, '\n');
@@ -1555,7 +1555,7 @@ static void mon_event_config_read(void *info)
 
 static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
 {
-	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
 
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1574,7 +1574,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		mon_info.evtid = evtid;
 		mondata_config_read(dom, &mon_info);
 
-		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1646,7 +1646,7 @@ static int mbm_config_write_domain(struct rdt_resource *r,
 	 * are scoped at the domain level. Writing any of these MSRs
 	 * on one CPU is observed by all the CPUs in the domain.
 	 */
-	smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
 			      &mon_info, 1);
 
 	/*
@@ -1689,8 +1689,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
 				return -EINVAL;
@@ -2232,14 +2232,14 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, list) {
+	list_for_each_entry(d, &r_l->domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
-			for_each_cpu(cpu, &d->cpu_mask)
+			for_each_cpu(cpu, &d->hdr.cpu_mask)
 				cpumask_set_cpu(cpu, cpu_mask);
 		else
 			/* Pick one CPU from each domain instance to update MSR */
-			cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+			cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 	}
 
 	/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2268,7 +2268,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	int cpu = cpumask_any(&d->cpu_mask);
+	int cpu = cpumask_any(&d->hdr.cpu_mask);
 	int i;
 
 	d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2780,9 +2780,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * CBMs in all domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
-		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
 			hw_dom->ctrl_val[i] = r->default_ctrl;
@@ -2986,7 +2986,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
 	/* create the directory */
 	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
 	if (IS_ERR(kn))
@@ -3002,7 +3002,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	}
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->id;
+	priv.u.domid = d->hdr.id;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3209,7 +3209,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
 	 */
 	tmp_cbm = cfg->new_ctrl;
 	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
-		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
 		return -ENOSPC;
 	}
 	cfg->have_new_ctrl = true;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, list) {
+	list_for_each_entry(d, &s->res->domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3864,7 +3864,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (static_branch_unlikely(&rdt_mon_enable_key))
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v13 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2023-12-04 18:53                   ` [PATCH v13 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2023-12-04 18:53                   ` [PATCH v13 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2023-12-04 18:53                   ` Tony Luck
  2023-12-04 23:03                     ` Moger, Babu
  2023-12-04 18:53                   ` [PATCH v13 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
                                     ` (7 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-12-04 18:53 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
 include/linux/resctrl.h                   |  25 ++-
 arch/x86/kernel/cpu/resctrl/internal.h    |   6 +-
 arch/x86/kernel/cpu/resctrl/core.c        | 211 ++++++++++++++++------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  12 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     |   4 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  55 +++---
 7 files changed, 220 insertions(+), 97 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c4067150a6b7..35e700edc6e6 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -52,15 +52,22 @@ struct resctrl_staged_config {
 	bool			have_new_ctrl;
 };
 
+enum resctrl_domain_type {
+	RESCTRL_CTRL_DOMAIN,
+	RESCTRL_MON_DOMAIN,
+};
+
 /**
  * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
+ * @type:		type of this instance
  * @cpu_mask:		which CPUs share this resource
  */
 struct rdt_domain_hdr {
 	struct list_head		list;
 	int				id;
+	enum resctrl_domain_type	type;
 	struct cpumask			cpu_mask;
 };
 
@@ -163,10 +170,12 @@ enum resctrl_scope {
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @scope:		Scope of this resource
+ * @ctrl_scope:		Scope of this resource for control functions
+ * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		All domains for this resource
+ * @ctrl_domains:	Control domains for this resource
+ * @mon_domains:	Monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -181,10 +190,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -230,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..24bf9d7989a9 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -520,8 +520,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -540,7 +540,7 @@ int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 62a989fd950d..1fd85533b4ca 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,8 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.mon_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
+			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.scope			= RESCTRL_L2_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.ctrl_scope		= RESCTRL_L2_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -352,11 +355,11 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
@@ -378,7 +381,7 @@ void rdt_ctrl_update(void *arg)
 	int cpu = smp_processor_id();
 	struct rdt_domain *d;
 
-	d = get_domain_from_cpu(cpu, r);
+	d = get_ctrl_domain_from_cpu(cpu, r);
 	if (d) {
 		hw_res->msr_update(d, m, r);
 		return;
@@ -388,26 +391,26 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Search for a domain id in a resource domain list.
  *
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
- * The domain list is sorted by id in ascending order.
+ * Search the domain list to find the domain id. If the domain id is
+ * found, return the domain. NULL otherwise.  If the domain id is not
+ * found (and NULL returned) then the first domain with id bigger than
+ * the input id can be returned to the caller via @pos.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos)
 {
-	struct rdt_domain *d;
+	struct rdt_domain_hdr *d;
 	struct list_head *l;
 
-	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, hdr.list);
+	list_for_each(l, h) {
+		d = list_entry(l, struct rdt_domain_hdr, list);
 		/* When id is found, return its domain. */
-		if (id == d->hdr.id)
+		if (id == d->id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->hdr.id)
+		if (id < d->id)
 			break;
 	}
 
@@ -501,35 +504,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	return -EINVAL;
 }
 
-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 	int err;
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
 
-	d = rdt_find_domain(r, id, &add_pos);
-	if (d) {
+	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+			return;
+
+		d = container_of(hdr, struct rdt_domain, hdr);
+
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
@@ -542,51 +538,114 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = &hw_dom->d_resctrl;
 	d->hdr.id = id;
+	d->hdr.type = RESCTRL_CTRL_DOMAIN;
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+	if (domain_setup_ctrlval(r, d)) {
 		domain_free(hw_dom);
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	list_add_tail(&d->hdr.list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+	int err;
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+			return;
+
+		d = container_of(hdr, struct rdt_domain, hdr);
+
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->hdr.id = id;
+	d->hdr.type = RESCTRL_MON_DOMAIN;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
 	}
 
 	list_add_tail(&d->hdr.list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+/*
+ * domain_add_cpu - Add a CPU to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
 
-	d = rdt_find_domain(r, id, NULL);
-	if (!d) {
-		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+	if (!hdr) {
+		pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n",
+			id, cpu, r->name);
 		return;
 	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
 	hw_dom = resctrl_to_arch_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del(&d->hdr.list);
 
 		/*
@@ -599,6 +658,42 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 		return;
 	}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+	if (!hdr) {
+		pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n",
+			id, cpu, r->name);
+		return;
+	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+	hw_dom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+
+		return;
+	}
 
 	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -613,6 +708,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 23f8258d36a8..0b4136c42762 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -226,7 +226,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -542,6 +542,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
+	struct rdt_domain_hdr *hdr;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
@@ -562,11 +563,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
-	if (!d) {
+	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+	if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
 		ret = -ENOENT;
 		goto out;
 	}
+	d = container_of(hdr, struct rdt_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index dd0ea1bc0092..ec5ad926c5dc 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
@@ -535,7 +535,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	rmid = rgrp->mon.rmid;
 	pmbm_data = &dom_mbm->mbm_local[rmid];
 
-	dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+	dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
 	if (!dom_mba) {
 		pr_warn_once("Failure to get domain for MBA update\n");
 		return;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index fcbd99e2eb66..ed6d59af1cef 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
-	enum resctrl_scope scope = plr->s->res->scope;
+	enum resctrl_scope scope = plr->s->res->ctrl_scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, hdr.list) {
+		list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->hdr.cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 04d32602ac33..760013ed1bff 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -984,7 +984,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1413,13 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
-	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+	if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->scope) {
+		if (ci->info_list[i].level == r->ctrl_scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1689,7 +1689,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
@@ -2232,7 +2232,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, hdr.list) {
+	list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->hdr.cpu_mask)
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2777,10 +2777,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
 
 	/*
 	 * Disable resource control for this resource by setting all
-	 * CBMs in all domains to the maximum mask value. Pick one CPU
+	 * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, hdr.list) {
+	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3849,15 +3849,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
+}
 
-	if (!r->mon_capable)
-		return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3914,18 +3916,21 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
-	int err;
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
-		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
-		return 0;
+	return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v13 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                     ` (2 preceding siblings ...)
  2023-12-04 18:53                   ` [PATCH v13 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-12-04 18:53                   ` Tony Luck
  2023-12-04 23:03                     ` Moger, Babu
  2023-12-04 18:53                   ` [PATCH v13 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
                                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-12-04 18:53 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 include/linux/resctrl.h                   | 48 +++++++------
 arch/x86/kernel/cpu/resctrl/internal.h    | 60 ++++++++++------
 arch/x86/kernel/cpu/resctrl/core.c        | 87 ++++++++++++-----------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 40 +++++------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 62 ++++++++--------
 7 files changed, 182 insertions(+), 153 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 35e700edc6e6..058a940c3239 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -72,7 +72,23 @@ struct rdt_domain_hdr {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr:		common header for different domain types
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_ctrl_domain {
+	struct rdt_domain_hdr		hdr;
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
  * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
@@ -81,13 +97,8 @@ struct rdt_domain_hdr {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
@@ -96,9 +107,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };
 
 /**
@@ -202,7 +210,7 @@ struct rdt_resource {
 	const char		*format_str;
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
-						 struct rdt_domain *d);
+						 struct rdt_ctrl_domain *d);
 	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
@@ -236,15 +244,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.
  */
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -260,7 +268,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
 
 /**
@@ -273,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid);
 
 /**
@@ -285,7 +293,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 24bf9d7989a9..ce3a70657842 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -107,7 +107,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -192,7 +192,7 @@ struct mongroup {
  */
 struct pseudo_lock_region {
 	struct resctrl_schema	*s;
-	struct rdt_domain	*d;
+	struct rdt_ctrl_domain	*d;
 	u32			cbm;
 	wait_queue_head_t	lock_thread_wq;
 	int			thread_done;
@@ -319,25 +319,41 @@ struct arch_mbm_state {
 };
 
 /**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ *			       a resource for a control function
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+	struct rdt_ctrl_domain		d_resctrl;
+	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ *			      a resource for a monitor function
+ * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
-struct rdt_hw_domain {
-	struct rdt_domain		d_resctrl;
-	u32				*ctrl_val;
+struct rdt_hw_mon_domain {
+	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };
 
-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
 {
-	return container_of(r, struct rdt_hw_domain, d_resctrl);
+	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
 }
 
 /**
@@ -405,7 +421,7 @@ struct rdt_hw_resource {
 	struct rdt_resource	r_resctrl;
 	u32			num_closid;
 	unsigned int		msr_base;
-	void (*msr_update)	(struct rdt_domain *d, struct msr_param *m,
+	void (*msr_update)	(struct rdt_ctrl_domain *d, struct msr_param *m,
 				 struct rdt_resource *r);
 	unsigned int		mon_scale;
 	unsigned int		mbm_width;
@@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 }
 
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d);
+	      struct rdt_ctrl_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d);
+	     struct rdt_ctrl_domain *d);
 
 extern struct mutex rdtgroup_mutex;
 
@@ -526,21 +542,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				  unsigned long cbm);
 enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
 int rdtgroup_tasks_assigned(struct rdtgroup *r);
 int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
 int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
 int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
@@ -550,17 +566,17 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 1fd85533b4ca..797cb3bf417a 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
 bool rdt_alloc_capable;
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
 #define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -307,11 +307,11 @@ static void rdt_get_cdp_l2_config(void)
 }
 
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -332,12 +332,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 }
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	/*  Write the delay values for mba. */
 	for (i = m->low; i < m->high; i++)
@@ -345,19 +345,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 }
 
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
@@ -379,7 +379,7 @@ void rdt_ctrl_update(void *arg)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
 	struct rdt_resource *r = m->res;
 	int cpu = smp_processor_id();
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	d = get_ctrl_domain_from_cpu(cpu, r);
 	if (d) {
@@ -434,18 +434,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 		*dc = r->default_ctrl;
 }
 
-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct msr_param m;
 	u32 *dc;
 
@@ -468,7 +473,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
 {
 	size_t tsize;
 
@@ -507,10 +512,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -524,7 +529,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 			return;
 
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_ctrl_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -544,7 +549,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	rdt_domain_reconfigure_cdp(r);
 
 	if (domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 		return;
 	}
 
@@ -553,17 +558,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	err = resctrl_online_ctrl_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 	}
 }
 
 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_mon_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -577,7 +582,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 			return;
 
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		return;
@@ -593,7 +598,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 		return;
 	}
 
@@ -602,7 +607,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 	}
 }
 
@@ -620,9 +625,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	if (id < 0) {
 		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
@@ -640,8 +645,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+	hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
@@ -649,12 +654,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		list_del(&d->hdr.list);
 
 		/*
-		 * rdt_domain "d" is going to be freed below, so clear
+		 * rdt_ctrl_domain "d" is going to be freed below, so clear
 		 * its pointer from pseudo_lock_region struct.
 		 */
 		if (d->plr)
 			d->plr->d = NULL;
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 
 		return;
 	}
@@ -663,9 +668,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_mon_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	if (id < 0) {
 		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
@@ -683,14 +688,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
+	hw_dom = resctrl_to_arch_mon_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 0b4136c42762..08fc97ce4135 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }
 
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d)
+	     struct rdt_ctrl_domain *d)
 {
 	struct resctrl_staged_config *cfg;
 	u32 closid = data->rdtgrp->closid;
@@ -137,7 +137,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
  * resource type.
  */
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d)
+	      struct rdt_ctrl_domain *d)
 {
 	struct rdtgroup *rdtgrp = data->rdtgrp;
 	struct resctrl_staged_config *cfg;
@@ -206,8 +206,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct resctrl_staged_config *cfg;
 	struct rdt_resource *r = s->res;
 	struct rdt_parse_data data;
+	struct rdt_ctrl_domain *d;
 	char *dom = NULL, *id;
-	struct rdt_domain *d;
 	unsigned long dom_id;
 
 	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -267,11 +267,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
 	}
 }
 
-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
 			 struct resctrl_staged_config *cfg, u32 idx,
 			 cpumask_var_t cpu_mask)
 {
-	struct rdt_domain *dom = &hw_dom->d_resctrl;
+	struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
 		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
@@ -283,11 +283,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	return false;
 }
 
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
@@ -307,11 +307,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	enum resctrl_conf_type t;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	u32 idx;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -319,7 +319,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 
 	msr_param.res = NULL;
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
 			if (!cfg->have_new_ctrl)
@@ -449,10 +449,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	u32 idx = get_config_index(closid, type);
 
 	return hw_dom->ctrl_val[idx];
@@ -461,7 +461,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
 {
 	struct rdt_resource *r = schema->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	bool sep = false;
 	u32 ctrl_val;
 
@@ -523,7 +523,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	/*
@@ -543,11 +543,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
 	struct rdt_domain_hdr *hdr;
+	struct rdt_mon_domain *d;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -568,7 +568,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		ret = -ENOENT;
 		goto out;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ec5ad926c5dc..4e145f5620b0 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	}
 }
 
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
 }
@@ -334,7 +334,7 @@ int alloc_rmid(void)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int cpu, err;
 	u64 val = 0;
 
@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
 				       enum resctrl_event_id evtid)
 {
 	switch (evtid) {
@@ -516,13 +516,13 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
+	struct rdt_ctrl_domain *dom_mba;
 	u32 cur_bw, delta_bw, user_bw;
 	struct rdt_resource *r_mba;
-	struct rdt_domain *dom_mba;
 	struct list_head *head;
 	struct rdtgroup *entry;
 
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	}
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
 {
 	struct rmid_read rr;
 
@@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
 	struct rdtgroup *prgrp, *crgrp;
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index ed6d59af1cef..08d35f828bc3 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
  * Return: true if @cbm overlaps with pseudo-locked region on @d, false
  * otherwise.
  */
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int cbm_len;
 	unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
  *         if it is not possible to test due to memory allocation issue,
  *         false otherwise.
  */
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
 {
+	struct rdt_ctrl_domain *d_i;
 	cpumask_var_t cpu_with_psl;
 	struct rdt_resource *r;
-	struct rdt_domain *d_i;
 	bool ret = false;
 
 	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 760013ed1bff..21bbd832f3f2 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -85,8 +85,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
 
 void rdt_staged_configs_clear(void)
 {
+	struct rdt_ctrl_domain *dom;
 	struct rdt_resource *r;
-	struct rdt_domain *dom;
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -976,7 +976,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	unsigned long sw_shareable = 0, hw_shareable = 0;
 	unsigned long exclusive = 0, pseudo_locked = 0;
 	struct rdt_resource *r = s->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	int i, hwb, swb, excl, psl;
 	enum rdtgrp_mode mode;
 	bool sep = false;
@@ -1205,7 +1205,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
  *
  * Return: false if CBM does not overlap, true if it does.
  */
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				    unsigned long cbm, int closid,
 				    enum resctrl_conf_type type, bool exclusive)
 {
@@ -1260,7 +1260,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
  *
  * Return: true if CBM overlap detected, false if there is no overlap
  */
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1291,10 +1291,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
 static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 {
 	int closid = rdtgrp->closid;
+	struct rdt_ctrl_domain *d;
 	struct resctrl_schema *s;
 	struct rdt_resource *r;
 	bool has_cache = false;
-	struct rdt_domain *d;
 	u32 ctrl;
 
 	list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1407,7 +1407,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
  * bitmap functions work correctly.
  */
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
-				  struct rdt_domain *d, unsigned long cbm)
+				  struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	struct cpu_cacheinfo *ci;
 	unsigned int size = 0;
@@ -1439,9 +1439,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 {
 	struct resctrl_schema *schema;
 	enum resctrl_conf_type type;
+	struct rdt_ctrl_domain *d;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 	unsigned int size;
 	int ret = 0;
 	u32 closid;
@@ -1553,7 +1553,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1561,7 +1561,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
@@ -1618,7 +1618,7 @@ static void mon_event_config_write(void *info)
 }
 
 static int mbm_config_write_domain(struct rdt_resource *r,
-				   struct rdt_domain *d, u32 evtid, u32 val)
+				   struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 	int ret = 0;
@@ -1668,7 +1668,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 {
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int ret = 0;
 
 next:
@@ -2216,9 +2216,9 @@ static inline bool is_mba_linear(void)
 static int set_cache_qos_cfg(int level, bool enable)
 {
 	void (*update)(void *arg);
+	struct rdt_ctrl_domain *d;
 	struct rdt_resource *r_l;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int cpu;
 
 	if (level == RDT_RESOURCE_L3)
@@ -2265,7 +2265,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 		l3_qos_cfg_update(&hw_res->cdp_enabled);
 }
 
-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
 	int cpu = cpumask_any(&d->hdr.cpu_mask);
@@ -2283,7 +2283,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 }
 
 static void mba_sc_domain_destroy(struct rdt_resource *r,
-				  struct rdt_domain *d)
+				  struct rdt_ctrl_domain *d)
 {
 	kfree(d->mbps_val);
 	d->mbps_val = NULL;
@@ -2309,7 +2309,7 @@ static int set_mba_sc(bool mba_sc)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2578,7 +2578,7 @@ static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	unsigned long flags = RFTYPE_CTRL_BASE;
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2762,10 +2762,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
 static int reset_all_ctrls(struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int i;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2781,7 +2781,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * from each domain to update the MSRs below.
 	 */
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
@@ -2976,7 +2976,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -3025,7 +3025,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mon_domain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -3047,7 +3047,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	int ret;
 
 	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3149,7 +3149,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
  * Set the RDT domain up to start off with all usable allocations. That is,
  * all shareable and unused bits. All-zero CBM is invalid.
  */
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
 				 u32 closid)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3229,7 +3229,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
  */
 static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int ret;
 
 	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3245,7 +3245,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
@@ -3842,14 +3842,14 @@ static void __init rdtgroup_setup_default(void)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3857,7 +3857,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 		mba_sc_domain_destroy(r, d);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3886,7 +3886,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	size_t tsize;
 
@@ -3916,7 +3916,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3926,7 +3926,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	int err;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v13 5/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                     ` (3 preceding siblings ...)
  2023-12-04 18:53                   ` [PATCH v13 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-12-04 18:53                   ` Tony Luck
  2023-12-04 23:04                     ` Moger, Babu
  2023-12-04 18:53                   ` [PATCH v13 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                                     ` (5 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-12-04 18:53 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 include/linux/resctrl.h            | 1 +
 arch/x86/kernel/cpu/resctrl/core.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 058a940c3239..b8a3a11b970d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -170,6 +170,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 797cb3bf417a..c9315ce8f7bd 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -502,6 +502,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
 	default:
 		break;
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v13 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                     ` (4 preceding siblings ...)
  2023-12-04 18:53                   ` [PATCH v13 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-12-04 18:53                   ` Tony Luck
  2023-12-04 23:04                     ` Moger, Babu
  2023-12-04 18:53                   ` [PATCH v13 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
                                     ` (4 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-12-04 18:53 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that shows how many
SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
SNC mode is either not implemented, or not enabled.

Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
   nodes.
3) The RMID renumbering operates when using the value from the
   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
   counter, adjust from the logical RMID to the physical
   RMID value for the SNC node that it wishes to read and load the
   adjusted value into the IA32_QM_EVTSEL MSR.
4) Divide the L3 cache between the SNC nodes. Divide the value
   reported in the resctrl "size" file by the number of SNC
   nodes because the effective amount of cache that can be allocated
   is reduced by that factor.
5) Disable the "-o mba_MBps" mount option in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.
   Trying to use this feedback loop might result in contradictory
   changes to the throttling level coming from each of the SNC
   node bandwidth measurements.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  5 +++--
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index ce3a70657842..e7a75a439c16 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern unsigned int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c9315ce8f7bd..cf5aba8a74bf 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.  Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 4e145f5620b0..30b7c3b9b517 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs.
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 21bbd832f3f2..79d57dade568 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /*
@@ -2298,7 +2298,8 @@ static bool supports_mba_mbps(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 
 	return (is_mbm_local_enabled() &&
-		r->alloc_capable && is_mba_linear());
+		r->alloc_capable && is_mba_linear() &&
+		snc_nodes_per_l3_cache == 1);
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v13 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                     ` (5 preceding siblings ...)
  2023-12-04 18:53                   ` [PATCH v13 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-12-04 18:53                   ` Tony Luck
  2023-12-04 23:04                     ` Moger, Babu
  2023-12-04 18:53                   ` [PATCH v13 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
                                     ` (3 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-12-04 18:53 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 arch/x86/include/asm/msr-index.h   |   1 +
 arch/x86/kernel/cpu/resctrl/core.c | 118 +++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d51e1850ed0..94d29d81e6db 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1111,6 +1111,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index cf5aba8a74bf..1de1c4499b7d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -740,11 +743,42 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+	u64 val;
+
+	/* Only need to enable once per package. */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
@@ -999,11 +1033,95 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+	int num_l3_caches;
+	int cache_id;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+
+	if (num_online_cpus() != num_present_cpus())
+		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids) {
+			cache_id = get_cpu_cacheinfo_id(cpu, 3);
+			if (cache_id != -1)
+				set_bit(cache_id, node_caches);
+		} else {
+			mem_only_nodes++;
+		}
+	}
+	cpus_read_unlock();
+
+	num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
+	kfree(node_caches);
+
+	if (!num_l3_caches)
+		goto insane;
+
+	/* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
+	if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
+		goto insane;
+
+	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+	/* sanity check #2: Only valid results are 1, 2, 3, 4 */
+	switch (ret) {
+	case 1:
+		break;
+	case 2:
+	case 3:
+	case 4:
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+		break;
+	default:
+		goto insane;
+	}
+
+	return ret;
+insane:
+	pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
+		(nr_node_ids - mem_only_nodes), num_l3_caches);
+	return 1;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = snc_get_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v13 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                     ` (6 preceding siblings ...)
  2023-12-04 18:53                   ` [PATCH v13 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-12-04 18:53                   ` Tony Luck
  2023-12-04 23:04                     ` Moger, Babu
  2023-12-04 21:20                   ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Moger, Babu
                                     ` (2 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2023-12-04 18:53 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Tony Luck, Shaopeng Tan

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
---
 Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..15f1cff6ee76 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
-	directories have one file per event (e.g. "llc_occupancy",
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event.  Each of these
+	directories has one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
 	all tasks in the group. In CTRL_MON groups these files provide
@@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
 each bit represents 5% of the capacity of the cache. You could partition
 the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache, but each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                     ` (7 preceding siblings ...)
  2023-12-04 18:53                   ` [PATCH v13 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-12-04 21:20                   ` Moger, Babu
  2023-12-04 21:33                     ` Luck, Tony
  2023-12-04 23:25                   ` Tony Luck
  2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
  10 siblings, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2023-12-04 21:20 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

Hi Tony,

Tested the series on AMD system. Just ran few basic tests. Everything 
looking good.

Thanks

Babu

On 12/4/2023 12:53 PM, Tony Luck wrote:
> The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
> that share an L3 cache into two or more sets. This plays havoc with the
> Resource Director Technology (RDT) monitoring features.  Prior to this
> patch Intel has advised that SNC and RDT are incompatible.
>
> Some of these CPU support an MSR that can partition the RMID counters in
> the same way. This allows monitoring features to be used. With the caveat
> that users must be aware that Linux may migrate tasks more frequently
> between SNC nodes than between "regular" NUMA nodes, so reading counters
> from all SNC nodes may be needed to get a complete picture of activity
> for tasks.
>
> Cache and memory bandwidth allocation features continue to operate at
> the scope of the L3 cache.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
>
> Changes since v12:
>
> All:
> 	Reinette - put commit tags in right order for TIP (Tested-by before
> 	Reviewed-by)
>
> Patch 7:
> 	Fam Zheng - Check for -1 return from get_cpu_cacheinfo_id() and
> 	increase size of bitmap tracking # of L3 instances.
> 	Reinette - Add extra sanity checks. Note that this patch has
> 	some additional tweaks beyond the e-mail discussion.
> 		1) "3" is a valid return in addition to 1, 2, 4
> 		2) Added a warning if the sanity checks fail that
> 		prints number of CPU nodes and number of L3 cache
> 		instances that were found.
>
> Patch 8:
> 	Babu - Fix grammar with an additional comma.
>
>
> Tony Luck (8):
>    x86/resctrl: Prepare for new domain scope
>    x86/resctrl: Prepare to split rdt_domain structure
>    x86/resctrl: Prepare for different scope for control/monitor
>      operations
>    x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
>    x86/resctrl: Add node-scope to the options for feature scope
>    x86/resctrl: Introduce snc_nodes_per_l3_cache
>    x86/resctrl: Sub NUMA Cluster detection and enable
>    x86/resctrl: Update documentation with Sub-NUMA cluster changes
>
>   Documentation/arch/x86/resctrl.rst        |  25 +-
>   include/linux/resctrl.h                   |  85 +++--
>   arch/x86/include/asm/msr-index.h          |   1 +
>   arch/x86/kernel/cpu/resctrl/internal.h    |  66 ++--
>   arch/x86/kernel/cpu/resctrl/core.c        | 433 +++++++++++++++++-----
>   arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  58 +--
>   arch/x86/kernel/cpu/resctrl/monitor.c     |  68 ++--
>   arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  26 +-
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 149 ++++----
>   9 files changed, 629 insertions(+), 282 deletions(-)
>
>
> base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-12-04 21:20                   ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Moger, Babu
@ 2023-12-04 21:33                     ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2023-12-04 21:33 UTC (permalink / raw)
  To: babu.moger, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	linux-kernel, linux-doc, patches

> Tested the series on AMD system. Just ran few basic tests. Everything 
> looking good.

Babu,

Thanks for testing. I'll add your Tested-by tag if[1] I make a v14.

-Tony

[1] realistically not if, but when :-( 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v13 1/8] x86/resctrl: Prepare for new domain scope
  2023-12-04 18:53                   ` [PATCH v13 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2023-12-04 23:03                     ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2023-12-04 23:03 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Shaopeng Tan


On 12/4/2023 12:53 PM, Tony Luck wrote:
> Resctrl resources operate on subsets of CPUs in the system with the
> defining attribute of each subset being an instance of a particular
> level of cache. E.g. all CPUs sharing an L3 cache would be part of the
> same domain.
>
> In preparation for features that are scoped at the NUMA node level
> change the code from explicit references to "cache_level" to a more
> generic scope. At this point the only options for this scope are groups
> of CPUs that share an L2 cache or L3 cache.
>
> Clean up the error handling when looking up domains. Report invalid id's
> before calling rdt_find_domain() in preparation for better messages when
> scope can be other than cache scope. This means that rdt_find_domain()
> will never return an error. So remove checks for error from the callsites.
>
> Signed-off-by: Tony Luck<tony.luck@intel.com>
> Tested-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
> Reviewed-by: Reinette Chatre<reinette.chatre@intel.com>
> Reviewed-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
> ---
>   include/linux/resctrl.h                   |  9 ++++-
>   arch/x86/kernel/cpu/resctrl/core.c        | 45 ++++++++++++++++-------
>   arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  2 +-
>   arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 ++-
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 ++-
>   5 files changed, 48 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 66942d7fba7f..7d4eb7df611d 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -144,13 +144,18 @@ struct resctrl_membw {
>   struct rdt_parse_data;
>   struct resctrl_schema;
>   
> +enum resctrl_scope {
> +	RESCTRL_L2_CACHE = 2,
> +	RESCTRL_L3_CACHE = 3,
> +};
> +
>   /**
>    * struct rdt_resource - attributes of a resctrl resource
>    * @rid:		The index of the resource
>    * @alloc_capable:	Is allocation available on this machine
>    * @mon_capable:	Is monitor feature available on this machine
>    * @num_rmid:		Number of RMIDs available
> - * @cache_level:	Which cache level defines scope of this resource
> + * @scope:		Scope of this resource
>    * @cache:		Cache allocation related data
>    * @membw:		If the component has bandwidth controls, their properties.
>    * @domains:		All domains for this resource
> @@ -168,7 +173,7 @@ struct rdt_resource {
>   	bool			alloc_capable;
>   	bool			mon_capable;
>   	int			num_rmid;
> -	int			cache_level;
> +	enum resctrl_scope	scope;
>   	struct resctrl_cache	cache;
>   	struct resctrl_membw	membw;
>   	struct list_head	domains;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 19e0681f0435..fd113bc29d4e 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>   		.r_resctrl = {
>   			.rid			= RDT_RESOURCE_L3,
>   			.name			= "L3",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>   			.domains		= domain_init(RDT_RESOURCE_L3),
>   			.parse_ctrlval		= parse_cbm,
>   			.format_str		= "%d=%0*x",
> @@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>   		.r_resctrl = {
>   			.rid			= RDT_RESOURCE_L2,
>   			.name			= "L2",
> -			.cache_level		= 2,
> +			.scope			= RESCTRL_L2_CACHE,
>   			.domains		= domain_init(RDT_RESOURCE_L2),
>   			.parse_ctrlval		= parse_cbm,
>   			.format_str		= "%d=%0*x",
> @@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>   		.r_resctrl = {
>   			.rid			= RDT_RESOURCE_MBA,
>   			.name			= "MB",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>   			.domains		= domain_init(RDT_RESOURCE_MBA),
>   			.parse_ctrlval		= parse_bw,
>   			.format_str		= "%d=%*u",
> @@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>   		.r_resctrl = {
>   			.rid			= RDT_RESOURCE_SMBA,
>   			.name			= "SMBA",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>   			.domains		= domain_init(RDT_RESOURCE_SMBA),
>   			.parse_ctrlval		= parse_bw,
>   			.format_str		= "%d=%*u",
> @@ -401,9 +401,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
>   	struct rdt_domain *d;
>   	struct list_head *l;
>   
> -	if (id < 0)
> -		return ERR_PTR(-ENODEV);
> -
>   	list_for_each(l, &r->domains) {
>   		d = list_entry(l, struct rdt_domain, list);
>   		/* When id is found, return its domain. */
> @@ -491,6 +488,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>   	return 0;
>   }
>   
> +static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> +{
> +	switch (scope) {
> +	case RESCTRL_L2_CACHE:
> +	case RESCTRL_L3_CACHE:
> +		return get_cpu_cacheinfo_id(cpu, scope);
> +	default:
> +		break;
> +	}
> +
> +	return -EINVAL;
> +}
> +
>   /*
>    * domain_add_cpu - Add a cpu to a resource's domain list.
>    *
> @@ -506,18 +516,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>    */
>   static void domain_add_cpu(int cpu, struct rdt_resource *r)
>   {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>   	struct list_head *add_pos = NULL;
>   	struct rdt_hw_domain *hw_dom;
>   	struct rdt_domain *d;
>   	int err;
>   
> -	d = rdt_find_domain(r, id, &add_pos);
> -	if (IS_ERR(d)) {
> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> +	if (id < 0) {
> +		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->scope, r->name);
>   		return;
>   	}
>   
> +	d = rdt_find_domain(r, id, &add_pos);
>   	if (d) {
>   		cpumask_set_cpu(cpu, &d->cpu_mask);
>   		if (r->cache.arch_has_per_cpu_cfg)
> @@ -556,13 +567,19 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>   
>   static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>   {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>   	struct rdt_hw_domain *hw_dom;
>   	struct rdt_domain *d;
>   
> +	if (id < 0) {
> +		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->scope, r->name);
> +		return;
> +	}
> +
>   	d = rdt_find_domain(r, id, NULL);
> -	if (IS_ERR_OR_NULL(d)) {
> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> +	if (!d) {
> +		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
>   		return;
>   	}
>   	hw_dom = resctrl_to_arch_dom(d);
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index beccb0e87ba7..3f8891d57fac 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -563,7 +563,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>   
>   	r = &rdt_resources_all[resid].r_resctrl;
>   	d = rdt_find_domain(r, domid, NULL);
> -	if (IS_ERR_OR_NULL(d)) {
> +	if (!d) {
>   		ret = -ENOENT;
>   		goto out;
>   	}
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 8f559eeae08e..2a682da9f43a 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
>    */
>   static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>   {
> +	enum resctrl_scope scope = plr->s->res->scope;
>   	struct cpu_cacheinfo *ci;
>   	int ret;
>   	int i;
>   
> +	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
> +		return -ENODEV;
> +
>   	/* Pick the first cpu we find that is associated with the cache. */
>   	plr->cpu = cpumask_first(&plr->d->cpu_mask);
>   
> @@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>   	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
>   
>   	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == plr->s->res->cache_level) {
> +		if (ci->info_list[i].level == scope) {
>   			plr->line_size = ci->info_list[i].coherency_line_size;
>   			return 0;
>   		}
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 69a1de92384a..c44be64d65ec 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1413,10 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>   	unsigned int size = 0;
>   	int num_b, i;
>   
> +	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
> +		return size;
> +
>   	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
>   	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
>   	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == r->cache_level) {
> +		if (ci->info_list[i].level == r->scope) {
>   			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
>   			break;
>   		}

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v13 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2023-12-04 18:53                   ` [PATCH v13 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2023-12-04 23:03                     ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2023-12-04 23:03 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Shaopeng Tan


On 12/4/2023 12:53 PM, Tony Luck wrote:
> The rdt_domain structure is used for both control and monitor features.
> It is about to be split into separate structures for these two usages
> because the scope for control and monitoring features for a resource
> will be different for future resources.
>
> To allow for common code that scans a list of domains looking for a
> specific domain id, move all the common fields ("list", "id", "cpu_mask")
> into their own structure within the rdt_domain structure.
>
> Signed-off-by: Tony Luck<tony.luck@intel.com>
> Tested-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
> Reviewed-by: Reinette Chatre<reinette.chatre@intel.com>
> Reviewed-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
> ---
>   include/linux/resctrl.h                   | 16 ++++--
>   arch/x86/kernel/cpu/resctrl/core.c        | 26 +++++-----
>   arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
>   arch/x86/kernel/cpu/resctrl/monitor.c     | 10 ++--
>   arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 60 +++++++++++------------
>   6 files changed, 78 insertions(+), 70 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 7d4eb7df611d..c4067150a6b7 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -53,10 +53,20 @@ struct resctrl_staged_config {
>   };
>   
>   /**
> - * struct rdt_domain - group of CPUs sharing a resctrl resource
> + * struct rdt_domain_hdr - common header for different domain types
>    * @list:		all instances of this resource
>    * @id:			unique id for this instance
>    * @cpu_mask:		which CPUs share this resource
> + */
> +struct rdt_domain_hdr {
> +	struct list_head		list;
> +	int				id;
> +	struct cpumask			cpu_mask;
> +};
> +
> +/**
> + * struct rdt_domain - group of CPUs sharing a resctrl resource
> + * @hdr:		common header for different domain types
>    * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
>    * @mbm_total:		saved state for MBM total bandwidth
>    * @mbm_local:		saved state for MBM local bandwidth
> @@ -71,9 +81,7 @@ struct resctrl_staged_config {
>    *			by closid
>    */
>   struct rdt_domain {
> -	struct list_head		list;
> -	int				id;
> -	struct cpumask			cpu_mask;
> +	struct rdt_domain_hdr		hdr;
>   	unsigned long			*rmid_busy_llc;
>   	struct mbm_state		*mbm_total;
>   	struct mbm_state		*mbm_local;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index fd113bc29d4e..62a989fd950d 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -356,9 +356,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
>   {
>   	struct rdt_domain *d;
>   
> -	list_for_each_entry(d, &r->domains, list) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
>   		/* Find the domain that contains this CPU */
> -		if (cpumask_test_cpu(cpu, &d->cpu_mask))
> +		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
>   			return d;
>   	}
>   
> @@ -402,12 +402,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
>   	struct list_head *l;
>   
>   	list_for_each(l, &r->domains) {
> -		d = list_entry(l, struct rdt_domain, list);
> +		d = list_entry(l, struct rdt_domain, hdr.list);
>   		/* When id is found, return its domain. */
> -		if (id == d->id)
> +		if (id == d->hdr.id)
>   			return d;
>   		/* Stop searching when finding id's position in sorted list. */
> -		if (id < d->id)
> +		if (id < d->hdr.id)
>   			break;
>   	}
>   
> @@ -530,7 +530,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>   
>   	d = rdt_find_domain(r, id, &add_pos);
>   	if (d) {
> -		cpumask_set_cpu(cpu, &d->cpu_mask);
> +		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>   		if (r->cache.arch_has_per_cpu_cfg)
>   			rdt_domain_reconfigure_cdp(r);
>   		return;
> @@ -541,8 +541,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>   		return;
>   
>   	d = &hw_dom->d_resctrl;
> -	d->id = id;
> -	cpumask_set_cpu(cpu, &d->cpu_mask);
> +	d->hdr.id = id;
> +	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>   
>   	rdt_domain_reconfigure_cdp(r);
>   
> @@ -556,11 +556,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>   		return;
>   	}
>   
> -	list_add_tail(&d->list, add_pos);
> +	list_add_tail(&d->hdr.list, add_pos);
>   
>   	err = resctrl_online_domain(r, d);
>   	if (err) {
> -		list_del(&d->list);
> +		list_del(&d->hdr.list);
>   		domain_free(hw_dom);
>   	}
>   }
> @@ -584,10 +584,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>   	}
>   	hw_dom = resctrl_to_arch_dom(d);
>   
> -	cpumask_clear_cpu(cpu, &d->cpu_mask);
> -	if (cpumask_empty(&d->cpu_mask)) {
> +	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> +	if (cpumask_empty(&d->hdr.cpu_mask)) {
>   		resctrl_offline_domain(r, d);
> -		list_del(&d->list);
> +		list_del(&d->hdr.list);
>   
>   		/*
>   		 * rdt_domain "d" is going to be freed below, so clear
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 3f8891d57fac..23f8258d36a8 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
>   
>   	cfg = &d->staged_config[s->conf_type];
>   	if (cfg->have_new_ctrl) {
> -		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
> +		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
>   		return -EINVAL;
>   	}
>   
> @@ -146,7 +146,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
>   
>   	cfg = &d->staged_config[s->conf_type];
>   	if (cfg->have_new_ctrl) {
> -		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
> +		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
>   		return -EINVAL;
>   	}
>   
> @@ -226,8 +226,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
>   		return -EINVAL;
>   	}
>   	dom = strim(dom);
> -	list_for_each_entry(d, &r->domains, list) {
> -		if (d->id == dom_id) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
> +		if (d->hdr.id == dom_id) {
>   			data.buf = dom;
>   			data.rdtgrp = rdtgrp;
>   			if (r->parse_ctrlval(&data, s, d))
> @@ -274,7 +274,7 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
>   	struct rdt_domain *dom = &hw_dom->d_resctrl;
>   
>   	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
> -		cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
> +		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
>   		hw_dom->ctrl_val[idx] = cfg->new_ctrl;
>   
>   		return true;
> @@ -291,7 +291,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
>   	u32 idx = get_config_index(closid, t);
>   	struct msr_param msr_param;
>   
> -	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
> +	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
>   		return -EINVAL;
>   
>   	hw_dom->ctrl_val[idx] = cfg_val;
> @@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
>   		return -ENOMEM;
>   
>   	msr_param.res = NULL;
> -	list_for_each_entry(d, &r->domains, list) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
>   		hw_dom = resctrl_to_arch_dom(d);
>   		for (t = 0; t < CDP_NUM_TYPES; t++) {
>   			cfg = &hw_dom->d_resctrl.staged_config[t];
> @@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
>   	u32 ctrl_val;
>   
>   	seq_printf(s, "%*s:", max_name_width, schema->name);
> -	list_for_each_entry(dom, &r->domains, list) {
> +	list_for_each_entry(dom, &r->domains, hdr.list) {
>   		if (sep)
>   			seq_puts(s, ";");
>   
> @@ -476,7 +476,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
>   			ctrl_val = resctrl_arch_get_config(r, dom, closid,
>   							   schema->conf_type);
>   
> -		seq_printf(s, r->format_str, dom->id, max_data_width,
> +		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
>   			   ctrl_val);
>   		sep = true;
>   	}
> @@ -505,7 +505,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
>   			} else {
>   				seq_printf(s, "%s:%d=%x\n",
>   					   rdtgrp->plr->s->res->name,
> -					   rdtgrp->plr->d->id,
> +					   rdtgrp->plr->d->hdr.id,
>   					   rdtgrp->plr->cbm);
>   			}
>   		} else {
> @@ -536,7 +536,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>   	rr->val = 0;
>   	rr->first = first;
>   
> -	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> +	smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
>   }
>   
>   int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..dd0ea1bc0092 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -238,7 +238,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>   	u64 msr_val, chunks;
>   	int ret;
>   
> -	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
> +	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
>   		return -EINVAL;
>   
>   	ret = __rmid_read(rmid, eventid, &msr_val);
> @@ -340,8 +340,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>   
>   	entry->busy = 0;
>   	cpu = get_cpu();
> -	list_for_each_entry(d, &r->domains, list) {
> -		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
> +		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
>   			err = resctrl_arch_rmid_read(r, d, entry->rmid,
>   						     QOS_L3_OCCUP_EVENT_ID,
>   						     &val);
> @@ -661,7 +661,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
>   	unsigned long delay = msecs_to_jiffies(delay_ms);
>   	int cpu;
>   
> -	cpu = cpumask_any(&dom->cpu_mask);
> +	cpu = cpumask_any(&dom->hdr.cpu_mask);
>   	dom->cqm_work_cpu = cpu;
>   
>   	schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
> @@ -708,7 +708,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
>   
>   	if (!static_branch_likely(&rdt_mon_enable_key))
>   		return;
> -	cpu = cpumask_any(&dom->cpu_mask);
> +	cpu = cpumask_any(&dom->hdr.cpu_mask);
>   	dom->mbm_work_cpu = cpu;
>   	schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
>   }
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 2a682da9f43a..fcbd99e2eb66 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
>   	int cpu;
>   	int ret;
>   
> -	for_each_cpu(cpu, &plr->d->cpu_mask) {
> +	for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
>   		pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
>   		if (!pm_req) {
>   			rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
> @@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>   		return -ENODEV;
>   
>   	/* Pick the first cpu we find that is associated with the cache. */
> -	plr->cpu = cpumask_first(&plr->d->cpu_mask);
> +	plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);
>   
>   	if (!cpu_online(plr->cpu)) {
>   		rdt_last_cmd_printf("CPU %u associated with cache not online\n",
> @@ -856,10 +856,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
>   	 * associated with them.
>   	 */
>   	for_each_alloc_capable_rdt_resource(r) {
> -		list_for_each_entry(d_i, &r->domains, list) {
> +		list_for_each_entry(d_i, &r->domains, hdr.list) {
>   			if (d_i->plr)
>   				cpumask_or(cpu_with_psl, cpu_with_psl,
> -					   &d_i->cpu_mask);
> +					   &d_i->hdr.cpu_mask);
>   		}
>   	}
>   
> @@ -867,7 +867,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
>   	 * Next test if new pseudo-locked region would intersect with
>   	 * existing region.
>   	 */
> -	if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
> +	if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
>   		ret = true;
>   
>   	free_cpumask_var(cpu_with_psl);
> @@ -1199,7 +1199,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
>   	}
>   
>   	plr->thread_done = 0;
> -	cpu = cpumask_first(&plr->d->cpu_mask);
> +	cpu = cpumask_first(&plr->d->hdr.cpu_mask);
>   	if (!cpu_online(cpu)) {
>   		ret = -ENODEV;
>   		goto out;
> @@ -1529,7 +1529,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
>   	 * may be scheduled elsewhere and invalidate entries in the
>   	 * pseudo-locked region.
>   	 */
> -	if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
> +	if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
>   		mutex_unlock(&rdtgroup_mutex);
>   		return -EINVAL;
>   	}
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index c44be64d65ec..04d32602ac33 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
>   	lockdep_assert_held(&rdtgroup_mutex);
>   
>   	for_each_alloc_capable_rdt_resource(r) {
> -		list_for_each_entry(dom, &r->domains, list)
> +		list_for_each_entry(dom, &r->domains, hdr.list)
>   			memset(dom->staged_config, 0, sizeof(dom->staged_config));
>   	}
>   }
> @@ -295,7 +295,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
>   				rdt_last_cmd_puts("Cache domain offline\n");
>   				ret = -ENODEV;
>   			} else {
> -				mask = &rdtgrp->plr->d->cpu_mask;
> +				mask = &rdtgrp->plr->d->hdr.cpu_mask;
>   				seq_printf(s, is_cpu_list(of) ?
>   					   "%*pbl\n" : "%*pb\n",
>   					   cpumask_pr_args(mask));
> @@ -984,12 +984,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
>   
>   	mutex_lock(&rdtgroup_mutex);
>   	hw_shareable = r->cache.shareable_bits;
> -	list_for_each_entry(dom, &r->domains, list) {
> +	list_for_each_entry(dom, &r->domains, hdr.list) {
>   		if (sep)
>   			seq_putc(seq, ';');
>   		sw_shareable = 0;
>   		exclusive = 0;
> -		seq_printf(seq, "%d=", dom->id);
> +		seq_printf(seq, "%d=", dom->hdr.id);
>   		for (i = 0; i < closids_supported(); i++) {
>   			if (!closid_allocated(i))
>   				continue;
> @@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
>   		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
>   			continue;
>   		has_cache = true;
> -		list_for_each_entry(d, &r->domains, list) {
> +		list_for_each_entry(d, &r->domains, hdr.list) {
>   			ctrl = resctrl_arch_get_config(r, d, closid,
>   						       s->conf_type);
>   			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
> @@ -1417,7 +1417,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>   		return size;
>   
>   	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
> -	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
> +	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
>   	for (i = 0; i < ci->num_leaves; i++) {
>   		if (ci->info_list[i].level == r->scope) {
>   			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
> @@ -1465,7 +1465,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
>   			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
>   						    rdtgrp->plr->d,
>   						    rdtgrp->plr->cbm);
> -			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
> +			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
>   		}
>   		goto out;
>   	}
> @@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
>   		type = schema->conf_type;
>   		sep = false;
>   		seq_printf(s, "%*s:", max_name_width, schema->name);
> -		list_for_each_entry(d, &r->domains, list) {
> +		list_for_each_entry(d, &r->domains, hdr.list) {
>   			if (sep)
>   				seq_putc(s, ';');
>   			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
> @@ -1495,7 +1495,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
>   				else
>   					size = rdtgroup_cbm_to_size(r, d, ctrl);
>   			}
> -			seq_printf(s, "%d=%u", d->id, size);
> +			seq_printf(s, "%d=%u", d->hdr.id, size);
>   			sep = true;
>   		}
>   		seq_putc(s, '\n');
> @@ -1555,7 +1555,7 @@ static void mon_event_config_read(void *info)
>   
>   static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
>   {
> -	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
> +	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
>   }
>   
>   static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
> @@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
>   
>   	mutex_lock(&rdtgroup_mutex);
>   
> -	list_for_each_entry(dom, &r->domains, list) {
> +	list_for_each_entry(dom, &r->domains, hdr.list) {
>   		if (sep)
>   			seq_puts(s, ";");
>   
> @@ -1574,7 +1574,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
>   		mon_info.evtid = evtid;
>   		mondata_config_read(dom, &mon_info);
>   
> -		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
> +		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
>   		sep = true;
>   	}
>   	seq_puts(s, "\n");
> @@ -1646,7 +1646,7 @@ static int mbm_config_write_domain(struct rdt_resource *r,
>   	 * are scoped at the domain level. Writing any of these MSRs
>   	 * on one CPU is observed by all the CPUs in the domain.
>   	 */
> -	smp_call_function_any(&d->cpu_mask, mon_event_config_write,
> +	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
>   			      &mon_info, 1);
>   
>   	/*
> @@ -1689,8 +1689,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
>   		return -EINVAL;
>   	}
>   
> -	list_for_each_entry(d, &r->domains, list) {
> -		if (d->id == dom_id) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
> +		if (d->hdr.id == dom_id) {
>   			ret = mbm_config_write_domain(r, d, evtid, val);
>   			if (ret)
>   				return -EINVAL;
> @@ -2232,14 +2232,14 @@ static int set_cache_qos_cfg(int level, bool enable)
>   		return -ENOMEM;
>   
>   	r_l = &rdt_resources_all[level].r_resctrl;
> -	list_for_each_entry(d, &r_l->domains, list) {
> +	list_for_each_entry(d, &r_l->domains, hdr.list) {
>   		if (r_l->cache.arch_has_per_cpu_cfg)
>   			/* Pick all the CPUs in the domain instance */
> -			for_each_cpu(cpu, &d->cpu_mask)
> +			for_each_cpu(cpu, &d->hdr.cpu_mask)
>   				cpumask_set_cpu(cpu, cpu_mask);
>   		else
>   			/* Pick one CPU from each domain instance to update MSR */
> -			cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
> +			cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
>   	}
>   
>   	/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
> @@ -2268,7 +2268,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
>   static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
>   {
>   	u32 num_closid = resctrl_arch_get_num_closid(r);
> -	int cpu = cpumask_any(&d->cpu_mask);
> +	int cpu = cpumask_any(&d->hdr.cpu_mask);
>   	int i;
>   
>   	d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
> @@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
>   
>   	r->membw.mba_sc = mba_sc;
>   
> -	list_for_each_entry(d, &r->domains, list) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
>   		for (i = 0; i < num_closid; i++)
>   			d->mbps_val[i] = MBA_MAX_MBPS;
>   	}
> @@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
>   
>   	if (is_mbm_enabled()) {
>   		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> -		list_for_each_entry(dom, &r->domains, list)
> +		list_for_each_entry(dom, &r->domains, hdr.list)
>   			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
>   	}
>   
> @@ -2780,9 +2780,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
>   	 * CBMs in all domains to the maximum mask value. Pick one CPU
>   	 * from each domain to update the MSRs below.
>   	 */
> -	list_for_each_entry(d, &r->domains, list) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
>   		hw_dom = resctrl_to_arch_dom(d);
> -		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
> +		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
>   
>   		for (i = 0; i < hw_res->num_closid; i++)
>   			hw_dom->ctrl_val[i] = r->default_ctrl;
> @@ -2986,7 +2986,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>   	char name[32];
>   	int ret;
>   
> -	sprintf(name, "mon_%s_%02d", r->name, d->id);
> +	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
>   	/* create the directory */
>   	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
>   	if (IS_ERR(kn))
> @@ -3002,7 +3002,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>   	}
>   
>   	priv.u.rid = r->rid;
> -	priv.u.domid = d->id;
> +	priv.u.domid = d->hdr.id;
>   	list_for_each_entry(mevt, &r->evt_list, list) {
>   		priv.u.evtid = mevt->evtid;
>   		ret = mon_addfile(kn, mevt->name, priv.priv);
> @@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
>   	struct rdt_domain *dom;
>   	int ret;
>   
> -	list_for_each_entry(dom, &r->domains, list) {
> +	list_for_each_entry(dom, &r->domains, hdr.list) {
>   		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
>   		if (ret)
>   			return ret;
> @@ -3209,7 +3209,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
>   	 */
>   	tmp_cbm = cfg->new_ctrl;
>   	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
> -		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
> +		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
>   		return -ENOSPC;
>   	}
>   	cfg->have_new_ctrl = true;
> @@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
>   	struct rdt_domain *d;
>   	int ret;
>   
> -	list_for_each_entry(d, &s->res->domains, list) {
> +	list_for_each_entry(d, &s->res->domains, hdr.list) {
>   		ret = __init_one_rdt_domain(d, s, closid);
>   		if (ret < 0)
>   			return ret;
> @@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
>   	struct resctrl_staged_config *cfg;
>   	struct rdt_domain *d;
>   
> -	list_for_each_entry(d, &r->domains, list) {
> +	list_for_each_entry(d, &r->domains, hdr.list) {
>   		if (is_mba_sc(r)) {
>   			d->mbps_val[closid] = MBA_MAX_MBPS;
>   			continue;
> @@ -3864,7 +3864,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
>   	 * per domain monitor data directories.
>   	 */
>   	if (static_branch_unlikely(&rdt_mon_enable_key))
> -		rmdir_mondata_subdir_allrdtgrp(r, d->id);
> +		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
>   
>   	if (is_mbm_enabled())
>   		cancel_delayed_work(&d->mbm_over);

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v13 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2023-12-04 18:53                   ` [PATCH v13 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2023-12-04 23:03                     ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2023-12-04 23:03 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Shaopeng Tan


On 12/4/2023 12:53 PM, Tony Luck wrote:
> Resctrl assumes that control and monitor operations on a resource are
> performed at the same scope.
>
> Prepare for systems that use different scope (specifically Intel needs
> to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
> and NODE scope for cache occupancy and memory bandwidth monitoring).
>
> Create separate domain lists for control and monitor operations.
>
> Note that errors during initialization of either control or monitor
> functions on a domain would previously result in that domain being
> excluded from both control and monitor operations. Now the domains are
> allocated independently it is no longer required to disable both control
> and monitor operations if either fail.
>
> Signed-off-by: Tony Luck<tony.luck@intel.com>
> Tested-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
> Reviewed-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
> Reviewed-by: Reinette Chatre<reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
> ---
>   include/linux/resctrl.h                   |  25 ++-
>   arch/x86/kernel/cpu/resctrl/internal.h    |   6 +-
>   arch/x86/kernel/cpu/resctrl/core.c        | 211 ++++++++++++++++------
>   arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  12 +-
>   arch/x86/kernel/cpu/resctrl/monitor.c     |   4 +-
>   arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  55 +++---
>   7 files changed, 220 insertions(+), 97 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index c4067150a6b7..35e700edc6e6 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -52,15 +52,22 @@ struct resctrl_staged_config {
>   	bool			have_new_ctrl;
>   };
>   
> +enum resctrl_domain_type {
> +	RESCTRL_CTRL_DOMAIN,
> +	RESCTRL_MON_DOMAIN,
> +};
> +
>   /**
>    * struct rdt_domain_hdr - common header for different domain types
>    * @list:		all instances of this resource
>    * @id:			unique id for this instance
> + * @type:		type of this instance
>    * @cpu_mask:		which CPUs share this resource
>    */
>   struct rdt_domain_hdr {
>   	struct list_head		list;
>   	int				id;
> +	enum resctrl_domain_type	type;
>   	struct cpumask			cpu_mask;
>   };
>   
> @@ -163,10 +170,12 @@ enum resctrl_scope {
>    * @alloc_capable:	Is allocation available on this machine
>    * @mon_capable:	Is monitor feature available on this machine
>    * @num_rmid:		Number of RMIDs available
> - * @scope:		Scope of this resource
> + * @ctrl_scope:		Scope of this resource for control functions
> + * @mon_scope:		Scope of this resource for monitor functions
>    * @cache:		Cache allocation related data
>    * @membw:		If the component has bandwidth controls, their properties.
> - * @domains:		All domains for this resource
> + * @ctrl_domains:	Control domains for this resource
> + * @mon_domains:	Monitor domains for this resource
>    * @name:		Name to use in "schemata" file.
>    * @data_width:		Character width of data when displaying
>    * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
> @@ -181,10 +190,12 @@ struct rdt_resource {
>   	bool			alloc_capable;
>   	bool			mon_capable;
>   	int			num_rmid;
> -	enum resctrl_scope	scope;
> +	enum resctrl_scope	ctrl_scope;
> +	enum resctrl_scope	mon_scope;
>   	struct resctrl_cache	cache;
>   	struct resctrl_membw	membw;
> -	struct list_head	domains;
> +	struct list_head	ctrl_domains;
> +	struct list_head	mon_domains;
>   	char			*name;
>   	int			data_width;
>   	u32			default_ctrl;
> @@ -230,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
>   
>   u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
>   			    u32 closid, enum resctrl_conf_type type);
> -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
> -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
>   
>   /**
>    * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a4f1aa15f0a2..24bf9d7989a9 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -520,8 +520,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
>   int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
>   int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
>   			     umode_t mask);
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> -				   struct list_head **pos);
> +struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
> +				       struct list_head **pos);
>   ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
>   				char *buf, size_t nbytes, loff_t off);
>   int rdtgroup_schemata_show(struct kernfs_open_file *of,
> @@ -540,7 +540,7 @@ int rdt_pseudo_lock_init(void);
>   void rdt_pseudo_lock_release(void);
>   int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
>   void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
> -struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
> +struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
>   int closids_supported(void);
>   void closid_free(int closid);
>   int alloc_rmid(void);
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 62a989fd950d..1fd85533b4ca 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -57,7 +57,8 @@ static void
>   mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
>   	      struct rdt_resource *r);
>   
> -#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
> +#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
> +#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
>   
>   struct rdt_hw_resource rdt_resources_all[] = {
>   	[RDT_RESOURCE_L3] =
> @@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
>   		.r_resctrl = {
>   			.rid			= RDT_RESOURCE_L3,
>   			.name			= "L3",
> -			.scope			= RESCTRL_L3_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_L3),
> +			.ctrl_scope		= RESCTRL_L3_CACHE,
> +			.mon_scope		= RESCTRL_L3_CACHE,
> +			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
> +			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
>   			.parse_ctrlval		= parse_cbm,
>   			.format_str		= "%d=%0*x", .fflags = RFTYPE_RES_CACHE, @@ -79,8 +82,8 @@ struct 
> rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = 
> RDT_RESOURCE_L2, .name = "L2",
> -			.scope			= RESCTRL_L2_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_L2),
> +			.ctrl_scope		= RESCTRL_L2_CACHE,
> +			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
>   			.parse_ctrlval		= parse_cbm,
>   			.format_str		= "%d=%0*x", .fflags = RFTYPE_RES_CACHE, @@ -93,8 +96,8 @@ struct 
> rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = 
> RDT_RESOURCE_MBA, .name = "MB",
> -			.scope			= RESCTRL_L3_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_MBA),
> +			.ctrl_scope		= RESCTRL_L3_CACHE,
> +			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
>   			.parse_ctrlval		= parse_bw,
>   			.format_str		= "%d=%*u",
>   			.fflags			= RFTYPE_RES_MB,
> @@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
>   		.r_resctrl = {
>   			.rid			= RDT_RESOURCE_SMBA,
>   			.name			= "SMBA",
> -			.scope			= RESCTRL_L3_CACHE,
> -			.domains		= domain_init(RDT_RESOURCE_SMBA),
> +			.ctrl_scope		= RESCTRL_L3_CACHE,
> +			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
>   			.parse_ctrlval		= parse_bw,
>   			.format_str		= "%d=%*u",
>   			.fflags			= RFTYPE_RES_MB,
> @@ -352,11 +355,11 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
>   		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
>   }
>   
> -struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
> +struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
>   {
>   	struct rdt_domain *d;
>   
> -	list_for_each_entry(d, &r->domains, hdr.list) {
> +	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
>   		/* Find the domain that contains this CPU */
>   		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
>   			return d;
> @@ -378,7 +381,7 @@ void rdt_ctrl_update(void *arg)
>   	int cpu = smp_processor_id();
>   	struct rdt_domain *d;
>   
> -	d = get_domain_from_cpu(cpu, r);
> +	d = get_ctrl_domain_from_cpu(cpu, r);
>   	if (d) {
>   		hw_res->msr_update(d, m, r);
>   		return;
> @@ -388,26 +391,26 @@ void rdt_ctrl_update(void *arg)
>   }
>   
>   /*
> - * rdt_find_domain - Find a domain in a resource that matches input resource id
> + * rdt_find_domain - Search for a domain id in a resource domain list.
>    *
> - * Search resource r's domain list to find the resource id. If the resource
> - * id is found in a domain, return the domain. Otherwise, if requested by
> - * caller, return the first domain whose id is bigger than the input id.
> - * The domain list is sorted by id in ascending order.
> + * Search the domain list to find the domain id. If the domain id is
> + * found, return the domain. NULL otherwise.  If the domain id is not
> + * found (and NULL returned) then the first domain with id bigger than
> + * the input id can be returned to the caller via @pos.
>    */
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> -				   struct list_head **pos)
> +struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
> +				       struct list_head **pos)
>   {
> -	struct rdt_domain *d;
> +	struct rdt_domain_hdr *d;
>   	struct list_head *l;
>   
> -	list_for_each(l, &r->domains) {
> -		d = list_entry(l, struct rdt_domain, hdr.list);
> +	list_for_each(l, h) {
> +		d = list_entry(l, struct rdt_domain_hdr, list);
>   		/* When id is found, return its domain. */
> -		if (id == d->hdr.id)
> +		if (id == d->id)
>   			return d;
>   		/* Stop searching when finding id's position in sorted list. */
> -		if (id < d->hdr.id)
> +		if (id < d->id)
>   			break;
>   	}
>   
> @@ -501,35 +504,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>   	return -EINVAL;
>   }
>   
> -/*
> - * domain_add_cpu - Add a cpu to a resource's domain list.
> - *
> - * If an existing domain in the resource r's domain list matches the cpu's
> - * resource id, add the cpu in the domain.
> - *
> - * Otherwise, a new domain is allocated and inserted into the right position
> - * in the domain list sorted by id in ascending order.
> - *
> - * The order in the domain list is visible to users when we print entries
> - * in the schemata file and schemata input is validated to have the same order
> - * as this list.
> - */
> -static void domain_add_cpu(int cpu, struct rdt_resource *r)
> +static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
>   {
> -	int id = get_domain_id_from_scope(cpu, r->scope);
> +	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
>   	struct list_head *add_pos = NULL;
>   	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain_hdr *hdr;
>   	struct rdt_domain *d;
>   	int err;
>   
>   	if (id < 0) {
> -		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> -			     cpu, r->scope, r->name);
> +		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->ctrl_scope, r->name);
>   		return;
>   	}
>   
> -	d = rdt_find_domain(r, id, &add_pos);
> -	if (d) {
> +	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
> +	if (hdr) {
> +		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
> +			return;
> +
> +		d = container_of(hdr, struct rdt_domain, hdr);
> +
>   		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>   		if (r->cache.arch_has_per_cpu_cfg)
>   			rdt_domain_reconfigure_cdp(r);
> @@ -542,51 +538,114 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>   
>   	d = &hw_dom->d_resctrl;
>   	d->hdr.id = id;
> +	d->hdr.type = RESCTRL_CTRL_DOMAIN;
>   	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>   
>   	rdt_domain_reconfigure_cdp(r);
>   
> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> +	if (domain_setup_ctrlval(r, d)) {
>   		domain_free(hw_dom);
>   		return;
>   	}
>   
> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> +	list_add_tail(&d->hdr.list, add_pos);
> +
> +	err = resctrl_online_ctrl_domain(r, d);
> +	if (err) {
> +		list_del(&d->hdr.list);
> +		domain_free(hw_dom);
> +	}
> +}
> +
> +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> +	struct list_head *add_pos = NULL;
> +	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain_hdr *hdr;
> +	struct rdt_domain *d;
> +	int err;
> +
> +	if (id < 0) {
> +		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->mon_scope, r->name);
> +		return;
> +	}
> +
> +	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
> +	if (hdr) {
> +		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
> +			return;
> +
> +		d = container_of(hdr, struct rdt_domain, hdr);
> +
> +		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> +		return;
> +	}
> +
> +	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!hw_dom)
> +		return;
> +
> +	d = &hw_dom->d_resctrl;
> +	d->hdr.id = id;
> +	d->hdr.type = RESCTRL_MON_DOMAIN;
> +	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> +
> +	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
>   		domain_free(hw_dom);
>   		return;
>   	}
>   
>   	list_add_tail(&d->hdr.list, add_pos);
>   
> -	err = resctrl_online_domain(r, d);
> +	err = resctrl_online_mon_domain(r, d);
>   	if (err) {
>   		list_del(&d->hdr.list);
>   		domain_free(hw_dom);
>   	}
>   }
>   
> -static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> +/*
> + * domain_add_cpu - Add a CPU to either/both resource's domain lists.
> + */
> +static void domain_add_cpu(int cpu, struct rdt_resource *r)
>   {
> -	int id = get_domain_id_from_scope(cpu, r->scope);
> +	if (r->alloc_capable)
> +		domain_add_cpu_ctrl(cpu, r);
> +	if (r->mon_capable)
> +		domain_add_cpu_mon(cpu, r);
> +}
> +
> +static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
>   	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain_hdr *hdr;
>   	struct rdt_domain *d;
>   
>   	if (id < 0) {
> -		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> -			     cpu, r->scope, r->name);
> +		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->ctrl_scope, r->name);
>   		return;
>   	}
>   
> -	d = rdt_find_domain(r, id, NULL);
> -	if (!d) {
> -		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
> +	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
> +	if (!hdr) {
> +		pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n",
> +			id, cpu, r->name);
>   		return;
>   	}
> +
> +	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
> +		return;
> +
> +	d = container_of(hdr, struct rdt_domain, hdr);
>   	hw_dom = resctrl_to_arch_dom(d);
>   
>   	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
>   	if (cpumask_empty(&d->hdr.cpu_mask)) {
> -		resctrl_offline_domain(r, d);
> +		resctrl_offline_ctrl_domain(r, d);
>   		list_del(&d->hdr.list);
>   
>   		/*
> @@ -599,6 +658,42 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>   
>   		return;
>   	}
> +}
> +
> +static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> +	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> +	struct rdt_hw_domain *hw_dom;
> +	struct rdt_domain_hdr *hdr;
> +	struct rdt_domain *d;
> +
> +	if (id < 0) {
> +		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->mon_scope, r->name);
> +		return;
> +	}
> +
> +	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
> +	if (!hdr) {
> +		pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n",
> +			id, cpu, r->name);
> +		return;
> +	}
> +
> +	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
> +		return;
> +
> +	d = container_of(hdr, struct rdt_domain, hdr);
> +	hw_dom = resctrl_to_arch_dom(d);
> +
> +	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> +	if (cpumask_empty(&d->hdr.cpu_mask)) {
> +		resctrl_offline_mon_domain(r, d);
> +		list_del(&d->hdr.list);
> +		domain_free(hw_dom);
> +
> +		return;
> +	}
>   
>   	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
>   		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> @@ -613,6 +708,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>   	}
>   }
>   
> +static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> +{
> +	if (r->alloc_capable)
> +		domain_remove_cpu_ctrl(cpu, r);
> +	if (r->mon_capable)
> +		domain_remove_cpu_mon(cpu, r);
> +}
> +
>   static void clear_closid_rmid(int cpu)
>   {
>   	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 23f8258d36a8..0b4136c42762 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -226,7 +226,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
>   		return -EINVAL;
>   	}
>   	dom = strim(dom);
> -	list_for_each_entry(d, &r->domains, hdr.list) {
> +	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
>   		if (d->hdr.id == dom_id) {
>   			data.buf = dom;
>   			data.rdtgrp = rdtgrp;
> @@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
>   		return -ENOMEM;
>   
>   	msr_param.res = NULL;
> -	list_for_each_entry(d, &r->domains, hdr.list) {
> +	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
>   		hw_dom = resctrl_to_arch_dom(d);
>   		for (t = 0; t < CDP_NUM_TYPES; t++) {
>   			cfg = &hw_dom->d_resctrl.staged_config[t];
> @@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
>   	u32 ctrl_val;
>   
>   	seq_printf(s, "%*s:", max_name_width, schema->name);
> -	list_for_each_entry(dom, &r->domains, hdr.list) {
> +	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
>   		if (sep)
>   			seq_puts(s, ";");
>   
> @@ -542,6 +542,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>   int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>   {
>   	struct kernfs_open_file *of = m->private;
> +	struct rdt_domain_hdr *hdr;
>   	u32 resid, evtid, domid;
>   	struct rdtgroup *rdtgrp;
>   	struct rdt_resource *r;
> @@ -562,11 +563,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>   	evtid = md.u.evtid;
>   
>   	r = &rdt_resources_all[resid].r_resctrl;
> -	d = rdt_find_domain(r, domid, NULL);
> -	if (!d) {
> +	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
> +	if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
>   		ret = -ENOENT;
>   		goto out;
>   	}
> +	d = container_of(hdr, struct rdt_domain, hdr);
>   
>   	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
>   
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index dd0ea1bc0092..ec5ad926c5dc 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>   
>   	entry->busy = 0;
>   	cpu = get_cpu();
> -	list_for_each_entry(d, &r->domains, hdr.list) {
> +	list_for_each_entry(d, &r->mon_domains, hdr.list) {
>   		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
>   			err = resctrl_arch_rmid_read(r, d, entry->rmid,
>   						     QOS_L3_OCCUP_EVENT_ID,
> @@ -535,7 +535,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
>   	rmid = rgrp->mon.rmid;
>   	pmbm_data = &dom_mbm->mbm_local[rmid];
>   
> -	dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
> +	dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
>   	if (!dom_mba) {
>   		pr_warn_once("Failure to get domain for MBA update\n");
>   		return;
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index fcbd99e2eb66..ed6d59af1cef 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
>    */
>   static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>   {
> -	enum resctrl_scope scope = plr->s->res->scope;
> +	enum resctrl_scope scope = plr->s->res->ctrl_scope;
>   	struct cpu_cacheinfo *ci;
>   	int ret;
>   	int i;
> @@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
>   	 * associated with them.
>   	 */
>   	for_each_alloc_capable_rdt_resource(r) {
> -		list_for_each_entry(d_i, &r->domains, hdr.list) {
> +		list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
>   			if (d_i->plr)
>   				cpumask_or(cpu_with_psl, cpu_with_psl,
>   					   &d_i->hdr.cpu_mask);
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 04d32602ac33..760013ed1bff 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
>   	lockdep_assert_held(&rdtgroup_mutex);
>   
>   	for_each_alloc_capable_rdt_resource(r) {
> -		list_for_each_entry(dom, &r->domains, hdr.list)
> +		list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
>   			memset(dom->staged_config, 0, sizeof(dom->staged_config));
>   	}
>   }
> @@ -984,7 +984,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
>   
>   	mutex_lock(&rdtgroup_mutex);
>   	hw_shareable = r->cache.shareable_bits;
> -	list_for_each_entry(dom, &r->domains, hdr.list) {
> +	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
>   		if (sep)
>   			seq_putc(seq, ';');
>   		sw_shareable = 0;
> @@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
>   		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
>   			continue;
>   		has_cache = true;
> -		list_for_each_entry(d, &r->domains, hdr.list) {
> +		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
>   			ctrl = resctrl_arch_get_config(r, d, closid,
>   						       s->conf_type);
>   			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
> @@ -1413,13 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>   	unsigned int size = 0;
>   	int num_b, i;
>   
> -	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
> +	if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
>   		return size;
>   
>   	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
>   	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
>   	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == r->scope) {
> +		if (ci->info_list[i].level == r->ctrl_scope) {
>   			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
>   			break;
>   		}
> @@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
>   		type = schema->conf_type;
>   		sep = false;
>   		seq_printf(s, "%*s:", max_name_width, schema->name);
> -		list_for_each_entry(d, &r->domains, hdr.list) {
> +		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
>   			if (sep)
>   				seq_putc(s, ';');
>   			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
> @@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
>   
>   	mutex_lock(&rdtgroup_mutex);
>   
> -	list_for_each_entry(dom, &r->domains, hdr.list) {
> +	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
>   		if (sep)
>   			seq_puts(s, ";");
>   
> @@ -1689,7 +1689,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
>   		return -EINVAL;
>   	}
>   
> -	list_for_each_entry(d, &r->domains, hdr.list) {
> +	list_for_each_entry(d, &r->mon_domains, hdr.list) {
>   		if (d->hdr.id == dom_id) {
>   			ret = mbm_config_write_domain(r, d, evtid, val);
>   			if (ret)
> @@ -2232,7 +2232,7 @@ static int set_cache_qos_cfg(int level, bool enable)
>   		return -ENOMEM;
>   
>   	r_l = &rdt_resources_all[level].r_resctrl;
> -	list_for_each_entry(d, &r_l->domains, hdr.list) {
> +	list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
>   		if (r_l->cache.arch_has_per_cpu_cfg)
>   			/* Pick all the CPUs in the domain instance */
>   			for_each_cpu(cpu, &d->hdr.cpu_mask)
> @@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
>   
>   	r->membw.mba_sc = mba_sc;
>   
> -	list_for_each_entry(d, &r->domains, hdr.list) {
> +	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
>   		for (i = 0; i < num_closid; i++)
>   			d->mbps_val[i] = MBA_MAX_MBPS;
>   	}
> @@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
>   
>   	if (is_mbm_enabled()) {
>   		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> -		list_for_each_entry(dom, &r->domains, hdr.list)
> +		list_for_each_entry(dom, &r->mon_domains, hdr.list)
>   			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
>   	}
>   
> @@ -2777,10 +2777,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
>   
>   	/*
>   	 * Disable resource control for this resource by setting all
> -	 * CBMs in all domains to the maximum mask value. Pick one CPU
> +	 * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
>   	 * from each domain to update the MSRs below.
>   	 */
> -	list_for_each_entry(d, &r->domains, hdr.list) {
> +	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
>   		hw_dom = resctrl_to_arch_dom(d);
>   		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
>   
> @@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
>   	struct rdt_domain *dom;
>   	int ret;
>   
> -	list_for_each_entry(dom, &r->domains, hdr.list) {
> +	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
>   		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
>   		if (ret)
>   			return ret;
> @@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
>   	struct rdt_domain *d;
>   	int ret;
>   
> -	list_for_each_entry(d, &s->res->domains, hdr.list) {
> +	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
>   		ret = __init_one_rdt_domain(d, s, closid);
>   		if (ret < 0)
>   			return ret;
> @@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
>   	struct resctrl_staged_config *cfg;
>   	struct rdt_domain *d;
>   
> -	list_for_each_entry(d, &r->domains, hdr.list) {
> +	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
>   		if (is_mba_sc(r)) {
>   			d->mbps_val[closid] = MBA_MAX_MBPS;
>   			continue;
> @@ -3849,15 +3849,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
>   	kfree(d->mbm_local);
>   }
>   
> -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
>   {
>   	lockdep_assert_held(&rdtgroup_mutex);
>   
>   	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
>   		mba_sc_domain_destroy(r, d);
> +}
>   
> -	if (!r->mon_capable)
> -		return;
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> +{
> +	lockdep_assert_held(&rdtgroup_mutex);
>   
>   	/*
>   	 * If resctrl is mounted, remove all the
> @@ -3914,18 +3916,21 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
>   	return 0;
>   }
>   
> -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
>   {
> -	int err;
> -
>   	lockdep_assert_held(&rdtgroup_mutex);
>   
>   	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
> -		/* RDT_RESOURCE_MBA is never mon_capable */
>   		return mba_sc_domain_allocate(r, d);
>   
> -	if (!r->mon_capable)
> -		return 0;
> +	return 0;
> +}
> +
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> +{
> +	int err;
> +
> +	lockdep_assert_held(&rdtgroup_mutex);
>   
>   	err = domain_setup_mon_state(r, d);
>   	if (err)

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v13 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2023-12-04 18:53                   ` [PATCH v13 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2023-12-04 23:03                     ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2023-12-04 23:03 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Shaopeng Tan


On 12/4/2023 12:53 PM, Tony Luck wrote:
> The same rdt_domain structure is used for both control and monitor
> functions. But this results in wasted memory as some of the fields are
> only used by control functions, while most are only used for monitor
> functions.
>
> Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
> just the fields required for control and monitoring respectively.
>
> Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
> and rdt_hw_mon_domain.
>
> Signed-off-by: Tony Luck<tony.luck@intel.com>
> Tested-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
> Reviewed-by: Peter Newman<peternewman@google.com>
> Reviewed-by: Reinette Chatre<reinette.chatre@intel.com>
> Reviewed-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
> ---
>   include/linux/resctrl.h                   | 48 +++++++------
>   arch/x86/kernel/cpu/resctrl/internal.h    | 60 ++++++++++------
>   arch/x86/kernel/cpu/resctrl/core.c        | 87 ++++++++++++-----------
>   arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
>   arch/x86/kernel/cpu/resctrl/monitor.c     | 40 +++++------
>   arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 62 ++++++++--------
>   7 files changed, 182 insertions(+), 153 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 35e700edc6e6..058a940c3239 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -72,7 +72,23 @@ struct rdt_domain_hdr {
>   };
>   
>   /**
> - * struct rdt_domain - group of CPUs sharing a resctrl resource
> + * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
> + * @hdr:		common header for different domain types
> + * @plr:		pseudo-locked region (if any) associated with domain
> + * @staged_config:	parsed configuration to be applied
> + * @mbps_val:		When mba_sc is enabled, this holds the array of user
> + *			specified control values for mba_sc in MBps, indexed
> + *			by closid
> + */
> +struct rdt_ctrl_domain {
> +	struct rdt_domain_hdr		hdr;
> +	struct pseudo_lock_region	*plr;
> +	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
> +	u32				*mbps_val;
> +};
> +
> +/**
> + * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
>    * @hdr:		common header for different domain types
>    * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
>    * @mbm_total:		saved state for MBM total bandwidth
> @@ -81,13 +97,8 @@ struct rdt_domain_hdr {
>    * @cqm_limbo:		worker to periodically read CQM h/w counters
>    * @mbm_work_cpu:	worker CPU for MBM h/w counters
>    * @cqm_work_cpu:	worker CPU for CQM h/w counters
> - * @plr:		pseudo-locked region (if any) associated with domain
> - * @staged_config:	parsed configuration to be applied
> - * @mbps_val:		When mba_sc is enabled, this holds the array of user
> - *			specified control values for mba_sc in MBps, indexed
> - *			by closid
>    */
> -struct rdt_domain {
> +struct rdt_mon_domain {
>   	struct rdt_domain_hdr		hdr;
>   	unsigned long			*rmid_busy_llc;
>   	struct mbm_state		*mbm_total;
> @@ -96,9 +107,6 @@ struct rdt_domain {
>   	struct delayed_work		cqm_limbo;
>   	int				mbm_work_cpu;
>   	int				cqm_work_cpu;
> -	struct pseudo_lock_region	*plr;
> -	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
> -	u32				*mbps_val;
>   };
>   
>   /**
> @@ -202,7 +210,7 @@ struct rdt_resource {
>   	const char		*format_str;
>   	int			(*parse_ctrlval)(struct rdt_parse_data *data,
>   						 struct resctrl_schema *s,
> -						 struct rdt_domain *d);
> +						 struct rdt_ctrl_domain *d);
>   	struct list_head	evt_list;
>   	unsigned long		fflags;
>   	bool			cdp_capable;
> @@ -236,15 +244,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
>    * Update the ctrl_val and apply this config right now.
>    * Must be called on one of the domain's CPUs.
>    */
> -int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
> +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
>   			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);
>   
> -u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
> +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
>   			    u32 closid, enum resctrl_conf_type type);
> -int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
> -void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
>   
>   /**
>    * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
> @@ -260,7 +268,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
>    * Return:
>    * 0 on success, or -EIO, -EINVAL etc on error.
>    */
> -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>   			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
>   
>   /**
> @@ -273,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>    *
>    * This can be called from any CPU.
>    */
> -void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
> +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
>   			     u32 rmid, enum resctrl_event_id eventid);
>   
>   /**
> @@ -285,7 +293,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
>    *
>    * This can be called from any CPU.
>    */
> -void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
> +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
>   
>   extern unsigned int resctrl_rmid_realloc_threshold;
>   extern unsigned int resctrl_rmid_realloc_limit;
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 24bf9d7989a9..ce3a70657842 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -107,7 +107,7 @@ union mon_data_bits {
>   struct rmid_read {
>   	struct rdtgroup		*rgrp;
>   	struct rdt_resource	*r;
> -	struct rdt_domain	*d;
> +	struct rdt_mon_domain	*d;
>   	enum resctrl_event_id	evtid;
>   	bool			first;
>   	int			err;
> @@ -192,7 +192,7 @@ struct mongroup {
>    */
>   struct pseudo_lock_region {
>   	struct resctrl_schema	*s;
> -	struct rdt_domain	*d;
> +	struct rdt_ctrl_domain	*d;
>   	u32			cbm;
>   	wait_queue_head_t	lock_thread_wq;
>   	int			thread_done;
> @@ -319,25 +319,41 @@ struct arch_mbm_state {
>   };
>   
>   /**
> - * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
> - *			  a resource
> + * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
> + *			       a resource for a control function
>    * @d_resctrl:	Properties exposed to the resctrl file system
>    * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
> + *
> + * Members of this structure are accessed via helpers that provide abstraction.
> + */
> +struct rdt_hw_ctrl_domain {
> +	struct rdt_ctrl_domain		d_resctrl;
> +	u32				*ctrl_val;
> +};
> +
> +/**
> + * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
> + *			      a resource for a monitor function
> + * @d_resctrl:	Properties exposed to the resctrl file system
>    * @arch_mbm_total:	arch private state for MBM total bandwidth
>    * @arch_mbm_local:	arch private state for MBM local bandwidth
>    *
>    * Members of this structure are accessed via helpers that provide abstraction.
>    */
> -struct rdt_hw_domain {
> -	struct rdt_domain		d_resctrl;
> -	u32				*ctrl_val;
> +struct rdt_hw_mon_domain {
> +	struct rdt_mon_domain		d_resctrl;
>   	struct arch_mbm_state		*arch_mbm_total;
>   	struct arch_mbm_state		*arch_mbm_local;
>   };
>   
> -static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
> +static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
> +{
> +	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
> +}
> +
> +static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
>   {
> -	return container_of(r, struct rdt_hw_domain, d_resctrl);
> +	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
>   }
>   
>   /**
> @@ -405,7 +421,7 @@ struct rdt_hw_resource {
>   	struct rdt_resource	r_resctrl;
>   	u32			num_closid;
>   	unsigned int		msr_base;
> -	void (*msr_update)	(struct rdt_domain *d, struct msr_param *m,
> +	void (*msr_update)	(struct rdt_ctrl_domain *d, struct msr_param *m,
>   				 struct rdt_resource *r);
>   	unsigned int		mon_scale;
>   	unsigned int		mbm_width;
> @@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
>   }
>   
>   int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
> -	      struct rdt_domain *d);
> +	      struct rdt_ctrl_domain *d);
>   int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
> -	     struct rdt_domain *d);
> +	     struct rdt_ctrl_domain *d);
>   
>   extern struct mutex rdtgroup_mutex;
>   
> @@ -526,21 +542,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
>   				char *buf, size_t nbytes, loff_t off);
>   int rdtgroup_schemata_show(struct kernfs_open_file *of,
>   			   struct seq_file *s, void *v);
> -bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
> +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
>   			   unsigned long cbm, int closid, bool exclusive);
> -unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
> +unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
>   				  unsigned long cbm);
>   enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
>   int rdtgroup_tasks_assigned(struct rdtgroup *r);
>   int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
>   int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
> -bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
> -bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
> +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
> +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
>   int rdt_pseudo_lock_init(void);
>   void rdt_pseudo_lock_release(void);
>   int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
>   void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
> -struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
> +struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
>   int closids_supported(void);
>   void closid_free(int closid);
>   int alloc_rmid(void);
> @@ -550,17 +566,17 @@ bool __init rdt_cpu_has(int flag);
>   void mon_event_count(void *info);
>   int rdtgroup_mondata_show(struct seq_file *m, void *arg);
>   void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> -		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
> +		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
>   		    int evtid, int first);
> -void mbm_setup_overflow_handler(struct rdt_domain *dom,
> +void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
>   				unsigned long delay_ms);
>   void mbm_handle_overflow(struct work_struct *work);
>   void __init intel_rdt_mbm_apply_quirk(void);
>   bool is_mba_sc(struct rdt_resource *r);
> -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
> +void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
>   void cqm_handle_limbo(struct work_struct *work);
> -bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
> -void __check_limbo(struct rdt_domain *d, bool force_free);
> +bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
> +void __check_limbo(struct rdt_mon_domain *d, bool force_free);
>   void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
>   void __init thread_throttle_mode_init(void);
>   void __init mbm_config_rftype_init(const char *config);
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 1fd85533b4ca..797cb3bf417a 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -49,12 +49,12 @@ int max_name_width, max_data_width;
>   bool rdt_alloc_capable;
>   
>   static void
> -mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> +mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
>   		struct rdt_resource *r);
>   static void
> -cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
> +cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
>   static void
> -mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
> +mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
>   	      struct rdt_resource *r);
>   
>   #define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
> @@ -307,11 +307,11 @@ static void rdt_get_cdp_l2_config(void)
>   }
>   
>   static void
> -mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
> +mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
>   {
> -	unsigned int i;
> -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> +	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
>   	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> +	unsigned int i;
>   
>   	for (i = m->low; i < m->high; i++)
>   		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
> @@ -332,12 +332,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
>   }
>   
>   static void
> -mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> +mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
>   		struct rdt_resource *r)
>   {
> -	unsigned int i;
> -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> +	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
>   	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> +	unsigned int i;
>   
>   	/*  Write the delay values for mba. */
>   	for (i = m->low; i < m->high; i++)
> @@ -345,19 +345,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
>   }
>   
>   static void
> -cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
> +cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
>   {
> -	unsigned int i;
> -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> +	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
>   	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> +	unsigned int i;
>   
>   	for (i = m->low; i < m->high; i++)
>   		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
>   }
>   
> -struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
> +struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
>   {
> -	struct rdt_domain *d;
> +	struct rdt_ctrl_domain *d;
>   
>   	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
>   		/* Find the domain that contains this CPU */
> @@ -379,7 +379,7 @@ void rdt_ctrl_update(void *arg)
>   	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
>   	struct rdt_resource *r = m->res;
>   	int cpu = smp_processor_id();
> -	struct rdt_domain *d;
> +	struct rdt_ctrl_domain *d;
>   
>   	d = get_ctrl_domain_from_cpu(cpu, r);
>   	if (d) {
> @@ -434,18 +434,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
>   		*dc = r->default_ctrl;
>   }
>   
> -static void domain_free(struct rdt_hw_domain *hw_dom)
> +static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
> +{
> +	kfree(hw_dom->ctrl_val);
> +	kfree(hw_dom);
> +}
> +
> +static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
>   {
>   	kfree(hw_dom->arch_mbm_total);
>   	kfree(hw_dom->arch_mbm_local);
> -	kfree(hw_dom->ctrl_val);
>   	kfree(hw_dom);
>   }
>   
> -static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
> +static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
>   {
> +	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
>   	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
>   	struct msr_param m;
>   	u32 *dc;
>   
> @@ -468,7 +473,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
>    * @num_rmid:	The size of the MBM counter array
>    * @hw_dom:	The domain that owns the allocated arrays
>    */
> -static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> +static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
>   {
>   	size_t tsize;
>   
> @@ -507,10 +512,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>   static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
>   {
>   	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
> +	struct rdt_hw_ctrl_domain *hw_dom;
>   	struct list_head *add_pos = NULL;
> -	struct rdt_hw_domain *hw_dom;
>   	struct rdt_domain_hdr *hdr;
> -	struct rdt_domain *d;
> +	struct rdt_ctrl_domain *d;
>   	int err;
>   
>   	if (id < 0) {
> @@ -524,7 +529,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
>   		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
>   			return;
>   
> -		d = container_of(hdr, struct rdt_domain, hdr);
> +		d = container_of(hdr, struct rdt_ctrl_domain, hdr);
>   
>   		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>   		if (r->cache.arch_has_per_cpu_cfg)
> @@ -544,7 +549,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
>   	rdt_domain_reconfigure_cdp(r);
>   
>   	if (domain_setup_ctrlval(r, d)) {
> -		domain_free(hw_dom);
> +		ctrl_domain_free(hw_dom);
>   		return;
>   	}
>   
> @@ -553,17 +558,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
>   	err = resctrl_online_ctrl_domain(r, d);
>   	if (err) {
>   		list_del(&d->hdr.list);
> -		domain_free(hw_dom);
> +		ctrl_domain_free(hw_dom);
>   	}
>   }
>   
>   static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>   {
>   	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> +	struct rdt_hw_mon_domain *hw_dom;
>   	struct list_head *add_pos = NULL;
> -	struct rdt_hw_domain *hw_dom;
>   	struct rdt_domain_hdr *hdr;
> -	struct rdt_domain *d;
> +	struct rdt_mon_domain *d;
>   	int err;
>   
>   	if (id < 0) {
> @@ -577,7 +582,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>   		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
>   			return;
>   
> -		d = container_of(hdr, struct rdt_domain, hdr);
> +		d = container_of(hdr, struct rdt_mon_domain, hdr);
>   
>   		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>   		return;
> @@ -593,7 +598,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>   	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>   
>   	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> -		domain_free(hw_dom);
> +		mon_domain_free(hw_dom);
>   		return;
>   	}
>   
> @@ -602,7 +607,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>   	err = resctrl_online_mon_domain(r, d);
>   	if (err) {
>   		list_del(&d->hdr.list);
> -		domain_free(hw_dom);
> +		mon_domain_free(hw_dom);
>   	}
>   }
>   
> @@ -620,9 +625,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>   static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
>   {
>   	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
> -	struct rdt_hw_domain *hw_dom;
> +	struct rdt_hw_ctrl_domain *hw_dom;
>   	struct rdt_domain_hdr *hdr;
> -	struct rdt_domain *d;
> +	struct rdt_ctrl_domain *d;
>   
>   	if (id < 0) {
>   		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
> @@ -640,8 +645,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
>   	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
>   		return;
>   
> -	d = container_of(hdr, struct rdt_domain, hdr);
> -	hw_dom = resctrl_to_arch_dom(d);
> +	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
> +	hw_dom = resctrl_to_arch_ctrl_dom(d);
>   
>   	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
>   	if (cpumask_empty(&d->hdr.cpu_mask)) {
> @@ -649,12 +654,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
>   		list_del(&d->hdr.list);
>   
>   		/*
> -		 * rdt_domain "d" is going to be freed below, so clear
> +		 * rdt_ctrl_domain "d" is going to be freed below, so clear
>   		 * its pointer from pseudo_lock_region struct.
>   		 */
>   		if (d->plr)
>   			d->plr->d = NULL;
> -		domain_free(hw_dom);
> +		ctrl_domain_free(hw_dom);
>   
>   		return;
>   	}
> @@ -663,9 +668,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
>   static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
>   {
>   	int id = get_domain_id_from_scope(cpu, r->mon_scope);
> -	struct rdt_hw_domain *hw_dom;
> +	struct rdt_hw_mon_domain *hw_dom;
>   	struct rdt_domain_hdr *hdr;
> -	struct rdt_domain *d;
> +	struct rdt_mon_domain *d;
>   
>   	if (id < 0) {
>   		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
> @@ -683,14 +688,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
>   	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
>   		return;
>   
> -	d = container_of(hdr, struct rdt_domain, hdr);
> -	hw_dom = resctrl_to_arch_dom(d);
> +	d = container_of(hdr, struct rdt_mon_domain, hdr);
> +	hw_dom = resctrl_to_arch_mon_dom(d);
>   
>   	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
>   	if (cpumask_empty(&d->hdr.cpu_mask)) {
>   		resctrl_offline_mon_domain(r, d);
>   		list_del(&d->hdr.list);
> -		domain_free(hw_dom);
> +		mon_domain_free(hw_dom);
>   
>   		return;
>   	}
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 0b4136c42762..08fc97ce4135 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
>   }
>   
>   int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
> -	     struct rdt_domain *d)
> +	     struct rdt_ctrl_domain *d)
>   {
>   	struct resctrl_staged_config *cfg;
>   	u32 closid = data->rdtgrp->closid;
> @@ -137,7 +137,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
>    * resource type.
>    */
>   int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
> -	      struct rdt_domain *d)
> +	      struct rdt_ctrl_domain *d)
>   {
>   	struct rdtgroup *rdtgrp = data->rdtgrp;
>   	struct resctrl_staged_config *cfg;
> @@ -206,8 +206,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
>   	struct resctrl_staged_config *cfg;
>   	struct rdt_resource *r = s->res;
>   	struct rdt_parse_data data;
> +	struct rdt_ctrl_domain *d;
>   	char *dom = NULL, *id;
> -	struct rdt_domain *d;
>   	unsigned long dom_id;
>   
>   	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
> @@ -267,11 +267,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
>   	}
>   }
>   
> -static bool apply_config(struct rdt_hw_domain *hw_dom,
> +static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
>   			 struct resctrl_staged_config *cfg, u32 idx,
>   			 cpumask_var_t cpu_mask)
>   {
> -	struct rdt_domain *dom = &hw_dom->d_resctrl;
> +	struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;
>   
>   	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
>   		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
> @@ -283,11 +283,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
>   	return false;
>   }
>   
> -int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
> +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
>   			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
>   {
> +	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
>   	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
>   	u32 idx = get_config_index(closid, t);
>   	struct msr_param msr_param;
>   
> @@ -307,11 +307,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
>   int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
>   {
>   	struct resctrl_staged_config *cfg;
> -	struct rdt_hw_domain *hw_dom;
> +	struct rdt_hw_ctrl_domain *hw_dom;
>   	struct msr_param msr_param;
> +	struct rdt_ctrl_domain *d;
>   	enum resctrl_conf_type t;
>   	cpumask_var_t cpu_mask;
> -	struct rdt_domain *d;
>   	u32 idx;
>   
>   	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
> @@ -319,7 +319,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
>   
>   	msr_param.res = NULL;
>   	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> -		hw_dom = resctrl_to_arch_dom(d);
> +		hw_dom = resctrl_to_arch_ctrl_dom(d);
>   		for (t = 0; t < CDP_NUM_TYPES; t++) {
>   			cfg = &hw_dom->d_resctrl.staged_config[t];
>   			if (!cfg->have_new_ctrl)
> @@ -449,10 +449,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
>   	return ret ?: nbytes;
>   }
>   
> -u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
> +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
>   			    u32 closid, enum resctrl_conf_type type)
>   {
> -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> +	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
>   	u32 idx = get_config_index(closid, type);
>   
>   	return hw_dom->ctrl_val[idx];
> @@ -461,7 +461,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
>   static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
>   {
>   	struct rdt_resource *r = schema->res;
> -	struct rdt_domain *dom;
> +	struct rdt_ctrl_domain *dom;
>   	bool sep = false;
>   	u32 ctrl_val;
>   
> @@ -523,7 +523,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
>   }
>   
>   void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> -		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
> +		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
>   		    int evtid, int first)
>   {
>   	/*
> @@ -543,11 +543,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>   {
>   	struct kernfs_open_file *of = m->private;
>   	struct rdt_domain_hdr *hdr;
> +	struct rdt_mon_domain *d;
>   	u32 resid, evtid, domid;
>   	struct rdtgroup *rdtgrp;
>   	struct rdt_resource *r;
>   	union mon_data_bits md;
> -	struct rdt_domain *d;
>   	struct rmid_read rr;
>   	int ret = 0;
>   
> @@ -568,7 +568,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>   		ret = -ENOENT;
>   		goto out;
>   	}
> -	d = container_of(hdr, struct rdt_domain, hdr);
> +	d = container_of(hdr, struct rdt_mon_domain, hdr);
>   
>   	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
>   
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index ec5ad926c5dc..4e145f5620b0 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>   	return 0;
>   }
>   
> -static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
> +static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
>   						 u32 rmid,
>   						 enum resctrl_event_id eventid)
>   {
> @@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
>   	return NULL;
>   }
>   
> -void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
> +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
>   			     u32 rmid, enum resctrl_event_id eventid)
>   {
> -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> +	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>   	struct arch_mbm_state *am;
>   
>   	am = get_arch_mbm_state(hw_dom, rmid, eventid);
> @@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
>    * Assumes that hardware counters are also reset and thus that there is
>    * no need to record initial non-zero counts.
>    */
> -void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
> +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
>   {
> -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> +	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>   
>   	if (is_mbm_total_enabled())
>   		memset(hw_dom->arch_mbm_total, 0,
> @@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
>   	return chunks >> shift;
>   }
>   
> -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>   			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
>   {
> +	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>   	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> -	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
>   	struct arch_mbm_state *am;
>   	u64 msr_val, chunks;
>   	int ret;
> @@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>    * decrement the count. If the busy count gets to zero on an RMID, we
>    * free the RMID
>    */
> -void __check_limbo(struct rdt_domain *d, bool force_free)
> +void __check_limbo(struct rdt_mon_domain *d, bool force_free)
>   {
>   	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>   	struct rmid_entry *entry;
> @@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>   	}
>   }
>   
> -bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
> +bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
>   {
>   	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
>   }
> @@ -334,7 +334,7 @@ int alloc_rmid(void)
>   static void add_rmid_to_limbo(struct rmid_entry *entry)
>   {
>   	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> -	struct rdt_domain *d;
> +	struct rdt_mon_domain *d;
>   	int cpu, err;
>   	u64 val = 0;
>   
> @@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
>   		list_add_tail(&entry->list, &rmid_free_lru);
>   }
>   
> -static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
> +static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
>   				       enum resctrl_event_id evtid)
>   {
>   	switch (evtid) {
> @@ -516,13 +516,13 @@ void mon_event_count(void *info)
>    * throttle MSRs already have low percentage values.  To avoid
>    * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
>    */
> -static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> +static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
>   {
>   	u32 closid, rmid, cur_msr_val, new_msr_val;
>   	struct mbm_state *pmbm_data, *cmbm_data;
> +	struct rdt_ctrl_domain *dom_mba;
>   	u32 cur_bw, delta_bw, user_bw;
>   	struct rdt_resource *r_mba;
> -	struct rdt_domain *dom_mba;
>   	struct list_head *head;
>   	struct rdtgroup *entry;
>   
> @@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
>   	}
>   }
>   
> -static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
> +static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
>   {
>   	struct rmid_read rr;
>   
> @@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
>   {
>   	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
>   	int cpu = smp_processor_id();
> +	struct rdt_mon_domain *d;
>   	struct rdt_resource *r;
> -	struct rdt_domain *d;
>   
>   	mutex_lock(&rdtgroup_mutex);
>   
>   	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> -	d = container_of(work, struct rdt_domain, cqm_limbo.work);
> +	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
>   
>   	__check_limbo(d, false);
>   
> @@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
>   	mutex_unlock(&rdtgroup_mutex);
>   }
>   
> -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
> +void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
>   {
>   	unsigned long delay = msecs_to_jiffies(delay_ms);
>   	int cpu;
> @@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
>   	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
>   	struct rdtgroup *prgrp, *crgrp;
>   	int cpu = smp_processor_id();
> +	struct rdt_mon_domain *d;
>   	struct list_head *head;
>   	struct rdt_resource *r;
> -	struct rdt_domain *d;
>   
>   	mutex_lock(&rdtgroup_mutex);
>   
> @@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
>   		goto out_unlock;
>   
>   	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> -	d = container_of(work, struct rdt_domain, mbm_over.work);
> +	d = container_of(work, struct rdt_mon_domain, mbm_over.work);
>   
>   	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
>   		mbm_update(r, d, prgrp->mon.rmid);
> @@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
>   	mutex_unlock(&rdtgroup_mutex);
>   }
>   
> -void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
> +void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
>   {
>   	unsigned long delay = msecs_to_jiffies(delay_ms);
>   	int cpu;
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index ed6d59af1cef..08d35f828bc3 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
>    * Return: true if @cbm overlaps with pseudo-locked region on @d, false
>    * otherwise.
>    */
> -bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
> +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
>   {
>   	unsigned int cbm_len;
>   	unsigned long cbm_b;
> @@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
>    *         if it is not possible to test due to memory allocation issue,
>    *         false otherwise.
>    */
> -bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
> +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
>   {
> +	struct rdt_ctrl_domain *d_i;
>   	cpumask_var_t cpu_with_psl;
>   	struct rdt_resource *r;
> -	struct rdt_domain *d_i;
>   	bool ret = false;
>   
>   	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 760013ed1bff..21bbd832f3f2 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -85,8 +85,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
>   
>   void rdt_staged_configs_clear(void)
>   {
> +	struct rdt_ctrl_domain *dom;
>   	struct rdt_resource *r;
> -	struct rdt_domain *dom;
>   
>   	lockdep_assert_held(&rdtgroup_mutex);
>   
> @@ -976,7 +976,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
>   	unsigned long sw_shareable = 0, hw_shareable = 0;
>   	unsigned long exclusive = 0, pseudo_locked = 0;
>   	struct rdt_resource *r = s->res;
> -	struct rdt_domain *dom;
> +	struct rdt_ctrl_domain *dom;
>   	int i, hwb, swb, excl, psl;
>   	enum rdtgrp_mode mode;
>   	bool sep = false;
> @@ -1205,7 +1205,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
>    *
>    * Return: false if CBM does not overlap, true if it does.
>    */
> -static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
> +static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
>   				    unsigned long cbm, int closid,
>   				    enum resctrl_conf_type type, bool exclusive)
>   {
> @@ -1260,7 +1260,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
>    *
>    * Return: true if CBM overlap detected, false if there is no overlap
>    */
> -bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
> +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
>   			   unsigned long cbm, int closid, bool exclusive)
>   {
>   	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
> @@ -1291,10 +1291,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
>   static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
>   {
>   	int closid = rdtgrp->closid;
> +	struct rdt_ctrl_domain *d;
>   	struct resctrl_schema *s;
>   	struct rdt_resource *r;
>   	bool has_cache = false;
> -	struct rdt_domain *d;
>   	u32 ctrl;
>   
>   	list_for_each_entry(s, &resctrl_schema_all, list) {
> @@ -1407,7 +1407,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
>    * bitmap functions work correctly.
>    */
>   unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> -				  struct rdt_domain *d, unsigned long cbm)
> +				  struct rdt_ctrl_domain *d, unsigned long cbm)
>   {
>   	struct cpu_cacheinfo *ci;
>   	unsigned int size = 0;
> @@ -1439,9 +1439,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
>   {
>   	struct resctrl_schema *schema;
>   	enum resctrl_conf_type type;
> +	struct rdt_ctrl_domain *d;
>   	struct rdtgroup *rdtgrp;
>   	struct rdt_resource *r;
> -	struct rdt_domain *d;
>   	unsigned int size;
>   	int ret = 0;
>   	u32 closid;
> @@ -1553,7 +1553,7 @@ static void mon_event_config_read(void *info)
>   	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
>   }
>   
> -static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
> +static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
>   {
>   	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
>   }
> @@ -1561,7 +1561,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
>   static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
>   {
>   	struct mon_config_info mon_info = {0};
> -	struct rdt_domain *dom;
> +	struct rdt_mon_domain *dom;
>   	bool sep = false;
>   
>   	mutex_lock(&rdtgroup_mutex);
> @@ -1618,7 +1618,7 @@ static void mon_event_config_write(void *info)
>   }
>   
>   static int mbm_config_write_domain(struct rdt_resource *r,
> -				   struct rdt_domain *d, u32 evtid, u32 val)
> +				   struct rdt_mon_domain *d, u32 evtid, u32 val)
>   {
>   	struct mon_config_info mon_info = {0};
>   	int ret = 0;
> @@ -1668,7 +1668,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
>   {
>   	char *dom_str = NULL, *id_str;
>   	unsigned long dom_id, val;
> -	struct rdt_domain *d;
> +	struct rdt_mon_domain *d;
>   	int ret = 0;
>   
>   next:
> @@ -2216,9 +2216,9 @@ static inline bool is_mba_linear(void)
>   static int set_cache_qos_cfg(int level, bool enable)
>   {
>   	void (*update)(void *arg);
> +	struct rdt_ctrl_domain *d;
>   	struct rdt_resource *r_l;
>   	cpumask_var_t cpu_mask;
> -	struct rdt_domain *d;
>   	int cpu;
>   
>   	if (level == RDT_RESOURCE_L3)
> @@ -2265,7 +2265,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
>   		l3_qos_cfg_update(&hw_res->cdp_enabled);
>   }
>   
> -static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
> +static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
>   {
>   	u32 num_closid = resctrl_arch_get_num_closid(r);
>   	int cpu = cpumask_any(&d->hdr.cpu_mask);
> @@ -2283,7 +2283,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
>   }
>   
>   static void mba_sc_domain_destroy(struct rdt_resource *r,
> -				  struct rdt_domain *d)
> +				  struct rdt_ctrl_domain *d)
>   {
>   	kfree(d->mbps_val);
>   	d->mbps_val = NULL;
> @@ -2309,7 +2309,7 @@ static int set_mba_sc(bool mba_sc)
>   {
>   	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>   	u32 num_closid = resctrl_arch_get_num_closid(r);
> -	struct rdt_domain *d;
> +	struct rdt_ctrl_domain *d;
>   	int i;
>   
>   	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
> @@ -2578,7 +2578,7 @@ static int rdt_get_tree(struct fs_context *fc)
>   {
>   	struct rdt_fs_context *ctx = rdt_fc2context(fc);
>   	unsigned long flags = RFTYPE_CTRL_BASE;
> -	struct rdt_domain *dom;
> +	struct rdt_mon_domain *dom;
>   	struct rdt_resource *r;
>   	int ret;
>   
> @@ -2762,10 +2762,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
>   static int reset_all_ctrls(struct rdt_resource *r)
>   {
>   	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> -	struct rdt_hw_domain *hw_dom;
> +	struct rdt_hw_ctrl_domain *hw_dom;
>   	struct msr_param msr_param;
> +	struct rdt_ctrl_domain *d;
>   	cpumask_var_t cpu_mask;
> -	struct rdt_domain *d;
>   	int i;
>   
>   	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
> @@ -2781,7 +2781,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
>   	 * from each domain to update the MSRs below.
>   	 */
>   	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> -		hw_dom = resctrl_to_arch_dom(d);
> +		hw_dom = resctrl_to_arch_ctrl_dom(d);
>   		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
>   
>   		for (i = 0; i < hw_res->num_closid; i++)
> @@ -2976,7 +2976,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>   }
>   
>   static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> -				struct rdt_domain *d,
> +				struct rdt_mon_domain *d,
>   				struct rdt_resource *r, struct rdtgroup *prgrp)
>   {
>   	union mon_data_bits priv;
> @@ -3025,7 +3025,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>    * and "monitor" groups with given domain id.
>    */
>   static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> -					   struct rdt_domain *d)
> +					   struct rdt_mon_domain *d)
>   {
>   	struct kernfs_node *parent_kn;
>   	struct rdtgroup *prgrp, *crgrp;
> @@ -3047,7 +3047,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
>   				       struct rdt_resource *r,
>   				       struct rdtgroup *prgrp)
>   {
> -	struct rdt_domain *dom;
> +	struct rdt_mon_domain *dom;
>   	int ret;
>   
>   	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> @@ -3149,7 +3149,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
>    * Set the RDT domain up to start off with all usable allocations. That is,
>    * all shareable and unused bits. All-zero CBM is invalid.
>    */
> -static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
> +static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
>   				 u32 closid)
>   {
>   	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
> @@ -3229,7 +3229,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
>    */
>   static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
>   {
> -	struct rdt_domain *d;
> +	struct rdt_ctrl_domain *d;
>   	int ret;
>   
>   	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
> @@ -3245,7 +3245,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
>   static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
>   {
>   	struct resctrl_staged_config *cfg;
> -	struct rdt_domain *d;
> +	struct rdt_ctrl_domain *d;
>   
>   	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
>   		if (is_mba_sc(r)) {
> @@ -3842,14 +3842,14 @@ static void __init rdtgroup_setup_default(void)
>   	mutex_unlock(&rdtgroup_mutex);
>   }
>   
> -static void domain_destroy_mon_state(struct rdt_domain *d)
> +static void domain_destroy_mon_state(struct rdt_mon_domain *d)
>   {
>   	bitmap_free(d->rmid_busy_llc);
>   	kfree(d->mbm_total);
>   	kfree(d->mbm_local);
>   }
>   
> -void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
>   {
>   	lockdep_assert_held(&rdtgroup_mutex);
>   
> @@ -3857,7 +3857,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
>   		mba_sc_domain_destroy(r, d);
>   }
>   
> -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
>   {
>   	lockdep_assert_held(&rdtgroup_mutex);
>   
> @@ -3886,7 +3886,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
>   	domain_destroy_mon_state(d);
>   }
>   
> -static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
> +static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
>   {
>   	size_t tsize;
>   
> @@ -3916,7 +3916,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
>   	return 0;
>   }
>   
> -int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
>   {
>   	lockdep_assert_held(&rdtgroup_mutex);
>   
> @@ -3926,7 +3926,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
>   	return 0;
>   }
>   
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
>   {
>   	int err;
>   

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v13 5/8] x86/resctrl: Add node-scope to the options for feature scope
  2023-12-04 18:53                   ` [PATCH v13 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2023-12-04 23:04                     ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2023-12-04 23:04 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Shaopeng Tan


On 12/4/2023 12:53 PM, Tony Luck wrote:
> Currently supported resctrl features are all domain scoped the same as the
> scope of the L2 or L3 caches.
>
> Add RESCTRL_NODE as a new option for features that are scoped at the
> same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
> Cluster (SNC) feature where monitoring features are node scoped.
>
> Signed-off-by: Tony Luck<tony.luck@intel.com>
> Tested-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
> Reviewed-by: Peter Newman<peternewman@google.com>
> Reviewed-by: Reinette Chatre<reinette.chatre@intel.com>
> Reviewed-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
> ---
>   include/linux/resctrl.h            | 1 +
>   arch/x86/kernel/cpu/resctrl/core.c | 2 ++
>   2 files changed, 3 insertions(+)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 058a940c3239..b8a3a11b970d 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -170,6 +170,7 @@ struct resctrl_schema;
>   enum resctrl_scope {
>   	RESCTRL_L2_CACHE = 2,
>   	RESCTRL_L3_CACHE = 3,
> +	RESCTRL_NODE,
>   };
>   
>   /**
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 797cb3bf417a..c9315ce8f7bd 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -502,6 +502,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>   	case RESCTRL_L2_CACHE:
>   	case RESCTRL_L3_CACHE:
>   		return get_cpu_cacheinfo_id(cpu, scope);
> +	case RESCTRL_NODE:
> +		return cpu_to_node(cpu);
>   	default:
>   		break;
>   	}

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v13 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2023-12-04 18:53                   ` [PATCH v13 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2023-12-04 23:04                     ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2023-12-04 23:04 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Shaopeng Tan


On 12/4/2023 12:53 PM, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
>
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
>
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
>
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
>
> The other mode divides the RMIDs and renumbers the ones on the second
> SNC node to start from zero.
>
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
>
> Add a global integer "snc_nodes_per_l3_cache" that shows how many
> SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
> SNC mode is either not implemented, or not enabled.
>
> Update all places to take appropriate action when SNC mode is enabled:
> 1) The number of logical RMIDs per L3 cache available for use is the
>     number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
>     nodes.
> 3) The RMID renumbering operates when using the value from the
>     IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
>     counter, adjust from the logical RMID to the physical
>     RMID value for the SNC node that it wishes to read and load the
>     adjusted value into the IA32_QM_EVTSEL MSR.
> 4) Divide the L3 cache between the SNC nodes. Divide the value
>     reported in the resctrl "size" file by the number of SNC
>     nodes because the effective amount of cache that can be allocated
>     is reduced by that factor.
> 5) Disable the "-o mba_MBps" mount option in SNC mode
>     because the monitoring is being done per SNC node, while the
>     bandwidth allocation is still done at the L3 cache scope.
>     Trying to use this feedback loop might result in contradictory
>     changes to the throttling level coming from each of the SNC
>     node bandwidth measurements.
>
> Signed-off-by: Tony Luck<tony.luck@intel.com>
> Tested-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
> Reviewed-by: Peter Newman<peternewman@google.com>
> Reviewed-by: Reinette Chatre<reinette.chatre@intel.com>
> Reviewed-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
> ---
>   arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
>   arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
>   arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c |  5 +++--
>   4 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index ce3a70657842..e7a75a439c16 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>   
>   extern struct dentry *debugfs_resctrl;
>   
> +extern unsigned int snc_nodes_per_l3_cache;
> +
>   enum resctrl_res_level {
>   	RDT_RESOURCE_L3,
>   	RDT_RESOURCE_L2,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index c9315ce8f7bd..cf5aba8a74bf 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -48,6 +48,12 @@ int max_name_width, max_data_width;
>    */
>   bool rdt_alloc_capable;
>   
> +/*
> + * Number of SNC nodes that share each L3 cache.  Default is 1 for
> + * systems that do not support SNC, or have SNC disabled.
> + */
> +unsigned int snc_nodes_per_l3_cache = 1;
> +
>   static void
>   mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
>   		struct rdt_resource *r);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 4e145f5620b0..30b7c3b9b517 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
>   
>   static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>   {
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	int cpu = smp_processor_id();
> +	int rmid_offset = 0;
>   	u64 msr_val;
>   
> +	/*
> +	 * When SNC mode is on, need to compute the offset to read the
> +	 * physical RMID counter for the node to which this CPU belongs.
> +	 */
> +	if (snc_nodes_per_l3_cache > 1)
> +		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> +
>   	/*
>   	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
>   	 * with a valid event code for supported resource type and the bits
> @@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>   	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>   	 * are error bits.
>   	 */
> -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
>   	rdmsrl(MSR_IA32_QM_CTR, msr_val);
>   
>   	if (msr_val & RMID_VAL_ERROR)
> @@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>   	int ret;
>   
>   	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> -	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> -	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> +	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> +	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
>   	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>   
>   	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 21bbd832f3f2..79d57dade568 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>   		}
>   	}
>   
> -	return size;
> +	return size / snc_nodes_per_l3_cache;
>   }
>   
>   /*
> @@ -2298,7 +2298,8 @@ static bool supports_mba_mbps(void)
>   	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>   
>   	return (is_mbm_local_enabled() &&
> -		r->alloc_capable && is_mba_linear());
> +		r->alloc_capable && is_mba_linear() &&
> +		snc_nodes_per_l3_cache == 1);
>   }
>   
>   /*

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v13 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2023-12-04 18:53                   ` [PATCH v13 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2023-12-04 23:04                     ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2023-12-04 23:04 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Shaopeng Tan


On 12/4/2023 12:53 PM, Tony Luck wrote:
> There isn't a simple hardware bit that indicates whether a CPU is
> running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
> the ratio of NUMA nodes to L3 cache instances.
>
> When SNC mode is detected, reconfigure the RMID counters by updating
> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
>
> Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
> on the second SNC node to start from zero.
>
> Signed-off-by: Tony Luck<tony.luck@intel.com>
> Tested-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
> Reviewed-by: Peter Newman<peternewman@google.com>
> Reviewed-by: Reinette Chatre<reinette.chatre@intel.com>
> Reviewed-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>

Reviewed-by: Babu Moger <babu.moger@amd.com>


> ---
>   arch/x86/include/asm/msr-index.h   |   1 +
>   arch/x86/kernel/cpu/resctrl/core.c | 118 +++++++++++++++++++++++++++++
>   2 files changed, 119 insertions(+)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 1d51e1850ed0..94d29d81e6db 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1111,6 +1111,7 @@
>   #define MSR_IA32_QM_CTR			0xc8e
>   #define MSR_IA32_PQR_ASSOC		0xc8f
>   #define MSR_IA32_L3_CBM_BASE		0xc90
> +#define MSR_RMID_SNC_CONFIG		0xca0
>   #define MSR_IA32_L2_CBM_BASE		0xd10
>   #define MSR_IA32_MBA_THRTL_BASE		0xd50
>   
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index cf5aba8a74bf..1de1c4499b7d 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>   
>   #define pr_fmt(fmt)	"resctrl: " fmt
>   
> +#include <linux/cpu.h>
>   #include <linux/slab.h>
>   #include <linux/err.h>
>   #include <linux/cacheinfo.h>
>   #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>   
> +#include <asm/cpu_device_id.h>
>   #include <asm/intel-family.h>
>   #include <asm/resctrl.h>
>   #include "internal.h"
> @@ -740,11 +743,42 @@ static void clear_closid_rmid(int cpu)
>   	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
>   }
>   
> +/*
> + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> + * which indicates that RMIDs are configured in legacy mode.
> + * This mode is incompatible with Linux resctrl semantics
> + * as RMIDs are partitioned between SNC nodes, which requires
> + * a user to know which RMID is allocated to a task.
> + * Clearing bit 0 reconfigures the RMID counters for use
> + * in Sub NUMA Cluster mode. This mode is better for Linux.
> + * The RMID space is divided between all SNC nodes with the
> + * RMIDs renumbered to start from zero in each node when
> + * couning operations from tasks. Code to read the counters
> + * must adjust RMID counter numbers based on SNC node. See
> + * __rmid_read() for code that does this.
> + */
> +static void snc_remap_rmids(int cpu)
> +{
> +	u64 val;
> +
> +	/* Only need to enable once per package. */
> +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> +		return;
> +
> +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
> +	val &= ~BIT_ULL(0);
> +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
> +}
> +
>   static int resctrl_online_cpu(unsigned int cpu)
>   {
>   	struct rdt_resource *r;
>   
>   	mutex_lock(&rdtgroup_mutex);
> +
> +	if (snc_nodes_per_l3_cache > 1)
> +		snc_remap_rmids(cpu);
> +
>   	for_each_capable_rdt_resource(r)
>   		domain_add_cpu(cpu, r);
>   	/* The cpu is set in default rdtgroup after online. */
> @@ -999,11 +1033,95 @@ static __init bool get_rdt_resources(void)
>   	return (rdt_mon_capable || rdt_alloc_capable);
>   }
>   
> +/* CPU models that support MSR_RMID_SNC_CONFIG */
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
> +	{}
> +};
> +
> +/*
> + * There isn't a simple hardware bit that indicates whether a CPU is running
> + * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
> + * ratio of NUMA nodes to L3 cache instances.
> + * It is not possible to accurately determine SNC state if the system is
> + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
> + * to L3 caches. It will be OK if system is booted with hyperthreading
> + * disabled (since this doesn't affect the ratio).
> + */
> +static __init int snc_get_config(void)
> +{
> +	unsigned long *node_caches;
> +	int mem_only_nodes = 0;
> +	int cpu, node, ret;
> +	int num_l3_caches;
> +	int cache_id;
> +
> +	if (!x86_match_cpu(snc_cpu_ids))
> +		return 1;
> +
> +	node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
> +	if (!node_caches)
> +		return 1;
> +
> +	cpus_read_lock();
> +
> +	if (num_online_cpus() != num_present_cpus())
> +		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> +
> +	for_each_node(node) {
> +		cpu = cpumask_first(cpumask_of_node(node));
> +		if (cpu < nr_cpu_ids) {
> +			cache_id = get_cpu_cacheinfo_id(cpu, 3);
> +			if (cache_id != -1)
> +				set_bit(cache_id, node_caches);
> +		} else {
> +			mem_only_nodes++;
> +		}
> +	}
> +	cpus_read_unlock();
> +
> +	num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
> +	kfree(node_caches);
> +
> +	if (!num_l3_caches)
> +		goto insane;
> +
> +	/* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
> +	if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
> +		goto insane;
> +
> +	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
> +
> +	/* sanity check #2: Only valid results are 1, 2, 3, 4 */
> +	switch (ret) {
> +	case 1:
> +		break;
> +	case 2:
> +	case 3:
> +	case 4:
> +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> +		break;
> +	default:
> +		goto insane;
> +	}
> +
> +	return ret;
> +insane:
> +	pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
> +		(nr_node_ids - mem_only_nodes), num_l3_caches);
> +	return 1;
> +}
> +
>   static __init void rdt_init_res_defs_intel(void)
>   {
>   	struct rdt_hw_resource *hw_res;
>   	struct rdt_resource *r;
>   
> +	snc_nodes_per_l3_cache = snc_get_config();
> +
>   	for_each_rdt_resource(r) {
>   		hw_res = resctrl_to_arch_res(r);
>   

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v13 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-12-04 18:53                   ` [PATCH v13 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-12-04 23:04                     ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2023-12-04 23:04 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches, Shaopeng Tan


On 12/4/2023 12:53 PM, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
>
> Users should be aware that SNC mode also affects the amount of L3 cache
> available for allocation within each SNC node.
>
> Signed-off-by: Tony Luck<tony.luck@intel.com>
> Tested-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>
> Reviewed-by: Peter Newman<peternewman@google.com>
> Reviewed-by: Reinette Chatre<reinette.chatre@intel.com>
> Reviewed-by: Shaopeng Tan<tan.shaopeng@jp.fujitsu.com>

Reviewed-by: Babu Moger <babu.moger@gmail.com>


> ---
>   Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
>   1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index a6279df64a9d..15f1cff6ee76 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
>   When monitoring is enabled all MON groups will also contain:
>   
>   "mon_data":
> -	This contains a set of files organized by L3 domain and by
> -	RDT event. E.g. on a system with two L3 domains there will
> -	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
> -	directories have one file per event (e.g. "llc_occupancy",
> +	This contains a set of files organized by L3 domain or by NUMA
> +	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> +	or enabled respectively) and by RDT event.  Each of these
> +	directories has one file per event (e.g. "llc_occupancy",
>   	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
>   	files provide a read out of the current value of the event for
>   	all tasks in the group. In CTRL_MON groups these files provide
> @@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
>   each bit represents 5% of the capacity of the cache. You could partition
>   the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>   
> +Notes on Sub-NUMA Cluster mode
> +==============================
> +When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
> +nodes much more readily than between regular NUMA nodes since the CPUs
> +on Sub-NUMA nodes share the same L3 cache and the system may report
> +the NUMA distance between Sub-NUMA nodes with a lower value than used
> +for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
> +specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
> +and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
> +to get the full view of traffic for which the tasks were the source.
> +
> +The cache allocation feature still provides the same number of
> +bits in a mask to control allocation into the L3 cache, but each
> +of those ways has its capacity reduced because the cache is divided
> +between the SNC nodes. The values reported in the resctrl
> +"size" files are adjusted accordingly.
> +
>   Memory bandwidth Allocation and monitoring
>   ==========================================
>   

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                     ` (8 preceding siblings ...)
  2023-12-04 21:20                   ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Moger, Babu
@ 2023-12-04 23:25                   ` Tony Luck
  2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
  10 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2023-12-04 23:25 UTC (permalink / raw)
  To: Borislav Petkov, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	linux-kernel, linux-doc, patches

On Mon, Dec 04, 2023 at 10:53:49AM -0800, Tony Luck wrote:

Boris: I've collected "Reviewed-by:" from Reinette for all patches. Babu
sent a Tested-by for the series, and Reviewed-by for each patch just
now.

So it's ready to got into your to-be-reviewed queue.

Thanks

-Tony

> The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
> that share an L3 cache into two or more sets. This plays havoc with the
> Resource Director Technology (RDT) monitoring features.  Prior to this
> patch Intel has advised that SNC and RDT are incompatible.
> 
> Some of these CPU support an MSR that can partition the RMID counters in
> the same way. This allows monitoring features to be used. With the caveat
> that users must be aware that Linux may migrate tasks more frequently
> between SNC nodes than between "regular" NUMA nodes, so reading counters
> from all SNC nodes may be needed to get a complete picture of activity
> for tasks.
> 
> Cache and memory bandwidth allocation features continue to operate at
> the scope of the L3 cache.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> 
> Changes since v12:
> 
> All:
> 	Reinette - put commit tags in right order for TIP (Tested-by before
> 	Reviewed-by)
> 
> Patch 7:
> 	Fam Zheng - Check for -1 return from get_cpu_cacheinfo_id() and
> 	increase size of bitmap tracking # of L3 instances.
> 	Reinette - Add extra sanity checks. Note that this patch has
> 	some additional tweaks beyond the e-mail discussion.
> 		1) "3" is a valid return in addition to 1, 2, 4
> 		2) Added a warning if the sanity checks fail that
> 		prints number of CPU nodes and number of L3 cache
> 		instances that were found.
> 
> Patch 8:
> 	Babu - Fix grammar with an additional comma.
> 
> 
> Tony Luck (8):
>   x86/resctrl: Prepare for new domain scope
>   x86/resctrl: Prepare to split rdt_domain structure
>   x86/resctrl: Prepare for different scope for control/monitor
>     operations
>   x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
>   x86/resctrl: Add node-scope to the options for feature scope
>   x86/resctrl: Introduce snc_nodes_per_l3_cache
>   x86/resctrl: Sub NUMA Cluster detection and enable
>   x86/resctrl: Update documentation with Sub-NUMA cluster changes
> 
>  Documentation/arch/x86/resctrl.rst        |  25 +-
>  include/linux/resctrl.h                   |  85 +++--
>  arch/x86/include/asm/msr-index.h          |   1 +
>  arch/x86/kernel/cpu/resctrl/internal.h    |  66 ++--
>  arch/x86/kernel/cpu/resctrl/core.c        | 433 +++++++++++++++++-----
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  58 +--
>  arch/x86/kernel/cpu/resctrl/monitor.c     |  68 ++--
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  26 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 149 ++++----
>  9 files changed, 629 insertions(+), 282 deletions(-)
> 
> 
> base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v14 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                                     ` (9 preceding siblings ...)
  2023-12-04 23:25                   ` Tony Luck
@ 2024-01-26 22:38                   ` Tony Luck
  2024-01-26 22:38                     ` [PATCH v14 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
                                       ` (9 more replies)
  10 siblings, 10 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-26 22:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, x86, linux-kernel, linux-doc, patches, Tony Luck

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Changes since v13:

Rebase to tip x86/cache
	fc747eebef73 ("x86/resctrl: Remove redundant variable in mbm_config_write_domain()")

Applied all Reviewed and Tested tags

Tony Luck (8):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  25 +-
 include/linux/resctrl.h                   |  85 +++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  66 ++--
 arch/x86/kernel/cpu/resctrl/core.c        | 433 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  58 +--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  68 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  26 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 149 ++++----
 9 files changed, 629 insertions(+), 282 deletions(-)


base-commit: fc747eebef734563cf68a512f57937c8f231834a
-- 
2.43.0


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v14 1/8] x86/resctrl: Prepare for new domain scope
  2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
@ 2024-01-26 22:38                     ` Tony Luck
  2024-01-26 22:38                     ` [PATCH v14 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
                                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-26 22:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, x86, linux-kernel, linux-doc, patches, Tony Luck,
	Shaopeng Tan

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Clean up the error handling when looking up domains. Report invalid id's
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the callsites.

Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |  9 ++++-
 arch/x86/kernel/cpu/resctrl/core.c        | 45 ++++++++++++++++-------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 ++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 ++-
 5 files changed, 48 insertions(+), 19 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..7d4eb7df611d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index aa9810a64258..268965c7c7ef 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -399,9 +399,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct rdt_domain *d;
 	struct list_head *l;
 
-	if (id < 0)
-		return ERR_PTR(-ENODEV);
-
 	list_for_each(l, &r->domains) {
 		d = list_entry(l, struct rdt_domain, list);
 		/* When id is found, return its domain. */
@@ -489,6 +486,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -504,18 +514,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 	int err;
 
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
 		return;
 	}
 
+	d = rdt_find_domain(r, id, &add_pos);
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -554,13 +565,19 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
+		return;
+	}
+
 	d = rdt_find_domain(r, id, NULL);
-	if (IS_ERR_OR_NULL(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (!d) {
+		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
 		return;
 	}
 	hw_dom = resctrl_to_arch_dom(d);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index beccb0e87ba7..3f8891d57fac 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -563,7 +563,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 
 	r = &rdt_resources_all[resid].r_resctrl;
 	d = rdt_find_domain(r, domid, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	if (!d) {
 		ret = -ENOENT;
 		goto out;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..2a682da9f43a 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	enum resctrl_scope scope = plr->s->res->scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index aa24343f1d23..1a9b2cd99b5f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1413,10 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return size;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v14 2/8] x86/resctrl: Prepare to split rdt_domain structure
  2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
  2024-01-26 22:38                     ` [PATCH v14 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2024-01-26 22:38                     ` Tony Luck
  2024-01-26 22:38                     ` [PATCH v14 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
                                       ` (7 subsequent siblings)
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-26 22:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, x86, linux-kernel, linux-doc, patches, Tony Luck,
	Shaopeng Tan

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.

Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   | 16 ++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 26 +++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 10 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 60 +++++++++++------------
 6 files changed, 78 insertions(+), 70 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 7d4eb7df611d..c4067150a6b7 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,10 +53,20 @@ struct resctrl_staged_config {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
  * @cpu_mask:		which CPUs share this resource
+ */
+struct rdt_domain_hdr {
+	struct list_head		list;
+	int				id;
+	struct cpumask			cpu_mask;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
  * @mbm_local:		saved state for MBM local bandwidth
@@ -71,9 +81,7 @@ struct resctrl_staged_config {
  *			by closid
  */
 struct rdt_domain {
-	struct list_head		list;
-	int				id;
-	struct cpumask			cpu_mask;
+	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
 	struct mbm_state		*mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 268965c7c7ef..625f75899d7b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -354,9 +354,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/* Find the domain that contains this CPU */
-		if (cpumask_test_cpu(cpu, &d->cpu_mask))
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
 	}
 
@@ -400,12 +400,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct list_head *l;
 
 	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, list);
+		d = list_entry(l, struct rdt_domain, hdr.list);
 		/* When id is found, return its domain. */
-		if (id == d->id)
+		if (id == d->hdr.id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->id)
+		if (id < d->hdr.id)
 			break;
 	}
 
@@ -528,7 +528,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = rdt_find_domain(r, id, &add_pos);
 	if (d) {
-		cpumask_set_cpu(cpu, &d->cpu_mask);
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
 		return;
@@ -539,8 +539,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 
 	d = &hw_dom->d_resctrl;
-	d->id = id;
-	cpumask_set_cpu(cpu, &d->cpu_mask);
+	d->hdr.id = id;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
@@ -554,11 +554,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail(&d->list, add_pos);
+	list_add_tail(&d->hdr.list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
@@ -582,10 +582,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 	hw_dom = resctrl_to_arch_dom(d);
 
-	cpumask_clear_cpu(cpu, &d->cpu_mask);
-	if (cpumask_empty(&d->cpu_mask)) {
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del(&d->list);
+		list_del(&d->hdr.list);
 
 		/*
 		 * rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 3f8891d57fac..23f8258d36a8 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -146,7 +146,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -226,8 +226,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
 			if (r->parse_ctrlval(&data, s, d))
@@ -274,7 +274,7 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	struct rdt_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
-		cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
+		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
 		hw_dom->ctrl_val[idx] = cfg->new_ctrl;
 
 		return true;
@@ -291,7 +291,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	hw_dom->ctrl_val[idx] = cfg_val;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -476,7 +476,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 			ctrl_val = resctrl_arch_get_config(r, dom, closid,
 							   schema->conf_type);
 
-		seq_printf(s, r->format_str, dom->id, max_data_width,
+		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
 			   ctrl_val);
 		sep = true;
 	}
@@ -505,7 +505,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			} else {
 				seq_printf(s, "%s:%d=%x\n",
 					   rdtgrp->plr->s->res->name,
-					   rdtgrp->plr->d->id,
+					   rdtgrp->plr->d->hdr.id,
 					   rdtgrp->plr->cbm);
 			}
 		} else {
@@ -536,7 +536,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 	rr->val = 0;
 	rr->first = first;
 
-	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
 }
 
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3a6c069614eb..310ab05f0cb7 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -238,7 +238,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 	u64 msr_val, chunks;
 	int ret;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	ret = __rmid_read(rmid, eventid, &msr_val);
@@ -340,8 +340,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
-		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
 						     &val);
@@ -639,7 +639,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
 
-	cpu = cpumask_any(&dom->cpu_mask);
+	cpu = cpumask_any(&dom->hdr.cpu_mask);
 	dom->cqm_work_cpu = cpu;
 
 	schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
@@ -686,7 +686,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
 
 	if (!static_branch_likely(&rdt_mon_enable_key))
 		return;
-	cpu = cpumask_any(&dom->cpu_mask);
+	cpu = cpumask_any(&dom->hdr.cpu_mask);
 	dom->mbm_work_cpu = cpu;
 	schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
 }
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 2a682da9f43a..fcbd99e2eb66 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
 	int cpu;
 	int ret;
 
-	for_each_cpu(cpu, &plr->d->cpu_mask) {
+	for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
 		pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
 		if (!pm_req) {
 			rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
@@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 		return -ENODEV;
 
 	/* Pick the first cpu we find that is associated with the cache. */
-	plr->cpu = cpumask_first(&plr->d->cpu_mask);
+	plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 
 	if (!cpu_online(plr->cpu)) {
 		rdt_last_cmd_printf("CPU %u associated with cache not online\n",
@@ -856,10 +856,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, list) {
+		list_for_each_entry(d_i, &r->domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
-					   &d_i->cpu_mask);
+					   &d_i->hdr.cpu_mask);
 		}
 	}
 
@@ -867,7 +867,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * Next test if new pseudo-locked region would intersect with
 	 * existing region.
 	 */
-	if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
+	if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
 		ret = true;
 
 	free_cpumask_var(cpu_with_psl);
@@ -1199,7 +1199,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
 	}
 
 	plr->thread_done = 0;
-	cpu = cpumask_first(&plr->d->cpu_mask);
+	cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 	if (!cpu_online(cpu)) {
 		ret = -ENODEV;
 		goto out;
@@ -1529,7 +1529,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * may be scheduled elsewhere and invalidate entries in the
 	 * pseudo-locked region.
 	 */
-	if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
+	if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
 		mutex_unlock(&rdtgroup_mutex);
 		return -EINVAL;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 1a9b2cd99b5f..a459c16735f7 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -295,7 +295,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
 				rdt_last_cmd_puts("Cache domain offline\n");
 				ret = -ENODEV;
 			} else {
-				mask = &rdtgrp->plr->d->cpu_mask;
+				mask = &rdtgrp->plr->d->hdr.cpu_mask;
 				seq_printf(s, is_cpu_list(of) ?
 					   "%*pbl\n" : "%*pb\n",
 					   cpumask_pr_args(mask));
@@ -984,12 +984,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
 		exclusive = 0;
-		seq_printf(seq, "%d=", dom->id);
+		seq_printf(seq, "%d=", dom->hdr.id);
 		for (i = 0; i < closids_supported(); i++) {
 			if (!closid_allocated(i))
 				continue;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1417,7 +1417,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
-	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
+	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
 		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
@@ -1465,7 +1465,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
 						    rdtgrp->plr->d,
 						    rdtgrp->plr->cbm);
-			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
 		}
 		goto out;
 	}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1495,7 +1495,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 				else
 					size = rdtgroup_cbm_to_size(r, d, ctrl);
 			}
-			seq_printf(s, "%d=%u", d->id, size);
+			seq_printf(s, "%d=%u", d->hdr.id, size);
 			sep = true;
 		}
 		seq_putc(s, '\n');
@@ -1555,7 +1555,7 @@ static void mon_event_config_read(void *info)
 
 static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
 {
-	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
 
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1574,7 +1574,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		mon_info.evtid = evtid;
 		mondata_config_read(dom, &mon_info);
 
-		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1639,7 +1639,7 @@ static void mbm_config_write_domain(struct rdt_resource *r,
 	 * are scoped at the domain level. Writing any of these MSRs
 	 * on one CPU is observed by all the CPUs in the domain.
 	 */
-	smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
 			      &mon_info, 1);
 
 	/*
@@ -1686,8 +1686,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			mbm_config_write_domain(r, d, evtid, val);
 			goto next;
 		}
@@ -2227,14 +2227,14 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, list) {
+	list_for_each_entry(d, &r_l->domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
-			for_each_cpu(cpu, &d->cpu_mask)
+			for_each_cpu(cpu, &d->hdr.cpu_mask)
 				cpumask_set_cpu(cpu, cpu_mask);
 		else
 			/* Pick one CPU from each domain instance to update MSR */
-			cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+			cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 	}
 
 	/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2263,7 +2263,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	int cpu = cpumask_any(&d->cpu_mask);
+	int cpu = cpumask_any(&d->hdr.cpu_mask);
 	int i;
 
 	d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
@@ -2312,7 +2312,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2648,7 +2648,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2775,9 +2775,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * CBMs in all domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
-		cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
 			hw_dom->ctrl_val[i] = r->default_ctrl;
@@ -2981,7 +2981,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
 	/* create the directory */
 	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
 	if (IS_ERR(kn))
@@ -2997,7 +2997,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	}
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->id;
+	priv.u.domid = d->hdr.id;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3045,7 +3045,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3204,7 +3204,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
 	 */
 	tmp_cbm = cfg->new_ctrl;
 	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
-		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
 		return -ENOSPC;
 	}
 	cfg->have_new_ctrl = true;
@@ -3227,7 +3227,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, list) {
+	list_for_each_entry(d, &s->res->domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3242,7 +3242,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3859,7 +3859,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (static_branch_unlikely(&rdt_mon_enable_key))
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v14 3/8] x86/resctrl: Prepare for different scope for control/monitor operations
  2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
  2024-01-26 22:38                     ` [PATCH v14 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
  2024-01-26 22:38                     ` [PATCH v14 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2024-01-26 22:38                     ` Tony Luck
  2024-01-26 22:38                     ` [PATCH v14 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
                                       ` (6 subsequent siblings)
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-26 22:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, x86, linux-kernel, linux-doc, patches, Tony Luck,
	Shaopeng Tan

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |  25 ++-
 arch/x86/kernel/cpu/resctrl/internal.h    |   6 +-
 arch/x86/kernel/cpu/resctrl/core.c        | 211 ++++++++++++++++------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  12 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     |   4 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  55 +++---
 7 files changed, 220 insertions(+), 97 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c4067150a6b7..35e700edc6e6 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -52,15 +52,22 @@ struct resctrl_staged_config {
 	bool			have_new_ctrl;
 };
 
+enum resctrl_domain_type {
+	RESCTRL_CTRL_DOMAIN,
+	RESCTRL_MON_DOMAIN,
+};
+
 /**
  * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
+ * @type:		type of this instance
  * @cpu_mask:		which CPUs share this resource
  */
 struct rdt_domain_hdr {
 	struct list_head		list;
 	int				id;
+	enum resctrl_domain_type	type;
 	struct cpumask			cpu_mask;
 };
 
@@ -163,10 +170,12 @@ enum resctrl_scope {
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @scope:		Scope of this resource
+ * @ctrl_scope:		Scope of this resource for control functions
+ * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		All domains for this resource
+ * @ctrl_domains:	Control domains for this resource
+ * @mon_domains:	Monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -181,10 +190,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -230,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 52e7e7deee10..c2b0f563ddf9 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -518,8 +518,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -538,7 +538,7 @@ int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 625f75899d7b..5783ca195a99 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,8 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.mon_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
+			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.scope			= RESCTRL_L2_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.ctrl_scope		= RESCTRL_L2_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -350,11 +353,11 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
@@ -376,7 +379,7 @@ void rdt_ctrl_update(void *arg)
 	int cpu = smp_processor_id();
 	struct rdt_domain *d;
 
-	d = get_domain_from_cpu(cpu, r);
+	d = get_ctrl_domain_from_cpu(cpu, r);
 	if (d) {
 		hw_res->msr_update(d, m, r);
 		return;
@@ -386,26 +389,26 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Search for a domain id in a resource domain list.
  *
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
- * The domain list is sorted by id in ascending order.
+ * Search the domain list to find the domain id. If the domain id is
+ * found, return the domain. NULL otherwise.  If the domain id is not
+ * found (and NULL returned) then the first domain with id bigger than
+ * the input id can be returned to the caller via @pos.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos)
 {
-	struct rdt_domain *d;
+	struct rdt_domain_hdr *d;
 	struct list_head *l;
 
-	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, hdr.list);
+	list_for_each(l, h) {
+		d = list_entry(l, struct rdt_domain_hdr, list);
 		/* When id is found, return its domain. */
-		if (id == d->hdr.id)
+		if (id == d->id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->hdr.id)
+		if (id < d->id)
 			break;
 	}
 
@@ -499,35 +502,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	return -EINVAL;
 }
 
-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 	int err;
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
 
-	d = rdt_find_domain(r, id, &add_pos);
-	if (d) {
+	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+			return;
+
+		d = container_of(hdr, struct rdt_domain, hdr);
+
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
@@ -540,51 +536,114 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = &hw_dom->d_resctrl;
 	d->hdr.id = id;
+	d->hdr.type = RESCTRL_CTRL_DOMAIN;
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+	if (domain_setup_ctrlval(r, d)) {
 		domain_free(hw_dom);
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	list_add_tail(&d->hdr.list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+	int err;
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+			return;
+
+		d = container_of(hdr, struct rdt_domain, hdr);
+
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->hdr.id = id;
+	d->hdr.type = RESCTRL_MON_DOMAIN;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
 	}
 
 	list_add_tail(&d->hdr.list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
 		domain_free(hw_dom);
 	}
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+/*
+ * domain_add_cpu - Add a CPU to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
 
-	d = rdt_find_domain(r, id, NULL);
-	if (!d) {
-		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+	if (!hdr) {
+		pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n",
+			id, cpu, r->name);
 		return;
 	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
 	hw_dom = resctrl_to_arch_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del(&d->hdr.list);
 
 		/*
@@ -597,6 +656,42 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 		return;
 	}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+	if (!hdr) {
+		pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n",
+			id, cpu, r->name);
+		return;
+	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+	hw_dom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del(&d->hdr.list);
+		domain_free(hw_dom);
+
+		return;
+	}
 
 	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -611,6 +706,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 23f8258d36a8..0b4136c42762 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -226,7 +226,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 		return -ENOMEM;
 
 	msr_param.res = NULL;
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -542,6 +542,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
+	struct rdt_domain_hdr *hdr;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
@@ -562,11 +563,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
-	if (!d) {
+	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+	if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
 		ret = -ENOENT;
 		goto out;
 	}
+	d = container_of(hdr, struct rdt_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 310ab05f0cb7..749f6662e7ca 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
@@ -532,7 +532,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	rmid = rgrp->mon.rmid;
 	pmbm_data = &dom_mbm->mbm_local[rmid];
 
-	dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+	dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
 	if (!dom_mba) {
 		pr_warn_once("Failure to get domain for MBA update\n");
 		return;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index fcbd99e2eb66..ed6d59af1cef 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
-	enum resctrl_scope scope = plr->s->res->scope;
+	enum resctrl_scope scope = plr->s->res->ctrl_scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, hdr.list) {
+		list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->hdr.cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a459c16735f7..5cc6e9b11a16 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -984,7 +984,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1413,13 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
-	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+	if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->scope) {
+		if (ci->info_list[i].level == r->ctrl_scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1686,7 +1686,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			mbm_config_write_domain(r, d, evtid, val);
 			goto next;
@@ -2227,7 +2227,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, hdr.list) {
+	list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->hdr.cpu_mask)
@@ -2312,7 +2312,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2648,7 +2648,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2772,10 +2772,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
 
 	/*
 	 * Disable resource control for this resource by setting all
-	 * CBMs in all domains to the maximum mask value. Pick one CPU
+	 * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
@@ -3045,7 +3045,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3227,7 +3227,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, hdr.list) {
+	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3242,7 +3242,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3844,15 +3844,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
+}
 
-	if (!r->mon_capable)
-		return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3909,18 +3911,21 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
-	int err;
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
-		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
-		return 0;
+	return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	lockdep_assert_held(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v14 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
                                       ` (2 preceding siblings ...)
  2024-01-26 22:38                     ` [PATCH v14 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2024-01-26 22:38                     ` Tony Luck
  2024-01-26 22:38                     ` [PATCH v14 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
                                       ` (5 subsequent siblings)
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-26 22:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, x86, linux-kernel, linux-doc, patches, Tony Luck,
	Shaopeng Tan

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   | 48 +++++++------
 arch/x86/kernel/cpu/resctrl/internal.h    | 60 ++++++++++------
 arch/x86/kernel/cpu/resctrl/core.c        | 87 ++++++++++++-----------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 40 +++++------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 62 ++++++++--------
 7 files changed, 182 insertions(+), 153 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 35e700edc6e6..058a940c3239 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -72,7 +72,23 @@ struct rdt_domain_hdr {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr:		common header for different domain types
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_ctrl_domain {
+	struct rdt_domain_hdr		hdr;
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
  * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
@@ -81,13 +97,8 @@ struct rdt_domain_hdr {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
@@ -96,9 +107,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };
 
 /**
@@ -202,7 +210,7 @@ struct rdt_resource {
 	const char		*format_str;
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
-						 struct rdt_domain *d);
+						 struct rdt_ctrl_domain *d);
 	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
@@ -236,15 +244,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.
  */
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -260,7 +268,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
 
 /**
@@ -273,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid);
 
 /**
@@ -285,7 +293,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c2b0f563ddf9..3bfd1cf25b49 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -106,7 +106,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -191,7 +191,7 @@ struct mongroup {
  */
 struct pseudo_lock_region {
 	struct resctrl_schema	*s;
-	struct rdt_domain	*d;
+	struct rdt_ctrl_domain	*d;
 	u32			cbm;
 	wait_queue_head_t	lock_thread_wq;
 	int			thread_done;
@@ -314,25 +314,41 @@ struct arch_mbm_state {
 };
 
 /**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ *			       a resource for a control function
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+	struct rdt_ctrl_domain		d_resctrl;
+	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ *			      a resource for a monitor function
+ * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
-struct rdt_hw_domain {
-	struct rdt_domain		d_resctrl;
-	u32				*ctrl_val;
+struct rdt_hw_mon_domain {
+	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };
 
-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
 {
-	return container_of(r, struct rdt_hw_domain, d_resctrl);
+	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
 }
 
 /**
@@ -402,7 +418,7 @@ struct rdt_hw_resource {
 	struct rdt_resource	r_resctrl;
 	u32			num_closid;
 	unsigned int		msr_base;
-	void (*msr_update)	(struct rdt_domain *d, struct msr_param *m,
+	void (*msr_update)	(struct rdt_ctrl_domain *d, struct msr_param *m,
 				 struct rdt_resource *r);
 	unsigned int		mon_scale;
 	unsigned int		mbm_width;
@@ -416,9 +432,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 }
 
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d);
+	      struct rdt_ctrl_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d);
+	     struct rdt_ctrl_domain *d);
 
 extern struct mutex rdtgroup_mutex;
 
@@ -524,21 +540,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				  unsigned long cbm);
 enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
 int rdtgroup_tasks_assigned(struct rdtgroup *r);
 int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
 int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
 int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
@@ -548,17 +564,17 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 5783ca195a99..d7c0651d337d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
 bool rdt_alloc_capable;
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
 #define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -305,11 +305,11 @@ static void rdt_get_cdp_l2_config(void)
 }
 
 static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -330,12 +330,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 }
 
 static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	/*  Write the delay values for mba. */
 	for (i = m->low; i < m->high; i++)
@@ -343,19 +343,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 }
 
 static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
-	unsigned int i;
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
@@ -377,7 +377,7 @@ void rdt_ctrl_update(void *arg)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
 	struct rdt_resource *r = m->res;
 	int cpu = smp_processor_id();
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	d = get_ctrl_domain_from_cpu(cpu, r);
 	if (d) {
@@ -432,18 +432,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 		*dc = r->default_ctrl;
 }
 
-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct msr_param m;
 	u32 *dc;
 
@@ -466,7 +471,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
 {
 	size_t tsize;
 
@@ -505,10 +510,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -522,7 +527,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 			return;
 
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_ctrl_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -542,7 +547,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	rdt_domain_reconfigure_cdp(r);
 
 	if (domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 		return;
 	}
 
@@ -551,17 +556,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	err = resctrl_online_ctrl_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 	}
 }
 
 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_mon_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int err;
 
 	if (id < 0) {
@@ -575,7 +580,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 			return;
 
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		return;
@@ -591,7 +596,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 		return;
 	}
 
@@ -600,7 +605,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 	}
 }
 
@@ -618,9 +623,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	if (id < 0) {
 		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
@@ -638,8 +643,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+	hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
@@ -647,12 +652,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		list_del(&d->hdr.list);
 
 		/*
-		 * rdt_domain "d" is going to be freed below, so clear
+		 * rdt_ctrl_domain "d" is going to be freed below, so clear
 		 * its pointer from pseudo_lock_region struct.
 		 */
 		if (d->plr)
 			d->plr->d = NULL;
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 
 		return;
 	}
@@ -661,9 +666,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_mon_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	if (id < 0) {
 		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
@@ -681,14 +686,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
+	hw_dom = resctrl_to_arch_mon_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del(&d->hdr.list);
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 0b4136c42762..08fc97ce4135 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }
 
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d)
+	     struct rdt_ctrl_domain *d)
 {
 	struct resctrl_staged_config *cfg;
 	u32 closid = data->rdtgrp->closid;
@@ -137,7 +137,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
  * resource type.
  */
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d)
+	      struct rdt_ctrl_domain *d)
 {
 	struct rdtgroup *rdtgrp = data->rdtgrp;
 	struct resctrl_staged_config *cfg;
@@ -206,8 +206,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct resctrl_staged_config *cfg;
 	struct rdt_resource *r = s->res;
 	struct rdt_parse_data data;
+	struct rdt_ctrl_domain *d;
 	char *dom = NULL, *id;
-	struct rdt_domain *d;
 	unsigned long dom_id;
 
 	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -267,11 +267,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
 	}
 }
 
-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
 			 struct resctrl_staged_config *cfg, u32 idx,
 			 cpumask_var_t cpu_mask)
 {
-	struct rdt_domain *dom = &hw_dom->d_resctrl;
+	struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;
 
 	if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
 		cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
@@ -283,11 +283,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
 	return false;
 }
 
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
@@ -307,11 +307,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	enum resctrl_conf_type t;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	u32 idx;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -319,7 +319,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 
 	msr_param.res = NULL;
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
 			if (!cfg->have_new_ctrl)
@@ -449,10 +449,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	u32 idx = get_config_index(closid, type);
 
 	return hw_dom->ctrl_val[idx];
@@ -461,7 +461,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
 {
 	struct rdt_resource *r = schema->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	bool sep = false;
 	u32 ctrl_val;
 
@@ -523,7 +523,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	/*
@@ -543,11 +543,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
 	struct rdt_domain_hdr *hdr;
+	struct rdt_mon_domain *d;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -568,7 +568,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		ret = -ENOENT;
 		goto out;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 749f6662e7ca..e51bafc1945b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 rmid, enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	}
 }
 
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
 }
@@ -334,7 +334,7 @@ int alloc_rmid(void)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int cpu, err;
 	u64 val = 0;
 
@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
 				       enum resctrl_event_id evtid)
 {
 	switch (evtid) {
@@ -513,12 +513,12 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
+	struct rdt_ctrl_domain *dom_mba;
 	struct rdt_resource *r_mba;
-	struct rdt_domain *dom_mba;
 	struct list_head *head;
 	struct rdtgroup *entry;
 	u32 cur_bw, user_bw;
@@ -578,7 +578,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
 {
 	struct rmid_read rr;
 
@@ -618,13 +618,13 @@ void cqm_handle_limbo(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -634,7 +634,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -650,9 +650,9 @@ void mbm_handle_overflow(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
 	struct rdtgroup *prgrp, *crgrp;
 	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -660,7 +660,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->mon.rmid);
@@ -679,7 +679,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index ed6d59af1cef..08d35f828bc3 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
  * Return: true if @cbm overlaps with pseudo-locked region on @d, false
  * otherwise.
  */
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int cbm_len;
 	unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
  *         if it is not possible to test due to memory allocation issue,
  *         false otherwise.
  */
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
 {
+	struct rdt_ctrl_domain *d_i;
 	cpumask_var_t cpu_with_psl;
 	struct rdt_resource *r;
-	struct rdt_domain *d_i;
 	bool ret = false;
 
 	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5cc6e9b11a16..441d2f744ccc 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -85,8 +85,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
 
 void rdt_staged_configs_clear(void)
 {
+	struct rdt_ctrl_domain *dom;
 	struct rdt_resource *r;
-	struct rdt_domain *dom;
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -976,7 +976,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	unsigned long sw_shareable = 0, hw_shareable = 0;
 	unsigned long exclusive = 0, pseudo_locked = 0;
 	struct rdt_resource *r = s->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	int i, hwb, swb, excl, psl;
 	enum rdtgrp_mode mode;
 	bool sep = false;
@@ -1205,7 +1205,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
  *
  * Return: false if CBM does not overlap, true if it does.
  */
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				    unsigned long cbm, int closid,
 				    enum resctrl_conf_type type, bool exclusive)
 {
@@ -1260,7 +1260,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
  *
  * Return: true if CBM overlap detected, false if there is no overlap
  */
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1291,10 +1291,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
 static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 {
 	int closid = rdtgrp->closid;
+	struct rdt_ctrl_domain *d;
 	struct resctrl_schema *s;
 	struct rdt_resource *r;
 	bool has_cache = false;
-	struct rdt_domain *d;
 	u32 ctrl;
 
 	list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1407,7 +1407,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
  * bitmap functions work correctly.
  */
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
-				  struct rdt_domain *d, unsigned long cbm)
+				  struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	struct cpu_cacheinfo *ci;
 	unsigned int size = 0;
@@ -1439,9 +1439,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 {
 	struct resctrl_schema *schema;
 	enum resctrl_conf_type type;
+	struct rdt_ctrl_domain *d;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 	unsigned int size;
 	int ret = 0;
 	u32 closid;
@@ -1553,7 +1553,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1561,7 +1561,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
@@ -1618,7 +1618,7 @@ static void mon_event_config_write(void *info)
 }
 
 static void mbm_config_write_domain(struct rdt_resource *r,
-				    struct rdt_domain *d, u32 evtid, u32 val)
+				    struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 
@@ -1659,7 +1659,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 next:
 	if (!tok || tok[0] == '\0')
@@ -2211,9 +2211,9 @@ static inline bool is_mba_linear(void)
 static int set_cache_qos_cfg(int level, bool enable)
 {
 	void (*update)(void *arg);
+	struct rdt_ctrl_domain *d;
 	struct rdt_resource *r_l;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int cpu;
 
 	if (level == RDT_RESOURCE_L3)
@@ -2260,7 +2260,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 		l3_qos_cfg_update(&hw_res->cdp_enabled);
 }
 
-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
 	int cpu = cpumask_any(&d->hdr.cpu_mask);
@@ -2278,7 +2278,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 }
 
 static void mba_sc_domain_destroy(struct rdt_resource *r,
-				  struct rdt_domain *d)
+				  struct rdt_ctrl_domain *d)
 {
 	kfree(d->mbps_val);
 	d->mbps_val = NULL;
@@ -2304,7 +2304,7 @@ static int set_mba_sc(bool mba_sc)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2573,7 +2573,7 @@ static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	unsigned long flags = RFTYPE_CTRL_BASE;
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2757,10 +2757,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
 static int reset_all_ctrls(struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int i;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2776,7 +2776,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * from each domain to update the MSRs below.
 	 */
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 
 		for (i = 0; i < hw_res->num_closid; i++)
@@ -2971,7 +2971,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -3020,7 +3020,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mon_domain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -3042,7 +3042,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	int ret;
 
 	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3144,7 +3144,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
  * Set the RDT domain up to start off with all usable allocations. That is,
  * all shareable and unused bits. All-zero CBM is invalid.
  */
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
 				 u32 closid)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3224,7 +3224,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
  */
 static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int ret;
 
 	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3240,7 +3240,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
@@ -3837,14 +3837,14 @@ static void __init rdtgroup_setup_default(void)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3852,7 +3852,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 		mba_sc_domain_destroy(r, d);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3881,7 +3881,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	size_t tsize;
 
@@ -3911,7 +3911,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3921,7 +3921,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	int err;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v14 5/8] x86/resctrl: Add node-scope to the options for feature scope
  2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
                                       ` (3 preceding siblings ...)
  2024-01-26 22:38                     ` [PATCH v14 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2024-01-26 22:38                     ` Tony Luck
  2024-01-26 22:38                     ` [PATCH v14 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                                       ` (4 subsequent siblings)
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-26 22:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, x86, linux-kernel, linux-doc, patches, Tony Luck,
	Shaopeng Tan

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h            | 1 +
 arch/x86/kernel/cpu/resctrl/core.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 058a940c3239..b8a3a11b970d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -170,6 +170,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d7c0651d337d..27523841f9fd 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -500,6 +500,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
 	default:
 		break;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v14 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
                                       ` (4 preceding siblings ...)
  2024-01-26 22:38                     ` [PATCH v14 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2024-01-26 22:38                     ` Tony Luck
  2024-01-26 22:38                     ` [PATCH v14 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
                                       ` (3 subsequent siblings)
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-26 22:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, x86, linux-kernel, linux-doc, patches, Tony Luck,
	Shaopeng Tan

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that shows how many
SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
SNC mode is either not implemented, or not enabled.

Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
   nodes.
3) The RMID renumbering operates when using the value from the
   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
   counter, adjust from the logical RMID to the physical
   RMID value for the SNC node that it wishes to read and load the
   adjusted value into the IA32_QM_EVTSEL MSR.
4) Divide the L3 cache between the SNC nodes. Divide the value
   reported in the resctrl "size" file by the number of SNC
   nodes because the effective amount of cache that can be allocated
   is reduced by that factor.
5) Disable the "-o mba_MBps" mount option in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.
   Trying to use this feedback loop might result in contradictory
   changes to the throttling level coming from each of the SNC
   node bandwidth measurements.

Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  5 +++--
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 3bfd1cf25b49..a1fb73e0681d 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -444,6 +444,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern unsigned int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 27523841f9fd..fcfc0b117ff7 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.  Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index e51bafc1945b..d93f50f1e1b4 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs.
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -761,8 +771,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 441d2f744ccc..52f8e0971ff1 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /*
@@ -2293,7 +2293,8 @@ static bool supports_mba_mbps(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 
 	return (is_mbm_local_enabled() &&
-		r->alloc_capable && is_mba_linear());
+		r->alloc_capable && is_mba_linear() &&
+		snc_nodes_per_l3_cache == 1);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v14 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
                                       ` (5 preceding siblings ...)
  2024-01-26 22:38                     ` [PATCH v14 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2024-01-26 22:38                     ` Tony Luck
  2024-01-26 22:38                     ` [PATCH v14 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
                                       ` (2 subsequent siblings)
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-26 22:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, x86, linux-kernel, linux-doc, patches, Tony Luck,
	Shaopeng Tan

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.

Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/msr-index.h   |   1 +
 arch/x86/kernel/cpu/resctrl/core.c | 118 +++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index f1bd7b91b3c6..f6ba7d0397b8 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1119,6 +1119,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index fcfc0b117ff7..ad8e68ed5ec9 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -738,11 +741,42 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+	u64 val;
+
+	/* Only need to enable once per package. */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
@@ -997,11 +1031,95 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+	int num_l3_caches;
+	int cache_id;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+
+	if (num_online_cpus() != num_present_cpus())
+		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids) {
+			cache_id = get_cpu_cacheinfo_id(cpu, 3);
+			if (cache_id != -1)
+				set_bit(cache_id, node_caches);
+		} else {
+			mem_only_nodes++;
+		}
+	}
+	cpus_read_unlock();
+
+	num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
+	kfree(node_caches);
+
+	if (!num_l3_caches)
+		goto insane;
+
+	/* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
+	if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
+		goto insane;
+
+	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+	/* sanity check #2: Only valid results are 1, 2, 3, 4 */
+	switch (ret) {
+	case 1:
+		break;
+	case 2:
+	case 3:
+	case 4:
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+		break;
+	default:
+		goto insane;
+	}
+
+	return ret;
+insane:
+	pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
+		(nr_node_ids - mem_only_nodes), num_l3_caches);
+	return 1;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = snc_get_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v14 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
                                       ` (6 preceding siblings ...)
  2024-01-26 22:38                     ` [PATCH v14 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2024-01-26 22:38                     ` Tony Luck
  2024-01-30  1:05                     ` [PATCH v14 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-26 22:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, x86, linux-kernel, linux-doc, patches, Tony Luck,
	Shaopeng Tan, Babu Moger

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..15f1cff6ee76 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
-	directories have one file per event (e.g. "llc_occupancy",
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event.  Each of these
+	directories has one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
 	all tasks in the group. In CTRL_MON groups these files provide
@@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
 each bit represents 5% of the capacity of the cache. You could partition
 the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache, but each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v14 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
                                       ` (7 preceding siblings ...)
  2024-01-26 22:38                     ` [PATCH v14 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2024-01-30  1:05                     ` Tony Luck
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
  9 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-30  1:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, Shaopeng Tan, James Morse, Jamie Iles, Babu Moger,
	Randy Dunlap, x86, linux-kernel, linux-doc, patches

I've been wondering whether the SNC patches were creating way too much
infrastructure that isn't generally useful. Specifically the capability
for a reactrl resource to have different scope for monitoring and
control functions.

That seems like a very strange thing.

History here is that Intel CMT, MBM and L3 CAT features arrived in
quick succession, closely followed by MBA and then L2 CAT. At the time
it made sense for the "L3" resource to cover all of CMT, MBM and L3 CAT.

That became slightly odd when MBA (another L3 scoped resource) was
added. It was given its own "struct rdt_resource". AMD later added
SMBA as yet another L3-scoped resource with its own "struct resource".

I wondered how the SNC series would look if I went back to an approach
from one of the earlier versions. In that one I created a new resource
for SNC monitoring that was NODE scoped. And then made all the CMT and
MBM code switch over to using that one when SNC was enabled. That was
a bit hacky, and I moved from there to the dual domain lists per
resource.

I just tried out an approach the split the RDT_RESOURCE_L3 resource
into separate resources, one for control, one for monitoring.

It makes the code much simpler:

 8 files changed, 235 insertions(+), 30 deletions(-) [1]

vs.

 9 files changed, 629 insertions(+), 282 deletions(-)


Tradeoffs:

The complex series (posted as v14) cleanly split the "rdt_domain"
structure into two. So it no longer carried all the fields needed
for both control and monitor, even though only one set was ever
used. But the cost was a lot of code churn that may never be useful
for anything other than SNC.

On non-SNC systems the new series produces separate linked lists
of domains for L3 control & monitor, even though the lists are the
same, and the domain structures still carry all fields for both
control and monitor functions.

Question: Does anyone think that single domain with different scope for
control and monitor is inherently more useful than two separate domains?


While I get this sorted out, maybe Boris should take James next set
of MPAM patches as they are, instead of on top of the complex SNC
series.

-Tony

[1] I'll post this set just as soon as I can get time on and SNC machine
to make sure they actually work.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
                                       ` (8 preceding siblings ...)
  2024-01-30  1:05                     ` [PATCH v14 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
@ 2024-01-30 22:20                     ` Tony Luck
  2024-01-30 22:20                       ` [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource Tony Luck
                                         ` (10 more replies)
  9 siblings, 11 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-30 22:20 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches, Tony Luck

This is the re-worked version of this series that I promised to post
yesterday. Check that e-mail for the arguments for this alternate
approach.

https://lore.kernel.org/all/ZbhLRDvZrxBZDv2j@agluck-desk3/

Apologies to Drew Fustini who I'd somehow dropped from later versions
of this series. Drew: you had made a comment at one point that having
different scopes within a single resource may be useful on RISC-V.
Version 14 included that, but it's gone here. Maybe multiple resctrl
"struct resource" for a single h/w entity like L3 as I'm doing in this
version could work for you too?

Patches 1-5 are almost completely rewritten based around the new
idea to give CMT and MBM their own "resource" instead of sharing
one with L3 CAT. This removes the need for separate domain lists,
and thus most of the churn of the previous version of this series.

Patches 6-8 are largely unchanged. But I removed all the Reviewed
and Tested tags since they are now built on a completely different
base.

Patches are against tip x86/cache:

  fc747eebef73 ("x86/resctrl: Remove redundant variable in mbm_config_write_domain()")

Re-work doesn't affect the v14 cover letter, so pasting it here:

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Tony Luck (8):
  x86/resctrl: Split the RDT_RESOURCE_L3 resource
  x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON
  x86/resctrl: Prepare for non-cache-scoped resources
  x86/resctrl: Add helper function to look up domain_id from scope
  x86/resctrl: Add "NODE" as an option for resource scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  25 ++-
 include/linux/resctrl.h                   |  10 +-
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |   3 +
 arch/x86/kernel/cpu/resctrl/core.c        | 181 +++++++++++++++++++++-
 arch/x86/kernel/cpu/resctrl/monitor.c     |  28 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  12 +-
 8 files changed, 236 insertions(+), 30 deletions(-)


base-commit: fc747eebef734563cf68a512f57937c8f231834a
-- 
2.43.0


^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
@ 2024-01-30 22:20                       ` Tony Luck
  2024-02-09 15:28                         ` Moger, Babu
  2024-01-30 22:20                       ` [PATCH v15-RFC 2/8] x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON Tony Luck
                                         ` (9 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2024-01-30 22:20 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches, Tony Luck

The RDT_RESOURCE_L3 is unique in that it is used for both monitoring
an control functions. This made sense while both uses had the same
scope. But systems with Sub-NUMA clustering enabled do not follow this
pattern.

Create a new resource: RDT_RESOURCE_L3_MON ready to take over the
monitoring functions.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  1 +
 arch/x86/kernel/cpu/resctrl/core.c     | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 52e7e7deee10..c6051bc70e96 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -429,6 +429,7 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 extern struct dentry *debugfs_resctrl;
 
 enum resctrl_res_level {
+	RDT_RESOURCE_L3_MON,
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
 	RDT_RESOURCE_MBA,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index aa9810a64258..c50f55d7790e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -60,6 +60,16 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 #define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
 
 struct rdt_hw_resource rdt_resources_all[] = {
+	[RDT_RESOURCE_L3_MON] =
+	{
+		.r_resctrl = {
+			.rid			= RDT_RESOURCE_L3_MON,
+			.name			= "L3",
+			.cache_level		= 3,
+			.domains		= domain_init(RDT_RESOURCE_L3_MON),
+			.fflags			= RFTYPE_RES_CACHE,
+		},
+	},
 	[RDT_RESOURCE_L3] =
 	{
 		.r_resctrl = {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v15-RFC 2/8] x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
  2024-01-30 22:20                       ` [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource Tony Luck
@ 2024-01-30 22:20                       ` Tony Luck
  2024-02-09 15:28                         ` Moger, Babu
  2024-01-30 22:20                       ` [PATCH v15-RFC 3/8] x86/resctrl: Prepare for non-cache-scoped resources Tony Luck
                                         ` (8 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2024-01-30 22:20 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches, Tony Luck

Switch over all places that setup and use monitoring funtions to
use the new resource structure.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++--
 arch/x86/kernel/cpu/resctrl/monitor.c  | 12 ++++--------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  2 +-
 3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c50f55d7790e..0828575c3e13 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -591,11 +591,13 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
+	if (r == &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
 			cancel_delayed_work(&d->mbm_over);
 			mbm_setup_overflow_handler(d, 0);
 		}
+	}
+	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
 		    has_busy_rmid(r, d)) {
 			cancel_delayed_work(&d->cqm_limbo);
@@ -826,7 +828,7 @@ static __init bool get_rdt_alloc_resources(void)
 
 static __init bool get_rdt_mon_resources(void)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
 
 	if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
 		rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3a6c069614eb..080cad0d7288 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -268,7 +268,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  */
 void __check_limbo(struct rdt_domain *d, bool force_free)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
 	struct rmid_entry *entry;
 	u32 crmid = 1, nrmid;
 	bool rmid_dirty;
@@ -333,7 +333,7 @@ int alloc_rmid(void)
 
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
 	struct rdt_domain *d;
 	int cpu, err;
 	u64 val = 0;
@@ -623,7 +623,7 @@ void cqm_handle_limbo(struct work_struct *work)
 
 	mutex_lock(&rdtgroup_mutex);
 
-	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
 	d = container_of(work, struct rdt_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
@@ -659,7 +659,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	if (!static_branch_likely(&rdt_mon_enable_key))
 		goto out_unlock;
 
-	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
 	d = container_of(work, struct rdt_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
@@ -736,10 +736,6 @@ static struct mon_evt mbm_local_event = {
 
 /*
  * Initialize the event list for the resource.
- *
- * Note that MBM events are also part of RDT_RESOURCE_L3 resource
- * because as per the SDM the total and local memory bandwidth
- * are enumerated as part of L3 monitoring.
  */
 static void l3_mon_evt_init(struct rdt_resource *r)
 {
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index aa24343f1d23..9ee3a9906781 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2644,7 +2644,7 @@ static int rdt_get_tree(struct fs_context *fc)
 		static_branch_enable_cpuslocked(&rdt_enable_key);
 
 	if (is_mbm_enabled()) {
-		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+		r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
 		list_for_each_entry(dom, &r->domains, list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v15-RFC 3/8] x86/resctrl: Prepare for non-cache-scoped resources
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
  2024-01-30 22:20                       ` [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource Tony Luck
  2024-01-30 22:20                       ` [PATCH v15-RFC 2/8] x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON Tony Luck
@ 2024-01-30 22:20                       ` Tony Luck
  2024-02-09 15:28                         ` Moger, Babu
  2024-01-30 22:20                       ` [PATCH v15-RFC 4/8] x86/resctrl: Add helper function to look up domain_id from scope Tony Luck
                                         ` (7 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2024-01-30 22:20 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches, Tony Luck

Not all resources are scoped in line with some level of hardware cache.

Prepare by renaming the "cache_level" field to "scope" and change
the type to an enum to ease adding new scopes.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |  9 +++++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 14 +++++++-------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  2 +-
 4 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..2155dc15e636 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Hardware scope for this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 0828575c3e13..d89dce63397b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3_MON,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3_MON),
 			.fflags			= RFTYPE_RES_CACHE,
 		},
@@ -75,7 +75,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -89,7 +89,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -103,7 +103,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -115,7 +115,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -514,7 +514,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_cpu_cacheinfo_id(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
@@ -564,7 +564,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_cpu_cacheinfo_id(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..6a72fb627aa5 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -311,7 +311,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == plr->s->res->scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9ee3a9906781..eff9d87547c9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1416,7 +1416,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v15-RFC 4/8] x86/resctrl: Add helper function to look up domain_id from scope
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
                                         ` (2 preceding siblings ...)
  2024-01-30 22:20                       ` [PATCH v15-RFC 3/8] x86/resctrl: Prepare for non-cache-scoped resources Tony Luck
@ 2024-01-30 22:20                       ` Tony Luck
  2024-02-09 15:28                         ` Moger, Babu
  2024-01-30 22:20                       ` [PATCH v15-RFC 5/8] x86/resctrl: Add "NODE" as an option for resource scope Tony Luck
                                         ` (6 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2024-01-30 22:20 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches, Tony Luck

Prepare for more options for scope of resources. Add some diagnostic
messages if lookup fails.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/core.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d89dce63397b..59e6aa7abef5 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -499,6 +499,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -514,12 +527,18 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 	int err;
 
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
+		return;
+	}
+
 	d = rdt_find_domain(r, id, &add_pos);
 	if (IS_ERR(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -564,10 +583,16 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
+		return;
+	}
+
 	d = rdt_find_domain(r, id, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v15-RFC 5/8] x86/resctrl: Add "NODE" as an option for resource scope
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
                                         ` (3 preceding siblings ...)
  2024-01-30 22:20                       ` [PATCH v15-RFC 4/8] x86/resctrl: Add helper function to look up domain_id from scope Tony Luck
@ 2024-01-30 22:20                       ` Tony Luck
  2024-02-09 15:29                         ` Moger, Babu
  2024-01-30 22:20                       ` [PATCH v15-RFC 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                                         ` (5 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2024-01-30 22:20 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches, Tony Luck

Add RESCTRL_NODE to the enum, and to the helper function that looks
up a domain id from a scope.

There are a couple of places where the scope must be a cache scope.
Add some defensive WARN_ON checks to those.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   | 1 +
 arch/x86/kernel/cpu/resctrl/core.c        | 3 +++
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 ++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 3 +++
 4 files changed, 11 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 2155dc15e636..e3cddf3f07f8 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -147,6 +147,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 59e6aa7abef5..b741cbf61843 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -505,6 +505,9 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
+
 	default:
 		break;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 6a72fb627aa5..2bafc73b51e2 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	enum resctrl_scope scope = plr->s->res->scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index eff9d87547c9..770f2bf98462 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1413,6 +1413,9 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return size;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v15-RFC 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
                                         ` (4 preceding siblings ...)
  2024-01-30 22:20                       ` [PATCH v15-RFC 5/8] x86/resctrl: Add "NODE" as an option for resource scope Tony Luck
@ 2024-01-30 22:20                       ` Tony Luck
  2024-02-09 15:29                         ` Moger, Babu
  2024-01-30 22:20                       ` [PATCH v15-RFC 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
                                         ` (4 subsequent siblings)
  10 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2024-01-30 22:20 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches, Tony Luck

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that shows how many
SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
SNC mode is either not implemented, or not enabled.

Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
   nodes.
3) The RMID renumbering operates when using the value from the
   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
   counter, adjust from the logical RMID to the physical
   RMID value for the SNC node that it wishes to read and load the
   adjusted value into the IA32_QM_EVTSEL MSR.
4) Divide the L3 cache between the SNC nodes. Divide the value
   reported in the resctrl "size" file by the number of SNC
   nodes because the effective amount of cache that can be allocated
   is reduced by that factor.
5) Disable the "-o mba_MBps" mount option in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.
   Trying to use this feedback loop might result in contradictory
   changes to the throttling level coming from each of the SNC
   node bandwidth measurements.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  5 +++--
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c6051bc70e96..d9c6dcf30922 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -428,6 +428,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern unsigned int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3_MON,
 	RDT_RESOURCE_L3,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b741cbf61843..dc886d2c9a33 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.  Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 080cad0d7288..357919bbadbe 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs.
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -757,8 +767,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 770f2bf98462..e639069f871a 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /*
@@ -2293,7 +2293,8 @@ static bool supports_mba_mbps(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 
 	return (is_mbm_local_enabled() &&
-		r->alloc_capable && is_mba_linear());
+		r->alloc_capable && is_mba_linear() &&
+		snc_nodes_per_l3_cache == 1);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v15-RFC 7/8] x86/resctrl: Sub NUMA Cluster detection and enable
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
                                         ` (5 preceding siblings ...)
  2024-01-30 22:20                       ` [PATCH v15-RFC 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2024-01-30 22:20                       ` Tony Luck
  2024-01-30 22:20                       ` [PATCH v15-RFC 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
                                         ` (3 subsequent siblings)
  10 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-30 22:20 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches, Tony Luck

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen. Update
the scope of the RDT_RESOURCE_L3_MON resource to NODE.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/msr-index.h   |   1 +
 arch/x86/kernel/cpu/resctrl/core.c | 119 +++++++++++++++++++++++++++++
 2 files changed, 120 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index f1bd7b91b3c6..f6ba7d0397b8 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1119,6 +1119,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index dc886d2c9a33..84c36e10241f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -651,11 +654,42 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+	u64 val;
+
+	/* Only need to enable once per package. */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	/* The cpu is set in default rdtgroup after online. */
@@ -910,11 +944,96 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+	int num_l3_caches;
+	int cache_id;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+
+	if (num_online_cpus() != num_present_cpus())
+		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids) {
+			cache_id = get_cpu_cacheinfo_id(cpu, 3);
+			if (cache_id != -1)
+				set_bit(cache_id, node_caches);
+		} else {
+			mem_only_nodes++;
+		}
+	}
+	cpus_read_unlock();
+
+	num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
+	kfree(node_caches);
+
+	if (!num_l3_caches)
+		goto insane;
+
+	/* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
+	if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
+		goto insane;
+
+	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+	/* sanity check #2: Only valid results are 1, 2, 3, 4 */
+	switch (ret) {
+	case 1:
+		break;
+	case 2:
+	case 3:
+	case 4:
+		rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl.scope = RESCTRL_NODE;
+		pr_info("Sub-NUMA Cluster: %d nodes per L3 cache\n", ret);
+		break;
+	default:
+		goto insane;
+	}
+
+	return ret;
+insane:
+	pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
+		(nr_node_ids - mem_only_nodes), num_l3_caches);
+	return 1;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = snc_get_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* [PATCH v15-RFC 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
                                         ` (6 preceding siblings ...)
  2024-01-30 22:20                       ` [PATCH v15-RFC 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2024-01-30 22:20                       ` Tony Luck
  2024-02-01 18:46                       ` [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
                                         ` (2 subsequent siblings)
  10 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-01-30 22:20 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches, Tony Luck,
	Shaopeng Tan, Babu Moger

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..15f1cff6ee76 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
-	directories have one file per event (e.g. "llc_occupancy",
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+	or enabled respectively) and by RDT event.  Each of these
+	directories has one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
 	all tasks in the group. In CTRL_MON groups these files provide
@@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
 each bit represents 5% of the capacity of the cache. You could partition
 the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.  Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache, but each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
                                         ` (7 preceding siblings ...)
  2024-01-30 22:20                       ` [PATCH v15-RFC 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2024-02-01 18:46                       ` Reinette Chatre
  2024-02-08  4:17                       ` Drew Fustini
  2024-02-09 15:27                       ` Moger, Babu
  10 siblings, 0 replies; 331+ messages in thread
From: Reinette Chatre @ 2024-02-01 18:46 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Hi Tony,

On 1/30/2024 2:20 PM, Tony Luck wrote:
> This is the re-worked version of this series that I promised to post
> yesterday. Check that e-mail for the arguments for this alternate
> approach.
> 
> https://lore.kernel.org/all/ZbhLRDvZrxBZDv2j@agluck-desk3/
> 
> Apologies to Drew Fustini who I'd somehow dropped from later versions
> of this series. Drew: you had made a comment at one point that having
> different scopes within a single resource may be useful on RISC-V.
> Version 14 included that, but it's gone here. Maybe multiple resctrl
> "struct resource" for a single h/w entity like L3 as I'm doing in this
> version could work for you too?
> 
> Patches 1-5 are almost completely rewritten based around the new
> idea to give CMT and MBM their own "resource" instead of sharing
> one with L3 CAT. This removes the need for separate domain lists,
> and thus most of the churn of the previous version of this series.

I do not see it as removing the need for separate domain lists but
instead keeping the idea of separate domain lists but in this case
duplicating the resource in order to host the second domain
list. This solution also keeps the same structures for control and
monitoring that previous version cleaned up [1].
To me this thus seems like a similar solution as v14 but with
additional duplication due to an extra resource and without the cleanup.

Reinette

[1] https://lore.kernel.org/lkml/20240126223837.21835-5-tony.luck@intel.com/


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
                                         ` (8 preceding siblings ...)
  2024-02-01 18:46                       ` [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
@ 2024-02-08  4:17                       ` Drew Fustini
  2024-02-08 19:26                         ` Luck, Tony
  2024-02-09 15:27                       ` Moger, Babu
  10 siblings, 1 reply; 331+ messages in thread
From: Drew Fustini @ 2024-02-08  4:17 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, Drew Fustini, linux-kernel, linux-doc,
	patches

On Tue, Jan 30, 2024 at 02:20:26PM -0800, Tony Luck wrote:
> This is the re-worked version of this series that I promised to post
> yesterday. Check that e-mail for the arguments for this alternate
> approach.
> 
> https://lore.kernel.org/all/ZbhLRDvZrxBZDv2j@agluck-desk3/
> 
> Apologies to Drew Fustini who I'd somehow dropped from later versions
> of this series. Drew: you had made a comment at one point that having
> different scopes within a single resource may be useful on RISC-V.
> Version 14 included that, but it's gone here. Maybe multiple resctrl
> "struct resource" for a single h/w entity like L3 as I'm doing in this
> version could work for you too?

Sorry for the latency.

The RISC-V CBQRI specification [1] describes a bandwidth controller
register interface [2]. It allows a controller to implement both
bandwidth allocation and bandwidth usage monitoring.

The proof-of-concept resctrl implementation [3] that I worked on created
two domains for each memory controller in the example SoC. One domain
would contain the MBA resource and the other would contain the L3
resource to represent MBM files like local_bytes:

  # cat /sys/fs/resctrl/schemata
  MB:4=  80;6=  80;8=  80
  L2:0=0fff;1=0fff
  L3:2=ffff;3=0000;5=0000;7=0000

Where:

  Domain 0 is L2 cache controller 0 capacity allocation
  Domain 1 is L2 cache controller 1 capacity allocation
  Domain 2 is L3 cache controller capacity allocation

  Domain 4 is Memory controller 0 bandwidth allocation
  Domain 6 is Memory controller 1 bandwidth allocation
  Domain 8 is Memory controller 2 bandwidth allocation

  Domain 3 is Memory controller 0 bandwidth monitoring
  Domain 5 is Memory controller 1 bandwidth monitoring
  Domain 7 is Memory controller 2 bandwidth monitoring

I think this scheme is confusing but I wasn't able to find a better
way to do it at the time.

> Patches 1-5 are almost completely rewritten based around the new
> idea to give CMT and MBM their own "resource" instead of sharing
> one with L3 CAT. This removes the need for separate domain lists,
> and thus most of the churn of the previous version of this series.

Very interesting. Do you think I would be able to create MBM files for
each memory controller without creating pointless L3 domains that show
up in schemata?

Thanks,
Drew

[1] https://github.com/riscv-non-isa/riscv-cbqri/releases/tag/v1.0-rc1
[2] https://github.com/riscv-non-isa/riscv-cbqri/blob/main/qos_bandwidth.adoc
[3] https://lore.kernel.org/linux-riscv/20230419111111.477118-1-dfustini@baylibre.com/

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-08  4:17                       ` Drew Fustini
@ 2024-02-08 19:26                         ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2024-02-08 19:26 UTC (permalink / raw)
  To: Drew Fustini
  Cc: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Babu Moger, Randy Dunlap, Drew Fustini, linux-kernel, linux-doc,
	patches

> > Patches 1-5 are almost completely rewritten based around the new
> > idea to give CMT and MBM their own "resource" instead of sharing
> > one with L3 CAT. This removes the need for separate domain lists,
> > and thus most of the churn of the previous version of this series.
>
> Very interesting. Do you think I would be able to create MBM files for
> each memory controller without creating pointless L3 domains that show
> up in schemata?

Entries only show up in the schemata file for resources that are "alloc_capable".
So you should be able to make as many rdt_hw_resource structures as you
need that are "mon_capable", but not "alloc_capable" ... though making more
than one such resource may explore untested areas of the code since there
has historically only been one mon_capable resource. It looks like the resource
id from the "rid" field is passed through to the "show" functions for MBM and CQM.

This patch series splits the one resource that is marked as both mon_capable
and alloc_capable into two. Maybe that's a useful cleanup, but maybe not a
requirement for what you need.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
                                         ` (9 preceding siblings ...)
  2024-02-08  4:17                       ` Drew Fustini
@ 2024-02-09 15:27                       ` Moger, Babu
  2024-02-09 18:31                         ` Tony Luck
  2024-02-12 19:44                         ` Reinette Chatre
  10 siblings, 2 replies; 331+ messages in thread
From: Moger, Babu @ 2024-02-09 15:27 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Hi Tony,

On 1/30/24 16:20, Tony Luck wrote:
> This is the re-worked version of this series that I promised to post
> yesterday. Check that e-mail for the arguments for this alternate
> approach.

To be honest, I like this series more than the previous series. I always
thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.

You need to separate the domain lists for RDT_RESOURCE_L3 and
RDT_RESOURCE_L3_MON if you are going this route. I didn't see that in this
series. Also I have few other comments as well.

Thanks
Babu


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource
  2024-01-30 22:20                       ` [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource Tony Luck
@ 2024-02-09 15:28                         ` Moger, Babu
  2024-02-09 18:44                           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2024-02-09 15:28 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Hi Tony,

On 1/30/24 16:20, Tony Luck wrote:
> The RDT_RESOURCE_L3 is unique in that it is used for both monitoring
> an control functions. This made sense while both uses had the same

an/and

> scope. But systems with Sub-NUMA clustering enabled do not follow this
> pattern.
> 
> Create a new resource: RDT_RESOURCE_L3_MON ready to take over the
> monitoring functions.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/resctrl/internal.h |  1 +
>  arch/x86/kernel/cpu/resctrl/core.c     | 10 ++++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 52e7e7deee10..c6051bc70e96 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -429,6 +429,7 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>  extern struct dentry *debugfs_resctrl;
>  
>  enum resctrl_res_level {
> +	RDT_RESOURCE_L3_MON,
>  	RDT_RESOURCE_L3,

How about?
RDT_RESOURCE_L3,
RDT_RESOURCE_L3_MON,

>  	RDT_RESOURCE_L2,
>  	RDT_RESOURCE_MBA,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index aa9810a64258..c50f55d7790e 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -60,6 +60,16 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
>  #define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
>  
>  struct rdt_hw_resource rdt_resources_all[] = {
> +	[RDT_RESOURCE_L3_MON] =
> +	{
> +		.r_resctrl = {
> +			.rid			= RDT_RESOURCE_L3_MON,
> +			.name			= "L3",

L3_MON ?

> +			.cache_level		= 3,
> +			.domains		= domain_init(RDT_RESOURCE_L3_MON),
> +			.fflags			= RFTYPE_RES_CACHE,
> +		},
> +	},
>  	[RDT_RESOURCE_L3] =
>  	{
>  		.r_resctrl = {

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 2/8] x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON
  2024-01-30 22:20                       ` [PATCH v15-RFC 2/8] x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON Tony Luck
@ 2024-02-09 15:28                         ` Moger, Babu
  2024-02-09 18:51                           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2024-02-09 15:28 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Tony,

On 1/30/24 16:20, Tony Luck wrote:
> Switch over all places that setup and use monitoring funtions to

functions?

> use the new resource structure.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++--
>  arch/x86/kernel/cpu/resctrl/monitor.c  | 12 ++++--------
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  2 +-
>  3 files changed, 9 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index c50f55d7790e..0828575c3e13 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -591,11 +591,13 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  		return;
>  	}
>  
> -	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> +	if (r == &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl) {
>  		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
>  			cancel_delayed_work(&d->mbm_over);
>  			mbm_setup_overflow_handler(d, 0);
>  		}
> +	}
> +	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {

RDT_RESOURCE_L3_MON?


>  		if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
>  		    has_busy_rmid(r, d)) {
>  			cancel_delayed_work(&d->cqm_limbo);
> @@ -826,7 +828,7 @@ static __init bool get_rdt_alloc_resources(void)
>  
>  static __init bool get_rdt_mon_resources(void)
>  {
> -	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
>  
>  	if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
>  		rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 3a6c069614eb..080cad0d7288 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -268,7 +268,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>   */
>  void __check_limbo(struct rdt_domain *d, bool force_free)
>  {
> -	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
>  	struct rmid_entry *entry;
>  	u32 crmid = 1, nrmid;
>  	bool rmid_dirty;
> @@ -333,7 +333,7 @@ int alloc_rmid(void)
>  
>  static void add_rmid_to_limbo(struct rmid_entry *entry)
>  {
> -	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
>  	struct rdt_domain *d;
>  	int cpu, err;
>  	u64 val = 0;
> @@ -623,7 +623,7 @@ void cqm_handle_limbo(struct work_struct *work)
>  
>  	mutex_lock(&rdtgroup_mutex);
>  
> -	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
>  	d = container_of(work, struct rdt_domain, cqm_limbo.work);
>  
>  	__check_limbo(d, false);
> @@ -659,7 +659,7 @@ void mbm_handle_overflow(struct work_struct *work)
>  	if (!static_branch_likely(&rdt_mon_enable_key))
>  		goto out_unlock;
>  
> -	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
>  	d = container_of(work, struct rdt_domain, mbm_over.work);
>  
>  	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> @@ -736,10 +736,6 @@ static struct mon_evt mbm_local_event = {
>  
>  /*
>   * Initialize the event list for the resource.
> - *
> - * Note that MBM events are also part of RDT_RESOURCE_L3 resource
> - * because as per the SDM the total and local memory bandwidth
> - * are enumerated as part of L3 monitoring.
>   */
>  static void l3_mon_evt_init(struct rdt_resource *r)
>  {
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index aa24343f1d23..9ee3a9906781 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2644,7 +2644,7 @@ static int rdt_get_tree(struct fs_context *fc)
>  		static_branch_enable_cpuslocked(&rdt_enable_key);
>  
>  	if (is_mbm_enabled()) {
> -		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +		r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
>  		list_for_each_entry(dom, &r->domains, list)
>  			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
>  	}

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 3/8] x86/resctrl: Prepare for non-cache-scoped resources
  2024-01-30 22:20                       ` [PATCH v15-RFC 3/8] x86/resctrl: Prepare for non-cache-scoped resources Tony Luck
@ 2024-02-09 15:28                         ` Moger, Babu
  2024-02-09 18:57                           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2024-02-09 15:28 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Hi Tony,

On 1/30/24 16:20, Tony Luck wrote:
> Not all resources are scoped in line with some level of hardware cache.

same level?

Thanks
Babu
> 
> Prepare by renaming the "cache_level" field to "scope" and change
> the type to an enum to ease adding new scopes.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h                   |  9 +++++++--
>  arch/x86/kernel/cpu/resctrl/core.c        | 14 +++++++-------
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  2 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  2 +-
>  4 files changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 66942d7fba7f..2155dc15e636 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -144,13 +144,18 @@ struct resctrl_membw {
>  struct rdt_parse_data;
>  struct resctrl_schema;
>  
> +enum resctrl_scope {
> +	RESCTRL_L2_CACHE = 2,
> +	RESCTRL_L3_CACHE = 3,
> +};
> +
>  /**
>   * struct rdt_resource - attributes of a resctrl resource
>   * @rid:		The index of the resource
>   * @alloc_capable:	Is allocation available on this machine
>   * @mon_capable:	Is monitor feature available on this machine
>   * @num_rmid:		Number of RMIDs available
> - * @cache_level:	Which cache level defines scope of this resource
> + * @scope:		Hardware scope for this resource
>   * @cache:		Cache allocation related data
>   * @membw:		If the component has bandwidth controls, their properties.
>   * @domains:		All domains for this resource
> @@ -168,7 +173,7 @@ struct rdt_resource {
>  	bool			alloc_capable;
>  	bool			mon_capable;
>  	int			num_rmid;
> -	int			cache_level;
> +	enum resctrl_scope	scope;
>  	struct resctrl_cache	cache;
>  	struct resctrl_membw	membw;
>  	struct list_head	domains;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 0828575c3e13..d89dce63397b 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L3_MON,
>  			.name			= "L3",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_L3_MON),
>  			.fflags			= RFTYPE_RES_CACHE,
>  		},
> @@ -75,7 +75,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L3,
>  			.name			= "L3",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_L3),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
> @@ -89,7 +89,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_L2,
>  			.name			= "L2",
> -			.cache_level		= 2,
> +			.scope			= RESCTRL_L2_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_L2),
>  			.parse_ctrlval		= parse_cbm,
>  			.format_str		= "%d=%0*x",
> @@ -103,7 +103,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_MBA,
>  			.name			= "MB",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_MBA),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
> @@ -115,7 +115,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  		.r_resctrl = {
>  			.rid			= RDT_RESOURCE_SMBA,
>  			.name			= "SMBA",
> -			.cache_level		= 3,
> +			.scope			= RESCTRL_L3_CACHE,
>  			.domains		= domain_init(RDT_RESOURCE_SMBA),
>  			.parse_ctrlval		= parse_bw,
>  			.format_str		= "%d=%*u",
> @@ -514,7 +514,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>   */
>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_cpu_cacheinfo_id(cpu, r->scope);
>  	struct list_head *add_pos = NULL;
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
> @@ -564,7 +564,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  
>  static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> +	int id = get_cpu_cacheinfo_id(cpu, r->scope);
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 8f559eeae08e..6a72fb627aa5 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -311,7 +311,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
>  
>  	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == plr->s->res->cache_level) {
> +		if (ci->info_list[i].level == plr->s->res->scope) {
>  			plr->line_size = ci->info_list[i].coherency_line_size;
>  			return 0;
>  		}
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 9ee3a9906781..eff9d87547c9 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1416,7 +1416,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>  	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
>  	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
>  	for (i = 0; i < ci->num_leaves; i++) {
> -		if (ci->info_list[i].level == r->cache_level) {
> +		if (ci->info_list[i].level == r->scope) {
>  			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
>  			break;
>  		}

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 4/8] x86/resctrl: Add helper function to look up domain_id from scope
  2024-01-30 22:20                       ` [PATCH v15-RFC 4/8] x86/resctrl: Add helper function to look up domain_id from scope Tony Luck
@ 2024-02-09 15:28                         ` Moger, Babu
  2024-02-09 19:06                           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2024-02-09 15:28 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Hi Tony,

On 1/30/24 16:20, Tony Luck wrote:
> Prepare for more options for scope of resources. Add some diagnostic
> messages if lookup fails.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/resctrl/core.c | 29 +++++++++++++++++++++++++++--
>  1 file changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index d89dce63397b..59e6aa7abef5 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -499,6 +499,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>  	return 0;
>  }
>  
> +static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> +{
> +	switch (scope) {
> +	case RESCTRL_L2_CACHE:
> +	case RESCTRL_L3_CACHE:
> +		return get_cpu_cacheinfo_id(cpu, scope);
> +	default:
> +		break;
> +	}
> +
> +	return -EINVAL;
> +}
> +
>  /*
>   * domain_add_cpu - Add a cpu to a resource's domain list.
>   *
> @@ -514,12 +527,18 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>   */
>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->scope);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>  	struct list_head *add_pos = NULL;
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  	int err;
>  
> +	if (id < 0) {
> +		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->scope, r->name);

Will it be good to move pr_warn_once inside get_domain_id_from_scope
instead of repeating during every call?

> +		return;
> +	}
> +
>  	d = rdt_find_domain(r, id, &add_pos);
>  	if (IS_ERR(d)) {
>  		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> @@ -564,10 +583,16 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  
>  static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  {
> -	int id = get_cpu_cacheinfo_id(cpu, r->scope);
> +	int id = get_domain_id_from_scope(cpu, r->scope);
>  	struct rdt_hw_domain *hw_dom;
>  	struct rdt_domain *d;
>  
> +	if (id < 0) {
> +		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> +			     cpu, r->scope, r->name);

Same comment as above. Will it be good to move pr_warn_once inside
get_domain_id_from_scope ?

> +		return;
> +	}
> +
>  	d = rdt_find_domain(r, id, NULL);
>  	if (IS_ERR_OR_NULL(d)) {
>  		pr_warn("Couldn't find cache id for CPU %d\n", cpu);

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 5/8] x86/resctrl: Add "NODE" as an option for resource scope
  2024-01-30 22:20                       ` [PATCH v15-RFC 5/8] x86/resctrl: Add "NODE" as an option for resource scope Tony Luck
@ 2024-02-09 15:29                         ` Moger, Babu
  2024-02-09 19:23                           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2024-02-09 15:29 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Hi Tony,

This patch probably needs to be merged with with patch 7.

Thanks

On 1/30/24 16:20, Tony Luck wrote:
> Add RESCTRL_NODE to the enum, and to the helper function that looks
> up a domain id from a scope.
> 
> There are a couple of places where the scope must be a cache scope.
> Add some defensive WARN_ON checks to those.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h                   | 1 +
>  arch/x86/kernel/cpu/resctrl/core.c        | 3 +++
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 ++++
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 3 +++
>  4 files changed, 11 insertions(+)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 2155dc15e636..e3cddf3f07f8 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -147,6 +147,7 @@ struct resctrl_schema;
>  enum resctrl_scope {
>  	RESCTRL_L2_CACHE = 2,
>  	RESCTRL_L3_CACHE = 3,
> +	RESCTRL_NODE,
>  };
>  
>  /**
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 59e6aa7abef5..b741cbf61843 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -505,6 +505,9 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
>  	case RESCTRL_L2_CACHE:
>  	case RESCTRL_L3_CACHE:
>  		return get_cpu_cacheinfo_id(cpu, scope);
> +	case RESCTRL_NODE:
> +		return cpu_to_node(cpu);
> +
>  	default:
>  		break;
>  	}
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 6a72fb627aa5..2bafc73b51e2 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
>   */
>  static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
>  {
> +	enum resctrl_scope scope = plr->s->res->scope;
>  	struct cpu_cacheinfo *ci;
>  	int ret;
>  	int i;
>  
> +	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
> +		return -ENODEV;
> +
>  	/* Pick the first cpu we find that is associated with the cache. */
>  	plr->cpu = cpumask_first(&plr->d->cpu_mask);
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index eff9d87547c9..770f2bf98462 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1413,6 +1413,9 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>  	unsigned int size = 0;
>  	int num_b, i;
>  
> +	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
> +		return size;
> +
>  	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
>  	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
>  	for (i = 0; i < ci->num_leaves; i++) {

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2024-01-30 22:20                       ` [PATCH v15-RFC 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2024-02-09 15:29                         ` Moger, Babu
  2024-02-09 19:35                           ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2024-02-09 15:29 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Hi Tony,

On 1/30/24 16:20, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
> 
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
> 
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
> 
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
> 
> The other mode divides the RMIDs and renumbers the ones on the second
> SNC node to start from zero.
> 
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
> 
> Add a global integer "snc_nodes_per_l3_cache" that shows how many
> SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
> SNC mode is either not implemented, or not enabled.
> 
> Update all places to take appropriate action when SNC mode is enabled:
> 1) The number of logical RMIDs per L3 cache available for use is the
>    number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
>    nodes.
> 3) The RMID renumbering operates when using the value from the
>    IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
>    counter, adjust from the logical RMID to the physical
>    RMID value for the SNC node that it wishes to read and load the
>    adjusted value into the IA32_QM_EVTSEL MSR.
> 4) Divide the L3 cache between the SNC nodes. Divide the value
>    reported in the resctrl "size" file by the number of SNC
>    nodes because the effective amount of cache that can be allocated
>    is reduced by that factor.
> 5) Disable the "-o mba_MBps" mount option in SNC mode
>    because the monitoring is being done per SNC node, while the
>    bandwidth allocation is still done at the L3 cache scope.
>    Trying to use this feedback loop might result in contradictory
>    changes to the throttling level coming from each of the SNC
>    node bandwidth measurements.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
>  arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
>  arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  5 +++--
>  4 files changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index c6051bc70e96..d9c6dcf30922 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -428,6 +428,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>  
>  extern struct dentry *debugfs_resctrl;
>  
> +extern unsigned int snc_nodes_per_l3_cache;

I feel this can be part of rdt_resource instead of global.


> +
>  enum resctrl_res_level {
>  	RDT_RESOURCE_L3_MON,
>  	RDT_RESOURCE_L3,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index b741cbf61843..dc886d2c9a33 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -48,6 +48,12 @@ int max_name_width, max_data_width;
>   */
>  bool rdt_alloc_capable;
>  
> +/*
> + * Number of SNC nodes that share each L3 cache.  Default is 1 for
> + * systems that do not support SNC, or have SNC disabled.
> + */
> +unsigned int snc_nodes_per_l3_cache = 1;
> +
>  static void
>  mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
>  		struct rdt_resource *r);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 080cad0d7288..357919bbadbe 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
>  
>  static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>  {
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;

RDT_RESOURCE_L3_MON?


> +	int cpu = smp_processor_id();
> +	int rmid_offset = 0;
>  	u64 msr_val;
>  
> +	/*
> +	 * When SNC mode is on, need to compute the offset to read the
> +	 * physical RMID counter for the node to which this CPU belongs.
> +	 */
> +	if (snc_nodes_per_l3_cache > 1)
> +		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;

Not sure if you have tested or not. r->num_rmid is initialized for the
resource RDT_RESOURCE_L3_MON. For other resource it is always 0.


> +
>  	/*
>  	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
>  	 * with a valid event code for supported resource type and the bits
> @@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>  	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>  	 * are error bits.
>  	 */
> -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
>  	rdmsrl(MSR_IA32_QM_CTR, msr_val);
>  
>  	if (msr_val & RMID_VAL_ERROR)
> @@ -757,8 +767,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>  	int ret;
>  
>  	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> -	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> -	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> +	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> +	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
>  	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>  
>  	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 770f2bf98462..e639069f871a 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
>  		}
>  	}
>  
> -	return size;
> +	return size / snc_nodes_per_l3_cache;
>  }
>  
>  /*
> @@ -2293,7 +2293,8 @@ static bool supports_mba_mbps(void)
>  	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>  
>  	return (is_mbm_local_enabled() &&
> -		r->alloc_capable && is_mba_linear());
> +		r->alloc_capable && is_mba_linear() &&
> +		snc_nodes_per_l3_cache == 1);
>  }
>  
>  /*

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-09 15:27                       ` Moger, Babu
@ 2024-02-09 18:31                         ` Tony Luck
  2024-02-09 19:38                           ` Moger, Babu
  2024-02-12 19:44                         ` Reinette Chatre
  1 sibling, 1 reply; 331+ messages in thread
From: Tony Luck @ 2024-02-09 18:31 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches

On Fri, Feb 09, 2024 at 09:27:56AM -0600, Moger, Babu wrote:
> Hi Tony,
> 
> On 1/30/24 16:20, Tony Luck wrote:
> > This is the re-worked version of this series that I promised to post
> > yesterday. Check that e-mail for the arguments for this alternate
> > approach.
> 
> To be honest, I like this series more than the previous series. I always
> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
> 
> You need to separate the domain lists for RDT_RESOURCE_L3 and
> RDT_RESOURCE_L3_MON if you are going this route. I didn't see that in this
> series. Also I have few other comments as well.

They are separated. Each "struct rdt_resource" has its own domain list.

Or do you mean break up the struct rdt_domain into the control and
monitor versions as was done in the previous series?
> 
> Thanks
> Babu
> 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource
  2024-02-09 15:28                         ` Moger, Babu
@ 2024-02-09 18:44                           ` Tony Luck
  2024-02-09 19:40                             ` Moger, Babu
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2024-02-09 18:44 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches

On Fri, Feb 09, 2024 at 09:28:16AM -0600, Moger, Babu wrote:
> >  enum resctrl_res_level {
> > +	RDT_RESOURCE_L3_MON,
> >  	RDT_RESOURCE_L3,
> 
> How about?
> RDT_RESOURCE_L3,
> RDT_RESOURCE_L3_MON,

Does the order matter? I put the L3_MON one first because historically
cache occupancy was the first resctrl tool. But if you have a better
argument for the order, then I can change it.

> >  	RDT_RESOURCE_L2,
> >  	RDT_RESOURCE_MBA,
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index aa9810a64258..c50f55d7790e 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -60,6 +60,16 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
> >  #define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
> >  
> >  struct rdt_hw_resource rdt_resources_all[] = {
> > +	[RDT_RESOURCE_L3_MON] =
> > +	{
> > +		.r_resctrl = {
> > +			.rid			= RDT_RESOURCE_L3_MON,
> > +			.name			= "L3",
> 
> L3_MON ?

That was my first choice too. But I found:

$ ls /sys/fs/resctrl/info
L3  L3_MON_MON  last_cmd_status  MB

This would be easy to fix ... just change this code to not append
an extra "_MON" to the directory name:

        for_each_mon_capable_rdt_resource(r) {
                fflags = r->fflags | RFTYPE_MON_INFO;
                sprintf(name, "%s_MON", r->name);
                ret = rdtgroup_mkdir_info_resdir(r, name, fflags);
                if (ret)
                        goto out_destroy;
        }

But I also saw this:
$ ls /sys/fs/resctrl/mon_data/
mon_L3_MON_00  mon_L3_MON_01

which didn't seem to have an easy fix. So I took the easy route and left
the ".name" field as "L3_MON".

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 2/8] x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON
  2024-02-09 15:28                         ` Moger, Babu
@ 2024-02-09 18:51                           ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-02-09 18:51 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches

On Fri, Feb 09, 2024 at 09:28:25AM -0600, Moger, Babu wrote:
> Tony,
> 
> On 1/30/24 16:20, Tony Luck wrote:
> > Switch over all places that setup and use monitoring funtions to
> 
> functions?

Yes. Will fix.

> > use the new resource structure.
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++--
> >  arch/x86/kernel/cpu/resctrl/monitor.c  | 12 ++++--------
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  2 +-
> >  3 files changed, 9 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index c50f55d7790e..0828575c3e13 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -591,11 +591,13 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> >  		return;
> >  	}
> >  
> > -	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> > +	if (r == &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl) {
> >  		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> >  			cancel_delayed_work(&d->mbm_over);
> >  			mbm_setup_overflow_handler(d, 0);
> >  		}
> > +	}
> > +	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> 
> RDT_RESOURCE_L3_MON?

Good catch.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 3/8] x86/resctrl: Prepare for non-cache-scoped resources
  2024-02-09 15:28                         ` Moger, Babu
@ 2024-02-09 18:57                           ` Tony Luck
  2024-02-09 19:44                             ` Moger, Babu
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2024-02-09 18:57 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches

On Fri, Feb 09, 2024 at 09:28:35AM -0600, Moger, Babu wrote:
> Hi Tony,
> 
> On 1/30/24 16:20, Tony Luck wrote:
> > Not all resources are scoped in line with some level of hardware cache.
> 
> same level?

No. "same" isn't what I meant here. If I shuffle this around:

	Not all resources are scoped to match the scope of a hardware
	cache level.

Is that more clear?


-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 4/8] x86/resctrl: Add helper function to look up domain_id from scope
  2024-02-09 15:28                         ` Moger, Babu
@ 2024-02-09 19:06                           ` Tony Luck
  0 siblings, 0 replies; 331+ messages in thread
From: Tony Luck @ 2024-02-09 19:06 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches

On Fri, Feb 09, 2024 at 09:28:52AM -0600, Moger, Babu wrote:
> > +	if (id < 0) {
> > +		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> > +			     cpu, r->scope, r->name);
> 
> Will it be good to move pr_warn_once inside get_domain_id_from_scope
> instead of repeating during every call?

Yes. Will move from here to get_domain_id_from_scope().

> > +	if (id < 0) {
> > +		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> > +			     cpu, r->scope, r->name);
> 
> Same comment as above. Will it be good to move pr_warn_once inside
> get_domain_id_from_scope ?

Moved this one too.

Thanks

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 5/8] x86/resctrl: Add "NODE" as an option for resource scope
  2024-02-09 15:29                         ` Moger, Babu
@ 2024-02-09 19:23                           ` Tony Luck
  2024-02-12 17:16                             ` Moger, Babu
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2024-02-09 19:23 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches

On Fri, Feb 09, 2024 at 09:29:03AM -0600, Moger, Babu wrote:
> Hi Tony,
> 
> This patch probably needs to be merged with with patch 7.

If it just added RESCTRL_NODE to the enum and the switch() I'd
agree (as patch 7 is where RESCTRL_NODE first used). But this
patch also adds the sanity checks on scope where it has to be
a cache level.

Patch 7 is already on the big side (119 lines added to core.c).

If you really think this series needs to cut back the
number of patches, I could move the sanity check pieces
from here to patch 3 (where the enum is introduced) and
just the RESCTRL_NODE bits to patch 7.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2024-02-09 15:29                         ` Moger, Babu
@ 2024-02-09 19:35                           ` Tony Luck
  2024-02-12 17:22                             ` Moger, Babu
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2024-02-09 19:35 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches

On Fri, Feb 09, 2024 at 09:29:16AM -0600, Moger, Babu wrote:
> >  
> > +extern unsigned int snc_nodes_per_l3_cache;
> 
> I feel this can be part of rdt_resource instead of global.

Mixed emotions about that. It would be another field that appears
in every instance of rdt_resource, but only used by the RDT_RESOURCE_L3_MON
copy.

> 
> > +
> >  enum resctrl_res_level {
> >  	RDT_RESOURCE_L3_MON,
> >  	RDT_RESOURCE_L3,
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index b741cbf61843..dc886d2c9a33 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -48,6 +48,12 @@ int max_name_width, max_data_width;
> >   */
> >  bool rdt_alloc_capable;
> >  
> > +/*
> > + * Number of SNC nodes that share each L3 cache.  Default is 1 for
> > + * systems that do not support SNC, or have SNC disabled.
> > + */
> > +unsigned int snc_nodes_per_l3_cache = 1;
> > +
> >  static void
> >  mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> >  		struct rdt_resource *r);
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index 080cad0d7288..357919bbadbe 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
> >  
> >  static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> >  {
> > +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> 
> RDT_RESOURCE_L3_MON?

Second good catch.

> 
> > +	int cpu = smp_processor_id();
> > +	int rmid_offset = 0;
> >  	u64 msr_val;
> >  
> > +	/*
> > +	 * When SNC mode is on, need to compute the offset to read the
> > +	 * physical RMID counter for the node to which this CPU belongs.
> > +	 */
> > +	if (snc_nodes_per_l3_cache > 1)
> > +		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> 
> Not sure if you have tested or not. r->num_rmid is initialized for the
> resource RDT_RESOURCE_L3_MON. For other resource it is always 0.

I hadn't got time on the SNC machine to try this out. Thanks
for catching this one, I'd have been scratching my head for a
while to track the symptoms of this problem back to this mistake.

Thanks

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-09 18:31                         ` Tony Luck
@ 2024-02-09 19:38                           ` Moger, Babu
  2024-02-09 21:26                             ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Moger, Babu @ 2024-02-09 19:38 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches


On 2/9/24 12:31, Tony Luck wrote:
> On Fri, Feb 09, 2024 at 09:27:56AM -0600, Moger, Babu wrote:
>> Hi Tony,
>>
>> On 1/30/24 16:20, Tony Luck wrote:
>>> This is the re-worked version of this series that I promised to post
>>> yesterday. Check that e-mail for the arguments for this alternate
>>> approach.
>>
>> To be honest, I like this series more than the previous series. I always
>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>>
>> You need to separate the domain lists for RDT_RESOURCE_L3 and
>> RDT_RESOURCE_L3_MON if you are going this route. I didn't see that in this
>> series. Also I have few other comments as well.
> 
> They are separated. Each "struct rdt_resource" has its own domain list.

Yea. You are right.
> 
> Or do you mean break up the struct rdt_domain into the control and
> monitor versions as was done in the previous series?

No. Not required. Each resource has its own domain list. So, it is
separated already as far as I can see.

Reinette seem to have some concerns about this series. But, I am fine with
both these approaches. I feel this is more clean approach.
-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource
  2024-02-09 18:44                           ` Tony Luck
@ 2024-02-09 19:40                             ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2024-02-09 19:40 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches



On 2/9/24 12:44, Tony Luck wrote:
> On Fri, Feb 09, 2024 at 09:28:16AM -0600, Moger, Babu wrote:
>>>  enum resctrl_res_level {
>>> +	RDT_RESOURCE_L3_MON,
>>>  	RDT_RESOURCE_L3,
>>
>> How about?
>> RDT_RESOURCE_L3,
>> RDT_RESOURCE_L3_MON,
> 
> Does the order matter? I put the L3_MON one first because historically
> cache occupancy was the first resctrl tool. But if you have a better
> argument for the order, then I can change it.

That is fine. No need to change.

> 
>>>  	RDT_RESOURCE_L2,
>>>  	RDT_RESOURCE_MBA,
>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>> index aa9810a64258..c50f55d7790e 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>> @@ -60,6 +60,16 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
>>>  #define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
>>>  
>>>  struct rdt_hw_resource rdt_resources_all[] = {
>>> +	[RDT_RESOURCE_L3_MON] =
>>> +	{
>>> +		.r_resctrl = {
>>> +			.rid			= RDT_RESOURCE_L3_MON,
>>> +			.name			= "L3",
>>
>> L3_MON ?
> 
> That was my first choice too. But I found:
> 
> $ ls /sys/fs/resctrl/info
> L3  L3_MON_MON  last_cmd_status  MB
> 
> This would be easy to fix ... just change this code to not append
> an extra "_MON" to the directory name:
> 
>         for_each_mon_capable_rdt_resource(r) {
>                 fflags = r->fflags | RFTYPE_MON_INFO;
>                 sprintf(name, "%s_MON", r->name);
>                 ret = rdtgroup_mkdir_info_resdir(r, name, fflags);
>                 if (ret)
>                         goto out_destroy;
>         }
> 
> But I also saw this:
> $ ls /sys/fs/resctrl/mon_data/
> mon_L3_MON_00  mon_L3_MON_01
> 
> which didn't seem to have an easy fix. So I took the easy route and left
> the ".name" field as "L3_MON".
> 

Ok. Sounds good.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 3/8] x86/resctrl: Prepare for non-cache-scoped resources
  2024-02-09 18:57                           ` Tony Luck
@ 2024-02-09 19:44                             ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2024-02-09 19:44 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches



On 2/9/24 12:57, Tony Luck wrote:
> On Fri, Feb 09, 2024 at 09:28:35AM -0600, Moger, Babu wrote:
>> Hi Tony,
>>
>> On 1/30/24 16:20, Tony Luck wrote:
>>> Not all resources are scoped in line with some level of hardware cache.
>>
>> same level?
> 
> No. "same" isn't what I meant here. If I shuffle this around:
> 
> 	Not all resources are scoped to match the scope of a hardware
> 	cache level.
> 
> Is that more clear?

Sure. Looks good.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-09 19:38                           ` Moger, Babu
@ 2024-02-09 21:26                             ` Reinette Chatre
  2024-02-09 21:36                               ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2024-02-09 21:26 UTC (permalink / raw)
  To: babu.moger, Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches



On 2/9/2024 11:38 AM, Moger, Babu wrote:
> On 2/9/24 12:31, Tony Luck wrote:
>> On Fri, Feb 09, 2024 at 09:27:56AM -0600, Moger, Babu wrote:
>>> On 1/30/24 16:20, Tony Luck wrote:
> 
> Reinette seem to have some concerns about this series. But, I am fine with
> both these approaches. I feel this is more clean approach.

I questioned the motivation but never received a response. 

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-09 21:26                             ` Reinette Chatre
@ 2024-02-09 21:36                               ` Luck, Tony
  2024-02-09 22:16                                 ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2024-02-09 21:36 UTC (permalink / raw)
  To: Chatre, Reinette, babu.moger
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

>> Reinette seem to have some concerns about this series. But, I am fine with
>> both these approaches. I feel this is more clean approach.
>
> I questioned the motivation but never received a response. 

Reinette,

Sorry. My motivation was to reduce the amount of code churn that
was done in the by the previous incarnation.

   9 files changed, 629 insertions(+), 282 deletions(-)

Vast amounts of that just added "_mon" or "_ctrl" to structure
or variable names.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-09 21:36                               ` Luck, Tony
@ 2024-02-09 22:16                                 ` Reinette Chatre
  2024-02-09 23:44                                   ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2024-02-09 22:16 UTC (permalink / raw)
  To: Luck, Tony, babu.moger
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Hi Tony,

On 2/9/2024 1:36 PM, Luck, Tony wrote:
>>> Reinette seem to have some concerns about this series. But, I am fine with
>>> both these approaches. I feel this is more clean approach.
>>
>> I questioned the motivation but never received a response. 
> 
> Reinette,
> 
> Sorry. My motivation was to reduce the amount of code churn that
> was done in the by the previous incarnation.
> 
>    9 files changed, 629 insertions(+), 282 deletions(-)
> 
> Vast amounts of that just added "_mon" or "_ctrl" to structure
> or variable names.

I actually had specific points that this response also ignores. 
Let me repeat and highlight the same points:

1) You claim that this series "removes the need for separate domain
   lists" ... but then this series does just that (create a separate
   domain list), but in an obfuscated way (duplicate the resource to
   have the monitoring domain list in there).
2) You claim this series "reduces amount of code churn", but this is
   because this series keeps using the same original data structures
   for separate monitoring and control usages. The previous series made 
   an effort to separate the structures for the different usages
   but this series does not. What makes it ok in this series to
   use the same data structures for different usages? 

Additionally:

Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure
or variable names." ... that is because the structures are actually split,
no? It is not just renaming for unnecessary churn.

What is the benefit of keeping the data structures to be shared
between monitor and control usages?
If there is a benefit to keeping these data structures, why not just
address this aspect in previous solution?

Reinette

   


^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-09 22:16                                 ` Reinette Chatre
@ 2024-02-09 23:44                                   ` Luck, Tony
  2024-02-10  0:28                                     ` Reinette Chatre
  0 siblings, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2024-02-09 23:44 UTC (permalink / raw)
  To: Chatre, Reinette, babu.moger
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

> I actually had specific points that this response also ignores.
> Let me repeat and highlight the same points:
>
> 1) You claim that this series "removes the need for separate domain
>    lists" ... but then this series does just that (create a separate
>    domain list), but in an obfuscated way (duplicate the resource to
>    have the monitoring domain list in there).

That was poorly worded on my part. I should have said "removes the
need for separate domain lists within a single rdt_resource".

Adding an extra domain list to a resource may be the start of a slippery
slope. What if there is some additional "L3"-like resctrl operation that
acts at the socket level (Intel has made products with multiple L3
instances per socket before). Would you be OK add a third domain
list to every struct rdt_resource to handle this? Or would it be simpler
to just add a new rdt_resource structure with socket scoped domains?

> 2) You claim this series "reduces amount of code churn", but this is
>    because this series keeps using the same original data structures
>    for separate monitoring and control usages. The previous series made
>    an effort to separate the structures for the different usages
>    but this series does not. What makes it ok in this series to
>    use the same data structures for different usages?

Legacy resctrl has been using the same rdt_domain structure for both
usages since the dawn of time. So it has been OK up until now.

> Additionally:
>
> Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure
> or variable names." ... that is because the structures are actually split,
> no? It is not just renaming for unnecessary churn.

Perhaps not "unnecessary" churn. But certainly a lot of code change for
what I perceive as very little real gain. 

> What is the benefit of keeping the data structures to be shared
> between monitor and control usages?

Benefit is no code changes. Cost is continuing to waste memory with
structures that are slightly bigger than they need to be.

> If there is a benefit to keeping these data structures, why not just
> address this aspect in previous solution?

The previous solution evolved to splitting these structures. But this
happened incrementally (remember that at an early stage the monitor
structures all got the "_mon" addition to their names, but the control
structures kept the original names). Only when I got to the end of this
process did I look at the magnitude of the change.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-09 23:44                                   ` Luck, Tony
@ 2024-02-10  0:28                                     ` Reinette Chatre
  2024-02-12 16:52                                       ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2024-02-10  0:28 UTC (permalink / raw)
  To: Luck, Tony, babu.moger
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Hi Tony,

On 2/9/2024 3:44 PM, Luck, Tony wrote:
>> I actually had specific points that this response also ignores.
>> Let me repeat and highlight the same points:
>>
>> 1) You claim that this series "removes the need for separate domain
>>    lists" ... but then this series does just that (create a separate
>>    domain list), but in an obfuscated way (duplicate the resource to
>>    have the monitoring domain list in there).
> 
> That was poorly worded on my part. I should have said "removes the
> need for separate domain lists within a single rdt_resource".
> 
> Adding an extra domain list to a resource may be the start of a slippery
> slope. What if there is some additional "L3"-like resctrl operation that
> acts at the socket level (Intel has made products with multiple L3
> instances per socket before). Would you be OK add a third domain
> list to every struct rdt_resource to handle this? Or would it be simpler
> to just add a new rdt_resource structure with socket scoped domains?

This should not be about what is simplest to patch into current resctrl.

There is no need to support a new domain list for a new scope. The domain
lists support the functionality: control or monitoring. If control has
socket scope the existing implementation supports that.
If there is another operation supported by a resource apart from 
control or monitoring then we can consider how to support it when
we know what it is. That would also be a great point to decide if
the same data structure should just grow to support an operation that
not all resources may support. That may depend on the amount of data
needed to support this hypothetical operation.

> 
>> 2) You claim this series "reduces amount of code churn", but this is
>>    because this series keeps using the same original data structures
>>    for separate monitoring and control usages. The previous series made
>>    an effort to separate the structures for the different usages
>>    but this series does not. What makes it ok in this series to
>>    use the same data structures for different usages?
> 
> Legacy resctrl has been using the same rdt_domain structure for both
> usages since the dawn of time. So it has been OK up until now.

This is not the same.

Legacy resctrl uses the same data structure in the same list for both control
and monitoring usages so it is fine to have both monitoring and control data
in the data structure.

What you are doing in both solutions is to place the same data structure
in separate lists for control and monitoring usages. In the one list only the
control data is used, on the other only the monitoring data is used.

>> Additionally:
>>
>> Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure
>> or variable names." ... that is because the structures are actually split,
>> no? It is not just renaming for unnecessary churn.
> 
> Perhaps not "unnecessary" churn. But certainly a lot of code change for
> what I perceive as very little real gain. 

ok. There may be little gain wrt saving space. One complication with
this single data structure is that its content may only be decided based
on which list it is part of. It should be obvious to developers when
which members are valid. Perhaps this can be addressed with clear
documentation of the data structures.

> 
>> What is the benefit of keeping the data structures to be shared
>> between monitor and control usages?
> 
> Benefit is no code changes. Cost is continuing to waste memory with
> structures that are slightly bigger than they need to be.
> 
>> If there is a benefit to keeping these data structures, why not just
>> address this aspect in previous solution?
> 
> The previous solution evolved to splitting these structures. But this
> happened incrementally (remember that at an early stage the monitor
> structures all got the "_mon" addition to their names, but the control
> structures kept the original names). Only when I got to the end of this
> process did I look at the magnitude of the change.

Not answering my question. 

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-10  0:28                                     ` Reinette Chatre
@ 2024-02-12 16:52                                       ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2024-02-12 16:52 UTC (permalink / raw)
  To: Chatre, Reinette, babu.moger
  Cc: Yu, Fenghua, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

> >> I actually had specific points that this response also ignores.
> >> Let me repeat and highlight the same points:
> >>
> >> 1) You claim that this series "removes the need for separate domain
> >>    lists" ... but then this series does just that (create a separate
> >>    domain list), but in an obfuscated way (duplicate the resource to
> >>    have the monitoring domain list in there).
> >
> > That was poorly worded on my part. I should have said "removes the
> > need for separate domain lists within a single rdt_resource".
> >
> > Adding an extra domain list to a resource may be the start of a slippery
> > slope. What if there is some additional "L3"-like resctrl operation that
> > acts at the socket level (Intel has made products with multiple L3
> > instances per socket before). Would you be OK add a third domain
> > list to every struct rdt_resource to handle this? Or would it be simpler
> > to just add a new rdt_resource structure with socket scoped domains?
>
> This should not be about what is simplest to patch into current resctrl.

I wanted to offer this in case Boris also thought that the previous version
was too much churn to support an obscure Intel-only (so far) feature.

But if you are going to Nack this new version on the grounds that it muddies
the water about usage of the rdt_domain structure, then I will abandon it.

> There is no need to support a new domain list for a new scope. The domain
> lists support the functionality: control or monitoring. If control has
> socket scope the existing implementation supports that.
> If there is another operation supported by a resource apart from
> control or monitoring then we can consider how to support it when
> we know what it is. That would also be a great point to decide if
> the same data structure should just grow to support an operation that
> not all resources may support. That may depend on the amount of data
> needed to support this hypothetical operation.
>
> >
> >> 2) You claim this series "reduces amount of code churn", but this is
> >>    because this series keeps using the same original data structures
> >>    for separate monitoring and control usages. The previous series made
> >>    an effort to separate the structures for the different usages
> >>    but this series does not. What makes it ok in this series to
> >>    use the same data structures for different usages?
> >
> > Legacy resctrl has been using the same rdt_domain structure for both
> > usages since the dawn of time. So it has been OK up until now.
>
> This is not the same.
>
> Legacy resctrl uses the same data structure in the same list for both control
> and monitoring usages so it is fine to have both monitoring and control data
> in the data structure.
>
> What you are doing in both solutions is to place the same data structure
> in separate lists for control and monitoring usages. In the one list only the
> control data is used, on the other only the monitoring data is used.
>
> >> Additionally:
> >>
> >> Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure
> >> or variable names." ... that is because the structures are actually split,
> >> no? It is not just renaming for unnecessary churn.
> >
> > Perhaps not "unnecessary" churn. But certainly a lot of code change for
> > what I perceive as very little real gain.
>
> ok. There may be little gain wrt saving space. One complication with
> this single data structure is that its content may only be decided based
> on which list it is part of. It should be obvious to developers when
> which members are valid. Perhaps this can be addressed with clear
> documentation of the data structures.
>
> >
> >> What is the benefit of keeping the data structures to be shared
> >> between monitor and control usages?
> >
> > Benefit is no code changes. Cost is continuing to waste memory with
> > structures that are slightly bigger than they need to be.
> >
> >> If there is a benefit to keeping these data structures, why not just
> >> address this aspect in previous solution?
> >
> > The previous solution evolved to splitting these structures. But this
> > happened incrementally (remember that at an early stage the monitor
> > structures all got the "_mon" addition to their names, but the control
> > structures kept the original names). Only when I got to the end of this
> > process did I look at the magnitude of the change.
>
> Not answering my question.

I'm not exactly sure what "aspect" you thought could be addressed in the
previous series. But the point is moot now. This diversion from the
series has come to a dead end, and I hope that Boris will look at v14
(either before the next group of ARM patches, or after).

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 5/8] x86/resctrl: Add "NODE" as an option for resource scope
  2024-02-09 19:23                           ` Tony Luck
@ 2024-02-12 17:16                             ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2024-02-12 17:16 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches

Hi Tony,

On 2/9/24 13:23, Tony Luck wrote:
> On Fri, Feb 09, 2024 at 09:29:03AM -0600, Moger, Babu wrote:
>> Hi Tony,
>>
>> This patch probably needs to be merged with with patch 7.
> 
> If it just added RESCTRL_NODE to the enum and the switch() I'd
> agree (as patch 7 is where RESCTRL_NODE first used). But this
> patch also adds the sanity checks on scope where it has to be
> a cache level.
> 
> Patch 7 is already on the big side (119 lines added to core.c).
> 
> If you really think this series needs to cut back the
> number of patches, I could move the sanity check pieces
> from here to patch 3 (where the enum is introduced) and
> just the RESCTRL_NODE bits to patch 7.

No. You dont have to cut back on number of patches. I think it is easy to
review if these changes are merged with patch 7.


-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2024-02-09 19:35                           ` Tony Luck
@ 2024-02-12 17:22                             ` Moger, Babu
  0 siblings, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2024-02-12 17:22 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches

Hi Tony,

On 2/9/24 13:35, Tony Luck wrote:
> On Fri, Feb 09, 2024 at 09:29:16AM -0600, Moger, Babu wrote:
>>>  
>>> +extern unsigned int snc_nodes_per_l3_cache;
>>
>> I feel this can be part of rdt_resource instead of global.
> 
> Mixed emotions about that. It would be another field that appears
> in every instance of rdt_resource, but only used by the RDT_RESOURCE_L3_MON
> copy.
> 
That was the comment I got earlier for my patches(search for global).

https://lore.kernel.org/lkml/47f870e0-ad1a-4a4d-9d9e-4c05bf431858@intel.com/
-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-09 15:27                       ` Moger, Babu
  2024-02-09 18:31                         ` Tony Luck
@ 2024-02-12 19:44                         ` Reinette Chatre
  2024-02-12 19:57                           ` Luck, Tony
  2024-02-12 20:46                           ` Moger, Babu
  1 sibling, 2 replies; 331+ messages in thread
From: Reinette Chatre @ 2024-02-12 19:44 UTC (permalink / raw)
  To: babu.moger, Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Hi Babu,

On 2/9/2024 7:27 AM, Moger, Babu wrote:

> To be honest, I like this series more than the previous series. I always
> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.

Would you prefer that your "Reviewed-by" tag be removed from the
previous series?

Reinette

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-12 19:44                         ` Reinette Chatre
@ 2024-02-12 19:57                           ` Luck, Tony
  2024-02-12 21:43                             ` Reinette Chatre
  2024-02-12 20:46                           ` Moger, Babu
  1 sibling, 1 reply; 331+ messages in thread
From: Luck, Tony @ 2024-02-12 19:57 UTC (permalink / raw)
  To: Chatre, Reinette, babu.moger, Yu, Fenghua, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

>> To be honest, I like this series more than the previous series. I always
>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>
> Would you prefer that your "Reviewed-by" tag be removed from the
> previous series?

I'm thinking that I could continue splitting things and break "struct rdt_resource" into
separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom.

Doing that would get rid of the rdt_resources_all[] array. Replacing with individual
rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature.

Features found on a system would be added to a list of ctrl or list of mon resources.

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-12 19:44                         ` Reinette Chatre
  2024-02-12 19:57                           ` Luck, Tony
@ 2024-02-12 20:46                           ` Moger, Babu
  1 sibling, 0 replies; 331+ messages in thread
From: Moger, Babu @ 2024-02-12 20:46 UTC (permalink / raw)
  To: Reinette Chatre, Tony Luck, Fenghua Yu, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches



On 2/12/24 13:44, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/9/2024 7:27 AM, Moger, Babu wrote:
> 
>> To be honest, I like this series more than the previous series. I always
>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
> 
> Would you prefer that your "Reviewed-by" tag be removed from the
> previous series?
> 
Sure. I will plan to review again the new series when Tony submits v16.
-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-12 19:57                           ` Luck, Tony
@ 2024-02-12 21:43                             ` Reinette Chatre
  2024-02-12 22:05                               ` Tony Luck
  0 siblings, 1 reply; 331+ messages in thread
From: Reinette Chatre @ 2024-02-12 21:43 UTC (permalink / raw)
  To: Luck, Tony, babu.moger, Yu, Fenghua, Peter Newman,
	Jonathan Corbet, Shuah Khan, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Hi Tony,

On 2/12/2024 11:57 AM, Luck, Tony wrote:
>>> To be honest, I like this series more than the previous series. I always
>>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>>
>> Would you prefer that your "Reviewed-by" tag be removed from the
>> previous series?
> 
> I'm thinking that I could continue splitting things and break "struct rdt_resource" into
> separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom.

It is not obvious what you mean with "continue splitting things". Are you
speaking about "continue splitting from v14" or "continue splitting from v15-RFC"?

I think that any solution needs to consider what makes sense for resctrl
as a whole instead of how to support SNC with smallest patch possible.

There should not be any changes that makes resctrl harder to understand
and maintain, as exemplified by confusion introduced by a simple thing as
resource name choice [1].

> 
> Doing that would get rid of the rdt_resources_all[] array. Replacing with individual
> rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature.
>
> Features found on a system would be added to a list of ctrl or list of mon resources.

Could you please elaborate what is architecturally wrong with v14 and how this
new proposal addresses that?

Reinette

[1] https://lore.kernel.org/lkml/ZcZyqs5hnQqZ5ZV0@agluck-desk3/

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-12 21:43                             ` Reinette Chatre
@ 2024-02-12 22:05                               ` Tony Luck
  2024-02-13 18:11                                 ` James Morse
  0 siblings, 1 reply; 331+ messages in thread
From: Tony Luck @ 2024-02-12 22:05 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: babu.moger, Yu, Fenghua, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, James Morse, Jamie Iles,
	Randy Dunlap, Drew Fustini, linux-kernel, linux-doc, patches

On Mon, Feb 12, 2024 at 01:43:56PM -0800, Reinette Chatre wrote:
> Hi Tony,
> 
> On 2/12/2024 11:57 AM, Luck, Tony wrote:
> >>> To be honest, I like this series more than the previous series. I always
> >>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
> >>
> >> Would you prefer that your "Reviewed-by" tag be removed from the
> >> previous series?
> > 
> > I'm thinking that I could continue splitting things and break "struct rdt_resource" into
> > separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom.
> 
> It is not obvious what you mean with "continue splitting things". Are you
> speaking about "continue splitting from v14" or "continue splitting from v15-RFC"?

I'm speaking of some future potential changes. Not proposing to
do this now.

> I think that any solution needs to consider what makes sense for resctrl
> as a whole instead of how to support SNC with smallest patch possible.

I am officially abandoning my v15-RFC patches. I wasn't clear enough in
my e-mail earlier today.

https://lore.kernel.org/all/SJ1PR11MB608378D1304224D9E8A9016FFC482@SJ1PR11MB6083.namprd11.prod.outlook.com/
> 
> There should not be any changes that makes resctrl harder to understand
> and maintain, as exemplified by confusion introduced by a simple thing as
> resource name choice [1].
> 
> > 
> > Doing that would get rid of the rdt_resources_all[] array. Replacing with individual
> > rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature.
> >
> > Features found on a system would be added to a list of ctrl or list of mon resources.
> 
> Could you please elaborate what is architecturally wrong with v14 and how this
> new proposal addresses that?

There is nothing architecturally wrong with v14. I thought it was more
complex than it needed to be. You have convinced me that my v15-RFC
series, while simpler, is not a reasonable path for long-term resctrl
maintainability.
> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/ZcZyqs5hnQqZ5ZV0@agluck-desk3/

-Tony

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-12 22:05                               ` Tony Luck
@ 2024-02-13 18:11                                 ` James Morse
  2024-02-13 19:02                                   ` Luck, Tony
  0 siblings, 1 reply; 331+ messages in thread
From: James Morse @ 2024-02-13 18:11 UTC (permalink / raw)
  To: Tony Luck, Reinette Chatre
  Cc: babu.moger, Yu, Fenghua, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

Hello,

On 12/02/2024 22:05, Tony Luck wrote:
> On Mon, Feb 12, 2024 at 01:43:56PM -0800, Reinette Chatre wrote:
>> On 2/12/2024 11:57 AM, Luck, Tony wrote:
>>>>> To be honest, I like this series more than the previous series. I always
>>>>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>>>>
>>>> Would you prefer that your "Reviewed-by" tag be removed from the
>>>> previous series?
>>>
>>> I'm thinking that I could continue splitting things and break "struct rdt_resource" into
>>> separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom.
>>
>> It is not obvious what you mean with "continue splitting things". Are you
>> speaking about "continue splitting from v14" or "continue splitting from v15-RFC"?
> 
> I'm speaking of some future potential changes. Not proposing to
> do this now.
> 
>> I think that any solution needs to consider what makes sense for resctrl
>> as a whole instead of how to support SNC with smallest patch possible.

>> There should not be any changes that makes resctrl harder to understand
>> and maintain, as exemplified by confusion introduced by a simple thing as
>> resource name choice [1].
>>
>>>
>>> Doing that would get rid of the rdt_resources_all[] array. Replacing with individual
>>> rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature.
>>>
>>> Features found on a system would be added to a list of ctrl or list of mon resources.
>>
>> Could you please elaborate what is architecturally wrong with v14 and how this
>> new proposal addresses that?
> 
> There is nothing architecturally wrong with v14. I thought it was more
> complex than it needed to be. You have convinced me that my v15-RFC
> series, while simpler, is not a reasonable path for long-term resctrl
> maintainability.

I'm not sure if its helpful to describe a third approach at this point - but on the off
chance its useful:
With SNC enable, the L3 monitors are unaffected, but the controls behave as if they were
part of some other component in the system..

ACPI describes something called "memory side caches" [0] in the HMAT table, which are
outside the CPU cache hierarchy, and are associated with a Proximity-Domain. I've heard
that one of Arm's partners has built a system with MPAM controls on something like this.
How would we support this - and would this be a better fit for the way SNC behaves?

I think this would be a new resource and schema, 'MSC'(?) with domain-ids using the NUMA
nid. As these aren't CPU caches, they wouldn't appear in the same part of the sysfs
hierarchy, and wouldn't necessarily have a cache-id.

For SNC systems, I think this would look like CMT on the L3, and CAT on the 'MSC'.
Existing software wouldn't know to use the new schema, but equally wouldn't be surprised
by the domain-ids being something other than the cache-id, and the controls and monitors
not lining up.
Where its not quite right for SNC is sysfs may not describe a memory side cache, but one
would be present in resctrl. I don't think that's a problem - unless these systems do also
have a memory-side-cache that behaves differently. (where is the controls being applied at
the 'near' side of the link - I don't think the difference matters)


I'm a little nervous that the SNC support looks strange if we ever add support for
something like the above. Given its described in ACPI, I assume there are plenty of
machines out there that look like this.

(Why aren't memory-side-caches a CPU cache? They live near the memory controller and cache
based on the PA, not the CPU that issued the transaction)


Thanks,

James

[0]
https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#memory-side-cache-overview

^ permalink raw reply	[flat|nested] 331+ messages in thread

* RE: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems
  2024-02-13 18:11                                 ` James Morse
@ 2024-02-13 19:02                                   ` Luck, Tony
  0 siblings, 0 replies; 331+ messages in thread
From: Luck, Tony @ 2024-02-13 19:02 UTC (permalink / raw)
  To: James Morse, Chatre, Reinette
  Cc: babu.moger, Yu, Fenghua, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86, Shaopeng Tan, Jamie Iles, Randy Dunlap,
	Drew Fustini, linux-kernel, linux-doc, patches

[-- Attachment #1: Type: text/plain, Size: 1496 bytes --]

> With SNC enable, the L3 monitors are unaffected, but the controls behave as if they were
> part of some other component in the system.

I don't think of it like that. See attached picture of a single socket divided in two by SNC.
[If the attachment is stripped off for those reading this via mailing lists, if you want the
picture, just send me an e-mail.]

Everything in blue is node 0. Yellow for node 1.

The rectangles in the middle represent the L3 cache (12-way associative). When cores
in node 0 access memory in node 0, it will be cached using the "top" half of the cache
indices. Similarly for node 1 using the "bottom" half.

Here’s how each of the Intel L3 resctrl functions operate with SNC enabled:

CQM: Reports how much of your half of the L3 cache is occupied

MBM: Reports on memory traffic from your half of the cache to your memory controllers.

CAT: Still controls which ways of the cache are available for allocation (but each way
has half the capacity.)

MBA: The same throttling levels applied to "blue" and "yellow" traffic (because there
are only socket level controls).

> I'm a little nervous that the SNC support looks strange if we ever add support for
> something like the above. Given its described in ACPI, I assume there are plenty of
> machines out there that look like this.

I'm also nervous as h/w designers find various ways to diverge from the old paradigm of

	socket scope == L3 cache scope == NUMA node scope

-Tony

[-- Attachment #2: SNC topology.png --]
[-- Type: image/png, Size: 18304 bytes --]

^ permalink raw reply	[flat|nested] 331+ messages in thread

end of thread, other threads:[~2024-02-13 19:02 UTC | newest]

Thread overview: 331+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-13 16:31 [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
2023-07-13 16:32 ` [PATCH v3 1/8] x86/resctrl: Refactor in preparation for node-scoped resources Tony Luck
2023-07-18 20:36   ` Reinette Chatre
2023-07-13 16:32 ` [PATCH v3 2/8] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c Tony Luck
2023-07-18 20:37   ` Reinette Chatre
2023-07-13 16:32 ` [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[] Tony Luck
2023-07-18 20:40   ` Reinette Chatre
2023-07-18 22:57     ` Tony Luck
2023-07-18 23:45       ` Reinette Chatre
2023-07-19  0:11         ` Luck, Tony
2023-07-20 17:27           ` Reinette Chatre
2023-07-20  0:20         ` Tony Luck
2023-07-20 17:27           ` Reinette Chatre
2023-07-20 21:56             ` Luck, Tony
2023-07-22 19:18               ` Tony Luck
2023-07-13 16:32 ` [PATCH v3 4/8] x86/resctrl: Add code to setup monitoring at L3 or NODE scope Tony Luck
2023-07-18 20:42   ` Reinette Chatre
2023-07-13 16:32 ` [PATCH v3 5/8] x86/resctrl: Add package scoped resource Tony Luck
2023-07-18 20:43   ` Reinette Chatre
2023-07-18 23:05     ` Tony Luck
2023-07-13 16:32 ` [PATCH v3 6/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2023-07-13 16:32 ` [PATCH v3 7/8] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
2023-07-18 20:45   ` Reinette Chatre
2023-07-13 16:32 ` [PATCH v3 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled Tony Luck
2023-07-19  2:43 ` [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Shaopeng Tan (Fujitsu)
2023-07-22 20:21   ` Tony Luck
2023-07-22 19:07 ` [PATCH v4 0/7] " Tony Luck
2023-07-22 19:07   ` [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring Tony Luck
2023-08-11 17:29     ` Reinette Chatre
2023-08-25 17:05       ` Tony Luck
2023-08-28 17:05         ` Reinette Chatre
2023-08-28 18:46           ` Tony Luck
2023-08-28 19:56             ` Reinette Chatre
2023-08-28 20:59               ` Tony Luck
2023-08-28 21:20                 ` Reinette Chatre
2023-08-28 22:12                   ` Tony Luck
2023-08-29  0:31                     ` Tony Luck
2023-07-22 19:07   ` [PATCH v4 2/7] x86/resctrl: Split the rdt_domain structures Tony Luck
2023-07-22 19:07   ` [PATCH v4 3/7] x86/resctrl: Change monitor code to use rdt_mondomain Tony Luck
2023-08-11 17:30     ` Reinette Chatre
2023-08-25 17:13       ` Tony Luck
2023-07-22 19:07   ` [PATCH v4 4/7] x86/resctrl: Delete unused fields from struct rdt_domain Tony Luck
2023-08-11 17:30     ` Reinette Chatre
2023-08-25 17:13       ` Tony Luck
2023-07-22 19:07   ` [PATCH v4 5/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
2023-08-11 17:32     ` Reinette Chatre
2023-08-25 17:49       ` Tony Luck
2023-08-28 17:05         ` Reinette Chatre
2023-08-28 19:01           ` Tony Luck
2023-07-22 19:07   ` [PATCH v4 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2023-08-11 17:33     ` Reinette Chatre
2023-08-25 17:50       ` Tony Luck
2023-07-22 19:07   ` [PATCH v4 7/7] selftests/resctrl: Adjust effective L3 cache size when SNC enabled Tony Luck
2023-08-11 17:33     ` Reinette Chatre
2023-08-25 17:56       ` Tony Luck
2023-08-28 17:06         ` Reinette Chatre
2023-08-28 19:06           ` Tony Luck
2023-07-26  3:10   ` [PATCH v4 0/7] Add support for Sub-NUMA cluster (SNC) systems Drew Fustini
2023-07-26 14:10     ` Tony Luck
2023-08-11 17:27   ` Reinette Chatre
2023-08-29 23:44   ` [PATCH v5 0/8] " Tony Luck
2023-08-29 23:44     ` [PATCH v5 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
2023-08-30  8:53       ` Maciej Wieczór-Retman
2023-08-30 14:11         ` Luck, Tony
2023-08-31 13:02           ` Maciej Wieczór-Retman
2023-08-31 17:26             ` Luck, Tony
2023-09-25 23:22       ` Reinette Chatre
2023-09-26  0:56         ` Tony Luck
2023-09-26 15:19           ` Reinette Chatre
2023-09-26 16:36       ` Moger, Babu
2023-09-26 17:18         ` Luck, Tony
2023-08-29 23:44     ` [PATCH v5 2/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2023-09-25 23:24       ` Reinette Chatre
2023-09-28 17:21         ` Tony Luck
2023-09-26 16:54       ` Moger, Babu
2023-09-26 17:45         ` Luck, Tony
2023-08-29 23:44     ` [PATCH v5 3/8] x86/resctrl: Split the rdt_domain structure Tony Luck
2023-09-25 23:25       ` Reinette Chatre
2023-09-26 18:46         ` Tony Luck
2023-09-26 22:00           ` Reinette Chatre
2023-09-26 22:14             ` Luck, Tony
2023-09-26 23:06               ` Reinette Chatre
2023-09-27  0:00                 ` Luck, Tony
2023-09-27 15:22                   ` Reinette Chatre
2023-09-28 17:33         ` Tony Luck
2023-09-28 18:42           ` Reinette Chatre
2023-08-29 23:44     ` [PATCH v5 4/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2023-09-25 23:25       ` Reinette Chatre
2023-09-28 17:42         ` Tony Luck
2023-09-28 18:44           ` Reinette Chatre
2023-08-29 23:44     ` [PATCH v5 5/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2023-09-25 23:27       ` Reinette Chatre
2023-09-28 17:48         ` Tony Luck
2023-08-29 23:44     ` [PATCH v5 6/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2023-09-25 23:29       ` Reinette Chatre
2023-09-28 18:12         ` Tony Luck
2023-09-28 18:44           ` Reinette Chatre
2023-09-26 19:16       ` Moger, Babu
2023-09-26 19:40         ` Luck, Tony
2023-09-26 19:48           ` Moger, Babu
2023-09-26 20:02             ` Luck, Tony
2023-09-26 20:18               ` Moger, Babu
2023-09-26 21:03           ` Reinette Chatre
2023-08-29 23:44     ` [PATCH v5 7/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2023-09-25 23:30       ` Reinette Chatre
2023-09-28 18:15         ` Tony Luck
2023-09-26 19:00       ` Moger, Babu
2023-09-26 19:11         ` Luck, Tony
2023-09-26 19:26           ` Moger, Babu
2023-09-26 19:43             ` Luck, Tony
2023-09-26 19:52               ` Moger, Babu
2023-08-29 23:44     ` [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled Tony Luck
2023-08-30 13:08       ` Maciej Wieczór-Retman
2023-08-30 15:43         ` Luck, Tony
2023-08-31  6:51           ` Maciej Wieczór-Retman
2023-08-31 17:04             ` Luck, Tony
2023-09-07  5:21       ` Shaopeng Tan (Fujitsu)
2023-09-07 16:19         ` Luck, Tony
2023-09-19  6:50           ` Maciej Wieczór-Retman
2023-09-19 14:36             ` Luck, Tony
2023-09-20  6:06               ` Maciej Wieczór-Retman
2023-09-20 15:00                 ` Luck, Tony
2023-09-11 20:23     ` [PATCH v5 0/8] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
2023-09-12 16:01       ` Tony Luck
2023-09-12 17:13         ` Reinette Chatre
2023-09-12 17:45           ` Tony Luck
2023-09-12 18:31             ` Reinette Chatre
2023-09-28 19:13     ` [PATCH v6 " Tony Luck
2023-09-28 19:13       ` [PATCH v6 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
2023-09-29 12:09         ` Peter Newman
2023-09-29 18:38           ` Tony Luck
2023-09-28 19:13       ` [PATCH v6 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
2023-09-29 12:19         ` Peter Newman
2023-09-28 19:13       ` [PATCH v6 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2023-09-29 13:10         ` Peter Newman
2023-09-29 19:15           ` Tony Luck
2023-09-28 19:13       ` [PATCH v6 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
2023-09-29 13:44         ` Peter Newman
2023-09-29 20:06           ` Tony Luck
2023-09-28 19:13       ` [PATCH v6 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2023-09-29 14:21         ` Peter Newman
2023-09-28 19:13       ` [PATCH v6 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2023-09-29 14:21         ` Peter Newman
2023-09-29 20:18           ` Tony Luck
2023-09-28 19:13       ` [PATCH v6 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2023-09-29 14:32         ` Peter Newman
2023-09-28 19:13       ` [PATCH v6 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2023-09-29 14:54         ` Peter Newman
2023-09-29 20:51           ` Tony Luck
2023-09-29 14:33       ` [PATCH v6 0/8] Add support for Sub-NUMA cluster (SNC) systems Peter Newman
2023-09-29 21:06         ` Tony Luck
2023-10-03 16:07       ` [PATCH v7 " Tony Luck
2023-10-03 16:07         ` [PATCH v7 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
2023-10-03 16:07         ` [PATCH v7 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
2023-10-03 16:07         ` [PATCH v7 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2023-10-03 16:07         ` [PATCH v7 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
2023-10-03 16:46           ` Luck, Tony
2023-10-03 16:07         ` [PATCH v7 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2023-10-03 16:07         ` [PATCH v7 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2023-10-03 16:07         ` [PATCH v7 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2023-10-03 16:07         ` [PATCH v7 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2023-10-03 16:16         ` [PATCH v7 0/8] Add support for Sub-NUMA cluster (SNC) systems Luck, Tony
2023-10-03 16:34           ` Reinette Chatre
2023-10-03 21:30       ` [PATCH v8 " Tony Luck
2023-10-03 21:30         ` [PATCH v8 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
2023-10-05 14:27           ` Peter Newman
2023-10-03 21:30         ` [PATCH v8 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
2023-10-03 21:30         ` [PATCH v8 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2023-10-05 14:34           ` Peter Newman
2023-10-03 21:30         ` [PATCH v8 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
2023-10-03 21:30         ` [PATCH v8 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2023-10-03 21:30         ` [PATCH v8 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2023-10-03 21:30         ` [PATCH v8 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2023-10-03 21:30         ` [PATCH v8 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2023-10-05 13:54           ` Peter Newman
2023-10-03 21:44         ` [PATCH v8 0/8] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
2023-10-05  8:10         ` Shaopeng Tan (Fujitsu)
2023-10-05 15:08           ` Luck, Tony
2023-10-06 20:27             ` Reinette Chatre
2023-10-20 21:30         ` [PATCH v9 " Tony Luck
2023-10-20 21:30           ` [PATCH v9 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
2023-10-30 21:18             ` Reinette Chatre
2023-10-20 21:30           ` [PATCH v9 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
2023-10-30 21:19             ` Reinette Chatre
2023-10-20 21:30           ` [PATCH v9 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2023-10-30 21:21             ` Reinette Chatre
2023-10-20 21:30           ` [PATCH v9 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
2023-10-30 21:21             ` Reinette Chatre
2023-10-20 21:30           ` [PATCH v9 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2023-10-20 21:30           ` [PATCH v9 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2023-10-30 21:22             ` Reinette Chatre
2023-10-20 21:30           ` [PATCH v9 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2023-10-30 21:23             ` Reinette Chatre
2023-10-20 21:31           ` [PATCH v9 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2023-10-24  5:41           ` [PATCH v9 0/8] Add support for Sub-NUMA cluster (SNC) systems Shaopeng Tan (Fujitsu)
2023-10-24 15:29             ` Luck, Tony
2023-10-31 21:17           ` [PATCH v10 " Tony Luck
2023-10-31 21:17             ` [PATCH v10 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
2023-11-07  0:31               ` Reinette Chatre
2023-10-31 21:17             ` [PATCH v10 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
2023-10-31 21:17             ` [PATCH v10 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2023-11-01 19:53               ` Luck, Tony
2023-11-07  0:32               ` Reinette Chatre
2023-10-31 21:17             ` [PATCH v10 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
2023-11-07  0:32               ` Reinette Chatre
2023-11-08 19:19                 ` Tony Luck
2023-11-08 19:43                   ` Reinette Chatre
2023-11-08 20:10                     ` Luck, Tony
2023-10-31 21:17             ` [PATCH v10 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2023-10-31 21:17             ` [PATCH v10 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2023-10-31 21:17             ` [PATCH v10 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2023-11-07  0:33               ` Reinette Chatre
2023-10-31 21:17             ` [PATCH v10 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2023-11-09 23:09             ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
2023-11-09 23:09               ` [PATCH v11 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
2023-11-13 19:36                 ` Moger, Babu
2023-11-13 20:49                   ` Luck, Tony
2023-11-20 22:16                 ` Reinette Chatre
2023-11-09 23:09               ` [PATCH v11 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
2023-11-13 18:08                 ` Moger, Babu
2023-11-13 18:52                   ` Tony Luck
2023-11-13 19:32                     ` Moger, Babu
2023-11-20 22:18                 ` Reinette Chatre
2023-11-09 23:09               ` [PATCH v11 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2023-11-20 22:19                 ` Reinette Chatre
2023-11-09 23:09               ` [PATCH v11 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
2023-11-20 22:19                 ` Reinette Chatre
2023-11-09 23:09               ` [PATCH v11 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2023-11-20 22:22                 ` Reinette Chatre
2023-11-09 23:09               ` [PATCH v11 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2023-11-20 22:23                 ` Reinette Chatre
2023-11-09 23:09               ` [PATCH v11 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2023-11-20 22:24                 ` Reinette Chatre
2023-11-09 23:09               ` [PATCH v11 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2023-11-09 23:14                 ` Randy Dunlap
2023-11-20 22:25                 ` Reinette Chatre
2023-11-13  4:46               ` [PATCH v11 0/8] Add support for Sub-NUMA cluster (SNC) systems Shaopeng Tan (Fujitsu)
2023-11-13 17:33                 ` Luck, Tony
2023-11-30  0:34               ` [PATCH v12 " Tony Luck
2023-11-30  0:34                 ` [PATCH v12 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
2023-11-30  0:34                 ` [PATCH v12 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
2023-11-30  0:34                 ` [PATCH v12 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2023-11-30 21:51                   ` Reinette Chatre
2023-11-30  0:34                 ` [PATCH v12 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
2023-11-30  0:34                 ` [PATCH v12 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2023-11-30  0:34                 ` [PATCH v12 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2023-11-30  0:34                 ` [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2023-11-30 18:02                   ` Fam Zheng
2023-11-30 20:57                     ` Tony Luck
2023-11-30 21:47                       ` Reinette Chatre
2023-11-30 22:43                         ` Tony Luck
2023-11-30 23:40                           ` Reinette Chatre
2023-12-01  0:37                             ` Tony Luck
2023-12-01  2:08                               ` Reinette Chatre
2023-11-30  0:34                 ` [PATCH v12 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2023-12-04 16:55                   ` Moger, Babu
2023-12-04 17:51                     ` Luck, Tony
2023-12-04 18:53                 ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
2023-12-04 18:53                   ` [PATCH v13 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
2023-12-04 23:03                     ` Moger, Babu
2023-12-04 18:53                   ` [PATCH v13 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
2023-12-04 23:03                     ` Moger, Babu
2023-12-04 18:53                   ` [PATCH v13 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2023-12-04 23:03                     ` Moger, Babu
2023-12-04 18:53                   ` [PATCH v13 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
2023-12-04 23:03                     ` Moger, Babu
2023-12-04 18:53                   ` [PATCH v13 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2023-12-04 23:04                     ` Moger, Babu
2023-12-04 18:53                   ` [PATCH v13 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2023-12-04 23:04                     ` Moger, Babu
2023-12-04 18:53                   ` [PATCH v13 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2023-12-04 23:04                     ` Moger, Babu
2023-12-04 18:53                   ` [PATCH v13 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2023-12-04 23:04                     ` Moger, Babu
2023-12-04 21:20                   ` [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems Moger, Babu
2023-12-04 21:33                     ` Luck, Tony
2023-12-04 23:25                   ` Tony Luck
2024-01-26 22:38                   ` [PATCH v14 " Tony Luck
2024-01-26 22:38                     ` [PATCH v14 1/8] x86/resctrl: Prepare for new domain scope Tony Luck
2024-01-26 22:38                     ` [PATCH v14 2/8] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
2024-01-26 22:38                     ` [PATCH v14 3/8] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2024-01-26 22:38                     ` [PATCH v14 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
2024-01-26 22:38                     ` [PATCH v14 5/8] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2024-01-26 22:38                     ` [PATCH v14 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2024-01-26 22:38                     ` [PATCH v14 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2024-01-26 22:38                     ` [PATCH v14 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2024-01-30  1:05                     ` [PATCH v14 0/8] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
2024-01-30 22:20                     ` [PATCH v15-RFC " Tony Luck
2024-01-30 22:20                       ` [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource Tony Luck
2024-02-09 15:28                         ` Moger, Babu
2024-02-09 18:44                           ` Tony Luck
2024-02-09 19:40                             ` Moger, Babu
2024-01-30 22:20                       ` [PATCH v15-RFC 2/8] x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON Tony Luck
2024-02-09 15:28                         ` Moger, Babu
2024-02-09 18:51                           ` Tony Luck
2024-01-30 22:20                       ` [PATCH v15-RFC 3/8] x86/resctrl: Prepare for non-cache-scoped resources Tony Luck
2024-02-09 15:28                         ` Moger, Babu
2024-02-09 18:57                           ` Tony Luck
2024-02-09 19:44                             ` Moger, Babu
2024-01-30 22:20                       ` [PATCH v15-RFC 4/8] x86/resctrl: Add helper function to look up domain_id from scope Tony Luck
2024-02-09 15:28                         ` Moger, Babu
2024-02-09 19:06                           ` Tony Luck
2024-01-30 22:20                       ` [PATCH v15-RFC 5/8] x86/resctrl: Add "NODE" as an option for resource scope Tony Luck
2024-02-09 15:29                         ` Moger, Babu
2024-02-09 19:23                           ` Tony Luck
2024-02-12 17:16                             ` Moger, Babu
2024-01-30 22:20                       ` [PATCH v15-RFC 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2024-02-09 15:29                         ` Moger, Babu
2024-02-09 19:35                           ` Tony Luck
2024-02-12 17:22                             ` Moger, Babu
2024-01-30 22:20                       ` [PATCH v15-RFC 7/8] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2024-01-30 22:20                       ` [PATCH v15-RFC 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2024-02-01 18:46                       ` [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
2024-02-08  4:17                       ` Drew Fustini
2024-02-08 19:26                         ` Luck, Tony
2024-02-09 15:27                       ` Moger, Babu
2024-02-09 18:31                         ` Tony Luck
2024-02-09 19:38                           ` Moger, Babu
2024-02-09 21:26                             ` Reinette Chatre
2024-02-09 21:36                               ` Luck, Tony
2024-02-09 22:16                                 ` Reinette Chatre
2024-02-09 23:44                                   ` Luck, Tony
2024-02-10  0:28                                     ` Reinette Chatre
2024-02-12 16:52                                       ` Luck, Tony
2024-02-12 19:44                         ` Reinette Chatre
2024-02-12 19:57                           ` Luck, Tony
2024-02-12 21:43                             ` Reinette Chatre
2024-02-12 22:05                               ` Tony Luck
2024-02-13 18:11                                 ` James Morse
2024-02-13 19:02                                   ` Luck, Tony
2024-02-12 20:46                           ` Moger, Babu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.