patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
@ 2023-01-26 18:41 Tony Luck
  2023-01-26 18:41 ` [PATCH 1/7] x86/resctrl: Refactor in preparation for node-scoped resources Tony Luck
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Tony Luck @ 2023-01-26 18:41 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches, Tony Luck

Intel server systems starting with Skylake support a mode that logically
partitions each socket. E.g. when partitioned two ways, half the cores,
L3 cache, and memory controllers are allocated to each of the partitions.
This may reduce average latency to access L3 cache and memory, with the
tradeoff that only half the L3 cache is available for subnode-local memory
access.

The existing Linux resctrl system mishandles RDT monitoring on systems
with SNC mode enabled.

But, with some simple changes, this can be fixed. When SNC mode is
enabled, the RDT RMID counters are also partitioned with the low numbered
counters going to the first partition, and the high numbered counters
to the second partition[1]. The key is to adjust the RMID value written
to the IA32_PQR_ASSOC MSR on context switch, and the value written to
the IA32_QM_EVTSEL when reading out counters, and to change the scaling
factor that was read from CPUID(0xf,1).EBX

E.g. in 2-way Sub-NUMA cluster with 200 RMID counters there are only
100 available counters to the resctrl code. When running on the first
SNC node RMID values 0..99 are used as before. But when running on the
second node, a task that is assigned resctrl rmid=10 must load 10+100
into IA32_PQR_ASSOC to use RMID counter 110.

There should be no changes to functionality on other architectures,
or on Intel systems with SNC disabled, where snc_ways == 1.

-Tony

[1] Some systems also support a 4-way split. All the above still
applies, just need to account for cores, cache, memory controllers
and RMID counters being divided four ways instead of two.

Tony Luck (7):
  x86/resctrl: Refactor in preparation for node-scoped resources
  x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c
  x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  x86/resctrl: Add code to setup monitoring at L3 or NODE scope.
  x86/resctrl: Add a new "snc_ways" file to the monitoring info
    directory.
  x86/resctrl: Update documentation with Sub-NUMA cluster changes
  x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

 Documentation/x86/resctrl.rst             | 15 +++-
 include/linux/resctrl.h                   |  4 +-
 arch/x86/include/asm/resctrl.h            |  4 +-
 arch/x86/kernel/cpu/resctrl/internal.h    |  9 +++
 arch/x86/kernel/cpu/resctrl/core.c        | 83 ++++++++++++++++++++---
 arch/x86/kernel/cpu/resctrl/monitor.c     | 24 ++++---
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 22 +++++-
 8 files changed, 136 insertions(+), 27 deletions(-)

-- 
2.39.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/7] x86/resctrl: Refactor in preparation for node-scoped resources
  2023-01-26 18:41 [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
@ 2023-01-26 18:41 ` Tony Luck
  2023-01-26 18:41 ` [PATCH 2/7] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c Tony Luck
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Tony Luck @ 2023-01-26 18:41 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches, Tony Luck

Sub-NUMA cluster systems provide monitoring resources at the NUMA
node scope instead of the L3 cache scope.

Rename the cache_level field in struct rdt_resource to the more
generic "scope" and add symbolic names and a helper function.

No functional change.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |  4 ++--
 arch/x86/kernel/cpu/resctrl/internal.h    |  5 +++++
 arch/x86/kernel/cpu/resctrl/core.c        | 15 ++++++++++-----
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  2 +-
 5 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0cee154abc9f..64ecfcafa0a2 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -150,7 +150,7 @@ struct resctrl_schema;
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource (cache level or NUMA node)
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
@@ -168,7 +168,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	int			scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 5ebd28e6aa0c..15cea517efaa 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -414,6 +414,11 @@ enum resctrl_res_level {
 	RDT_NUM_RESOURCES,
 };
 
+enum resctrl_scope {
+	SCOPE_L2_CACHE = 2,
+	SCOPE_L3_CACHE = 3
+};
+
 static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(res);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c98e52ff5f20..6914232acf84 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= SCOPE_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= SCOPE_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= SCOPE_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -462,6 +462,11 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id(int cpu, enum resctrl_scope scope)
+{
+	return get_cpu_cacheinfo_id(cpu, scope);
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -477,7 +482,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
@@ -527,7 +532,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 524f8ff3e69c..d2ba4f7f6a79 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -297,7 +297,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == plr->s->res->scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5993da21d822..a6ba3080e5db 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1333,7 +1333,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/7] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c
  2023-01-26 18:41 [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2023-01-26 18:41 ` [PATCH 1/7] x86/resctrl: Refactor in preparation for node-scoped resources Tony Luck
@ 2023-01-26 18:41 ` Tony Luck
  2023-01-27  4:51   ` Yu, Fenghua
  2023-01-26 18:41 ` [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[] Tony Luck
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Tony Luck @ 2023-01-26 18:41 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches, Tony Luck

Scope of monitoring may be scoped at L3 cache granularity (legacy) or
at the node level (systems with Sub NUMA Cluster enabled).

Save the struct rdt_resource pointer that was used to initialize
the monitor sections of code and use that value instead of the
hard-coded RDT_RESOURCE_L3.

No functional change.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 77538abeb72a..d05bbd4f6b2d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -30,6 +30,8 @@ struct rmid_entry {
 	struct list_head		list;
 };
 
+static struct rdt_resource *mon_resource;
+
 /**
  * @rmid_free_lru    A least recently used list of free RMIDs
  *     These RMIDs are guaranteed to have an occupancy less than the
@@ -251,7 +253,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
  */
 void __check_limbo(struct rdt_domain *d, bool force_free)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_resource *r = mon_resource;
 	struct rmid_entry *entry;
 	u32 crmid = 1, nrmid;
 	bool rmid_dirty;
@@ -316,7 +318,7 @@ int alloc_rmid(void)
 
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_resource *r = mon_resource;
 	struct rdt_domain *d;
 	int cpu, err;
 	u64 val = 0;
@@ -633,7 +635,7 @@ void cqm_handle_limbo(struct work_struct *work)
 
 	mutex_lock(&rdtgroup_mutex);
 
-	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	r = mon_resource;
 	d = container_of(work, struct rdt_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
@@ -669,7 +671,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	if (!static_branch_likely(&rdt_mon_enable_key))
 		goto out_unlock;
 
-	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	r = mon_resource;
 	d = container_of(work, struct rdt_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
@@ -747,9 +749,11 @@ static struct mon_evt mbm_local_event = {
 /*
  * Initialize the event list for the resource.
  *
- * Note that MBM events are also part of RDT_RESOURCE_L3 resource
+ * Note that MBM events can either be part of RDT_RESOURCE_L3 resource
  * because as per the SDM the total and local memory bandwidth
- * are enumerated as part of L3 monitoring.
+ * are enumerated as part of L3 monitoring, or they may be per NUMA
+ * node on systems with sub-NUMA cluster enabled and are then in the
+ * RDT_RESOURCE_NODE resource.
  */
 static void l3_mon_evt_init(struct rdt_resource *r)
 {
@@ -761,6 +765,8 @@ static void l3_mon_evt_init(struct rdt_resource *r)
 		list_add_tail(&mbm_total_event.list, &r->evt_list);
 	if (is_mbm_local_enabled())
 		list_add_tail(&mbm_local_event.list, &r->evt_list);
+
+	mon_resource = r;
 }
 
 int rdt_get_mon_l3_config(struct rdt_resource *r)
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-01-26 18:41 [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2023-01-26 18:41 ` [PATCH 1/7] x86/resctrl: Refactor in preparation for node-scoped resources Tony Luck
  2023-01-26 18:41 ` [PATCH 2/7] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c Tony Luck
@ 2023-01-26 18:41 ` Tony Luck
  2023-01-27  5:24   ` Yu, Fenghua
                     ` (3 more replies)
  2023-01-26 18:41 ` [PATCH 4/7] x86/resctrl: Add code to setup monitoring at L3 or NODE scope Tony Luck
                   ` (4 subsequent siblings)
  7 siblings, 4 replies; 34+ messages in thread
From: Tony Luck @ 2023-01-26 18:41 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches, Tony Luck

Add a placeholder in the array of struct rdt_hw_resource to be used
for event monitoring of systems with Sub-NUMA Cluster enabled.

Update get_domain_id() to handle SCOPE_NODE.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 15cea517efaa..39a62babd60b 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -409,12 +409,14 @@ enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
 	RDT_RESOURCE_MBA,
+	RDT_RESOURCE_NODE,
 
 	/* Must be the last */
 	RDT_NUM_RESOURCES,
 };
 
 enum resctrl_scope {
+	SCOPE_NODE,
 	SCOPE_L2_CACHE = 2,
 	SCOPE_L3_CACHE = 3
 };
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6914232acf84..19be6fe42ef3 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -100,6 +100,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.fflags			= RFTYPE_RES_MB,
 		},
 	},
+	[RDT_RESOURCE_NODE] =
+	{
+		.r_resctrl = {
+			.rid			= RDT_RESOURCE_NODE,
+			.name			= "L3",
+			.scope			= SCOPE_NODE,
+			.domains		= domain_init(RDT_RESOURCE_NODE),
+			.fflags			= RFTYPE_RES_MB,
+		},
+	},
 };
 
 /*
@@ -464,6 +474,8 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 
 static int get_domain_id(int cpu, enum resctrl_scope scope)
 {
+	if (scope == SCOPE_NODE)
+		return cpu_to_node(cpu);
 	return get_cpu_cacheinfo_id(cpu, scope);
 }
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/7] x86/resctrl: Add code to setup monitoring at L3 or NODE scope.
  2023-01-26 18:41 [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (2 preceding siblings ...)
  2023-01-26 18:41 ` [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[] Tony Luck
@ 2023-01-26 18:41 ` Tony Luck
  2023-02-28 17:13   ` James Morse
  2023-01-26 18:41 ` [PATCH 5/7] x86/resctrl: Add a new "snc_ways" file to the monitoring info directory Tony Luck
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Tony Luck @ 2023-01-26 18:41 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches, Tony Luck

When Sub-NUMA cluster is enabled (snc_ways > 1) use the RDT_RESOURCE_NODE
instead of RDT_RESOURCE_L3 for all monitoring operations.

The mon_scale and num_rmid values from CPUID(0xf,0x1),(EBX,ECX) must be
scaled down by the number of Sub-NUMA Clusters.

A subsequent change will detect sub-NUMA cluster mode and set
"snc_ways". For now set to one (meaning each L3 cache spans one
node).

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     | 13 ++++++++++++-
 arch/x86/kernel/cpu/resctrl/monitor.c  |  4 ++--
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  5 ++++-
 4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 39a62babd60b..ad26d008dafa 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -405,6 +405,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern int snc_ways;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19be6fe42ef3..53b2ab37af2f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,11 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * How many Sub-Numa Cluster nodes share a single L3 cache
+ */
+int snc_ways = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
@@ -786,7 +791,13 @@ static __init bool get_rdt_alloc_resources(void)
 
 static __init bool get_rdt_mon_resources(void)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_resource *r;
+
+	/* When SNC enabled, monitor functions at node instead of L3 cache scope */
+	if (snc_ways > 1)
+		r = &rdt_resources_all[RDT_RESOURCE_NODE].r_resctrl;
+	else
+		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 
 	if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
 		rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index d05bbd4f6b2d..3fc63aa68130 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -777,8 +777,8 @@ int rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_ways;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_ways;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a6ba3080e5db..a0dc64a70d01 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2238,7 +2238,10 @@ static int rdt_get_tree(struct fs_context *fc)
 		static_branch_enable_cpuslocked(&rdt_enable_key);
 
 	if (is_mbm_enabled()) {
-		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+		if (snc_ways > 1)
+			r = &rdt_resources_all[RDT_RESOURCE_NODE].r_resctrl;
+		else
+			r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 		list_for_each_entry(dom, &r->domains, list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/7] x86/resctrl: Add a new "snc_ways" file to the monitoring info directory.
  2023-01-26 18:41 [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (3 preceding siblings ...)
  2023-01-26 18:41 ` [PATCH 4/7] x86/resctrl: Add code to setup monitoring at L3 or NODE scope Tony Luck
@ 2023-01-26 18:41 ` Tony Luck
  2023-02-28 17:13   ` James Morse
  2023-01-26 18:41 ` [PATCH 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Tony Luck @ 2023-01-26 18:41 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches, Tony Luck

Make it easy for the user to tell if Sub-NUMA Cluster is enabled by
providing an info/ file.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a0dc64a70d01..392e7a08d083 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -997,6 +997,14 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of,
 	return 0;
 }
 
+static int rdt_snc_ways_show(struct kernfs_open_file *of,
+			     struct seq_file *seq, void *v)
+{
+	seq_printf(seq, "%d\n", snc_ways);
+
+	return 0;
+}
+
 static int rdt_mon_features_show(struct kernfs_open_file *of,
 				 struct seq_file *seq, void *v)
 {
@@ -1451,6 +1459,13 @@ static struct rftype res_common_files[] = {
 		.seq_show	= rdt_num_rmids_show,
 		.fflags		= RF_MON_INFO,
 	},
+	{
+		.name		= "snc_ways",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdt_snc_ways_show,
+		.fflags		= RF_MON_INFO,
+	},
 	{
 		.name		= "cbm_mask",
 		.mode		= 0444,
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-01-26 18:41 [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (4 preceding siblings ...)
  2023-01-26 18:41 ` [PATCH 5/7] x86/resctrl: Add a new "snc_ways" file to the monitoring info directory Tony Luck
@ 2023-01-26 18:41 ` Tony Luck
  2023-02-28 17:14   ` James Morse
  2023-01-26 18:41 ` [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
  2023-02-28 17:12 ` [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems James Morse
  7 siblings, 1 reply; 34+ messages in thread
From: Tony Luck @ 2023-01-26 18:41 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches, Tony Luck

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffices of directories with "L3" in
their name refer to Sun-NUMA nodes instead of L3 cache ids.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/x86/resctrl.rst | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/Documentation/x86/resctrl.rst b/Documentation/x86/resctrl.rst
index 71a531061e4e..9043a2d2f2d3 100644
--- a/Documentation/x86/resctrl.rst
+++ b/Documentation/x86/resctrl.rst
@@ -167,6 +167,11 @@ with the following files:
 		bytes) at which a previously used LLC_occupancy
 		counter can be considered for re-use.
 
+"snc_ways":
+                A value of "1" marks that SNC mode is disabled.
+                Values of "2" or "4" indicate how many NUMA
+                nodes share an L3 cache.
+
 Finally, in the top level of the "info" directory there is a file
 named "last_cmd_status". This is reset with every "command" issued
 via the file system (making new directories or writing to any of the
@@ -254,9 +259,13 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
+	This contains a set of files organized by L3 domain or by NUMA
+	node (depending on whether SNC mode is disabled or enabled
+	respectively) and by RDT event. E.g. on a system with SNC
+	mode disabled with two L3 domains there will be subdirectories
+	"mon_L3_00" and "mon_L3_01" the numerical suffix refers to the
+        L3 cache id.  With SNC enabled the directory names are the same,
+        but the numerical suffix refers to the node id.  Each of these
 	directories have one file per event (e.g. "llc_occupancy",
 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 	files provide a read out of the current value of the event for
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-01-26 18:41 [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (5 preceding siblings ...)
  2023-01-26 18:41 ` [PATCH 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-01-26 18:41 ` Tony Luck
  2023-02-27 13:30   ` Peter Newman
                     ` (2 more replies)
  2023-02-28 17:12 ` [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems James Morse
  7 siblings, 3 replies; 34+ messages in thread
From: Tony Luck @ 2023-01-26 18:41 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches, Tony Luck

There isn't a simple hardware enumeration to indicate to software that
a system is running with Sub-NUMA Cluster enabled.

Compare the number of NUMA nodes with the number of L3 caches to calculate
the number of Sub-NUMA nodes per L3 cache.

When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
are distributed equally between the SNC nodes within each socket.

E.g. if there are 400 RMID counters, and the system is configured with
two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
0 on the socket, and RMID counter 200..399 on SNC node 1.

Handle this by initializing a per-cpu RMID offset value. Use this
to calculate the value to write to the RMID field of the IA32_PQR_ASSOC
MSR during context switch, and also to the IA32_QM_EVTSEL MSR when
reading RMID event values.

N.B. this works well for well-behaved NUMA applications that access
memory predominantly from the local memory node. For applications that
access memory across multiple nodes it may be necessary for the user
to read counters for all SNC nodes on a socket and add the values to
get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
all that different from applications that span across multiple sockets
in a legacy system.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/resctrl.h        |  4 ++-
 arch/x86/kernel/cpu/resctrl/core.c    | 43 +++++++++++++++++++++++++--
 arch/x86/kernel/cpu/resctrl/monitor.c |  2 +-
 3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 52788f79786f..59b8afd8c53c 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
 DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
 
+DECLARE_PER_CPU(int, rmid_offset);
+
 /*
  * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
  *
@@ -69,7 +71,7 @@ static void __resctrl_sched_in(void)
 	if (static_branch_likely(&rdt_mon_enable_key)) {
 		tmp = READ_ONCE(current->rmid);
 		if (tmp)
-			rmid = tmp;
+			rmid = tmp + this_cpu_read(rmid_offset);
 	}
 
 	if (closid != state->cur_closid || rmid != state->cur_rmid) {
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 53b2ab37af2f..0ff739375e3b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,6 +16,7 @@
 
 #define pr_fmt(fmt)	"resctrl: " fmt
 
+#include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
@@ -484,6 +485,13 @@ static int get_domain_id(int cpu, enum resctrl_scope scope)
 	return get_cpu_cacheinfo_id(cpu, scope);
 }
 
+DEFINE_PER_CPU(int, rmid_offset);
+
+static void set_per_cpu_rmid_offset(int cpu, struct rdt_resource *r)
+{
+	this_cpu_write(rmid_offset, (cpu_to_node(cpu) % snc_ways) * r->num_rmid);
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -515,6 +523,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
+		if (r->mon_capable)
+			set_per_cpu_rmid_offset(cpu, r);
 		return;
 	}
 
@@ -533,9 +543,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
-		return;
+	if (r->mon_capable) {
+		if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+			domain_free(hw_dom);
+			return;
+		}
+		set_per_cpu_rmid_offset(cpu, r);
 	}
 
 	list_add_tail(&d->list, add_pos);
@@ -845,11 +858,35 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+static __init int find_snc_ways(void)
+{
+	unsigned long *node_caches;
+	int cpu, node, ret;
+
+	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+	}
+	cpus_read_unlock();
+
+	ret = nr_node_ids / bitmap_weight(node_caches, nr_node_ids);
+	kfree(node_caches);
+
+	return ret;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_ways = find_snc_ways();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3fc63aa68130..bd5ec348d925 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -160,7 +160,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c
  2023-01-26 18:41 ` [PATCH 2/7] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c Tony Luck
@ 2023-01-27  4:51   ` Yu, Fenghua
  0 siblings, 0 replies; 34+ messages in thread
From: Yu, Fenghua @ 2023-01-27  4:51 UTC (permalink / raw)
  To: Luck, Tony, Chatre, Reinette, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches

Hi, Tony,

On 1/26/23 10:41, Tony Luck wrote:
> Scope of monitoring may be scoped at L3 cache granularity (legacy) or
> at the node level (systems with Sub NUMA Cluster enabled).
> 
> Save the struct rdt_resource pointer that was used to initialize
> the monitor sections of code and use that value instead of the
> hard-coded RDT_RESOURCE_L3.
> 
> No functional change.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/monitor.c | 18 ++++++++++++------
>   1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 77538abeb72a..d05bbd4f6b2d 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -30,6 +30,8 @@ struct rmid_entry {
>   	struct list_head		list;
>   };
>   
> +static struct rdt_resource *mon_resource;
> +
>   /**
>    * @rmid_free_lru    A least recently used list of free RMIDs
>    *     These RMIDs are guaranteed to have an occupancy less than the
> @@ -251,7 +253,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>    */
>   void __check_limbo(struct rdt_domain *d, bool force_free)
>   {
> -	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	struct rdt_resource *r = mon_resource;
>   	struct rmid_entry *entry;
>   	u32 crmid = 1, nrmid;
>   	bool rmid_dirty;
> @@ -316,7 +318,7 @@ int alloc_rmid(void)
>   
>   static void add_rmid_to_limbo(struct rmid_entry *entry)
>   {
> -	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	struct rdt_resource *r = mon_resource;
>   	struct rdt_domain *d;
>   	int cpu, err;
>   	u64 val = 0;
> @@ -633,7 +635,7 @@ void cqm_handle_limbo(struct work_struct *work)
>   
>   	mutex_lock(&rdtgroup_mutex);
>   
> -	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	r = mon_resource;
>   	d = container_of(work, struct rdt_domain, cqm_limbo.work);
>   
>   	__check_limbo(d, false);
> @@ -669,7 +671,7 @@ void mbm_handle_overflow(struct work_struct *work)
>   	if (!static_branch_likely(&rdt_mon_enable_key))
>   		goto out_unlock;
>   
> -	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	r = mon_resource;
>   	d = container_of(work, struct rdt_domain, mbm_over.work);
>   
>   	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> @@ -747,9 +749,11 @@ static struct mon_evt mbm_local_event = {
>   /*
>    * Initialize the event list for the resource.
>    *
> - * Note that MBM events are also part of RDT_RESOURCE_L3 resource
> + * Note that MBM events can either be part of RDT_RESOURCE_L3 resource
>    * because as per the SDM the total and local memory bandwidth
> - * are enumerated as part of L3 monitoring.
> + * are enumerated as part of L3 monitoring, or they may be per NUMA
> + * node on systems with sub-NUMA cluster enabled and are then in the
> + * RDT_RESOURCE_NODE resource.

"RDT_RESOURCE_NODE" is not defined yet. It will be defined in patch #3.
Maybe better to move this comment change to patch #3 to avoid confusion
on RDT_RESOURCE_NODE.

Further, the current comment calls out MBM because MBM is not obviously
related to L3. But with SNC, I think we need to call out SNC node for
all monitor events, not just MBM.

Maybe something like this?
    /*
     * Initialize the event list for the resource.
     *
     * Monitor events can either be part of RDT_RESOURCE_L3 resource,
     * or they may be per NUMA node on systems with sub-NUMA cluster
     * enabled and are then in the RDT_RESOURCE_NODE resource.
     *
     * Note that MBM events are also part of RDT_RESOURCE_L3 or
     * RDT_RESOURCE_NODE resource
     * because as per the SDM the total and local memory bandwidth
     * are enumerated as part of L3 monitoring.
     */

>    */
>   static void l3_mon_evt_init(struct rdt_resource *r)
>   {
> @@ -761,6 +765,8 @@ static void l3_mon_evt_init(struct rdt_resource *r)
>   		list_add_tail(&mbm_total_event.list, &r->evt_list);
>   	if (is_mbm_local_enabled())
>   		list_add_tail(&mbm_local_event.list, &r->evt_list);
> +
> +	mon_resource = r;
>   }
>   
>   int rdt_get_mon_l3_config(struct rdt_resource *r)

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-01-26 18:41 ` [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[] Tony Luck
@ 2023-01-27  5:24   ` Yu, Fenghua
  2023-01-27 16:02     ` Peter Newman
  2023-01-27 18:23     ` Luck, Tony
  2023-01-28  2:22   ` Yu, Fenghua
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 34+ messages in thread
From: Yu, Fenghua @ 2023-01-27  5:24 UTC (permalink / raw)
  To: Luck, Tony, Chatre, Reinette, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches

Hi, Tony,

On 1/26/23 10:41, Tony Luck wrote:
> Add a placeholder in the array of struct rdt_hw_resource to be used
> for event monitoring of systems with Sub-NUMA Cluster enabled.
> 
> Update get_domain_id() to handle SCOPE_NODE.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
>   arch/x86/kernel/cpu/resctrl/core.c     | 12 ++++++++++++
>   2 files changed, 14 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 15cea517efaa..39a62babd60b 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -409,12 +409,14 @@ enum resctrl_res_level {
>   	RDT_RESOURCE_L3,
>   	RDT_RESOURCE_L2,
>   	RDT_RESOURCE_MBA,
> +	RDT_RESOURCE_NODE,
>   
>   	/* Must be the last */
>   	RDT_NUM_RESOURCES,
>   };
>   
>   enum resctrl_scope {
> +	SCOPE_NODE,
>   	SCOPE_L2_CACHE = 2,
>   	SCOPE_L3_CACHE = 3
>   };
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 6914232acf84..19be6fe42ef3 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -100,6 +100,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
>   			.fflags			= RFTYPE_RES_MB,
>   		},
>   	},
> +	[RDT_RESOURCE_NODE] =
> +	{
> +		.r_resctrl = {
> +			.rid			= RDT_RESOURCE_NODE,
> +			.name			= "L3",
"L3" was named as RDT_RESOURCE_L3 already. The duplicate name here may
cause duplicate file names in info dir. Maybe rename it as "L3_NODE"?

> +			.scope			= SCOPE_NODE,
> +			.domains		= domain_init(RDT_RESOURCE_NODE),
> +			.fflags			= RFTYPE_RES_MB,

I'm not sure if fflags is RFTYPE_RES_MB | RFTYPE_RES_L3 for both cache
and MB?

> +		},
> +	},
>   };
>   
>   /*
> @@ -464,6 +474,8 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>   
>   static int get_domain_id(int cpu, enum resctrl_scope scope)
>   {
> +	if (scope == SCOPE_NODE)
> +		return cpu_to_node(cpu);
>   	return get_cpu_cacheinfo_id(cpu, scope);
>   }
>   

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-01-27  5:24   ` Yu, Fenghua
@ 2023-01-27 16:02     ` Peter Newman
  2023-01-27 18:23     ` Luck, Tony
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Newman @ 2023-01-27 16:02 UTC (permalink / raw)
  To: Yu, Fenghua
  Cc: Luck, Tony, Chatre, Reinette, Jonathan Corbet, x86, Shaopeng Tan,
	James Morse, Jamie Iles, Babu Moger, linux-kernel, linux-doc,
	patches

Hi Tony,

On Fri, Jan 27, 2023 at 6:25 AM Yu, Fenghua <fenghua.yu@intel.com> wrote:
>
> On 1/26/23 10:41, Tony Luck wrote:
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 6914232acf84..19be6fe42ef3 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -100,6 +100,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
> >                       .fflags                 = RFTYPE_RES_MB,
> >               },
> >       },
> > +     [RDT_RESOURCE_NODE] =
> > +     {
> > +             .r_resctrl = {
> > +                     .rid                    = RDT_RESOURCE_NODE,
> > +                     .name                   = "L3",
> "L3" was named as RDT_RESOURCE_L3 already. The duplicate name here may
> cause duplicate file names in info dir. Maybe rename it as "L3_NODE"?

I'm trying to get some feedback from our own users on whether changing
the directory names would bother them. At least from my own testing, I
did learn to appreciate the interface change a bit more: I needed an
SNC and non-SNC case to correctly predict which mon_data subdirectory
the data would appear in.

I was able to confirm that this change allows bandwidth to be counted
on RMID/CPU combos where it didn't work before on an SNC2
configuration.

If I'm understanding this correctly, it might be helpful to highlight
that the extra resource is needed to allow a different number of L3
domains in L3 monitoring vs allocation.

Thanks!
-Peter

Tested-By: Peter Newman <peternewman@google.com>
Reviewed-By: Peter Newman <peternewman@google.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-01-27  5:24   ` Yu, Fenghua
  2023-01-27 16:02     ` Peter Newman
@ 2023-01-27 18:23     ` Luck, Tony
  2023-01-28  2:25       ` Yu, Fenghua
  1 sibling, 1 reply; 34+ messages in thread
From: Luck, Tony @ 2023-01-27 18:23 UTC (permalink / raw)
  To: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches

>> +	[RDT_RESOURCE_NODE] =
>> +	{
>> +		.r_resctrl = {
>> +			.rid			= RDT_RESOURCE_NODE,
>> +			.name			= "L3",

> "L3" was named as RDT_RESOURCE_L3 already. The duplicate name here may
> cause duplicate file names in info dir. Maybe rename it as "L3_NODE"?

I thought the same, and my first implementation used a different string here (I picked
"NODE" rather than "L3_NODE").

But my testers complained that this broke all their existing infrastructure that reads
cache occupancy and memory bandwidth. This string is not just used in the info/
directory, it is also the basis for the directory names in mon_data/

$ tree /sys/fs/resctrl/mon_data
/sys/fs/resctrl/mon_data
├── mon_L3_00
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_L3_01
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_L3_02
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
└── mon_L3_03
    ├── llc_occupancy
    ├── mbm_local_bytes
    └── mbm_total_bytes

The name using "L3" is still appropriate and accurate.

There isn't a "duplicate file names" problem in the info/ directory because a system
either has SNC disabled, and uses the L3-scoped resource, or has SNC enabled and
uses the node-scoped resource.

-Tony


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-01-26 18:41 ` [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[] Tony Luck
  2023-01-27  5:24   ` Yu, Fenghua
@ 2023-01-28  2:22   ` Yu, Fenghua
  2023-01-28  2:36   ` Yu, Fenghua
  2023-02-28 14:27   ` Moger, Babu
  3 siblings, 0 replies; 34+ messages in thread
From: Yu, Fenghua @ 2023-01-28  2:22 UTC (permalink / raw)
  To: Luck, Tony, Chatre, Reinette, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches

Hi, Tony,

On 1/26/23 10:41, Tony Luck wrote:
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 6914232acf84..19be6fe42ef3 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -100,6 +100,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
>   			.fflags			= RFTYPE_RES_MB,
>   		},
>   	},
> +	[RDT_RESOURCE_NODE] =
> +	{
> +		.r_resctrl = {
> +			.rid			= RDT_RESOURCE_NODE,
> +			.name			= "L3",
> +			.scope			= SCOPE_NODE,
> +			.domains		= domain_init(RDT_RESOURCE_NODE),
> +			.fflags			= RFTYPE_RES_MB,

RFTYPE_RES_MB is for the resource to add some files in info/MB.
But the NODE resource doesn't have any files to show in info/MB.

Only shown file for the NODE resource is info/L3_MON/snc_ways. But this
file doesn't need to set fflags.

Maybe

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-01-27 18:23     ` Luck, Tony
@ 2023-01-28  2:25       ` Yu, Fenghua
  0 siblings, 0 replies; 34+ messages in thread
From: Yu, Fenghua @ 2023-01-28  2:25 UTC (permalink / raw)
  To: Luck, Tony, Chatre, Reinette, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches

Hi, Tony,

On 1/27/23 10:23, Luck, Tony wrote:
>>> +	[RDT_RESOURCE_NODE] =
>>> +	{
>>> +		.r_resctrl = {
>>> +			.rid			= RDT_RESOURCE_NODE,
>>> +			.name			= "L3",
> 
>> "L3" was named as RDT_RESOURCE_L3 already. The duplicate name here may
>> cause duplicate file names in info dir. Maybe rename it as "L3_NODE"?
> 
> I thought the same, and my first implementation used a different string here (I picked
> "NODE" rather than "L3_NODE").
> 
> But my testers complained that this broke all their existing infrastructure that reads
> cache occupancy and memory bandwidth. This string is not just used in the info/
> directory, it is also the basis for the directory names in mon_data/
> 
> $ tree /sys/fs/resctrl/mon_data
> /sys/fs/resctrl/mon_data
> ├── mon_L3_00
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_L3_01
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_L3_02
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> └── mon_L3_03
>      ├── llc_occupancy
>      ├── mbm_local_bytes
>      └── mbm_total_bytes
> 
> The name using "L3" is still appropriate and accurate.
> 
> There isn't a "duplicate file names" problem in the info/ directory because a system
> either has SNC disabled, and uses the L3-scoped resource, or has SNC enabled and
> uses the node-scoped resource.

That's right.

Thank you for your info!

-Fenghua

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-01-26 18:41 ` [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[] Tony Luck
  2023-01-27  5:24   ` Yu, Fenghua
  2023-01-28  2:22   ` Yu, Fenghua
@ 2023-01-28  2:36   ` Yu, Fenghua
  2023-01-30 19:04     ` Luck, Tony
  2023-02-28 14:27   ` Moger, Babu
  3 siblings, 1 reply; 34+ messages in thread
From: Yu, Fenghua @ 2023-01-28  2:36 UTC (permalink / raw)
  To: Luck, Tony, Chatre, Reinette, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches

Hi, Tony,

On 1/26/23 10:41, Tony Luck wrote:
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 6914232acf84..19be6fe42ef3 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -100,6 +100,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
>   			.fflags			= RFTYPE_RES_MB,
>   		},
>   	},
> +	[RDT_RESOURCE_NODE] =
> +	{
> +		.r_resctrl = {
> +			.rid			= RDT_RESOURCE_NODE,
> +			.name			= "L3",
> +			.scope			= SCOPE_NODE,
> +			.domains		= domain_init(RDT_RESOURCE_NODE),
> +			.fflags			= RFTYPE_RES_MB,

RFTYPE_RES_MB is for the resource to add some files in info/MB.
But the NODE resource doesn't have any files to show in info/MB.

Only shown file for the NODE resource is info/L3_MON/snc_ways. But this 
file doesn't need to set fflags.

Seems no need to set fflags or fflags=0 to eliminate confusion?

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-01-28  2:36   ` Yu, Fenghua
@ 2023-01-30 19:04     ` Luck, Tony
  0 siblings, 0 replies; 34+ messages in thread
From: Luck, Tony @ 2023-01-30 19:04 UTC (permalink / raw)
  To: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches

>> +			.domains		= domain_init(RDT_RESOURCE_NODE),
>> +			.fflags			= RFTYPE_RES_MB,

> RFTYPE_RES_MB is for the resource to add some files in info/MB.
> But the NODE resource doesn't have any files to show in info/MB.
>
> Only shown file for the NODE resource is info/L3_MON/snc_ways. But this 
> file doesn't need to set fflags.
>
> Seems no need to set fflags or fflags=0 to eliminate confusion?

I just cut & pasted from the L3 ... oops. I think you may be right that
an explicit ".fflags = 0" would be best here.

-Tony

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-01-26 18:41 ` [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
@ 2023-02-27 13:30   ` Peter Newman
  2023-03-10 17:30     ` Tony Luck
  2023-02-28 19:51   ` Moger, Babu
       [not found]   ` <85d7e70a-b9c8-6551-b1ac-229b51ee18d7@amd.com>
  2 siblings, 1 reply; 34+ messages in thread
From: Peter Newman @ 2023-02-27 13:30 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, x86, Shaopeng Tan,
	James Morse, Jamie Iles, Babu Moger, linux-kernel, linux-doc,
	patches

Hi Tony,

On Thu, Jan 26, 2023 at 7:42 PM Tony Luck <tony.luck@intel.com> wrote:
> +static __init int find_snc_ways(void)
> +{
> +       unsigned long *node_caches;
> +       int cpu, node, ret;
> +
> +       node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> +       if (!node_caches)
> +               return 1;
> +
> +       cpus_read_lock();
> +       for_each_node(node) {

Someone tried this patch on a machine with a CPU-less node...

We need to check for this:

+               if (cpumask_empty(cpumask_of_node(node)))
+                       continue;

> +               cpu = cpumask_first(cpumask_of_node(node));
> +               set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> +       }
> +       cpus_read_unlock();

Thanks!
-Peter

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-01-26 18:41 ` [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[] Tony Luck
                     ` (2 preceding siblings ...)
  2023-01-28  2:36   ` Yu, Fenghua
@ 2023-02-28 14:27   ` Moger, Babu
  2023-02-28 17:05     ` Luck, Tony
  3 siblings, 1 reply; 34+ messages in thread
From: Moger, Babu @ 2023-02-28 14:27 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, linux-kernel, linux-doc, patches

Hi Tony,
Sorry for the late response. I was looking at your patches.

Do you really need a new resource [RDT_RESOURCE_NODE] to handle this 
feature?

As far as I can see, all that matters is writing/reading the MSRs 
IA32_PQR_ASSOC and IA32_QM_EVTSEL based on cpu index. I think that can 
be done without having the new resource. Let me know if I have 
misunderstood something.

Thanks
Babu


> -----Original Message-----
> From: Tony Luck <tony.luck@intel.com>
> Sent: Thursday, January 26, 2023 12:42 PM
> To: Fenghua Yu <fenghua.yu@intel.com>; Reinette Chatre
> <reinette.chatre@intel.com>; Peter Newman <peternewman@google.com>;
> Jonathan Corbet <corbet@lwn.net>; x86@kernel.org
> Cc: Shaopeng Tan <tan.shaopeng@fujitsu.com>; James Morse
> <james.morse@arm.com>; Jamie Iles <quic_jiles@quicinc.com>; Moger, Babu
> <Babu.Moger@amd.com>; linux-kernel@vger.kernel.org; linux-
> doc@vger.kernel.org; patches@lists.linux.dev; Tony Luck
> <tony.luck@intel.com>
> Subject: [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to
> rdt_resources_all[]
> 
> Add a placeholder in the array of struct rdt_hw_resource to be used for event
> monitoring of systems with Sub-NUMA Cluster enabled.
> 
> Update get_domain_id() to handle SCOPE_NODE.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
>  arch/x86/kernel/cpu/resctrl/core.c     | 12 ++++++++++++
>  2 files changed, 14 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h
> b/arch/x86/kernel/cpu/resctrl/internal.h
> index 15cea517efaa..39a62babd60b 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -409,12 +409,14 @@ enum resctrl_res_level {
>  	RDT_RESOURCE_L3,
>  	RDT_RESOURCE_L2,
>  	RDT_RESOURCE_MBA,
> +	RDT_RESOURCE_NODE,
> 
>  	/* Must be the last */
>  	RDT_NUM_RESOURCES,
>  };
> 
>  enum resctrl_scope {
> +	SCOPE_NODE,
>  	SCOPE_L2_CACHE = 2,
>  	SCOPE_L3_CACHE = 3
>  };
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c
> b/arch/x86/kernel/cpu/resctrl/core.c
> index 6914232acf84..19be6fe42ef3 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -100,6 +100,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  			.fflags			= RFTYPE_RES_MB,
>  		},
>  	},
> +	[RDT_RESOURCE_NODE] =
> +	{
> +		.r_resctrl = {
> +			.rid			= RDT_RESOURCE_NODE,
> +			.name			= "L3",
> +			.scope			= SCOPE_NODE,
> +			.domains		=
> domain_init(RDT_RESOURCE_NODE),
> +			.fflags			= RFTYPE_RES_MB,
> +		},
> +	},
>  };
> 
>  /*
> @@ -464,6 +474,8 @@ static int arch_domain_mbm_alloc(u32 num_rmid,
> struct rdt_hw_domain *hw_dom)
> 
>  static int get_domain_id(int cpu, enum resctrl_scope scope)  {
> +	if (scope == SCOPE_NODE)
> +		return cpu_to_node(cpu);
>  	return get_cpu_cacheinfo_id(cpu, scope);  }
> 
> --
> 2.39.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  2023-02-28 14:27   ` Moger, Babu
@ 2023-02-28 17:05     ` Luck, Tony
  0 siblings, 0 replies; 34+ messages in thread
From: Luck, Tony @ 2023-02-28 17:05 UTC (permalink / raw)
  To: Moger, Babu, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, linux-kernel, linux-doc, patches

Babu,

Thanks for looking at my patches.

> Do you really need a new resource [RDT_RESOURCE_NODE] to handle this 
> feature?

Yes. When sub-numa cluster mode is enabled, there are separate counts for
each "node" on the socket. This new resource is the key to creating extra
directories in the /sys/fs/resctrl/mon_data/ area so that the memory bandwidth
and cache occupancy can be read for each node, instead of just for each socket.

But there are some other issues with this patch series. New version will be posted
once they are fixed up.

-Tony

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
  2023-01-26 18:41 [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (6 preceding siblings ...)
  2023-01-26 18:41 ` [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
@ 2023-02-28 17:12 ` James Morse
  2023-02-28 18:04   ` Luck, Tony
  7 siblings, 1 reply; 34+ messages in thread
From: James Morse @ 2023-02-28 17:12 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, x86
  Cc: Shaopeng Tan, Jamie Iles, Babu Moger, linux-kernel, linux-doc, patches

Hi Tony,

On 26/01/2023 18:41, Tony Luck wrote:
> Intel server systems starting with Skylake support a mode that logically
> partitions each socket. E.g. when partitioned two ways, half the cores,
> L3 cache, and memory controllers are allocated to each of the partitions.
> This may reduce average latency to access L3 cache and memory, with the
> tradeoff that only half the L3 cache is available for subnode-local memory
> access.

I couldn't find a description of what happens to the CAT bitmaps or counters.

Presumably the CAT bitmaps are duplicated, so each cluster has its own set, and
the counters aren't - so software has to co-ordinate the use of RMID across the CPUs?


How come cacheinfo isn't modified to report the L3 partitions as separate caches?
Otherwise user-space would assume the full size of the cache is available on any of those
CPUs.
This would avoid an ABI change in resctrl (domain is now the numa node), leaving only the
RMID range code.


Thanks,

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] x86/resctrl: Add a new "snc_ways" file to the monitoring info directory.
  2023-01-26 18:41 ` [PATCH 5/7] x86/resctrl: Add a new "snc_ways" file to the monitoring info directory Tony Luck
@ 2023-02-28 17:13   ` James Morse
  2023-02-28 17:44     ` Luck, Tony
  0 siblings, 1 reply; 34+ messages in thread
From: James Morse @ 2023-02-28 17:13 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, x86
  Cc: Shaopeng Tan, Jamie Iles, Babu Moger, linux-kernel, linux-doc, patches

Hi Tony,

On 26/01/2023 18:41, Tony Luck wrote:
> Make it easy for the user to tell if Sub-NUMA Cluster is enabled by
> providing an info/ file.

I think what this is conveying to user-space is 'domain_id_is_numa_node'.

Does user-space need to know the number of ways?

Will this always be a single number, or will it ever be possible to have an SNC=2 and
SNC=1 package in the same system?


Thanks,

James

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index a0dc64a70d01..392e7a08d083 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -997,6 +997,14 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of,
>  	return 0;
>  }
>  
> +static int rdt_snc_ways_show(struct kernfs_open_file *of,
> +			     struct seq_file *seq, void *v)
> +{
> +	seq_printf(seq, "%d\n", snc_ways);
> +
> +	return 0;
> +}
> +
>  static int rdt_mon_features_show(struct kernfs_open_file *of,
>  				 struct seq_file *seq, void *v)
>  {
> @@ -1451,6 +1459,13 @@ static struct rftype res_common_files[] = {
>  		.seq_show	= rdt_num_rmids_show,
>  		.fflags		= RF_MON_INFO,
>  	},
> +	{
> +		.name		= "snc_ways",
> +		.mode		= 0444,
> +		.kf_ops		= &rdtgroup_kf_single_ops,
> +		.seq_show	= rdt_snc_ways_show,
> +		.fflags		= RF_MON_INFO,
> +	},
>  	{
>  		.name		= "cbm_mask",
>  		.mode		= 0444,


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] x86/resctrl: Add code to setup monitoring at L3 or NODE scope.
  2023-01-26 18:41 ` [PATCH 4/7] x86/resctrl: Add code to setup monitoring at L3 or NODE scope Tony Luck
@ 2023-02-28 17:13   ` James Morse
  2023-02-28 17:28     ` Luck, Tony
  0 siblings, 1 reply; 34+ messages in thread
From: James Morse @ 2023-02-28 17:13 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, x86
  Cc: Shaopeng Tan, Jamie Iles, Babu Moger, linux-kernel, linux-doc, patches

Hi Tony,

On 26/01/2023 18:41, Tony Luck wrote:
> When Sub-NUMA cluster is enabled (snc_ways > 1) use the RDT_RESOURCE_NODE
> instead of RDT_RESOURCE_L3 for all monitoring operations.
> 
> The mon_scale and num_rmid values from CPUID(0xf,0x1),(EBX,ECX) must be
> scaled down by the number of Sub-NUMA Clusters.
> 
> A subsequent change will detect sub-NUMA cluster mode and set
> "snc_ways". For now set to one (meaning each L3 cache spans one
> node).

(I'm looking at decoupling "monitoring is always on RDT_RESOURCE_L3" as a separate thing
 to enabling SNC ... just in case my comments seem strange!)


> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 19be6fe42ef3..53b2ab37af2f 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -786,7 +791,13 @@ static __init bool get_rdt_alloc_resources(void)
>  
>  static __init bool get_rdt_mon_resources(void)
>  {
> -	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	struct rdt_resource *r;
> +
> +	/* When SNC enabled, monitor functions at node instead of L3 cache scope */
> +	if (snc_ways > 1)
> +		r = &rdt_resources_all[RDT_RESOURCE_NODE].r_resctrl;
> +	else
> +		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;

Could this be hidden in a helper with some name like resctrl_arch_get_mbm_resource()?
You have the same pattern again in rdt_get_tree(). If this gets more complex in the
future, it means its outside the filesystem code, and all in one place.


Thanks,

James


>  	if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
>  		rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index a6ba3080e5db..a0dc64a70d01 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2238,7 +2238,10 @@ static int rdt_get_tree(struct fs_context *fc)
>  		static_branch_enable_cpuslocked(&rdt_enable_key);
>  
>  	if (is_mbm_enabled()) {
> -		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +		if (snc_ways > 1)
> +			r = &rdt_resources_all[RDT_RESOURCE_NODE].r_resctrl;
> +		else
> +			r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>  		list_for_each_entry(dom, &r->domains, list)
>  			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
>  	}


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2023-01-26 18:41 ` [PATCH 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2023-02-28 17:14   ` James Morse
  0 siblings, 0 replies; 34+ messages in thread
From: James Morse @ 2023-02-28 17:14 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, x86
  Cc: Shaopeng Tan, Jamie Iles, Babu Moger, linux-kernel, linux-doc, patches

Hi Tony,

On 26/01/2023 18:41, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffices of directories with "L3" in

(Typo: Suffixes)

> their name refer to Sun-NUMA nodes instead of L3 cache ids.

(Typo: Sub-NUMA)


> diff --git a/Documentation/x86/resctrl.rst b/Documentation/x86/resctrl.rst
> index 71a531061e4e..9043a2d2f2d3 100644
> --- a/Documentation/x86/resctrl.rst
> +++ b/Documentation/x86/resctrl.rst
> @@ -167,6 +167,11 @@ with the following files:
>  		bytes) at which a previously used LLC_occupancy
>  		counter can be considered for re-use.
>  
> +"snc_ways":
> +                A value of "1" marks that SNC mode is disabled.

It would be good to expand the acronym the first time its used in the documentation.
This gives folk a few more characters to google!


> +                Values of "2" or "4" indicate how many NUMA
> +                nodes share an L3 cache.
> +
>  Finally, in the top level of the "info" directory there is a file
>  named "last_cmd_status". This is reset with every "command" issued
>  via the file system (making new directories or writing to any of the



Thanks,

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 4/7] x86/resctrl: Add code to setup monitoring at L3 or NODE scope.
  2023-02-28 17:13   ` James Morse
@ 2023-02-28 17:28     ` Luck, Tony
  0 siblings, 0 replies; 34+ messages in thread
From: Luck, Tony @ 2023-02-28 17:28 UTC (permalink / raw)
  To: James Morse, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, x86
  Cc: Shaopeng Tan, Jamie Iles, Babu Moger, linux-kernel, linux-doc, patches

>> +	/* When SNC enabled, monitor functions at node instead of L3 cache scope */
>> +	if (snc_ways > 1)
>> +		r = &rdt_resources_all[RDT_RESOURCE_NODE].r_resctrl;
>> +	else
>> +		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>
> Could this be hidden in a helper with some name like resctrl_arch_get_mbm_resource()?
> You have the same pattern again in rdt_get_tree(). If this gets more complex in the
> future, it means its outside the filesystem code, and all in one place.

Sounds like a good idea.  Thanks.

-Tony

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 5/7] x86/resctrl: Add a new "snc_ways" file to the monitoring info directory.
  2023-02-28 17:13   ` James Morse
@ 2023-02-28 17:44     ` Luck, Tony
  2023-03-03 18:32       ` James Morse
  0 siblings, 1 reply; 34+ messages in thread
From: Luck, Tony @ 2023-02-28 17:44 UTC (permalink / raw)
  To: James Morse, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, x86
  Cc: Shaopeng Tan, Jamie Iles, Babu Moger, linux-kernel, linux-doc, patches

> > Make it easy for the user to tell if Sub-NUMA Cluster is enabled by
> > providing an info/ file.
>
> I think what this is conveying to user-space is 'domain_id_is_numa_node'.

That seems more architecturally neutral. I like it.

> Does user-space need to know the number of ways?

I don't know. Maybe some might. Perhaps there is some better name that
is architecturally neutral, but still has a numerical rather than boolean value?

> Will this always be a single number, or will it ever be possible to have an SNC=2 and
> SNC=1 package in the same system?

I sincerely hope that it is the same value across the system. Currently the
BIOS setup option to enable SNC doesn't have per-socket choices, it is
just an all-or-nothing choice. "2" isn't the only choice for number of SNC
nodes on a socket. "4" is (or will be) a choice.

-Tony

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
  2023-02-28 17:12 ` [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems James Morse
@ 2023-02-28 18:04   ` Luck, Tony
  0 siblings, 0 replies; 34+ messages in thread
From: Luck, Tony @ 2023-02-28 18:04 UTC (permalink / raw)
  To: James Morse, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, x86
  Cc: Shaopeng Tan, Jamie Iles, Babu Moger, linux-kernel, linux-doc, patches

> > Intel server systems starting with Skylake support a mode that logically
> > partitions each socket. E.g. when partitioned two ways, half the cores,
> > L3 cache, and memory controllers are allocated to each of the partitions.
> > This may reduce average latency to access L3 cache and memory, with the
> > tradeoff that only half the L3 cache is available for subnode-local memory
> > access.
>
> I couldn't find a description of what happens to the CAT bitmaps or counters.

No changes to CAT. The cache is partitioned between sub-numa nodes based
on the index, not by dividing the ways. E.g. an 8-way associative 32MB cache is
still 8-way associative in each sub-node, but with 16MB available to each node.

This means users who want a specific amount of cache will need to allocate
more bits in the cache way mask (because each way is half as big).

> Presumably the CAT bitmaps are duplicated, so each cluster has its own set, and
> the counters aren't - so software has to co-ordinate the use of RMID across the CPUs?

Nope. Still one set of CAT bit maps per socket.

With "N" RMIDs available on a system with SNC disabled, there will be N/2 available
when there are 2 SNC nodes per socket. Processes use values [0 .. N/2).

> How come cacheinfo isn't modified to report the L3 partitions as separate caches?
> Otherwise user-space would assume the full size of the cache is available on any of those
> CPUs.

The size of the cache is perhaps poorly defined in the SNC enabled case. A well
behaved NUMA application that is only accessing memory from its local node will
see an effective cache half the size. But if a process accesses memory from the
other SNC node on the same socket, then it will get allocations in that SNC nodes
half share of the cache.  Accessing memory across inter-socket links will end up
allocating across the whole cache.

Moral: SNC mode is intended for applications that have very well-behaved NUMA
characteristics.

-Tony

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-01-26 18:41 ` [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
  2023-02-27 13:30   ` Peter Newman
@ 2023-02-28 19:51   ` Moger, Babu
  2023-03-14 20:23     ` Tony Luck
       [not found]   ` <85d7e70a-b9c8-6551-b1ac-229b51ee18d7@amd.com>
  2 siblings, 1 reply; 34+ messages in thread
From: Moger, Babu @ 2023-02-28 19:51 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Peter Newman,
	Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, linux-kernel, linux-doc, patches

Hi Tony,


On 1/26/23 12:41, Tony Luck wrote:
> There isn't a simple hardware enumeration to indicate to software that
> a system is running with Sub-NUMA Cluster enabled.
> 
> Compare the number of NUMA nodes with the number of L3 caches to calculate
> the number of Sub-NUMA nodes per L3 cache.
> 
> When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
> are distributed equally between the SNC nodes within each socket.
> 
> E.g. if there are 400 RMID counters, and the system is configured with
> two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
> 0 on the socket, and RMID counter 200..399 on SNC node 1.
> 
> Handle this by initializing a per-cpu RMID offset value. Use this
> to calculate the value to write to the RMID field of the IA32_PQR_ASSOC
> MSR during context switch, and also to the IA32_QM_EVTSEL MSR when
> reading RMID event values.
> 
> N.B. this works well for well-behaved NUMA applications that access
> memory predominantly from the local memory node. For applications that
> access memory across multiple nodes it may be necessary for the user
> to read counters for all SNC nodes on a socket and add the values to
> get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
> all that different from applications that span across multiple sockets
> in a legacy system.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/include/asm/resctrl.h        |  4 ++-
>  arch/x86/kernel/cpu/resctrl/core.c    | 43 +++++++++++++++++++++++++--
>  arch/x86/kernel/cpu/resctrl/monitor.c |  2 +-
>  3 files changed, 44 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 52788f79786f..59b8afd8c53c 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
>  DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>  DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
>  
> +DECLARE_PER_CPU(int, rmid_offset);
> +
>  /*
>   * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
>   *
> @@ -69,7 +71,7 @@ static void __resctrl_sched_in(void)
>  	if (static_branch_likely(&rdt_mon_enable_key)) {
>  		tmp = READ_ONCE(current->rmid);
>  		if (tmp)
> -			rmid = tmp;
> +			rmid = tmp + this_cpu_read(rmid_offset);
>  	}
>  
>  	if (closid != state->cur_closid || rmid != state->cur_rmid) {
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 53b2ab37af2f..0ff739375e3b 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,6 +16,7 @@
>  
>  #define pr_fmt(fmt)	"resctrl: " fmt
>  
> +#include <linux/cpu.h>
>  #include <linux/slab.h>
>  #include <linux/err.h>
>  #include <linux/cacheinfo.h>
> @@ -484,6 +485,13 @@ static int get_domain_id(int cpu, enum resctrl_scope scope)
>  	return get_cpu_cacheinfo_id(cpu, scope);
>  }
>  
> +DEFINE_PER_CPU(int, rmid_offset);
> +
> +static void set_per_cpu_rmid_offset(int cpu, struct rdt_resource *r)
> +{
> +	this_cpu_write(rmid_offset, (cpu_to_node(cpu) % snc_ways) * r->num_rmid);
> +}
> +
>  /*
>   * domain_add_cpu - Add a cpu to a resource's domain list.
>   *
> @@ -515,6 +523,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  		cpumask_set_cpu(cpu, &d->cpu_mask);
>  		if (r->cache.arch_has_per_cpu_cfg)
>  			rdt_domain_reconfigure_cdp(r);
> +		if (r->mon_capable)
> +			set_per_cpu_rmid_offset(cpu, r);
>  		return;
>  	}
>  
> @@ -533,9 +543,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  		return;
>  	}
>  
> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> -		domain_free(hw_dom);
> -		return;
> +	if (r->mon_capable) {
> +		if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> +			domain_free(hw_dom);
> +			return;
> +		}
> +		set_per_cpu_rmid_offset(cpu, r);
>  	}
>  
>  	list_add_tail(&d->list, add_pos);
> @@ -845,11 +858,35 @@ static __init bool get_rdt_resources(void)
>  	return (rdt_mon_capable || rdt_alloc_capable);
>  }
>  
> +static __init int find_snc_ways(void)
> +{
> +	unsigned long *node_caches;
> +	int cpu, node, ret;
> +
> +	node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> +	if (!node_caches)
> +		return 1;
> +
> +	cpus_read_lock();
> +	for_each_node(node) {
> +		cpu = cpumask_first(cpumask_of_node(node));
> +		set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> +	}
> +	cpus_read_unlock();
> +
> +	ret = nr_node_ids / bitmap_weight(node_caches, nr_node_ids);
> +	kfree(node_caches);
> +
> +	return ret;
> +}
> +
>  static __init void rdt_init_res_defs_intel(void)
>  {
>  	struct rdt_hw_resource *hw_res;
>  	struct rdt_resource *r;
>  
> +	snc_ways = find_snc_ways();
> +
>  	for_each_rdt_resource(r) {
>  		hw_res = resctrl_to_arch_res(r);
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 3fc63aa68130..bd5ec348d925 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -160,7 +160,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>  	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>  	 * are error bits.
>  	 */
> -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));

I am thinking loud here.
When a new monitor group is created, new RMID is assigned. This is done by
alloc_rmid. It does not know about the rmid_offset details. This will
allocate the one of the free RMIDs.

When CPUs are assigned to the group, then per cpu  pqr_state is updated.
At that point, this RMID becomes default_rmid for that cpu.

But CPUs can be assigned from two different Sub-NUMA nodes.

Considering same example you mentioned.

E.g. in 2-way Sub-NUMA cluster with 200 RMID counters there are only
100 available counters to the resctrl code. When running on the first
SNC node RMID values 0..99 are used as before. But when running on the
second node, a task that is assigned resctrl rmid=10 must load 10+100
into IA32_PQR_ASSOC to use RMID counter 110.

#mount -t resctrl resctrl /sys/fs/resctrl/
#cd /sys/fs/resctrl/
#mkdir test  (Lets say RMID 1 is allocated)
#cd test
#echo 1 > cpus_list
#echo 101 > cpus_list

In this case, the following code may run on two different RMIDs even
though it was intended to run on same RMID.

wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));

Have you thought of this problem?

Thanks
Babu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
       [not found]   ` <85d7e70a-b9c8-6551-b1ac-229b51ee18d7@amd.com>
@ 2023-02-28 20:39     ` Luck, Tony
  2023-02-28 22:31       ` Moger, Babu
  0 siblings, 1 reply; 34+ messages in thread
From: Luck, Tony @ 2023-02-28 20:39 UTC (permalink / raw)
  To: babu.moger, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, linux-kernel, linux-doc, patches

Babu wrote:
> I am thinking loud here. Have you thought of addressing this problem?
> When a new monitor group is created, new RMID is assigned. This is done by alloc_rmid. It does not know about the rmid_offset details. This will allocate the one of the free RMIDs.
> When CPUs are assigned to the group, then per cpu  pqr_state is updated. At that point, this RMID becomes default_rmid for that cpu.

Good point. This is a gap. I haven't handled assigning CPUs to resctrl groups when SNC is enabled.

I'm not sure this has a solution :-(

-Tony


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-02-28 20:39     ` Luck, Tony
@ 2023-02-28 22:31       ` Moger, Babu
  0 siblings, 0 replies; 34+ messages in thread
From: Moger, Babu @ 2023-02-28 22:31 UTC (permalink / raw)
  To: Luck, Tony, babu.moger, Yu, Fenghua, Chatre, Reinette,
	Peter Newman, Jonathan Corbet, x86
  Cc: Shaopeng Tan, James Morse, Jamie Iles, linux-kernel, linux-doc, patches


On 2/28/2023 2:39 PM, Luck, Tony wrote:
> Babu wrote:
>> I am thinking loud here. Have you thought of addressing this problem?
>> When a new monitor group is created, new RMID is assigned. This is done by alloc_rmid. It does not know about the rmid_offset details. This will allocate the one of the free RMIDs.
>> When CPUs are assigned to the group, then per cpu  pqr_state is updated. At that point, this RMID becomes default_rmid for that cpu.
> Good point. This is a gap. I haven't handled assigning CPUs to resctrl groups when SNC is enabled.

You may need to document it.

Thanks

Babu


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] x86/resctrl: Add a new "snc_ways" file to the monitoring info directory.
  2023-02-28 17:44     ` Luck, Tony
@ 2023-03-03 18:32       ` James Morse
  0 siblings, 0 replies; 34+ messages in thread
From: James Morse @ 2023-03-03 18:32 UTC (permalink / raw)
  To: Luck, Tony, Yu, Fenghua, Chatre, Reinette, Peter Newman,
	Jonathan Corbet, x86
  Cc: Shaopeng Tan, Jamie Iles, Babu Moger, linux-kernel, linux-doc, patches

Hi Tony,

On 28/02/2023 17:44, Luck, Tony wrote:
>>> Make it easy for the user to tell if Sub-NUMA Cluster is enabled by
>>> providing an info/ file.
>>
>> I think what this is conveying to user-space is 'domain_id_is_numa_node'.
> 
> That seems more architecturally neutral. I like it.
> 
>> Does user-space need to know the number of ways?
> 
> I don't know. Maybe some might. Perhaps there is some better name that
> is architecturally neutral, but still has a numerical rather than boolean value?

If we don't know what user-space needs this for, I'd prefer we don't expose it. It'll be a
problem in the future if there is no single number we can put there.

I don't see what the difference between 2 and 4 would be used for when setting up resctrl,
unless you are choosing which numa-nodes to spread tasks over ... but the numa distance
would be a much better number to base that decision on.

User-space is able to perform the same calculation to find the snc_ways using the
cache/index stuff and node entries in sysfs.


>> Will this always be a single number, or will it ever be possible to have an SNC=2 and
>> SNC=1 package in the same system?
> 
> I sincerely hope that it is the same value across the system. Currently the
> BIOS setup option to enable SNC doesn't have per-socket choices, it is
> just an all-or-nothing choice. "2" isn't the only choice for number of SNC
> nodes on a socket. "4" is (or will be) a choice.

Yeah, in the arm world, partners get to make the decision on what is sane. Big-little
means someone could do something that looks like SNC in on cluster, but not another.

If we don't know what user-space needs it for, I'd prefer we don't expose it, just to
avoid giving out rope to shoot ourselves in the foot with.


Thanks,

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-02-27 13:30   ` Peter Newman
@ 2023-03-10 17:30     ` Tony Luck
  2023-03-13  9:19       ` Peter Newman
  0 siblings, 1 reply; 34+ messages in thread
From: Tony Luck @ 2023-03-10 17:30 UTC (permalink / raw)
  To: Peter Newman
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, x86, Shaopeng Tan,
	James Morse, Jamie Iles, Babu Moger, linux-kernel, linux-doc,
	patches, tony.luck

On Mon, Feb 27, 2023 at 02:30:38PM +0100, Peter Newman wrote:
> Hi Tony,
> 
> On Thu, Jan 26, 2023 at 7:42 PM Tony Luck <tony.luck@intel.com> wrote:
> > +static __init int find_snc_ways(void)
> > +{
> > +       unsigned long *node_caches;
> > +       int cpu, node, ret;
> > +
> > +       node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> > +       if (!node_caches)
> > +               return 1;
> > +
> > +       cpus_read_lock();
> > +       for_each_node(node) {
> 
> Someone tried this patch on a machine with a CPU-less node...
> 
> We need to check for this:
> 
> +               if (cpumask_empty(cpumask_of_node(node)))
> +                       continue;
> 
> > +               cpu = cpumask_first(cpumask_of_node(node));
> > +               set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> > +       }
> > +       cpus_read_unlock();

Peter,

Tell me more about your CPU-less nodes.  Your fix avoids a bad
pointer reference (because cpumask_first() returns cpu >= nr_cpu_ids
for an empty bitmask).

But now I'm worried about whether I have the right values in the
formula:

	nr_node_ids / bitmap_weight(node_caches, nr_node_ids);

This fix avoids counting the L3 from a non-existent CPU, but still
counts the node in the numerator.

Is your CPU-less node a full (non-SNC) node? Like this:

          Socket 0                 Socket 1
    +--------------------+  +--------------------+
    |          .         |  |          .         |
    |  SNC 0.0 . SNC 0.1 |  |  zero    . zero    |
    |          .         |  |  CPUs    . CPUs    |
    |          .         |  |          .         |
    |          .         |  |          .         |
    +--------------------+  +--------------------+
    |      L3 Cache      |  |      L3 Cache      |
    +--------------------+  +--------------------+

I could fix this case by counting how many CPU-less
nodes I find, and reducing the numerator (the denominator
didn't count the L3 cache from socket 1 because there
are no CPUs there)

	(nr_node_ids - n_empty_nodes) / bitmap_weight(node_caches, nr_node_ids);

	=> 2 / 1

But that won't work if your CPU-less node is an SNC node
and the other SNC node in the same socket does have some
CPUs:

          Socket 0                 Socket 1
    +--------------------+  +--------------------+
    |          .         |  |          .         |
    |  SNC 0.0 . SNC 0.1 |  |  zero    . SNC 1.1 |
    |          .         |  |  CPUs    .         |
    |          .         |  |          .         |
    |          .         |  |          .         |
    +--------------------+  +--------------------+
    |      L3 Cache      |  |      L3 Cache      |
    +--------------------+  +--------------------+

This would get 3 / 2 ... i.e. I should still count the
empty node because its cache was counted by its SNC
buddy.

-Tony

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-03-10 17:30     ` Tony Luck
@ 2023-03-13  9:19       ` Peter Newman
  2023-03-13 16:38         ` Luck, Tony
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Newman @ 2023-03-13  9:19 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Jonathan Corbet, x86, Shaopeng Tan,
	James Morse, Jamie Iles, Babu Moger, linux-kernel, linux-doc,
	patches

Hi Tony,

On Fri, Mar 10, 2023 at 6:30 PM Tony Luck <tony.luck@intel.com> wrote:
> Tell me more about your CPU-less nodes. Your fix avoids a bad
> pointer reference (because cpumask_first() returns cpu >= nr_cpu_ids
> for an empty bitmask).
>
> But now I'm worried about whether I have the right values in the
> formula:
>
> nr_node_ids / bitmap_weight(node_caches, nr_node_ids);
>
> This fix avoids counting the L3 from a non-existent CPU, but still
> counts the node in the numerator.
>
> Is your CPU-less node a full (non-SNC) node? Like this:
>
> Socket 0 Socket 1
> +--------------------+ +--------------------+
> | . | | . |
> | SNC 0.0 . SNC 0.1 | | zero . zero |
> | . | | CPUs . CPUs |
> | . | | . |
> | . | | . |
> +--------------------+ +--------------------+
> | L3 Cache | | L3 Cache |
> +--------------------+ +--------------------+

In the case I saw, the nodes were AEP DIMMs, so all-memory nodes.

Browsing sysfs, they are listed in has_memory, but not
has_normal_memory or has_cpu.

I imagine CXL.mem would have similar issues.

-Peter

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-03-13  9:19       ` Peter Newman
@ 2023-03-13 16:38         ` Luck, Tony
  0 siblings, 0 replies; 34+ messages in thread
From: Luck, Tony @ 2023-03-13 16:38 UTC (permalink / raw)
  To: Peter Newman
  Cc: Yu, Fenghua, Chatre, Reinette, Jonathan Corbet, x86,
	Shaopeng Tan, James Morse, Jamie Iles, Babu Moger, linux-kernel,
	linux-doc, patches

> In the case I saw, the nodes were AEP DIMMs, so all-memory nodes.

Peter,

Thanks. This helps a lot.

Ok. I will add code to count the number of memory only nodes and subtract
that from the numerator of "nodes / L3-caches".

I'll ignore the weird case of a memory-only SNC node when other SNC
nodes on the same socket do have CPUs until such time as someone
convinces me that there is a real-world reason to enable SNC and then
disable the CPUs in one node. It would seem much better to keep SNC
turned off so that the remaining CPUs on the socket get access to all
of the L3.

-Tony

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
  2023-02-28 19:51   ` Moger, Babu
@ 2023-03-14 20:23     ` Tony Luck
  0 siblings, 0 replies; 34+ messages in thread
From: Tony Luck @ 2023-03-14 20:23 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet, x86,
	Shaopeng Tan, James Morse, Jamie Iles, linux-kernel, linux-doc,
	patches

On Tue, Feb 28, 2023 at 01:51:32PM -0600, Moger, Babu wrote:
> I am thinking loud here.
> When a new monitor group is created, new RMID is assigned. This is done by
> alloc_rmid. It does not know about the rmid_offset details. This will
> allocate the one of the free RMIDs.
> 
> When CPUs are assigned to the group, then per cpu  pqr_state is updated.
> At that point, this RMID becomes default_rmid for that cpu.
> 
> But CPUs can be assigned from two different Sub-NUMA nodes.
> 
> Considering same example you mentioned.
> 
> E.g. in 2-way Sub-NUMA cluster with 200 RMID counters there are only
> 100 available counters to the resctrl code. When running on the first
> SNC node RMID values 0..99 are used as before. But when running on the
> second node, a task that is assigned resctrl rmid=10 must load 10+100
> into IA32_PQR_ASSOC to use RMID counter 110.
> 
> #mount -t resctrl resctrl /sys/fs/resctrl/
> #cd /sys/fs/resctrl/
> #mkdir test  (Lets say RMID 1 is allocated)
> #cd test
> #echo 1 > cpus_list
> #echo 101 > cpus_list
> 
> In this case, the following code may run on two different RMIDs even
> though it was intended to run on same RMID.
> 
> wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));
> 
> Have you thought of this problem?

Now I've thought about this. I don't think it is a problem.

With SNC enabled for two nodes per socket the available RMIDs
are divided between the SNC nodes, but are for some purposes
numbered [0 .. N/2) but in some cases must be viewed as two
separate sets [0 .. N/2) on the first node and [N/2 .. N) on
the second.

In your example RMID 1 is assigned to the group and you have
one CPU from each node in the group. Processes on CPU1 will
load IA32_PQR_ASSOC.RMID = 1, while processes on CPU101 will
set IA32_PQR_ASSOC.RMID = 101. So counts of memory bandwidth
and cache occupancy will be in two different physical RMID
counters.

To read these back the user needs to lookup which $node each CPU
belongs to and then read from the appropriate
mon_data/mon_L3_$node/{llc_occupancy,mbm_local_bytes,mbm_total_bytes}
file.

$ cat mon_data/mon_L3_00/llc_occupancy # reads RMID=1
$ cat mon_data/mon_L3_01/llc_occupancy # reads RMID=101

-Tony
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-03-14 20:23 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-26 18:41 [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Tony Luck
2023-01-26 18:41 ` [PATCH 1/7] x86/resctrl: Refactor in preparation for node-scoped resources Tony Luck
2023-01-26 18:41 ` [PATCH 2/7] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c Tony Luck
2023-01-27  4:51   ` Yu, Fenghua
2023-01-26 18:41 ` [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[] Tony Luck
2023-01-27  5:24   ` Yu, Fenghua
2023-01-27 16:02     ` Peter Newman
2023-01-27 18:23     ` Luck, Tony
2023-01-28  2:25       ` Yu, Fenghua
2023-01-28  2:22   ` Yu, Fenghua
2023-01-28  2:36   ` Yu, Fenghua
2023-01-30 19:04     ` Luck, Tony
2023-02-28 14:27   ` Moger, Babu
2023-02-28 17:05     ` Luck, Tony
2023-01-26 18:41 ` [PATCH 4/7] x86/resctrl: Add code to setup monitoring at L3 or NODE scope Tony Luck
2023-02-28 17:13   ` James Morse
2023-02-28 17:28     ` Luck, Tony
2023-01-26 18:41 ` [PATCH 5/7] x86/resctrl: Add a new "snc_ways" file to the monitoring info directory Tony Luck
2023-02-28 17:13   ` James Morse
2023-02-28 17:44     ` Luck, Tony
2023-03-03 18:32       ` James Morse
2023-01-26 18:41 ` [PATCH 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2023-02-28 17:14   ` James Morse
2023-01-26 18:41 ` [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize Tony Luck
2023-02-27 13:30   ` Peter Newman
2023-03-10 17:30     ` Tony Luck
2023-03-13  9:19       ` Peter Newman
2023-03-13 16:38         ` Luck, Tony
2023-02-28 19:51   ` Moger, Babu
2023-03-14 20:23     ` Tony Luck
     [not found]   ` <85d7e70a-b9c8-6551-b1ac-229b51ee18d7@amd.com>
2023-02-28 20:39     ` Luck, Tony
2023-02-28 22:31       ` Moger, Babu
2023-02-28 17:12 ` [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems James Morse
2023-02-28 18:04   ` Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).