All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking
@ 2023-01-13 17:54 James Morse
  2023-01-13 17:54 ` [PATCH v2 01/18] x86/resctrl: Track the closid with the rmid James Morse
                   ` (18 more replies)
  0 siblings, 19 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

Hello!

(Changes since v1 are noted in each patch)

This series does two things, it changes resctrl to call resctrl_arch_rmid_read()
in a way that works for MPAM, and it separates the locking so that the arch code
and filesystem code don't have to share a mutex. I tried to split this as two
series, but these touch similar call sites, so it would create more work.

(What's MPAM? See the cover letter of the first series. [1])

On x86 the RMID is an independent number. MPAMs equivalent is PMG, but this
isn't an independent number - it extends the PARTID (same as CLOSID) space
with bits that aren't used to select the configuration. The monitors can
then be told to match specific PMG values, allowing monitor-groups to be
created.

But, MPAM expects the monitors to always monitor by PARTID. The
Cache-storage-utilisation counters can only work this way.
(In the MPAM spec not setting the MATCH_PARTID bit is made CONSTRAINED
UNPREDICTABLE - which is Arm's term to mean portable software can't rely on
this)

It gets worse, as some SoCs may have very few PMG bits. I've seen the
datasheet for one that has a single bit of PMG space.

To be usable, MPAM's counters always need the PARTID and the PMG.
For resctrl, this means always making the CLOSID available when the RMID
is used.

To ensure RMID are always unique, this series combines the CLOSID and RMID
into an index, and manages RMID based on that. For x86, the index and RMID
would always be the same.


Currently the architecture specific code in the cpuhp callbacks takes the
rdtgroup_mutex. This means the filesystem code would have to export this
lock, resulting in an ill-defined interface between the two, and the possibility
of cross-architecture lock-ordering head aches.

The second part of this series adds a domain_list_lock to protect writes to the
domain list, and protects the domain list with RCU - or read_cpus_lock().

Use of RCU is to allow lockless readers of the domain list, today resctrl only has
one, rdt_bit_usage_show(). But to get MPAMs monitors working, its very likely
they'll need to be plumbed up to perf. The uncore PMU driver would be a second
lockless reader of the domain list.

This series is based on v6.2-rc3, and can be retrieved from:
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/monitors_and_locking/v2

Bugs welcome,


Thanks,

James


[1] https://lore.kernel.org/lkml/20210728170637.25610-1-james.morse@arm.com/
[v1] https://lore.kernel.org/all/20221021131204.5581-1-james.morse@arm.com/

James Morse (18):
  x86/resctrl: Track the closid with the rmid
  x86/resctrl: Access per-rmid structures by index
  x86/resctrl: Create helper for RMID allocation and mondata dir
    creation
  x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()
  x86/resctrl: Allow RMID allocation to be scoped by CLOSID
  x86/resctrl: Allow the allocator to check if a CLOSID can allocate
    clean RMID
  x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
  x86/resctrl: Queue mon_event_read() instead of sending an IPI
  x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  x86/resctrl: Allow arch to allocate memory needed in
    resctrl_arch_rmid_read()
  x86/resctrl: Make resctrl_mounted checks explicit
  x86/resctrl: Move alloc/mon static keys into helpers
  x86/resctrl: Make rdt_enable_key the arch's decision to switch
  x86/resctrl: Add helpers for system wide mon/alloc capable
  x86/resctrl: Add cpu online callback for resctrl work
  x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but
    cpu
  x86/resctrl: Add cpu offline callback for resctrl work
  x86/resctrl: Separate arch and fs resctrl locks

 arch/x86/include/asm/resctrl.h            |  90 ++++++
 arch/x86/kernel/cpu/resctrl/core.c        |  71 ++---
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  19 +-
 arch/x86/kernel/cpu/resctrl/internal.h    |  24 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     | 342 ++++++++++++++++------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  15 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 297 +++++++++++++------
 include/linux/resctrl.h                   |  15 +-
 8 files changed, 637 insertions(+), 236 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v2 01/18] x86/resctrl: Track the closid with the rmid
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-13 17:54 ` [PATCH v2 02/18] x86/resctrl: Access per-rmid structures by index James Morse
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

x86's RMID are independent of the CLOSID. An RMID can be allocated,
used and freed without considering the CLOSID.

MPAM's equivalent feature is PMG, which is not an independent number,
it extends the CLOSID/PARTID space. For MPAM, only PMG-bits worth of
'RMID' can be allocated for a single CLOSID.
i.e. if there is 1 bit of PMG space, then each CLOSID can have two
monitor groups.

To allow resctrl to disambiguate RMID values for different CLOSID,
everything in resctrl that keeps an RMID value needs to know the CLOSID
too. This will always be ignored on x86.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: James Morse <james.morse@arm.com>

---
Is there a better term for 'the unique identifier for a monitor group'.
Using RMID for that here may be confusing...

Changes since v1:
 * Added comment in struct rmid_entry
---
 arch/x86/kernel/cpu/resctrl/internal.h    |  2 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     | 59 ++++++++++++++---------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 12 ++---
 include/linux/resctrl.h                   | 11 ++++-
 5 files changed, 54 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 5ebd28e6aa0c..5dbff3035723 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -509,7 +509,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(void);
-void free_rmid(u32 rmid);
+void free_rmid(u32 closid, u32 rmid);
 int rdt_get_mon_l3_config(struct rdt_resource *r);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index efe0c30d3a12..13673cab175a 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -25,6 +25,12 @@
 #include "internal.h"
 
 struct rmid_entry {
+	/*
+	 * Some architectures's resctrl_arch_rmid_read() needs the CLOSID value
+	 * in order to access the correct monitor. This field provides the
+	 * value to list walkers like __check_limbo(). On x86 this is ignored.
+	 */
+	u32				closid;
 	u32				rmid;
 	int				busy;
 	struct list_head		list;
@@ -136,7 +142,7 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val)
 	return val;
 }
 
-static inline struct rmid_entry *__rmid_entry(u32 rmid)
+static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid)
 {
 	struct rmid_entry *entry;
 
@@ -166,7 +172,8 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 }
 
 void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
-			     u32 rmid, enum resctrl_event_id eventid)
+			     u32 closid, u32 rmid,
+			     enum resctrl_event_id eventid)
 {
 	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
@@ -185,7 +192,8 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 }
 
 int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
-			   u32 rmid, enum resctrl_event_id eventid, u64 *val)
+			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
+			   u64 *val)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
 	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
@@ -251,9 +259,9 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 		if (nrmid >= r->num_rmid)
 			break;
 
-		entry = __rmid_entry(nrmid);
+		entry = __rmid_entry(~0, nrmid);	// temporary
 
-		if (resctrl_arch_rmid_read(r, d, entry->rmid,
+		if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
 					   QOS_L3_OCCUP_EVENT_ID, &val)) {
 			rmid_dirty = true;
 		} else {
@@ -308,7 +316,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 	cpu = get_cpu();
 	list_for_each_entry(d, &r->domains, list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
-			err = resctrl_arch_rmid_read(r, d, entry->rmid,
+			err = resctrl_arch_rmid_read(r, d, entry->closid,
+						     entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
 						     &val);
 			if (err || val <= resctrl_rmid_realloc_threshold)
@@ -332,7 +341,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-void free_rmid(u32 rmid)
+void free_rmid(u32 closid, u32 rmid)
 {
 	struct rmid_entry *entry;
 
@@ -341,7 +350,7 @@ void free_rmid(u32 rmid)
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
-	entry = __rmid_entry(rmid);
+	entry = __rmid_entry(closid, rmid);
 
 	if (is_llc_occupancy_enabled())
 		add_rmid_to_limbo(entry);
@@ -349,15 +358,16 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static int __mon_event_count(u32 rmid, struct rmid_read *rr)
+static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
 {
 	struct mbm_state *m;
 	u64 tval = 0;
 
 	if (rr->first)
-		resctrl_arch_reset_rmid(rr->r, rr->d, rmid, rr->evtid);
+		resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid);
 
-	rr->err = resctrl_arch_rmid_read(rr->r, rr->d, rmid, rr->evtid, &tval);
+	rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid,
+					 &tval);
 	if (rr->err)
 		return rr->err;
 
@@ -400,7 +410,7 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
  * __mon_event_count() is compared with the chunks value from the previous
  * invocation. This must be called once per second to maintain values in MBps.
  */
-static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
+static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
 {
 	struct mbm_state *m = &rr->d->mbm_local[rmid];
 	u64 cur_bw, bytes, cur_bytes;
@@ -430,7 +440,7 @@ void mon_event_count(void *info)
 
 	rdtgrp = rr->rgrp;
 
-	ret = __mon_event_count(rdtgrp->mon.rmid, rr);
+	ret = __mon_event_count(rdtgrp->closid, rdtgrp->mon.rmid, rr);
 
 	/*
 	 * For Ctrl groups read data from child monitor groups and
@@ -441,7 +451,8 @@ void mon_event_count(void *info)
 
 	if (rdtgrp->type == RDTCTRL_GROUP) {
 		list_for_each_entry(entry, head, mon.crdtgrp_list) {
-			if (__mon_event_count(entry->mon.rmid, rr) == 0)
+			if (__mon_event_count(rdtgrp->closid, entry->mon.rmid,
+					      rr) == 0)
 				ret = 0;
 		}
 	}
@@ -571,7 +582,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	}
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
+		       u32 closid, u32 rmid)
 {
 	struct rmid_read rr;
 
@@ -586,12 +598,12 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
 	if (is_mbm_total_enabled()) {
 		rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
 		rr.val = 0;
-		__mon_event_count(rmid, &rr);
+		__mon_event_count(closid, rmid, &rr);
 	}
 	if (is_mbm_local_enabled()) {
 		rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
 		rr.val = 0;
-		__mon_event_count(rmid, &rr);
+		__mon_event_count(closid, rmid, &rr);
 
 		/*
 		 * Call the MBA software controller only for the
@@ -599,7 +611,7 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
 		 * the software controller explicitly.
 		 */
 		if (is_mba_sc(NULL))
-			mbm_bw_count(rmid, &rr);
+			mbm_bw_count(closid, rmid, &rr);
 	}
 }
 
@@ -656,11 +668,11 @@ void mbm_handle_overflow(struct work_struct *work)
 	d = container_of(work, struct rdt_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
-		mbm_update(r, d, prgrp->mon.rmid);
+		mbm_update(r, d, prgrp->closid, prgrp->mon.rmid);
 
 		head = &prgrp->mon.crdtgrp_list;
 		list_for_each_entry(crgrp, head, mon.crdtgrp_list)
-			mbm_update(r, d, crgrp->mon.rmid);
+			mbm_update(r, d, crgrp->closid, crgrp->mon.rmid);
 
 		if (is_mba_sc(NULL))
 			update_mba_bw(prgrp, d);
@@ -703,10 +715,11 @@ static int dom_data_init(struct rdt_resource *r)
 	}
 
 	/*
-	 * RMID 0 is special and is always allocated. It's used for all
-	 * tasks that are not monitored.
+	 * RMID 0 is special and is always allocated. It's used for the
+	 * default_rdtgroup control group, which will be setup later. See
+	 * rdtgroup_setup_root().
 	 */
-	entry = __rmid_entry(0);
+	entry = __rmid_entry(0, 0);
 	list_del(&entry->list);
 
 	return 0;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 524f8ff3e69c..c51932516965 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -738,7 +738,7 @@ int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp)
 	 * anymore when this group would be used for pseudo-locking. This
 	 * is safe to call on platforms not capable of monitoring.
 	 */
-	free_rmid(rdtgrp->mon.rmid);
+	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 
 	ret = 0;
 	goto out;
@@ -773,7 +773,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
 
 	ret = rdtgroup_locksetup_user_restore(rdtgrp);
 	if (ret) {
-		free_rmid(rdtgrp->mon.rmid);
+		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 		return ret;
 	}
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..f3b739c52e42 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2422,7 +2422,7 @@ static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
 
 	head = &rdtgrp->mon.crdtgrp_list;
 	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
-		free_rmid(sentry->mon.rmid);
+		free_rmid(sentry->closid, sentry->mon.rmid);
 		list_del(&sentry->mon.crdtgrp_list);
 
 		if (atomic_read(&sentry->waitcount) != 0)
@@ -2462,7 +2462,7 @@ static void rmdir_all_sub(void)
 		cpumask_or(&rdtgroup_default.cpu_mask,
 			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
 
-		free_rmid(rdtgrp->mon.rmid);
+		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 
 		kernfs_remove(rdtgrp->kn);
 		list_del(&rdtgrp->rdtgroup_list);
@@ -2955,7 +2955,7 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
 	return 0;
 
 out_idfree:
-	free_rmid(rdtgrp->mon.rmid);
+	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 out_destroy:
 	kernfs_put(rdtgrp->kn);
 	kernfs_remove(rdtgrp->kn);
@@ -2969,7 +2969,7 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
 static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp)
 {
 	kernfs_remove(rgrp->kn);
-	free_rmid(rgrp->mon.rmid);
+	free_rmid(rgrp->closid, rgrp->mon.rmid);
 	rdtgroup_remove(rgrp);
 }
 
@@ -3118,7 +3118,7 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
 	update_closid_rmid(tmpmask, NULL);
 
 	rdtgrp->flags = RDT_DELETED;
-	free_rmid(rdtgrp->mon.rmid);
+	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 
 	/*
 	 * Remove the rdtgrp from the parent ctrl_mon group's list
@@ -3164,8 +3164,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
 	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
 	update_closid_rmid(tmpmask, NULL);
 
+	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 	closid_free(rdtgrp->closid);
-	free_rmid(rdtgrp->mon.rmid);
 
 	rdtgroup_ctrl_remove(rdtgrp);
 
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0cee154abc9f..57d32c3ce06f 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -225,6 +225,8 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
  *			      for this resource and domain.
  * @r:			resource that the counter should be read from.
  * @d:			domain that the counter should be read from.
+ * @closid:		closid that matches the rmid. The counter may
+ *			match traffic of both closid and rmid, or rmid only.
  * @rmid:		rmid of the counter to read.
  * @eventid:		eventid to read, e.g. L3 occupancy.
  * @val:		result of the counter read in bytes.
@@ -235,20 +237,25 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
 int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
-			   u32 rmid, enum resctrl_event_id eventid, u64 *val);
+			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
+			   u64 *val);
+
 
 /**
  * resctrl_arch_reset_rmid() - Reset any private state associated with rmid
  *			       and eventid.
  * @r:		The domain's resource.
  * @d:		The rmid's domain.
+ * @closid:	The closid that matches the rmid. Counters may match both
+ *		closid and rmid, or rmid only.
  * @rmid:	The rmid whose counter values should be reset.
  * @eventid:	The eventid whose counter values should be reset.
  *
  * This can be called from any CPU.
  */
 void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
-			     u32 rmid, enum resctrl_event_id eventid);
+			     u32 closid, u32 rmid,
+			     enum resctrl_event_id eventid);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 02/18] x86/resctrl: Access per-rmid structures by index
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
  2023-01-13 17:54 ` [PATCH v2 01/18] x86/resctrl: Track the closid with the rmid James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-17 18:28   ` Yu, Fenghua
                     ` (2 more replies)
  2023-01-13 17:54 ` [PATCH v2 03/18] x86/resctrl: Create helper for RMID allocation and mondata dir creation James Morse
                   ` (16 subsequent siblings)
  18 siblings, 3 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
monitors, RMID values on arm64 are not unique unless the CLOSID is
also included. Bitmaps like rmid_busy_llc need to be sized by the
number of unique entries for this resource.

Add helpers to encode/decode the CLOSID and RMID to an index. The
domain's busy_rmid_llc and the rmid_ptrs[] array are then sized by
index. On x86, this is always just the RMID. This gives resctrl a
unique value it can use to store monitor values, and allows MPAM to
decode the closid when reading the hardware counters.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
Changes since v1:
 * Added X86_BAD_CLOSID macro to make it clear what this value means
 * Added second WARN_ON() for closid checking, and made both _ONCE()
---
 arch/x86/include/asm/resctrl.h         | 24 ++++++++
 arch/x86/kernel/cpu/resctrl/internal.h |  2 +
 arch/x86/kernel/cpu/resctrl/monitor.c  | 79 +++++++++++++++++---------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  7 ++-
 4 files changed, 82 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 52788f79786f..44d568f3577c 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -7,6 +7,13 @@
 #include <linux/sched.h>
 #include <linux/jump_label.h>
 
+/*
+ * This value can never be a valid CLOSID, and is used when mapping a
+ * (closid, rmid) pair to an index and back. On x86 only the RMID is
+ * needed.
+ */
+#define X86_RESCTRL_BAD_CLOSID		~0
+
 /**
  * struct resctrl_pqr_state - State cache for the PQR MSR
  * @cur_rmid:		The cached Resource Monitoring ID
@@ -94,6 +101,23 @@ static inline void resctrl_sched_in(void)
 		__resctrl_sched_in();
 }
 
+static inline u32 resctrl_arch_system_num_rmid_idx(void)
+{
+	/* RMID are independent numbers for x86. num_rmid_idx==num_rmid */
+	return boot_cpu_data.x86_cache_max_rmid + 1;
+}
+
+static inline void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid)
+{
+	*rmid = idx;
+	*closid = X86_RESCTRL_BAD_CLOSID;
+}
+
+static inline u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid)
+{
+	return rmid;
+}
+
 void resctrl_cpu_detect(struct cpuinfo_x86 *c);
 
 #else
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 5dbff3035723..af71401c57e2 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -8,6 +8,8 @@
 #include <linux/fs_context.h>
 #include <linux/jump_label.h>
 
+#include <asm/resctrl.h>
+
 #define L3_QOS_CDP_ENABLE		0x01ULL
 
 #define L2_QOS_CDP_ENABLE		0x01ULL
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 13673cab175a..dbae380e3d1c 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -142,12 +142,27 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val)
 	return val;
 }
 
-static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid)
+/*
+ * x86 and arm64 differ in their handling of monitoring.
+ * x86's RMID are an independent number, there is one RMID '1'.
+ * arm64's PMG extend the PARTID/CLOSID space, there is one RMID '1' for each
+ * CLOSID. The RMID is no longer unique.
+ * To account for this, resctrl uses an index. On x86 this is just the RMID,
+ * on arm64 it encodes the CLOSID and RMID. This gives a unique number.
+ *
+ * The domain's rmid_busy_llc and rmid_ptrs are sized by index. The arch code
+ * must accept an attempt to read every index.
+ */
+static inline struct rmid_entry *__rmid_entry(u32 idx)
 {
 	struct rmid_entry *entry;
+	u32 closid, rmid;
 
-	entry = &rmid_ptrs[rmid];
-	WARN_ON(entry->rmid != rmid);
+	entry = &rmid_ptrs[idx];
+	resctrl_arch_rmid_idx_decode(idx, &closid, &rmid);
+
+	WARN_ON_ONCE(entry->closid != closid);
+	WARN_ON_ONCE(entry->rmid != rmid);
 
 	return entry;
 }
@@ -243,8 +258,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 void __check_limbo(struct rdt_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
 	struct rmid_entry *entry;
-	u32 crmid = 1, nrmid;
+	u32 idx, cur_idx = 1;
 	bool rmid_dirty;
 	u64 val = 0;
 
@@ -255,12 +271,11 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	 * RMID and move it to the free list when the counter reaches 0.
 	 */
 	for (;;) {
-		nrmid = find_next_bit(d->rmid_busy_llc, r->num_rmid, crmid);
-		if (nrmid >= r->num_rmid)
+		idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx);
+		if (idx >= idx_limit)
 			break;
 
-		entry = __rmid_entry(~0, nrmid);	// temporary
-
+		entry = __rmid_entry(idx);
 		if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
 					   QOS_L3_OCCUP_EVENT_ID, &val)) {
 			rmid_dirty = true;
@@ -269,19 +284,21 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 		}
 
 		if (force_free || !rmid_dirty) {
-			clear_bit(entry->rmid, d->rmid_busy_llc);
+			clear_bit(idx, d->rmid_busy_llc);
 			if (!--entry->busy) {
 				rmid_limbo_count--;
 				list_add_tail(&entry->list, &rmid_free_lru);
 			}
 		}
-		crmid = nrmid + 1;
+		cur_idx = idx + 1;
 	}
 }
 
 bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
 {
-	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
+	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
+
+	return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit;
 }
 
 /*
@@ -311,6 +328,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 	struct rdt_domain *d;
 	int cpu, err;
 	u64 val = 0;
+	u32 idx;
+
+	idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
 
 	entry->busy = 0;
 	cpu = get_cpu();
@@ -330,7 +350,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 		 */
 		if (!has_busy_rmid(r, d))
 			cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL);
-		set_bit(entry->rmid, d->rmid_busy_llc);
+		set_bit(idx, d->rmid_busy_llc);
 		entry->busy++;
 	}
 	put_cpu();
@@ -343,14 +363,16 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 
 void free_rmid(u32 closid, u32 rmid)
 {
+	u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
 	struct rmid_entry *entry;
 
-	if (!rmid)
-		return;
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
-	entry = __rmid_entry(closid, rmid);
+	/* do not allow the default rmid to be free'd */
+	if (!idx)
+		return;
+
+	entry = __rmid_entry(idx);
 
 	if (is_llc_occupancy_enabled())
 		add_rmid_to_limbo(entry);
@@ -360,6 +382,7 @@ void free_rmid(u32 closid, u32 rmid)
 
 static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
 {
+	u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
 	struct mbm_state *m;
 	u64 tval = 0;
 
@@ -376,10 +399,10 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
 		rr->val += tval;
 		return 0;
 	case QOS_L3_MBM_TOTAL_EVENT_ID:
-		m = &rr->d->mbm_total[rmid];
+		m = &rr->d->mbm_total[idx];
 		break;
 	case QOS_L3_MBM_LOCAL_EVENT_ID:
-		m = &rr->d->mbm_local[rmid];
+		m = &rr->d->mbm_local[idx];
 		break;
 	default:
 		/*
@@ -412,7 +435,8 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
  */
 static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
 {
-	struct mbm_state *m = &rr->d->mbm_local[rmid];
+	u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+	struct mbm_state *m = &rr->d->mbm_local[idx];
 	u64 cur_bw, bytes, cur_bytes;
 
 	cur_bytes = rr->val;
@@ -502,7 +526,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
-	u32 cur_bw, delta_bw, user_bw;
+	u32 cur_bw, delta_bw, user_bw, idx;
 	struct rdt_resource *r_mba;
 	struct rdt_domain *dom_mba;
 	struct list_head *head;
@@ -515,7 +539,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 
 	closid = rgrp->closid;
 	rmid = rgrp->mon.rmid;
-	pmbm_data = &dom_mbm->mbm_local[rmid];
+	idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+	pmbm_data = &dom_mbm->mbm_local[idx];
 
 	dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
 	if (!dom_mba) {
@@ -698,19 +723,19 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
 
 static int dom_data_init(struct rdt_resource *r)
 {
+	u32 nr_idx = resctrl_arch_system_num_rmid_idx();
 	struct rmid_entry *entry = NULL;
-	int i, nr_rmids;
+	int i;
 
-	nr_rmids = r->num_rmid;
-	rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL);
+	rmid_ptrs = kcalloc(nr_idx, sizeof(struct rmid_entry), GFP_KERNEL);
 	if (!rmid_ptrs)
 		return -ENOMEM;
 
-	for (i = 0; i < nr_rmids; i++) {
+	for (i = 0; i < nr_idx; i++) {
 		entry = &rmid_ptrs[i];
 		INIT_LIST_HEAD(&entry->list);
 
-		entry->rmid = i;
+		resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid);
 		list_add_tail(&entry->list, &rmid_free_lru);
 	}
 
@@ -719,7 +744,7 @@ static int dom_data_init(struct rdt_resource *r)
 	 * default_rdtgroup control group, which will be setup later. See
 	 * rdtgroup_setup_root().
 	 */
-	entry = __rmid_entry(0, 0);
+	entry = __rmid_entry(resctrl_arch_rmid_idx_encode(0, 0));
 	list_del(&entry->list);
 
 	return 0;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f3b739c52e42..9ce4746778f4 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3320,16 +3320,17 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 
 static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 {
+	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
 	size_t tsize;
 
 	if (is_llc_occupancy_enabled()) {
-		d->rmid_busy_llc = bitmap_zalloc(r->num_rmid, GFP_KERNEL);
+		d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL);
 		if (!d->rmid_busy_llc)
 			return -ENOMEM;
 	}
 	if (is_mbm_total_enabled()) {
 		tsize = sizeof(*d->mbm_total);
-		d->mbm_total = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
+		d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL);
 		if (!d->mbm_total) {
 			bitmap_free(d->rmid_busy_llc);
 			return -ENOMEM;
@@ -3337,7 +3338,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	}
 	if (is_mbm_local_enabled()) {
 		tsize = sizeof(*d->mbm_local);
-		d->mbm_local = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
+		d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL);
 		if (!d->mbm_local) {
 			bitmap_free(d->rmid_busy_llc);
 			kfree(d->mbm_total);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 03/18] x86/resctrl: Create helper for RMID allocation and mondata dir creation
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
  2023-01-13 17:54 ` [PATCH v2 01/18] x86/resctrl: Track the closid with the rmid James Morse
  2023-01-13 17:54 ` [PATCH v2 02/18] x86/resctrl: Access per-rmid structures by index James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-13 17:54 ` [PATCH v2 04/18] x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare() James Morse
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

RMID are allocated for each monitor or control group directory, because
each of these needs its own RMID. For control groups,
rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID.

MPAM's equivalent of RMID are not an independent number, so can't be
allocated until the CLOSID is known. An RMID allocation for one CLOSID
may fail, whereas another may succeed depending on how many monitor
groups a control group has.

The RMID allocation needs to move to be after the CLOSID has been
allocated.

To make a subsequent change that does this easier to read, move the RMID
allocation and mondata dir creation to a helper.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 42 +++++++++++++++++---------
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9ce4746778f4..841294ad6263 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2868,6 +2868,30 @@ static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp)
 	return 0;
 }
 
+static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
+{
+	int ret;
+
+	if (!rdt_mon_capable)
+		return 0;
+
+	ret = alloc_rmid();
+	if (ret < 0) {
+		rdt_last_cmd_puts("Out of RMIDs\n");
+		return ret;
+	}
+	rdtgrp->mon.rmid = ret;
+
+	ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
+	if (ret) {
+		rdt_last_cmd_puts("kernfs subdir error\n");
+		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
+		return ret;
+	}
+
+	return 0;
+}
+
 static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
 			     const char *name, umode_t mode,
 			     enum rdt_group_type rtype, struct rdtgroup **r)
@@ -2933,20 +2957,10 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
 		goto out_destroy;
 	}
 
-	if (rdt_mon_capable) {
-		ret = alloc_rmid();
-		if (ret < 0) {
-			rdt_last_cmd_puts("Out of RMIDs\n");
-			goto out_destroy;
-		}
-		rdtgrp->mon.rmid = ret;
+	ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
+	if (ret)
+		goto out_destroy;
 
-		ret = mkdir_mondata_all(kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
-		if (ret) {
-			rdt_last_cmd_puts("kernfs subdir error\n");
-			goto out_idfree;
-		}
-	}
 	kernfs_activate(kn);
 
 	/*
@@ -2954,8 +2968,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
 	 */
 	return 0;
 
-out_idfree:
-	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 out_destroy:
 	kernfs_put(rdtgrp->kn);
 	kernfs_remove(rdtgrp->kn);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 04/18] x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (2 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 03/18] x86/resctrl: Create helper for RMID allocation and mondata dir creation James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-17 18:28   ` Yu, Fenghua
  2023-02-02 23:45   ` Reinette Chatre
  2023-01-13 17:54 ` [PATCH v2 05/18] x86/resctrl: Allow RMID allocation to be scoped by CLOSID James Morse
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

RMID are allocated for each monitor or control group directory, because
each of these needs its own RMID. For control groups,
rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID.

MPAM's equivalent of RMID are not an independent number, so can't be
allocated until the closid is known. An RMID allocation for one CLOSID
may fail, whereas another may succeed depending on how many monitor
groups a control group has.

The RMID allocation needs to move to be after the CLOSID has been
allocated.

Move the RMID allocation out of mkdir_rdt_prepare() to occur in its caller,
after the mkdir_rdt_prepare() call. This allows the RMID allocator to
know the CLOSID.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 29 +++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 841294ad6263..c67083a8a5f5 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2892,6 +2892,12 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
 	return 0;
 }
 
+static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp)
+{
+	if (rdt_mon_capable)
+		free_rmid(rgrp->closid, rgrp->mon.rmid);
+}
+
 static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
 			     const char *name, umode_t mode,
 			     enum rdt_group_type rtype, struct rdtgroup **r)
@@ -2957,10 +2963,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
 		goto out_destroy;
 	}
 
-	ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
-	if (ret)
-		goto out_destroy;
-
 	kernfs_activate(kn);
 
 	/*
@@ -2981,7 +2983,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
 static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp)
 {
 	kernfs_remove(rgrp->kn);
-	free_rmid(rgrp->closid, rgrp->mon.rmid);
 	rdtgroup_remove(rgrp);
 }
 
@@ -3003,12 +3004,19 @@ static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn,
 	prgrp = rdtgrp->mon.parent;
 	rdtgrp->closid = prgrp->closid;
 
+	ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
+	if (ret) {
+		mkdir_rdt_prepare_clean(rdtgrp);
+		goto out_unlock;
+	}
+
 	/*
 	 * Add the rdtgrp to the list of rdtgrps the parent
 	 * ctrl_mon group has to track.
 	 */
 	list_add_tail(&rdtgrp->mon.crdtgrp_list, &prgrp->mon.crdtgrp_list);
 
+out_unlock:
 	rdtgroup_kn_unlock(parent_kn);
 	return ret;
 }
@@ -3039,10 +3047,15 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
 	ret = 0;
 
 	rdtgrp->closid = closid;
-	ret = rdtgroup_init_alloc(rdtgrp);
-	if (ret < 0)
+
+	ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
+	if (ret)
 		goto out_id_free;
 
+	ret = rdtgroup_init_alloc(rdtgrp);
+	if (ret < 0)
+		goto out_rmid_free;
+
 	list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
 
 	if (rdt_mon_capable) {
@@ -3061,6 +3074,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
 
 out_del_list:
 	list_del(&rdtgrp->rdtgroup_list);
+out_rmid_free:
+	mkdir_rdt_prepare_rmid_free(rdtgrp);
 out_id_free:
 	closid_free(closid);
 out_common_fail:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 05/18] x86/resctrl: Allow RMID allocation to be scoped by CLOSID
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (3 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 04/18] x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare() James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-17 18:53   ` Yu, Fenghua
  2023-02-02 23:45   ` Reinette Chatre
  2023-01-13 17:54 ` [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID James Morse
                   ` (13 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

MPAMs RMID values are not unique unless the CLOSID is considered as well.

alloc_rmid() expects the RMID to be an independent number.

Pass the CLOSID in to alloc_rmid(). Use this to compare indexes when
allocating. If the CLOSID is not relevant to the index, this ends up
comparing the free RMID with itself, and the first free entry will be
used. With MPAM the CLOSID is included in the index, so this becomes a
walk of the free RMID entries, until one that matches the supplied
CLOSID is found.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h    |  2 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     | 44 ++++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  2 +-
 4 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index af71401c57e2..013c8fc9fd28 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -510,7 +510,7 @@ void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
 struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
-int alloc_rmid(void);
+int alloc_rmid(u32 closid);
 void free_rmid(u32 closid, u32 rmid);
 int rdt_get_mon_l3_config(struct rdt_resource *r);
 void mon_event_count(void *info);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index dbae380e3d1c..347be3767241 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -301,25 +301,51 @@ bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
 	return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit;
 }
 
+static struct rmid_entry *resctrl_find_free_rmid(u32 closid)
+{
+	struct rmid_entry *itr;
+	u32 itr_idx, cmp_idx;
+
+	if (list_empty(&rmid_free_lru))
+		return rmid_limbo_count ? ERR_PTR(-EBUSY) : ERR_PTR(-ENOSPC);
+
+	list_for_each_entry(itr, &rmid_free_lru, list) {
+		/*
+		 * get the index of this free RMID, and the index it would need
+		 * to be if it were used with this CLOSID.
+		 * If the CLOSID is irrelevant on this architecture, these will
+		 * always be the same. Otherwise they will only match if this
+		 * RMID can be used with this CLOSID.
+		 */
+		itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid);
+		cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid);
+
+		if (itr_idx == cmp_idx)
+			return itr;
+	}
+
+	return ERR_PTR(-ENOSPC);
+}
+
 /*
- * As of now the RMIDs allocation is global.
+ * As of now the RMIDs allocation is the same in each domain.
  * However we keep track of which packages the RMIDs
  * are used to optimize the limbo list management.
+ * The closid is ignored on x86.
  */
-int alloc_rmid(void)
+int alloc_rmid(u32 closid)
 {
 	struct rmid_entry *entry;
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
-	if (list_empty(&rmid_free_lru))
-		return rmid_limbo_count ? -EBUSY : -ENOSPC;
+	entry = resctrl_find_free_rmid(closid);
+	if (!IS_ERR(entry)) {
+		list_del(&entry->list);
+		return entry->rmid;
+	}
 
-	entry = list_first_entry(&rmid_free_lru,
-				 struct rmid_entry, list);
-	list_del(&entry->list);
-
-	return entry->rmid;
+	return PTR_ERR(entry);
 }
 
 static void add_rmid_to_limbo(struct rmid_entry *entry)
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index c51932516965..3b724a40d3a2 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -763,7 +763,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
 	int ret;
 
 	if (rdt_mon_capable) {
-		ret = alloc_rmid();
+		ret = alloc_rmid(rdtgrp->closid);
 		if (ret < 0) {
 			rdt_last_cmd_puts("Out of RMIDs\n");
 			return ret;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index c67083a8a5f5..ac88610a6946 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2875,7 +2875,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
 	if (!rdt_mon_capable)
 		return 0;
 
-	ret = alloc_rmid();
+	ret = alloc_rmid(rdtgrp->closid);
 	if (ret < 0) {
 		rdt_last_cmd_puts("Out of RMIDs\n");
 		return ret;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (4 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 05/18] x86/resctrl: Allow RMID allocation to be scoped by CLOSID James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-17 18:29   ` Yu, Fenghua
  2023-02-02 23:46   ` Reinette Chatre
  2023-01-13 17:54 ` [PATCH v2 07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers James Morse
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
used for different control groups.

This means once a CLOSID is allocated, all its monitoring ids may still be
dirty, and held in limbo.

Add a helper to allow the CLOSID allocator to check if a CLOSID has dirty
RMID values. This behaviour is enabled by a kconfig option selected by
the architecture, which avoids a pointless search for x86.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>

---
Changes since v1:
 * Removed superflous IS_ENABLED().
---
 arch/x86/kernel/cpu/resctrl/internal.h |  1 +
 arch/x86/kernel/cpu/resctrl/monitor.c  | 31 ++++++++++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 ++++++++------
 3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 013c8fc9fd28..adae6231324f 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -509,6 +509,7 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
 struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
+bool resctrl_closid_is_dirty(u32 closid);
 void closid_free(int closid);
 int alloc_rmid(u32 closid);
 void free_rmid(u32 closid, u32 rmid);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 347be3767241..190ac183139e 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -327,6 +327,37 @@ static struct rmid_entry *resctrl_find_free_rmid(u32 closid)
 	return ERR_PTR(-ENOSPC);
 }
 
+/**
+ * resctrl_closid_is_dirty - Determine if clean RMID can be allocate for this
+ *                           CLOSID.
+ * @closid: The CLOSID that is being queried.
+ *
+ * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocate CLOSID
+ * may not be able to allocate clean RMID. To avoid this the allocator will
+ * only return clean CLOSID.
+ */
+bool resctrl_closid_is_dirty(u32 closid)
+{
+	struct rmid_entry *entry;
+	int i;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
+		return false;
+
+	for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
+		entry = &rmid_ptrs[i];
+		if (entry->closid != closid)
+			continue;
+
+		if (entry->busy)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * As of now the RMIDs allocation is the same in each domain.
  * However we keep track of which packages the RMIDs
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index ac88610a6946..e1f879e13823 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -93,7 +93,7 @@ void rdt_last_cmd_printf(const char *fmt, ...)
  * - Our choices on how to configure each resource become progressively more
  *   limited as the number of resources grows.
  */
-static int closid_free_map;
+static unsigned long closid_free_map;
 static int closid_free_map_len;
 
 int closids_supported(void)
@@ -119,14 +119,17 @@ static void closid_init(void)
 
 static int closid_alloc(void)
 {
-	u32 closid = ffs(closid_free_map);
+	u32 closid;
 
-	if (closid == 0)
-		return -ENOSPC;
-	closid--;
-	closid_free_map &= ~(1 << closid);
+	for_each_set_bit(closid, &closid_free_map, closid_free_map_len) {
+		if (resctrl_closid_is_dirty(closid))
+			continue;
 
-	return closid;
+		clear_bit(closid, &closid_free_map);
+		return closid;
+	}
+
+	return -ENOSPC;
 }
 
 void closid_free(int closid)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (5 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-17 19:10   ` Yu, Fenghua
  2023-02-02 23:47   ` Reinette Chatre
  2023-01-13 17:54 ` [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI James Morse
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

When switching tasks, the CLOSID and RMID that the new task should
use are stored in struct task_struct. For x86 the CLOSID known by resctrl,
the value in task_struct, and the value written to the CPU register are
all the same thing.

MPAM's CPU interface has two different PARTID's one for data accesses
the other for instruction fetch. Storing resctrl's CLOSID value in
struct task_struct implies the arch code knows whether resctrl is using
CDP.

Move the matching and setting of the struct task_struct properties
to use helpers. This allows arm64 to store the hardware format of
the register, instead of having to convert it each time.

__rdtgroup_move_task()s use of READ_ONCE()/WRITE_ONCE() ensures torn
values aren't seen as another CPU may schedule the task being moved
while the value is being changed. MPAM has an additional corner-case
here as the PMG bits extend the PARTID space. If the scheduler sees a
new-CLOSID but old-RMID, the task will dirty an RMID that the limbo code
is not watching causing an inaccurate count. x86's RMID are independent
values, so the limbo code will still be watching the old-RMID in this
circumstance.
To avoid this, arm64 needs both the CLOSID/RMID WRITE_ONCE()d together.
Both values must be provided together.

Because MPAM's RMID values are not unique, the CLOSID must be provided
when matching the RMID.

CC: Valentin Schneider <vschneid@redhat.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/include/asm/resctrl.h         | 18 ++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 57 +++++++++++++++-----------
 2 files changed, 51 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 44d568f3577c..d589a82995ac 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -95,6 +95,24 @@ static inline unsigned int resctrl_arch_round_mon_val(unsigned int val)
 	return val * scale;
 }
 
+static inline void resctrl_arch_set_closid_rmid(struct task_struct *tsk,
+						u32 closid, u32 rmid)
+{
+	WRITE_ONCE(tsk->closid, closid);
+	WRITE_ONCE(tsk->rmid, rmid);
+}
+
+static inline bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid)
+{
+	return READ_ONCE(tsk->closid) == closid;
+}
+
+static inline bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 ignored,
+					   u32 rmid)
+{
+	return READ_ONCE(tsk->rmid) == rmid;
+}
+
 static inline void resctrl_sched_in(void)
 {
 	if (static_branch_likely(&rdt_enable_key))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e1f879e13823..ced7400decae 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -84,7 +84,7 @@ void rdt_last_cmd_printf(const char *fmt, ...)
  *
  * Using a global CLOSID across all resources has some advantages and
  * some drawbacks:
- * + We can simply set "current->closid" to assign a task to a resource
+ * + We can simply set current's closid to assign a task to a resource
  *   group.
  * + Context switch code can avoid extra memory references deciding which
  *   CLOSID to load into the PQR_ASSOC MSR
@@ -549,14 +549,26 @@ static void update_task_closid_rmid(struct task_struct *t)
 		_update_task_closid_rmid(t);
 }
 
+static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup *rdtgrp)
+{
+	u32 closid, rmid = rdtgrp->mon.rmid;
+
+	if (rdtgrp->type == RDTCTRL_GROUP)
+		closid = rdtgrp->closid;
+	else if (rdtgrp->type == RDTMON_GROUP)
+		closid = rdtgrp->mon.parent->closid;
+	else
+		return false;
+
+	return resctrl_arch_match_closid(tsk, closid) &&
+	       resctrl_arch_match_rmid(tsk, closid, rmid);
+}
+
 static int __rdtgroup_move_task(struct task_struct *tsk,
 				struct rdtgroup *rdtgrp)
 {
 	/* If the task is already in rdtgrp, no need to move the task. */
-	if ((rdtgrp->type == RDTCTRL_GROUP && tsk->closid == rdtgrp->closid &&
-	     tsk->rmid == rdtgrp->mon.rmid) ||
-	    (rdtgrp->type == RDTMON_GROUP && tsk->rmid == rdtgrp->mon.rmid &&
-	     tsk->closid == rdtgrp->mon.parent->closid))
+	if (task_in_rdtgroup(tsk, rdtgrp))
 		return 0;
 
 	/*
@@ -567,19 +579,14 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 	 * For monitor groups, can move the tasks only from
 	 * their parent CTRL group.
 	 */
-
-	if (rdtgrp->type == RDTCTRL_GROUP) {
-		WRITE_ONCE(tsk->closid, rdtgrp->closid);
-		WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
-	} else if (rdtgrp->type == RDTMON_GROUP) {
-		if (rdtgrp->mon.parent->closid == tsk->closid) {
-			WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
-		} else {
-			rdt_last_cmd_puts("Can't move task to different control group\n");
-			return -EINVAL;
-		}
+	if (rdtgrp->type == RDTMON_GROUP &&
+	    !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) {
+		rdt_last_cmd_puts("Can't move task to different control group\n");
+		return -EINVAL;
 	}
 
+	resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, rdtgrp->mon.rmid);
+
 	/*
 	 * Ensure the task's closid and rmid are written before determining if
 	 * the task is current that will decide if it will be interrupted.
@@ -599,14 +606,15 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 
 static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
 {
-	return (rdt_alloc_capable &&
-	       (r->type == RDTCTRL_GROUP) && (t->closid == r->closid));
+	return (rdt_alloc_capable && (r->type == RDTCTRL_GROUP) &&
+		resctrl_arch_match_closid(t, r->closid));
 }
 
 static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r)
 {
-	return (rdt_mon_capable &&
-	       (r->type == RDTMON_GROUP) && (t->rmid == r->mon.rmid));
+	return (rdt_mon_capable && (r->type == RDTMON_GROUP) &&
+		resctrl_arch_match_rmid(t, r->mon.parent->closid,
+					r->mon.rmid));
 }
 
 /**
@@ -802,7 +810,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
 		    rdtg->mode != RDT_MODE_EXCLUSIVE)
 			continue;
 
-		if (rdtg->closid != tsk->closid)
+		if (!resctrl_arch_match_closid(tsk, rdtg->closid))
 			continue;
 
 		seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "",
@@ -810,7 +818,8 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
 		seq_puts(s, "mon:");
 		list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
 				    mon.crdtgrp_list) {
-			if (tsk->rmid != crg->mon.rmid)
+			if (!resctrl_arch_match_rmid(tsk, crg->mon.parent->closid,
+						     crg->mon.rmid))
 				continue;
 			seq_printf(s, "%s", crg->kn->name);
 			break;
@@ -2401,8 +2410,8 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
 	for_each_process_thread(p, t) {
 		if (!from || is_closid_match(t, from) ||
 		    is_rmid_match(t, from)) {
-			WRITE_ONCE(t->closid, to->closid);
-			WRITE_ONCE(t->rmid, to->mon.rmid);
+			resctrl_arch_set_closid_rmid(t, to->closid,
+						     to->mon.rmid);
 
 			/*
 			 * If the task is on a CPU, set the CPU in the mask.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (6 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-17 18:29   ` Yu, Fenghua
  2023-02-02 23:47   ` Reinette Chatre
  2023-01-13 17:54 ` [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep James Morse
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

x86 is blessed with an abundance of monitors, one per RMID, that can be
read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
the number implemented is up to the manufacturer. This means when there are
fewer monitors than needed, they need to be allocated and freed.

Worse, the domain may be broken up into slices, and the MMIO accesses
for each slice may need performing from different CPUs.

These two details mean MPAMs monitor code needs to be able to sleep, and
IPI another CPU in the domain to read from a resource that has been sliced.

mon_event_read() already invokes mon_event_count() via IPI, which means
this isn't possible.

Change mon_event_read() to schedule mon_event_count() on a remote CPU and
wait, instead of sending an IPI. This function is only used in response to
a user-space filesystem request (not the timing sensitive overflow code).

This allows MPAM to hide the slice behaviour from resctrl, and to keep
the monitor-allocation in monitor.c.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 7 +++++--
 arch/x86/kernel/cpu/resctrl/internal.h    | 2 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     | 6 ++++--
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 1df0e3262bca..4ee3da6dced7 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -532,8 +532,11 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
+	/* When picking a cpu from cpu_mask, ensure it can't race with cpuhp */
+	lockdep_assert_held(&rdtgroup_mutex);
+
 	/*
-	 * setup the parameters to send to the IPI to read the data.
+	 * setup the parameters to pass to mon_event_count() to read the data.
 	 */
 	rr->rgrp = rdtgrp;
 	rr->evtid = evtid;
@@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 	rr->val = 0;
 	rr->first = first;
 
-	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+	smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);
 }
 
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index adae6231324f..1f90a10b75a1 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -514,7 +514,7 @@ void closid_free(int closid);
 int alloc_rmid(u32 closid);
 void free_rmid(u32 closid, u32 rmid);
 int rdt_get_mon_l3_config(struct rdt_resource *r);
-void mon_event_count(void *info);
+int mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 190ac183139e..d309b830aeb2 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -509,10 +509,10 @@ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
 }
 
 /*
- * This is called via IPI to read the CQM/MBM counters
+ * This is scheduled by mon_event_read() to read the CQM/MBM counters
  * on a domain.
  */
-void mon_event_count(void *info)
+int mon_event_count(void *info)
 {
 	struct rdtgroup *rdtgrp, *entry;
 	struct rmid_read *rr = info;
@@ -545,6 +545,8 @@ void mon_event_count(void *info)
 	 */
 	if (ret == 0)
 		rr->err = 0;
+
+	return 0;
 }
 
 /*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (7 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-23 13:54   ` Peter Newman
  2023-01-23 15:33   ` Peter Newman
  2023-01-13 17:54 ` [PATCH v2 10/18] x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read() James Morse
                   ` (9 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

MPAM's cache occupancy counters can take a little while to settle once
the monitor has been configured. The maximum settling time is described
to the driver via a firmware table. The value could be large enough
that it makes sense to sleep.

To avoid exposing this to resctrl, it should be hidden behind MPAM's
resctrl_arch_rmid_read(). But add_rmid_to_limbo() calls
resctrl_arch_rmid_read() from a non-preemptible context.

add_rmid_to_limbo() is opportunistically reading the L3 occupancy counter
on this domain to avoid adding the RMID to limbo if this domain's value
has drifted below resctrl_rmid_realloc_threshold since the limbo handler
last ran. Determining 'this domain' involves disabling preeption to
prevent the thread being migrated to CPUs in a different domain between
the check and resctrl_arch_rmid_read() call. The check is skipped
for all remote domains.

Instead, call resctrl_arch_rmid_read() for each domain, and get it to
read the arch specific counter via IPI if its called on a CPU outside
the target domain. By covering remote domains, this change stops the
limbo handler from being started unnecessarily.

This also allows resctrl_arch_rmid_read() to sleep.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
The alternative is to remove the counter read from this path altogether,
and assume user-space would never try to re-allocate the last RMID before
the limbo handler runs next.
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 58 ++++++++++++++++++---------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index d309b830aeb2..d6ae4b713801 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -206,17 +206,19 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
-			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
-			   u64 *val)
+struct __rmid_read_arg
 {
-	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
-	struct arch_mbm_state *am;
-	u64 msr_val, chunks;
+	u32 rmid;
+	enum resctrl_event_id eventid;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
-		return -EINVAL;
+	u64 msr_val;
+};
+
+static void __rmid_read(void *arg)
+{
+	enum resctrl_event_id eventid = ((struct __rmid_read_arg *)arg)->eventid;
+	u32 rmid = ((struct __rmid_read_arg *)arg)->rmid;
+	u64 msr_val;
 
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
@@ -229,6 +231,28 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
+	((struct __rmid_read_arg *)arg)->msr_val = msr_val;
+}
+
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
+			   u64 *val)
+{
+	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct __rmid_read_arg arg;
+	struct arch_mbm_state *am;
+	u64 msr_val, chunks;
+	int err;
+
+	arg.rmid = rmid;
+	arg.eventid = eventid;
+
+	err = smp_call_function_any(&d->cpu_mask, __rmid_read, &arg, true);
+	if (err)
+		return err;
+
+	msr_val = arg.msr_val;
 	if (msr_val & RMID_VAL_ERROR)
 		return -EIO;
 	if (msr_val & RMID_VAL_UNAVAIL)
@@ -383,23 +407,18 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rdt_domain *d;
-	int cpu, err;
 	u64 val = 0;
 	u32 idx;
+	int err;
 
 	idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
 
 	entry->busy = 0;
-	cpu = get_cpu();
 	list_for_each_entry(d, &r->domains, list) {
-		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
-			err = resctrl_arch_rmid_read(r, d, entry->closid,
-						     entry->rmid,
-						     QOS_L3_OCCUP_EVENT_ID,
-						     &val);
-			if (err || val <= resctrl_rmid_realloc_threshold)
-				continue;
-		}
+		err = resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
+					     QOS_L3_OCCUP_EVENT_ID, &val);
+		if (err || val <= resctrl_rmid_realloc_threshold)
+			continue;
 
 		/*
 		 * For the first limbo RMID in the domain,
@@ -410,7 +429,6 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 		set_bit(idx, d->rmid_busy_llc);
 		entry->busy++;
 	}
-	put_cpu();
 
 	if (entry->busy)
 		rmid_limbo_count++;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 10/18] x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read()
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (8 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-13 17:54 ` [PATCH v2 11/18] x86/resctrl: Make resctrl_mounted checks explicit James Morse
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

Depending on the number of monitors available, Arm's MPAM may need to
allocate a monitor prior to reading the counter value. Allocating a
contended resource may involve sleeping.

All callers of resctrl_arch_rmid_read() read the counter on more than
one domain. If the monitor is allocated globally, there is no need to
allocate and free it for each call to resctrl_arch_rmid_read().

Add arch hooks for this allocation, which need calling before
resctrl_arch_rmid_read(). The allocated monitor is passed to
resctrl_arch_rmid_read(), then freed again afterwards. The helper
can be called on any CPU, and can sleep.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/include/asm/resctrl.h         | 11 +++++++
 arch/x86/kernel/cpu/resctrl/internal.h |  1 +
 arch/x86/kernel/cpu/resctrl/monitor.c  | 40 +++++++++++++++++++++++---
 include/linux/resctrl.h                |  4 +--
 4 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index d589a82995ac..194a1570af7b 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -136,6 +136,17 @@ static inline u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid)
 	return rmid;
 }
 
+/* x86 can always read an rmid, nothing needs allocating */
+struct rdt_resource;
+static inline int resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid)
+{
+	might_sleep();
+	return 0;
+};
+
+static inline void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid,
+					     int ctx) { };
+
 void resctrl_cpu_detect(struct cpuinfo_x86 *c);
 
 #else
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 1f90a10b75a1..e85e454bec72 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -88,6 +88,7 @@ struct rmid_read {
 	bool			first;
 	int			err;
 	u64			val;
+	int			arch_mon_ctx;
 };
 
 extern bool rdt_alloc_capable;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index d6ae4b713801..4e248f4a5f59 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -15,6 +15,7 @@
  * Software Developer Manual June 2016, volume 3, section 17.17.
  */
 
+#include <linux/cpu.h>
 #include <linux/module.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
@@ -236,7 +237,7 @@ static void __rmid_read(void *arg)
 
 int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
-			   u64 *val)
+			   u64 *val, int ignored)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
 	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
@@ -285,9 +286,14 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
 	struct rmid_entry *entry;
 	u32 idx, cur_idx = 1;
+	int arch_mon_ctx;
 	bool rmid_dirty;
 	u64 val = 0;
 
+	arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
+	if (arch_mon_ctx < 0)
+		return;
+
 	/*
 	 * Skip RMID 0 and start from RMID 1 and check all the RMIDs that
 	 * are marked as busy for occupancy < threshold. If the occupancy
@@ -301,7 +307,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 
 		entry = __rmid_entry(idx);
 		if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
-					   QOS_L3_OCCUP_EVENT_ID, &val)) {
+					   QOS_L3_OCCUP_EVENT_ID, &val,
+					   arch_mon_ctx)) {
 			rmid_dirty = true;
 		} else {
 			rmid_dirty = (val >= resctrl_rmid_realloc_threshold);
@@ -316,6 +323,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 		}
 		cur_idx = idx + 1;
 	}
+
+	resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
 }
 
 bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
@@ -407,16 +416,22 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rdt_domain *d;
+	int arch_mon_ctx;
 	u64 val = 0;
 	u32 idx;
 	int err;
 
 	idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
 
+	arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
+	if (arch_mon_ctx < 0)
+		return;
+
 	entry->busy = 0;
 	list_for_each_entry(d, &r->domains, list) {
 		err = resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
-					     QOS_L3_OCCUP_EVENT_ID, &val);
+					     QOS_L3_OCCUP_EVENT_ID, &val,
+					     arch_mon_ctx);
 		if (err || val <= resctrl_rmid_realloc_threshold)
 			continue;
 
@@ -429,6 +444,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 		set_bit(idx, d->rmid_busy_llc);
 		entry->busy++;
 	}
+	resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
 
 	if (entry->busy)
 		rmid_limbo_count++;
@@ -465,7 +481,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
 		resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid);
 
 	rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid,
-					 &tval);
+					 &tval, rr->arch_mon_ctx);
 	if (rr->err)
 		return rr->err;
 
@@ -538,6 +554,9 @@ int mon_event_count(void *info)
 	int ret;
 
 	rdtgrp = rr->rgrp;
+	rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr->r, rr->evtid);
+	if (rr->arch_mon_ctx < 0)
+		return rr->arch_mon_ctx;
 
 	ret = __mon_event_count(rdtgrp->closid, rdtgrp->mon.rmid, rr);
 
@@ -564,6 +583,8 @@ int mon_event_count(void *info)
 	if (ret == 0)
 		rr->err = 0;
 
+	resctrl_arch_mon_ctx_free(rr->r, rr->evtid, rr->arch_mon_ctx);
+
 	return 0;
 }
 
@@ -700,11 +721,21 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
 	if (is_mbm_total_enabled()) {
 		rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
 		rr.val = 0;
+		rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
+		if (rr.arch_mon_ctx < 0)
+			return;
+
 		__mon_event_count(closid, rmid, &rr);
+
+		resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
 	}
 	if (is_mbm_local_enabled()) {
 		rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
 		rr.val = 0;
+		rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
+		if (rr.arch_mon_ctx < 0)
+			return;
+
 		__mon_event_count(closid, rmid, &rr);
 
 		/*
@@ -714,6 +745,7 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
 		 */
 		if (is_mba_sc(NULL))
 			mbm_bw_count(closid, rmid, &rr);
+		resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
 	}
 }
 
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 57d32c3ce06f..d90d3dca48e9 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -230,6 +230,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
  * @rmid:		rmid of the counter to read.
  * @eventid:		eventid to read, e.g. L3 occupancy.
  * @val:		result of the counter read in bytes.
+ * @arch_mon_ctx:	An allocated context from resctrl_arch_mon_ctx_alloc().
  *
  * Call from process context on a CPU that belongs to domain @d.
  *
@@ -238,8 +239,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
  */
 int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
-			   u64 *val);
-
+			   u64 *val, int arch_mon_ctx);
 
 /**
  * resctrl_arch_reset_rmid() - Reset any private state associated with rmid
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 11/18] x86/resctrl: Make resctrl_mounted checks explicit
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (9 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 10/18] x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read() James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-13 17:54 ` [PATCH v2 12/18] x86/resctrl: Move alloc/mon static keys into helpers James Morse
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

The rdt_enable_key is switched when resctrl is mounted, and used to
prevent a second mount of the filesystem. It also enables the
architecture's context switch code.

This requires another architecture to have the same set of static-keys,
as resctrl depends on them too.

Make the resctrl_mounted checks explicit: resctrl can keep track of
whether it has been mounted once. This doesn't need to be combined with
whether the arch code is context switching the CLOSID.
Tests against the rdt_mon_enable_key become a test that resctrl is
mounted and that monitoring is enabled.

This will allow the static-key changing to be moved behind resctrl_arch_
calls.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  1 +
 arch/x86/kernel/cpu/resctrl/monitor.c  |  5 +++--
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 +++++++++++------
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e85e454bec72..65d85e1ef75f 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -95,6 +95,7 @@ extern bool rdt_alloc_capable;
 extern bool rdt_mon_capable;
 extern unsigned int rdt_mon_features;
 extern struct list_head resctrl_schema_all;
+extern bool resctrl_mounted;
 
 enum rdt_group_type {
 	RDTCTRL_GROUP = 0,
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 4e248f4a5f59..4ff258b49e9c 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -795,7 +795,7 @@ void mbm_handle_overflow(struct work_struct *work)
 
 	mutex_lock(&rdtgroup_mutex);
 
-	if (!static_branch_likely(&rdt_mon_enable_key))
+	if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
@@ -823,8 +823,9 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
 
-	if (!static_branch_likely(&rdt_mon_enable_key))
+	if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
 		return;
+
 	cpu = cpumask_any(&dom->cpu_mask);
 	dom->mbm_work_cpu = cpu;
 	schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index ced7400decae..370da7077c67 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -42,6 +42,9 @@ LIST_HEAD(rdt_all_groups);
 /* list of entries for the schemata file */
 LIST_HEAD(resctrl_schema_all);
 
+/* the filesystem can only be mounted once */
+bool resctrl_mounted;
+
 /* Kernel fs node for "info" directory under root */
 static struct kernfs_node *kn_info;
 
@@ -794,7 +797,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
 	mutex_lock(&rdtgroup_mutex);
 
 	/* Return empty if resctrl has not been mounted. */
-	if (!static_branch_unlikely(&rdt_enable_key)) {
+	if (!resctrl_mounted) {
 		seq_puts(s, "res:\nmon:\n");
 		goto unlock;
 	}
@@ -2196,7 +2199,7 @@ static int rdt_get_tree(struct fs_context *fc)
 	/*
 	 * resctrl file system can only be mounted once.
 	 */
-	if (static_branch_unlikely(&rdt_enable_key)) {
+	if (resctrl_mounted) {
 		ret = -EBUSY;
 		goto out;
 	}
@@ -2244,8 +2247,10 @@ static int rdt_get_tree(struct fs_context *fc)
 	if (rdt_mon_capable)
 		static_branch_enable_cpuslocked(&rdt_mon_enable_key);
 
-	if (rdt_alloc_capable || rdt_mon_capable)
+	if (rdt_alloc_capable || rdt_mon_capable) {
 		static_branch_enable_cpuslocked(&rdt_enable_key);
+		resctrl_mounted = true;
+	}
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
@@ -2512,6 +2517,7 @@ static void rdt_kill_sb(struct super_block *sb)
 	static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
 	static_branch_disable_cpuslocked(&rdt_mon_enable_key);
 	static_branch_disable_cpuslocked(&rdt_enable_key);
+	resctrl_mounted = false;
 	kernfs_kill_sb(sb);
 	mutex_unlock(&rdtgroup_mutex);
 	cpus_read_unlock();
@@ -3336,7 +3342,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * If resctrl is mounted, remove all the
 	 * per domain monitor data directories.
 	 */
-	if (static_branch_unlikely(&rdt_mon_enable_key))
+	if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
 		rmdir_mondata_subdir_allrdtgrp(r, d->id);
 
 	if (is_mbm_enabled())
@@ -3413,8 +3419,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 	if (is_llc_occupancy_enabled())
 		INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
 
-	/* If resctrl is mounted, add per domain monitor data directories. */
-	if (static_branch_unlikely(&rdt_mon_enable_key))
+	if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
 		mkdir_mondata_subdir_allrdtgrp(r, d);
 
 	return 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 12/18] x86/resctrl: Move alloc/mon static keys into helpers
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (10 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 11/18] x86/resctrl: Make resctrl_mounted checks explicit James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-13 17:54 ` [PATCH v2 13/18] x86/resctrl: Make rdt_enable_key the arch's decision to switch James Morse
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

resctrl enables three static keys depending on the features it has enabled.
Another architecture's context switch code may look different, any
static keys that control it should be buried behind helpers.

Move the alloc/mon logic into arch-specific helpers as a preparatory step
for making the rdt_enable_key's status something the arch code decides.

This means other architectures don't have to mirror the static keys.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/include/asm/resctrl.h         | 20 ++++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/internal.h |  5 -----
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  8 ++++----
 3 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 194a1570af7b..d6889aea5496 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -42,6 +42,26 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
 DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
 
+static inline void resctrl_arch_enable_alloc(void)
+{
+	static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
+}
+
+static inline void resctrl_arch_disable_alloc(void)
+{
+	static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
+}
+
+static inline void resctrl_arch_enable_mon(void)
+{
+	static_branch_enable_cpuslocked(&rdt_mon_enable_key);
+}
+
+static inline void resctrl_arch_disable_mon(void)
+{
+	static_branch_disable_cpuslocked(&rdt_mon_enable_key);
+}
+
 /*
  * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
  *
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 65d85e1ef75f..3997386cee89 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -47,9 +47,6 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
 	return container_of(kfc, struct rdt_fs_context, kfc);
 }
 
-DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
-DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
-
 /**
  * struct mon_evt - Entry in the event list of a resource
  * @evtid:		event id
@@ -405,8 +402,6 @@ extern struct mutex rdtgroup_mutex;
 
 extern struct rdt_hw_resource rdt_resources_all[];
 extern struct rdtgroup rdtgroup_default;
-DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
-
 extern struct dentry *debugfs_resctrl;
 
 enum resctrl_res_level {
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 370da7077c67..bbad906f573a 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2243,9 +2243,9 @@ static int rdt_get_tree(struct fs_context *fc)
 		goto out_psl;
 
 	if (rdt_alloc_capable)
-		static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
+		resctrl_arch_enable_alloc();
 	if (rdt_mon_capable)
-		static_branch_enable_cpuslocked(&rdt_mon_enable_key);
+		resctrl_arch_enable_mon();
 
 	if (rdt_alloc_capable || rdt_mon_capable) {
 		static_branch_enable_cpuslocked(&rdt_enable_key);
@@ -2514,8 +2514,8 @@ static void rdt_kill_sb(struct super_block *sb)
 	rdt_pseudo_lock_release();
 	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
 	schemata_list_destroy();
-	static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
-	static_branch_disable_cpuslocked(&rdt_mon_enable_key);
+	resctrl_arch_disable_alloc();
+	resctrl_arch_disable_mon();
 	static_branch_disable_cpuslocked(&rdt_enable_key);
 	resctrl_mounted = false;
 	kernfs_kill_sb(sb);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 13/18] x86/resctrl: Make rdt_enable_key the arch's decision to switch
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (11 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 12/18] x86/resctrl: Move alloc/mon static keys into helpers James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-02-02 23:48   ` Reinette Chatre
  2023-01-13 17:54 ` [PATCH v2 14/18] x86/resctrl: Add helpers for system wide mon/alloc capable James Morse
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

rdt_enable_key is switched when resctrl is mounted. It was also previously
used to prevent a second mount of the filesystem.

Any other architecture that wants to support resctrl has to provide
identical static keys.

Now that we have helpers for enablign and disabling the alloc/mon keys,
resctrl doesn't need to switch this extra key, it can be done by the arch
code. Use the static-key increment and decrement helpers, and change
resctrl to ensure the calls are balanced.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/include/asm/resctrl.h         |  4 ++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++------
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index d6889aea5496..5b5ae6d8a343 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -45,21 +45,25 @@ DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
 static inline void resctrl_arch_enable_alloc(void)
 {
 	static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
+	static_branch_inc_cpuslocked(&rdt_enable_key);
 }
 
 static inline void resctrl_arch_disable_alloc(void)
 {
 	static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
+	static_branch_dec_cpuslocked(&rdt_enable_key);
 }
 
 static inline void resctrl_arch_enable_mon(void)
 {
 	static_branch_enable_cpuslocked(&rdt_mon_enable_key);
+	static_branch_inc_cpuslocked(&rdt_enable_key);
 }
 
 static inline void resctrl_arch_disable_mon(void)
 {
 	static_branch_disable_cpuslocked(&rdt_mon_enable_key);
+	static_branch_dec_cpuslocked(&rdt_enable_key);
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index bbad906f573a..0e22f8361392 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2247,10 +2247,8 @@ static int rdt_get_tree(struct fs_context *fc)
 	if (rdt_mon_capable)
 		resctrl_arch_enable_mon();
 
-	if (rdt_alloc_capable || rdt_mon_capable) {
-		static_branch_enable_cpuslocked(&rdt_enable_key);
+	if (rdt_alloc_capable || rdt_mon_capable)
 		resctrl_mounted = true;
-	}
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
@@ -2514,9 +2512,10 @@ static void rdt_kill_sb(struct super_block *sb)
 	rdt_pseudo_lock_release();
 	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
 	schemata_list_destroy();
-	resctrl_arch_disable_alloc();
-	resctrl_arch_disable_mon();
-	static_branch_disable_cpuslocked(&rdt_enable_key);
+	if (rdt_alloc_capable)
+		resctrl_arch_disable_alloc();
+	if (rdt_mon_capable)
+		resctrl_arch_disable_mon();
 	resctrl_mounted = false;
 	kernfs_kill_sb(sb);
 	mutex_unlock(&rdtgroup_mutex);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 14/18] x86/resctrl: Add helpers for system wide mon/alloc capable
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (12 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 13/18] x86/resctrl: Make rdt_enable_key the arch's decision to switch James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-25  7:16   ` Shaopeng Tan (Fujitsu)
  2023-01-13 17:54 ` [PATCH v2 15/18] x86/resctrl: Add cpu online callback for resctrl work James Morse
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

resctrl reads rdt_alloc_capable or rdt_mon_capable to determine
whether any of the resources support the corresponding features.
resctrl also uses the static-keys that affect the architecture's
context-switch code to determine the same thing.

This forces another architecture to have the same static-keys.

As the static-key is enabled based on the capable flag, and none of
the filesystem uses of these are in the scheduler path, move the
capable flags behind helpers, and use these in the filesystem
code instead of the static-key.

After this change, only the architecture code manages and uses
the static-keys to ensure __resctrl_sched_in() does not need
runtime checks.

This avoids multiple architectures having to define the same
static-keys.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>

---
Changes since v1:
 * Added missing conversion in mkdir_rdt_prepare_rmid_free()
---
 arch/x86/include/asm/resctrl.h            | 13 +++++++++
 arch/x86/kernel/cpu/resctrl/internal.h    |  2 --
 arch/x86/kernel/cpu/resctrl/monitor.c     |  4 +--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 ++--
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 34 +++++++++++------------
 5 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 5b5ae6d8a343..3364d640f791 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -38,10 +38,18 @@ struct resctrl_pqr_state {
 
 DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
 
+extern bool rdt_alloc_capable;
+extern bool rdt_mon_capable;
+
 DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
 DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
 
+static inline bool resctrl_arch_alloc_capable(void)
+{
+	return rdt_alloc_capable;
+}
+
 static inline void resctrl_arch_enable_alloc(void)
 {
 	static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
@@ -54,6 +62,11 @@ static inline void resctrl_arch_disable_alloc(void)
 	static_branch_dec_cpuslocked(&rdt_enable_key);
 }
 
+static inline bool resctrl_arch_mon_capable(void)
+{
+	return rdt_mon_capable;
+}
+
 static inline void resctrl_arch_enable_mon(void)
 {
 	static_branch_enable_cpuslocked(&rdt_mon_enable_key);
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 3997386cee89..a1bf97adee2e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -88,8 +88,6 @@ struct rmid_read {
 	int			arch_mon_ctx;
 };
 
-extern bool rdt_alloc_capable;
-extern bool rdt_mon_capable;
 extern unsigned int rdt_mon_features;
 extern struct list_head resctrl_schema_all;
 extern bool resctrl_mounted;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 4ff258b49e9c..1a214bd32ed4 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -795,7 +795,7 @@ void mbm_handle_overflow(struct work_struct *work)
 
 	mutex_lock(&rdtgroup_mutex);
 
-	if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
+	if (!resctrl_mounted || !resctrl_arch_mon_capable())
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
@@ -823,7 +823,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
 
-	if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
+	if (!resctrl_mounted || !resctrl_arch_mon_capable())
 		return;
 
 	cpu = cpumask_any(&dom->cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 3b724a40d3a2..0b4fdb118643 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -567,7 +567,7 @@ static int rdtgroup_locksetup_user_restrict(struct rdtgroup *rdtgrp)
 	if (ret)
 		goto err_cpus;
 
-	if (rdt_mon_capable) {
+	if (resctrl_arch_mon_capable()) {
 		ret = rdtgroup_kn_mode_restrict(rdtgrp, "mon_groups");
 		if (ret)
 			goto err_cpus_list;
@@ -614,7 +614,7 @@ static int rdtgroup_locksetup_user_restore(struct rdtgroup *rdtgrp)
 	if (ret)
 		goto err_cpus;
 
-	if (rdt_mon_capable) {
+	if (resctrl_arch_mon_capable()) {
 		ret = rdtgroup_kn_mode_restore(rdtgrp, "mon_groups", 0777);
 		if (ret)
 			goto err_cpus_list;
@@ -762,7 +762,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
 {
 	int ret;
 
-	if (rdt_mon_capable) {
+	if (resctrl_arch_mon_capable()) {
 		ret = alloc_rmid(rdtgrp->closid);
 		if (ret < 0) {
 			rdt_last_cmd_puts("Out of RMIDs\n");
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 0e22f8361392..44e6d6fbab25 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -609,13 +609,13 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 
 static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
 {
-	return (rdt_alloc_capable && (r->type == RDTCTRL_GROUP) &&
+	return (resctrl_arch_alloc_capable() && (r->type == RDTCTRL_GROUP) &&
 		resctrl_arch_match_closid(t, r->closid));
 }
 
 static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r)
 {
-	return (rdt_mon_capable && (r->type == RDTMON_GROUP) &&
+	return (resctrl_arch_mon_capable() && (r->type == RDTMON_GROUP) &&
 		resctrl_arch_match_rmid(t, r->mon.parent->closid,
 					r->mon.rmid));
 }
@@ -2220,7 +2220,7 @@ static int rdt_get_tree(struct fs_context *fc)
 	if (ret < 0)
 		goto out_schemata_free;
 
-	if (rdt_mon_capable) {
+	if (resctrl_arch_mon_capable()) {
 		ret = mongroup_create_dir(rdtgroup_default.kn,
 					  &rdtgroup_default, "mon_groups",
 					  &kn_mongrp);
@@ -2242,12 +2242,12 @@ static int rdt_get_tree(struct fs_context *fc)
 	if (ret < 0)
 		goto out_psl;
 
-	if (rdt_alloc_capable)
+	if (resctrl_arch_alloc_capable())
 		resctrl_arch_enable_alloc();
-	if (rdt_mon_capable)
+	if (resctrl_arch_mon_capable())
 		resctrl_arch_enable_mon();
 
-	if (rdt_alloc_capable || rdt_mon_capable)
+	if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable())
 		resctrl_mounted = true;
 
 	if (is_mbm_enabled()) {
@@ -2261,10 +2261,10 @@ static int rdt_get_tree(struct fs_context *fc)
 out_psl:
 	rdt_pseudo_lock_release();
 out_mondata:
-	if (rdt_mon_capable)
+	if (resctrl_arch_mon_capable())
 		kernfs_remove(kn_mondata);
 out_mongrp:
-	if (rdt_mon_capable)
+	if (resctrl_arch_mon_capable())
 		kernfs_remove(kn_mongrp);
 out_info:
 	kernfs_remove(kn_info);
@@ -2512,9 +2512,9 @@ static void rdt_kill_sb(struct super_block *sb)
 	rdt_pseudo_lock_release();
 	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
 	schemata_list_destroy();
-	if (rdt_alloc_capable)
+	if (resctrl_arch_alloc_capable())
 		resctrl_arch_disable_alloc();
-	if (rdt_mon_capable)
+	if (resctrl_arch_mon_capable())
 		resctrl_arch_disable_mon();
 	resctrl_mounted = false;
 	kernfs_kill_sb(sb);
@@ -2889,7 +2889,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
 {
 	int ret;
 
-	if (!rdt_mon_capable)
+	if (!resctrl_arch_mon_capable())
 		return 0;
 
 	ret = alloc_rmid(rdtgrp->closid);
@@ -2911,7 +2911,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
 
 static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp)
 {
-	if (rdt_mon_capable)
+	if (resctrl_arch_mon_capable())
 		free_rmid(rgrp->closid, rgrp->mon.rmid);
 }
 
@@ -3075,7 +3075,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
 
 	list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
 
-	if (rdt_mon_capable) {
+	if (resctrl_arch_mon_capable()) {
 		/*
 		 * Create an empty mon_groups directory to hold the subset
 		 * of tasks and cpus to monitor.
@@ -3130,14 +3130,14 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	 * allocation is supported, add a control and monitoring
 	 * subdirectory
 	 */
-	if (rdt_alloc_capable && parent_kn == rdtgroup_default.kn)
+	if (resctrl_arch_alloc_capable() && parent_kn == rdtgroup_default.kn)
 		return rdtgroup_mkdir_ctrl_mon(parent_kn, name, mode);
 
 	/*
 	 * If RDT monitoring is supported and the parent directory is a valid
 	 * "mon_groups" directory, add a monitoring subdirectory.
 	 */
-	if (rdt_mon_capable && is_mon_groups(parent_kn, name))
+	if (resctrl_arch_mon_capable() && is_mon_groups(parent_kn, name))
 		return rdtgroup_mkdir_mon(parent_kn, name, mode);
 
 	return -EPERM;
@@ -3341,7 +3341,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * If resctrl is mounted, remove all the
 	 * per domain monitor data directories.
 	 */
-	if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
+	if (resctrl_mounted && resctrl_arch_mon_capable())
 		rmdir_mondata_subdir_allrdtgrp(r, d->id);
 
 	if (is_mbm_enabled())
@@ -3418,7 +3418,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 	if (is_llc_occupancy_enabled())
 		INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
 
-	if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
+	if (resctrl_mounted && resctrl_arch_mon_capable())
 		mkdir_mondata_subdir_allrdtgrp(r, d);
 
 	return 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 15/18] x86/resctrl: Add cpu online callback for resctrl work
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (13 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 14/18] x86/resctrl: Add helpers for system wide mon/alloc capable James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-13 17:54 ` [PATCH v2 16/18] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu James Morse
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

The resctrl architecture specific code may need to create a domain when
a CPU comes online, it also needs to reset the CPUs PQR_ASSOC register.
The resctrl filesystem code needs to update the rdtgroup_default cpu
mask when cpus are brought online.

Currently this is all done in one function, resctrl_online_cpu().
This will need to be split into architecture and filesystem parts
before resctrl can be moved to /fs/.

Pull the rdtgroup_default update work out as a filesystem specific
cpu_online helper. resctrl_online_cpu() is the obvious name for this,
which means the version in core.c needs renaming.

resctrl_online_cpu() is called by the arch code once it has done the
work to add the new cpu to any domains.

In future patches, resctrl_online_cpu() will take the rdtgroup_mutex
itself.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/core.c     | 11 ++++++-----
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 10 ++++++++++
 include/linux/resctrl.h                |  1 +
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c98e52ff5f20..749d9a450749 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -578,19 +578,20 @@ static void clear_closid_rmid(int cpu)
 	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
 }
 
-static int resctrl_online_cpu(unsigned int cpu)
+static int resctrl_arch_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
+	int err;
 
 	mutex_lock(&rdtgroup_mutex);
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
-	/* The cpu is set in default rdtgroup after online. */
-	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
 	clear_closid_rmid(cpu);
+
+	err = resctrl_online_cpu(cpu);
 	mutex_unlock(&rdtgroup_mutex);
 
-	return 0;
+	return err;
 }
 
 static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
@@ -917,7 +918,7 @@ static int __init resctrl_late_init(void)
 
 	state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
 				  "x86/resctrl/cat:online:",
-				  resctrl_online_cpu, resctrl_offline_cpu);
+				  resctrl_arch_online_cpu, resctrl_offline_cpu);
 	if (state < 0)
 		return state;
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 44e6d6fbab25..25b027d19dbe 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3424,6 +3424,16 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
+int resctrl_online_cpu(unsigned int cpu)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	/* The cpu is set in default rdtgroup after online. */
+	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
+
+	return 0;
+}
+
 /*
  * rdtgroup_init - rdtgroup initialization
  *
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index d90d3dca48e9..7b71c3558d83 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -219,6 +219,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
 int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
 void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_cpu(unsigned int cpu);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 16/18] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (14 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 15/18] x86/resctrl: Add cpu online callback for resctrl work James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-02-02 23:49   ` Reinette Chatre
  2023-01-13 17:54 ` [PATCH v2 17/18] x86/resctrl: Add cpu offline callback for resctrl work James Morse
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

When a cpu is taken offline resctrl may need to move the overflow or
limbo handlers to run on a different CPU.

Once the offline callbacks have been split, cqm_setup_limbo_handler()
will be called while the CPU that is going offline is still present
in the cpu_mask.

Pass the CPU to exclude to cqm_setup_limbo_handler() and
mbm_setup_overflow_handler(). These functions can use cpumask_any_but()
when selecting the CPU. -1 is used to indicate no CPUs need excluding.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---

Both cpumask_any() and cpumask_any_but() return a value >= nr_cpu_ids
on error. schedule_delayed_work_on() doesn't appear to check. Add the
error handling to be robust. It doesn't look like its possible to hit
this.
---
 arch/x86/kernel/cpu/resctrl/core.c     |  6 ++--
 arch/x86/kernel/cpu/resctrl/internal.h |  6 ++--
 arch/x86/kernel/cpu/resctrl/monitor.c  | 39 +++++++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +--
 4 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 749d9a450749..a3c171bd2de0 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -557,12 +557,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
 			cancel_delayed_work(&d->mbm_over);
-			mbm_setup_overflow_handler(d, 0);
+			/* exclude_cpu=-1 as we already cpumask_clear_cpu()d */
+			mbm_setup_overflow_handler(d, 0, -1);
 		}
 		if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
 		    has_busy_rmid(r, d)) {
 			cancel_delayed_work(&d->cqm_limbo);
-			cqm_setup_limbo_handler(d, 0);
+			/* exclude_cpu=-1 as we already cpumask_clear_cpu()d */
+			cqm_setup_limbo_handler(d, 0, -1);
 		}
 	}
 }
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a1bf97adee2e..d8c7a549b43a 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -515,11 +515,13 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
 void mbm_setup_overflow_handler(struct rdt_domain *dom,
-				unsigned long delay_ms);
+				unsigned long delay_ms,
+				int exclude_cpu);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+			     int exclude_cpu);
 void cqm_handle_limbo(struct work_struct *work);
 bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
 void __check_limbo(struct rdt_domain *d, bool force_free);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 1a214bd32ed4..334fb3f1c6e2 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -440,7 +440,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 		 * setup up the limbo worker.
 		 */
 		if (!has_busy_rmid(r, d))
-			cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL);
+			cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, -1);
 		set_bit(idx, d->rmid_busy_llc);
 		entry->busy++;
 	}
@@ -773,15 +773,27 @@ void cqm_handle_limbo(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+/**
+ * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this
+ *                             domain.
+ * @delay_ms:      How far in the future the handler should run.
+ * @exclude_cpu:   Which CPU the handler should not run on, -1 to pick any CPU.
+ */
+void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+			     int exclude_cpu)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
 
-	cpu = cpumask_any(&dom->cpu_mask);
+	if (exclude_cpu == -1)
+		cpu = cpumask_any(&dom->cpu_mask);
+	else
+		cpu = cpumask_any_but(&dom->cpu_mask, exclude_cpu);
+
 	dom->cqm_work_cpu = cpu;
 
-	schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
+	if (cpu < nr_cpu_ids)
+		schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
 }
 
 void mbm_handle_overflow(struct work_struct *work)
@@ -818,7 +830,14 @@ void mbm_handle_overflow(struct work_struct *work)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+/**
+ * mbm_setup_overflow_handler() - Schedule the overflow handler to run for this
+ *                                domain.
+ * @delay_ms:      How far in the future the handler should run.
+ * @exclude_cpu:   Which CPU the handler should not run on, -1 to pick any CPU.
+ */
+void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
+				int exclude_cpu)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
@@ -826,9 +845,15 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
 	if (!resctrl_mounted || !resctrl_arch_mon_capable())
 		return;
 
-	cpu = cpumask_any(&dom->cpu_mask);
+	if (exclude_cpu == -1)
+		cpu = cpumask_any(&dom->cpu_mask);
+	else
+		cpu = cpumask_any_but(&dom->cpu_mask, exclude_cpu);
+
 	dom->mbm_work_cpu = cpu;
-	schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
+
+	if (cpu < nr_cpu_ids)
+		schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
 }
 
 static int dom_data_init(struct rdt_resource *r)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 25b027d19dbe..2998d20691ea 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2253,7 +2253,7 @@ static int rdt_get_tree(struct fs_context *fc)
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 		list_for_each_entry(dom, &r->domains, list)
-			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
+			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, -1);
 	}
 
 	goto out;
@@ -3412,7 +3412,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 
 	if (is_mbm_enabled()) {
 		INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow);
-		mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL);
+		mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, -1);
 	}
 
 	if (is_llc_occupancy_enabled())
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 17/18] x86/resctrl: Add cpu offline callback for resctrl work
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (15 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 16/18] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-01-13 17:54 ` [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks James Morse
  2023-01-25  7:19 ` [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking Shaopeng Tan (Fujitsu)
  18 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

The resctrl architecture specific code may need to free a domain when
a CPU goes offline, it also needs to reset the CPUs PQR_ASSOC register.
The resctrl filesystem code needs to move the overflow and limbo work
to run on a different CPU, and clear this CPU from the cpu_mask of
control and monitor groups.

Currently this is all done in core.c and called from
resctrl_offline_cpu(), making the split between architecture and
filesystem code unclear.

Move the filesystem work into a filesystem helper called
resctrl_offline_cpu(), and rename the one in core.c
resctrl_arch_offline_cpu().

The rdtgroup_mutex is unlocked and locked again in the call in
preparation for changing the locking rules for the architecture
code.

resctrl_offline_cpu() is called before any of the resource/domains
are updated, and makes use of the exclude_cpu feature we previously
added.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/core.c     | 39 ++++----------------------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 39 ++++++++++++++++++++++++++
 include/linux/resctrl.h                |  1 +
 3 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index a3c171bd2de0..7896fcf11df6 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -553,20 +553,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 		return;
 	}
-
-	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
-		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
-			cancel_delayed_work(&d->mbm_over);
-			/* exclude_cpu=-1 as we already cpumask_clear_cpu()d */
-			mbm_setup_overflow_handler(d, 0, -1);
-		}
-		if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
-		    has_busy_rmid(r, d)) {
-			cancel_delayed_work(&d->cqm_limbo);
-			/* exclude_cpu=-1 as we already cpumask_clear_cpu()d */
-			cqm_setup_limbo_handler(d, 0, -1);
-		}
-	}
 }
 
 static void clear_closid_rmid(int cpu)
@@ -596,31 +582,15 @@ static int resctrl_arch_online_cpu(unsigned int cpu)
 	return err;
 }
 
-static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
+static int resctrl_arch_offline_cpu(unsigned int cpu)
 {
-	struct rdtgroup *cr;
-
-	list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) {
-		if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) {
-			break;
-		}
-	}
-}
-
-static int resctrl_offline_cpu(unsigned int cpu)
-{
-	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 
 	mutex_lock(&rdtgroup_mutex);
+	resctrl_offline_cpu(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_remove_cpu(cpu, r);
-	list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
-		if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
-			clear_childcpus(rdtgrp, cpu);
-			break;
-		}
-	}
 	clear_closid_rmid(cpu);
 	mutex_unlock(&rdtgroup_mutex);
 
@@ -920,7 +890,8 @@ static int __init resctrl_late_init(void)
 
 	state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
 				  "x86/resctrl/cat:online:",
-				  resctrl_arch_online_cpu, resctrl_offline_cpu);
+				  resctrl_arch_online_cpu,
+				  resctrl_arch_offline_cpu);
 	if (state < 0)
 		return state;
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 2998d20691ea..aa75c42e0faa 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3434,6 +3434,45 @@ int resctrl_online_cpu(unsigned int cpu)
 	return 0;
 }
 
+static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
+{
+	struct rdtgroup *cr;
+
+	list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) {
+		if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask))
+			break;
+	}
+}
+
+void resctrl_offline_cpu(unsigned int cpu)
+{
+	struct rdt_domain *d;
+	struct rdtgroup *rdtgrp;
+	struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
+		if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
+			clear_childcpus(rdtgrp, cpu);
+			break;
+		}
+	}
+
+	d = get_domain_from_cpu(cpu, l3);
+	if (d) {
+		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
+			cancel_delayed_work(&d->mbm_over);
+			mbm_setup_overflow_handler(d, 0, cpu);
+		}
+		if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
+		    has_busy_rmid(l3, d)) {
+			cancel_delayed_work(&d->cqm_limbo);
+			cqm_setup_limbo_handler(d, 0, cpu);
+		}
+	}
+}
+
 /*
  * rdtgroup_init - rdtgroup initialization
  *
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 7b71c3558d83..0d21e7ba7c1d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -220,6 +220,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
 void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
 int resctrl_online_cpu(unsigned int cpu);
+void resctrl_offline_cpu(unsigned int cpu);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (16 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 17/18] x86/resctrl: Add cpu offline callback for resctrl work James Morse
@ 2023-01-13 17:54 ` James Morse
  2023-02-02 23:50   ` Reinette Chatre
  2023-01-25  7:19 ` [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking Shaopeng Tan (Fujitsu)
  18 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-01-13 17:54 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger, James Morse,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

resctrl has one mutex that is taken by the architecture specific code,
and the filesystem parts. The two interact via cpuhp, where the
architecture code updates the domain list. Filesystem handlers that
walk the domains list should not run concurrently with the cpuhp
callback modifying the list.

Exposing a lock from the filesystem code means the interface is not
cleanly defined, and creates the possibility of cross-architecture
lock ordering headaches. The interaction only exists so that certain
filesystem paths are serialised against cpu hotplug. The cpu hotplug
code already has a mechanism to do this using cpus_read_lock().

MPAM's monitors have an overflow interrupt, so it needs to be possible
to walk the domains list in irq context. RCU is ideal for this,
but some paths need to be able to sleep to allocate memory.

Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
of a cpuhp callback, cpus_read_lock() must always be taken first.
rdtgroup_schemata_write() already does this.

All but one of the filesystem code's domain list walkers are
currently protected by the rdtgroup_mutex taken in
rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
which takes the lock directly.

Make the domain list protected by RCU. An architecture-specific
lock prevents concurrent writers. rdt_bit_usage_show() can
walk the domain list under rcu_read_lock().
The other filesystem list walkers need to be able to sleep.
Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
cpuhp callbacks can't be invoked when file system operations are
occurring.

Add lockdep_assert_cpus_held() in the cases where the
rdtgroup_kn_lock_live() call isn't obvious.

Resctrl's domain online/offline calls now need to take the
rdtgroup_mutex themselves.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/core.c        | 33 ++++++++------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 14 ++++--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  3 ++
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  3 ++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 54 ++++++++++++++++++++---
 include/linux/resctrl.h                   |  2 +-
 6 files changed, 84 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7896fcf11df6..dc1ba580c4db 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -25,8 +25,14 @@
 #include <asm/resctrl.h>
 #include "internal.h"
 
-/* Mutex to protect rdtgroup access. */
-DEFINE_MUTEX(rdtgroup_mutex);
+/*
+ * rdt_domain structures are kfree()d when their last cpu goes offline,
+ * and allocated when the first cpu in a new domain comes online.
+ * The rdt_resource's domain list is updated when this happens. The domain
+ * list is protected by RCU, but callers can also take the cpus_read_lock()
+ * to prevent modification if they need to sleep. All writers take this mutex:
+ */
+static DEFINE_MUTEX(domain_list_lock);
 
 /*
  * The cached resctrl_pqr_state is strictly per CPU and can never be
@@ -483,6 +489,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 	struct rdt_domain *d;
 	int err;
 
+	lockdep_assert_held(&domain_list_lock);
+
 	d = rdt_find_domain(r, id, &add_pos);
 	if (IS_ERR(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -516,11 +524,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail(&d->list, add_pos);
+	list_add_tail_rcu(&d->list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del(&d->list);
+		list_del_rcu(&d->list);
+		synchronize_rcu();
 		domain_free(hw_dom);
 	}
 }
@@ -541,7 +550,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del(&d->list);
+		list_del_rcu(&d->list);
+		synchronize_rcu();
 
 		/*
 		 * rdt_domain "d" is going to be freed below, so clear
@@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
 static int resctrl_arch_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
-	int err;
 
-	mutex_lock(&rdtgroup_mutex);
+	mutex_lock(&domain_list_lock);
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	clear_closid_rmid(cpu);
+	mutex_unlock(&domain_list_lock);
 
-	err = resctrl_online_cpu(cpu);
-	mutex_unlock(&rdtgroup_mutex);
-
-	return err;
+	return resctrl_online_cpu(cpu);
 }
 
 static int resctrl_arch_offline_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
-	mutex_lock(&rdtgroup_mutex);
 	resctrl_offline_cpu(cpu);
 
+	mutex_lock(&domain_list_lock);
 	for_each_capable_rdt_resource(r)
 		domain_remove_cpu(cpu, r);
 	clear_closid_rmid(cpu);
-	mutex_unlock(&rdtgroup_mutex);
+	mutex_unlock(&domain_list_lock);
 
 	return 0;
 }
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 4ee3da6dced7..6eb74304860f 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -208,6 +208,9 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct rdt_domain *d;
 	unsigned long dom_id;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
 	    r->rid == RDT_RESOURCE_MBA) {
 		rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n");
@@ -313,6 +316,9 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 	int cpu;
 	u32 idx;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
 		return -ENOMEM;
 
@@ -383,11 +389,9 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 		return -EINVAL;
 	buf[nbytes - 1] = '\0';
 
-	cpus_read_lock();
 	rdtgrp = rdtgroup_kn_lock_live(of->kn);
 	if (!rdtgrp) {
 		rdtgroup_kn_unlock(of->kn);
-		cpus_read_unlock();
 		return -ENOENT;
 	}
 	rdt_last_cmd_clear();
@@ -451,7 +455,6 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 
 out:
 	rdtgroup_kn_unlock(of->kn);
-	cpus_read_unlock();
 	return ret ?: nbytes;
 }
 
@@ -471,6 +474,9 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	bool sep = false;
 	u32 ctrl_val;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	seq_printf(s, "%*s:", max_name_width, schema->name);
 	list_for_each_entry(dom, &r->domains, list) {
 		if (sep)
@@ -533,7 +539,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 		    int evtid, int first)
 {
 	/* When picking a cpu from cpu_mask, ensure it can't race with cpuhp */
-	lockdep_assert_held(&rdtgroup_mutex);
+	lockdep_assert_cpus_held();
 
 	/*
 	 * setup the parameters to pass to mon_event_count() to read the data.
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 334fb3f1c6e2..a2d5cf9052e4 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -421,6 +421,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 	u32 idx;
 	int err;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
 
 	arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 0b4fdb118643..f8864626d593 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -830,6 +830,9 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	struct rdt_domain *d_i;
 	bool ret = false;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
 		return true;
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index aa75c42e0faa..391ac3c56680 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -35,6 +35,10 @@
 DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
 DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key);
 DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
+
+/* Mutex to protect rdtgroup access. */
+DEFINE_MUTEX(rdtgroup_mutex);
+
 static struct kernfs_root *rdt_root;
 struct rdtgroup rdtgroup_default;
 LIST_HEAD(rdt_all_groups);
@@ -929,7 +933,8 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(dom, &r->domains, list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -985,8 +990,10 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 		}
 		sep = true;
 	}
+	rcu_read_unlock();
 	seq_putc(seq, '\n');
 	mutex_unlock(&rdtgroup_mutex);
+
 	return 0;
 }
 
@@ -1226,6 +1233,9 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 	struct rdt_domain *d;
 	u32 ctrl;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	list_for_each_entry(s, &resctrl_schema_all, list) {
 		r = s->res;
 		if (r->rid == RDT_RESOURCE_MBA)
@@ -1859,6 +1869,9 @@ static int set_cache_qos_cfg(int level, bool enable)
 	struct rdt_domain *d;
 	int cpu;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	if (level == RDT_RESOURCE_L3)
 		update = l3_qos_cfg_update;
 	else if (level == RDT_RESOURCE_L2)
@@ -2051,6 +2064,7 @@ struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn)
 	atomic_inc(&rdtgrp->waitcount);
 	kernfs_break_active_protection(kn);
 
+	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
 	/* Was this group deleted while we waited? */
@@ -2068,6 +2082,7 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn)
 		return;
 
 	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
 
 	if (atomic_dec_and_test(&rdtgrp->waitcount) &&
 	    (rdtgrp->flags & RDT_DELETED)) {
@@ -2364,6 +2379,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	struct rdt_domain *d;
 	int i, cpu;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
 		return -ENOMEM;
 
@@ -2644,6 +2662,9 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	list_for_each_entry(dom, &r->domains, list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
@@ -3327,7 +3348,8 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+static void _resctrl_offline_domain(struct rdt_resource *r,
+				    struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3362,6 +3384,13 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	mutex_lock(&rdtgroup_mutex);
+	_resctrl_offline_domain(r, d);
+	mutex_unlock(&rdtgroup_mutex);
+}
+
 static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 {
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -3393,7 +3422,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+static int _resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	int err;
 
@@ -3424,12 +3453,23 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
+int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	mutex_lock(&rdtgroup_mutex);
+	err = _resctrl_online_domain(r, d);
+	mutex_unlock(&rdtgroup_mutex);
+
+	return err;
+}
+
 int resctrl_online_cpu(unsigned int cpu)
 {
-	lockdep_assert_held(&rdtgroup_mutex);
-
+	mutex_lock(&rdtgroup_mutex);
 	/* The cpu is set in default rdtgroup after online. */
 	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
+	mutex_unlock(&rdtgroup_mutex);
 
 	return 0;
 }
@@ -3450,8 +3490,7 @@ void resctrl_offline_cpu(unsigned int cpu)
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 
-	lockdep_assert_held(&rdtgroup_mutex);
-
+	mutex_lock(&rdtgroup_mutex);
 	list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
 		if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
 			clear_childcpus(rdtgrp, cpu);
@@ -3471,6 +3510,7 @@ void resctrl_offline_cpu(unsigned int cpu)
 			cqm_setup_limbo_handler(d, 0, cpu);
 		}
 	}
+	mutex_unlock(&rdtgroup_mutex);
 }
 
 /*
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0d21e7ba7c1d..4fedd45738a5 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -153,7 +153,7 @@ struct resctrl_schema;
  * @cache_level:	Which cache level defines scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		All domains for this resource
+ * @domains:		RCU list of all domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* RE: [PATCH v2 02/18] x86/resctrl: Access per-rmid structures by index
  2023-01-13 17:54 ` [PATCH v2 02/18] x86/resctrl: Access per-rmid structures by index James Morse
@ 2023-01-17 18:28   ` Yu, Fenghua
  2023-03-03 18:33     ` James Morse
  2023-01-17 18:29   ` Yu, Fenghua
  2023-02-02 23:44   ` Reinette Chatre
  2 siblings, 1 reply; 73+ messages in thread
From: Yu, Fenghua @ 2023-01-17 18:28 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Chatre, Reinette, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi, James,

> Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
> monitors, RMID values on arm64 are not unique unless the CLOSID is also
> included. Bitmaps like rmid_busy_llc need to be sized by the number of unique
> entries for this resource.
> 
> Add helpers to encode/decode the CLOSID and RMID to an index. The domain's
> busy_rmid_llc and the rmid_ptrs[] array are then sized by index. On x86, this is
> always just the RMID. This gives resctrl a unique value it can use to store
> monitor values, and allows MPAM to decode the closid when reading the
> hardware counters.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> Changes since v1:
>  * Added X86_BAD_CLOSID macro to make it clear what this value means
>  * Added second WARN_ON() for closid checking, and made both _ONCE()
> ---
>  arch/x86/include/asm/resctrl.h         | 24 ++++++++
>  arch/x86/kernel/cpu/resctrl/internal.h |  2 +
> arch/x86/kernel/cpu/resctrl/monitor.c  | 79 +++++++++++++++++---------
> arch/x86/kernel/cpu/resctrl/rdtgroup.c |  7 ++-
>  4 files changed, 82 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 52788f79786f..44d568f3577c 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -7,6 +7,13 @@
>  #include <linux/sched.h>
>  #include <linux/jump_label.h>
> 
> +/*
> + * This value can never be a valid CLOSID, and is used when mapping a
> + * (closid, rmid) pair to an index and back. On x86 only the RMID is
> + * needed.
> + */
> +#define X86_RESCTRL_BAD_CLOSID		~0
> +
>  /**
>   * struct resctrl_pqr_state - State cache for the PQR MSR
>   * @cur_rmid:		The cached Resource Monitoring ID
> @@ -94,6 +101,23 @@ static inline void resctrl_sched_in(void)
>  		__resctrl_sched_in();
>  }
> 
> +static inline u32 resctrl_arch_system_num_rmid_idx(void)
> +{
> +	/* RMID are independent numbers for x86. num_rmid_idx==num_rmid
> */

Is it better to change the comment to something like:
+       /* RMIDs are independent of CLOSIDs and number of RMIDs is fixed. */

> +	return boot_cpu_data.x86_cache_max_rmid + 1; }
> +
> +static inline void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid,
> +u32 *rmid) {
> +	*rmid = idx;
> +	*closid = X86_RESCTRL_BAD_CLOSID;
> +}
> +
> +static inline u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid) {
> +	return rmid;
> +}
> +
>  void resctrl_cpu_detect(struct cpuinfo_x86 *c);
> 
>  #else
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h
> b/arch/x86/kernel/cpu/resctrl/internal.h
> index 5dbff3035723..af71401c57e2 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -8,6 +8,8 @@
>  #include <linux/fs_context.h>
>  #include <linux/jump_label.h>
> 
Please remove this blank line.

> +#include <asm/resctrl.h>
> +
>  #define L3_QOS_CDP_ENABLE		0x01ULL
> 
>  #define L2_QOS_CDP_ENABLE		0x01ULL
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
> b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 13673cab175a..dbae380e3d1c 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -142,12 +142,27 @@ static inline u64 get_corrected_mbm_count(u32 rmid,
> unsigned long val)
>  	return val;
>  }
> 
> -static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid)
> +/*
> + * x86 and arm64 differ in their handling of monitoring.
> + * x86's RMID are an independent number, there is one RMID '1'.
What do you mean by "one RMID '1'"?
Should this sentence be "x86's RMID is an independent number."?

> + * arm64's PMG extend the PARTID/CLOSID space, there is one RMID '1'
> +for each
> + * CLOSID. The RMID is no longer unique.
> + * To account for this, resctrl uses an index. On x86 this is just the
> +RMID,
> + * on arm64 it encodes the CLOSID and RMID. This gives a unique number.
> + *
> + * The domain's rmid_busy_llc and rmid_ptrs are sized by index. The
> +arch code
> + * must accept an attempt to read every index.
> + */
> +static inline struct rmid_entry *__rmid_entry(u32 idx)
>  {
>  	struct rmid_entry *entry;
> +	u32 closid, rmid;
> 
> -	entry = &rmid_ptrs[rmid];
> -	WARN_ON(entry->rmid != rmid);
> +	entry = &rmid_ptrs[idx];
> +	resctrl_arch_rmid_idx_decode(idx, &closid, &rmid);
> +
> +	WARN_ON_ONCE(entry->closid != closid);
> +	WARN_ON_ONCE(entry->rmid != rmid);
> 
>  	return entry;
>  }
> @@ -243,8 +258,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct
> rdt_domain *d,  void __check_limbo(struct rdt_domain *d, bool force_free)  {
>  	struct rdt_resource *r =
> &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
>  	struct rmid_entry *entry;
> -	u32 crmid = 1, nrmid;
> +	u32 idx, cur_idx = 1;
>  	bool rmid_dirty;
>  	u64 val = 0;
> 
> @@ -255,12 +271,11 @@ void __check_limbo(struct rdt_domain *d, bool
> force_free)
>  	 * RMID and move it to the free list when the counter reaches 0.
>  	 */
>  	for (;;) {
> -		nrmid = find_next_bit(d->rmid_busy_llc, r->num_rmid, crmid);
> -		if (nrmid >= r->num_rmid)
> +		idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx);
> +		if (idx >= idx_limit)
>  			break;
> 
> -		entry = __rmid_entry(~0, nrmid);	// temporary
> -
> +		entry = __rmid_entry(idx);
>  		if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
>  					   QOS_L3_OCCUP_EVENT_ID, &val)) {
>  			rmid_dirty = true;
> @@ -269,19 +284,21 @@ void __check_limbo(struct rdt_domain *d, bool
> force_free)
>  		}
> 
>  		if (force_free || !rmid_dirty) {
> -			clear_bit(entry->rmid, d->rmid_busy_llc);
> +			clear_bit(idx, d->rmid_busy_llc);
>  			if (!--entry->busy) {
>  				rmid_limbo_count--;
>  				list_add_tail(&entry->list, &rmid_free_lru);
>  			}
>  		}
> -		crmid = nrmid + 1;
> +		cur_idx = idx + 1;
>  	}
>  }
> 
>  bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)  {
> -	return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
> +	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> +
> +	return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit;
>  }
> 
>  /*
> @@ -311,6 +328,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>  	struct rdt_domain *d;
>  	int cpu, err;
>  	u64 val = 0;
> +	u32 idx;
> +
> +	idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
> 
>  	entry->busy = 0;
>  	cpu = get_cpu();
> @@ -330,7 +350,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>  		 */
>  		if (!has_busy_rmid(r, d))
>  			cqm_setup_limbo_handler(d,
> CQM_LIMBOCHECK_INTERVAL);
> -		set_bit(entry->rmid, d->rmid_busy_llc);
> +		set_bit(idx, d->rmid_busy_llc);
>  		entry->busy++;
>  	}
>  	put_cpu();
> @@ -343,14 +363,16 @@ static void add_rmid_to_limbo(struct rmid_entry
> *entry)
> 
>  void free_rmid(u32 closid, u32 rmid)
>  {
> +	u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
>  	struct rmid_entry *entry;
> 
> -	if (!rmid)
> -		return;
> -
>  	lockdep_assert_held(&rdtgroup_mutex);
> 
> -	entry = __rmid_entry(closid, rmid);
> +	/* do not allow the default rmid to be free'd */
> +	if (!idx)
> +		return;
> +
> +	entry = __rmid_entry(idx);
> 
>  	if (is_llc_occupancy_enabled())
>  		add_rmid_to_limbo(entry);
> @@ -360,6 +382,7 @@ void free_rmid(u32 closid, u32 rmid)
> 
>  static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)  {
> +	u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
>  	struct mbm_state *m;
>  	u64 tval = 0;
> 
> @@ -376,10 +399,10 @@ static int __mon_event_count(u32 closid, u32 rmid,
> struct rmid_read *rr)
>  		rr->val += tval;
>  		return 0;
>  	case QOS_L3_MBM_TOTAL_EVENT_ID:
> -		m = &rr->d->mbm_total[rmid];
> +		m = &rr->d->mbm_total[idx];
>  		break;
>  	case QOS_L3_MBM_LOCAL_EVENT_ID:
> -		m = &rr->d->mbm_local[rmid];
> +		m = &rr->d->mbm_local[idx];
>  		break;
>  	default:
>  		/*
> @@ -412,7 +435,8 @@ static int __mon_event_count(u32 closid, u32 rmid,
> struct rmid_read *rr)
>   */
>  static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)  {
> -	struct mbm_state *m = &rr->d->mbm_local[rmid];
> +	u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
> +	struct mbm_state *m = &rr->d->mbm_local[idx];
>  	u64 cur_bw, bytes, cur_bytes;
> 
>  	cur_bytes = rr->val;
> @@ -502,7 +526,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct
> rdt_domain *dom_mbm)  {
>  	u32 closid, rmid, cur_msr_val, new_msr_val;
>  	struct mbm_state *pmbm_data, *cmbm_data;
> -	u32 cur_bw, delta_bw, user_bw;
> +	u32 cur_bw, delta_bw, user_bw, idx;
>  	struct rdt_resource *r_mba;
>  	struct rdt_domain *dom_mba;
>  	struct list_head *head;
> @@ -515,7 +539,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct
> rdt_domain *dom_mbm)
> 
>  	closid = rgrp->closid;
>  	rmid = rgrp->mon.rmid;
> -	pmbm_data = &dom_mbm->mbm_local[rmid];
> +	idx = resctrl_arch_rmid_idx_encode(closid, rmid);
> +	pmbm_data = &dom_mbm->mbm_local[idx];
> 
>  	dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
>  	if (!dom_mba) {
> @@ -698,19 +723,19 @@ void mbm_setup_overflow_handler(struct
> rdt_domain *dom, unsigned long delay_ms)
> 
>  static int dom_data_init(struct rdt_resource *r)  {
> +	u32 nr_idx = resctrl_arch_system_num_rmid_idx();
>  	struct rmid_entry *entry = NULL;
> -	int i, nr_rmids;
> +	int i;
> 
> -	nr_rmids = r->num_rmid;
> -	rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL);
> +	rmid_ptrs = kcalloc(nr_idx, sizeof(struct rmid_entry), GFP_KERNEL);
>  	if (!rmid_ptrs)
>  		return -ENOMEM;
> 
> -	for (i = 0; i < nr_rmids; i++) {
> +	for (i = 0; i < nr_idx; i++) {
>  		entry = &rmid_ptrs[i];
>  		INIT_LIST_HEAD(&entry->list);
> 
> -		entry->rmid = i;
> +		resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid);
>  		list_add_tail(&entry->list, &rmid_free_lru);
>  	}
> 
> @@ -719,7 +744,7 @@ static int dom_data_init(struct rdt_resource *r)
>  	 * default_rdtgroup control group, which will be setup later. See
>  	 * rdtgroup_setup_root().
>  	 */
> -	entry = __rmid_entry(0, 0);
> +	entry = __rmid_entry(resctrl_arch_rmid_idx_encode(0, 0));

Better change to:
+	entry = __rmid_entry(resctrl_arch_rmid_idx_encode(RESCTRL_BAD_CLOSID, 0));
because this explicitly tells CLOSID is invalid here on X86.

>  	list_del(&entry->list);
> 
>  	return 0;
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index f3b739c52e42..9ce4746778f4 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3320,16 +3320,17 @@ void resctrl_offline_domain(struct rdt_resource *r,
> struct rdt_domain *d)
> 
>  static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain
> *d)  {
> +	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
>  	size_t tsize;
> 
>  	if (is_llc_occupancy_enabled()) {
> -		d->rmid_busy_llc = bitmap_zalloc(r->num_rmid, GFP_KERNEL);
> +		d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL);
>  		if (!d->rmid_busy_llc)
>  			return -ENOMEM;
>  	}
>  	if (is_mbm_total_enabled()) {
>  		tsize = sizeof(*d->mbm_total);
> -		d->mbm_total = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
> +		d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL);
>  		if (!d->mbm_total) {
>  			bitmap_free(d->rmid_busy_llc);
>  			return -ENOMEM;
> @@ -3337,7 +3338,7 @@ static int domain_setup_mon_state(struct
> rdt_resource *r, struct rdt_domain *d)
>  	}
>  	if (is_mbm_local_enabled()) {
>  		tsize = sizeof(*d->mbm_local);
> -		d->mbm_local = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
> +		d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL);
>  		if (!d->mbm_local) {
>  			bitmap_free(d->rmid_busy_llc);
>  			kfree(d->mbm_total);
> --
> 2.30.2

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v2 04/18] x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()
  2023-01-13 17:54 ` [PATCH v2 04/18] x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare() James Morse
@ 2023-01-17 18:28   ` Yu, Fenghua
  2023-02-02 23:45   ` Reinette Chatre
  1 sibling, 0 replies; 73+ messages in thread
From: Yu, Fenghua @ 2023-01-17 18:28 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Chatre, Reinette, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi, James,

> RMID are allocated for each monitor or control group directory, because each
> of these needs its own RMID. For control groups,
> rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID.
> 
> MPAM's equivalent of RMID are not an independent number, so can't be

s/are/is/?

> allocated until the closid is known. An RMID allocation for one CLOSID may fail,

s/closid/CLOSID/

> whereas another may succeed depending on how many monitor groups a
> control group has.
> 
> The RMID allocation needs to move to be after the CLOSID has been allocated.
> 
> Move the RMID allocation out of mkdir_rdt_prepare() to occur in its caller, after
> the mkdir_rdt_prepare() call. This allows the RMID allocator to know the CLOSID.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 29 +++++++++++++++++++-------
>  1 file changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 841294ad6263..c67083a8a5f5 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2892,6 +2892,12 @@ static int mkdir_rdt_prepare_rmid_alloc(struct
> rdtgroup *rdtgrp)
>  	return 0;
>  }
> 
> +static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp) {
> +	if (rdt_mon_capable)
> +		free_rmid(rgrp->closid, rgrp->mon.rmid); }
> +
>  static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
>  			     const char *name, umode_t mode,
>  			     enum rdt_group_type rtype, struct rdtgroup **r)
> @@ -2957,10 +2963,6 @@ static int mkdir_rdt_prepare(struct kernfs_node
> *parent_kn,
>  		goto out_destroy;
>  	}
> 
> -	ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
> -	if (ret)
> -		goto out_destroy;
> -
>  	kernfs_activate(kn);
> 
>  	/*
> @@ -2981,7 +2983,6 @@ static int mkdir_rdt_prepare(struct kernfs_node
> *parent_kn,  static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp)  {
>  	kernfs_remove(rgrp->kn);
> -	free_rmid(rgrp->closid, rgrp->mon.rmid);
>  	rdtgroup_remove(rgrp);
>  }
> 
> @@ -3003,12 +3004,19 @@ static int rdtgroup_mkdir_mon(struct kernfs_node
> *parent_kn,
>  	prgrp = rdtgrp->mon.parent;
>  	rdtgrp->closid = prgrp->closid;
> 
> +	ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
> +	if (ret) {
> +		mkdir_rdt_prepare_clean(rdtgrp);
> +		goto out_unlock;
> +	}
> +
>  	/*
>  	 * Add the rdtgrp to the list of rdtgrps the parent
>  	 * ctrl_mon group has to track.
>  	 */
>  	list_add_tail(&rdtgrp->mon.crdtgrp_list, &prgrp->mon.crdtgrp_list);
> 
> +out_unlock:
>  	rdtgroup_kn_unlock(parent_kn);
>  	return ret;
>  }
> @@ -3039,10 +3047,15 @@ static int rdtgroup_mkdir_ctrl_mon(struct
> kernfs_node *parent_kn,
>  	ret = 0;
> 
>  	rdtgrp->closid = closid;
> -	ret = rdtgroup_init_alloc(rdtgrp);
> -	if (ret < 0)
> +
> +	ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
> +	if (ret)
>  		goto out_id_free;
> 
> +	ret = rdtgroup_init_alloc(rdtgrp);
> +	if (ret < 0)
> +		goto out_rmid_free;
> +
>  	list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
> 
>  	if (rdt_mon_capable) {
> @@ -3061,6 +3074,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct
> kernfs_node *parent_kn,
> 
>  out_del_list:
>  	list_del(&rdtgrp->rdtgroup_list);
> +out_rmid_free:
> +	mkdir_rdt_prepare_rmid_free(rdtgrp);
>  out_id_free:
>  	closid_free(closid);
>  out_common_fail:
> --
> 2.30.2

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID
  2023-01-13 17:54 ` [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID James Morse
@ 2023-01-17 18:29   ` Yu, Fenghua
  2023-03-03 18:35     ` James Morse
  2023-02-02 23:46   ` Reinette Chatre
  1 sibling, 1 reply; 73+ messages in thread
From: Yu, Fenghua @ 2023-01-17 18:29 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Chatre, Reinette, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi, James,

> MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
> used for different control groups.
> 
> This means once a CLOSID is allocated, all its monitoring ids may still be dirty,
> and held in limbo.
> 
> Add a helper to allow the CLOSID allocator to check if a CLOSID has dirty RMID
> values. This behaviour is enabled by a kconfig option selected by the
> architecture, which avoids a pointless search for x86.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> 
> ---
> Changes since v1:
>  * Removed superflous IS_ENABLED().
> ---
>  arch/x86/kernel/cpu/resctrl/internal.h |  1 +
> arch/x86/kernel/cpu/resctrl/monitor.c  | 31 ++++++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 ++++++++------
>  3 files changed, 42 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h
> b/arch/x86/kernel/cpu/resctrl/internal.h
> index 013c8fc9fd28..adae6231324f 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -509,6 +509,7 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup
> *rdtgrp);  void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);  struct
> rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);  int
> closids_supported(void);
> +bool resctrl_closid_is_dirty(u32 closid);
>  void closid_free(int closid);
>  int alloc_rmid(u32 closid);
>  void free_rmid(u32 closid, u32 rmid);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
> b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 347be3767241..190ac183139e 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -327,6 +327,37 @@ static struct rmid_entry *resctrl_find_free_rmid(u32
> closid)
>  	return ERR_PTR(-ENOSPC);
>  }
> 
> +/**
> + * resctrl_closid_is_dirty - Determine if clean RMID can be allocate for this

s/allocate/allocated/

> + *                           CLOSID.
> + * @closid: The CLOSID that is being queried.
> + *
> + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocate

s/allocate/allocated/

> +CLOSID
> + * may not be able to allocate clean RMID. To avoid this the allocator
> +will
> + * only return clean CLOSID.
> + */
> +bool resctrl_closid_is_dirty(u32 closid) {
> +	struct rmid_entry *entry;
> +	int i;
> +
> +	lockdep_assert_held(&rdtgroup_mutex);

It's better to move lockdep_asser_held() after if (!IS_ENABLE()).
Then compiler might optimize this function to empty on X86.

> +
> +	if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
> +		return false;
> +
> +	for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
> +		entry = &rmid_ptrs[i];
> +		if (entry->closid != closid)
> +			continue;
> +
> +		if (entry->busy)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  /*
>   * As of now the RMIDs allocation is the same in each domain.
>   * However we keep track of which packages the RMIDs diff --git
> a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index ac88610a6946..e1f879e13823 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -93,7 +93,7 @@ void rdt_last_cmd_printf(const char *fmt, ...)
>   * - Our choices on how to configure each resource become progressively more
>   *   limited as the number of resources grows.
>   */
> -static int closid_free_map;
> +static unsigned long closid_free_map;
>  static int closid_free_map_len;
> 
>  int closids_supported(void)
> @@ -119,14 +119,17 @@ static void closid_init(void)
> 
>  static int closid_alloc(void)
>  {
> -	u32 closid = ffs(closid_free_map);
> +	u32 closid;
> 
> -	if (closid == 0)
> -		return -ENOSPC;
> -	closid--;
> -	closid_free_map &= ~(1 << closid);
> +	for_each_set_bit(closid, &closid_free_map, closid_free_map_len) {
> +		if (resctrl_closid_is_dirty(closid))
> +			continue;
> 
> -	return closid;
> +		clear_bit(closid, &closid_free_map);
> +		return closid;
> +	}
> +
> +	return -ENOSPC;
>  }
> 
>  void closid_free(int closid)
> --
> 2.30.2

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v2 02/18] x86/resctrl: Access per-rmid structures by index
  2023-01-13 17:54 ` [PATCH v2 02/18] x86/resctrl: Access per-rmid structures by index James Morse
  2023-01-17 18:28   ` Yu, Fenghua
@ 2023-01-17 18:29   ` Yu, Fenghua
  2023-02-02 23:44   ` Reinette Chatre
  2 siblings, 0 replies; 73+ messages in thread
From: Yu, Fenghua @ 2023-01-17 18:29 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Chatre, Reinette, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi, James,

> Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
> monitors, RMID values on arm64 are not unique unless the CLOSID is also
> included. Bitmaps like rmid_busy_llc need to be sized by the number of unique
> entries for this resource.
> 
> Add helpers to encode/decode the CLOSID and RMID to an index. The domain's
> busy_rmid_llc and the rmid_ptrs[] array are then sized by index. On x86, this is
> always just the RMID. This gives resctrl a unique value it can use to store
> monitor values, and allows MPAM to decode the closid when reading the
> hardware counters.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> Changes since v1:
>  * Added X86_BAD_CLOSID macro to make it clear what this value means
>  * Added second WARN_ON() for closid checking, and made both _ONCE()
> ---
>  arch/x86/include/asm/resctrl.h         | 24 ++++++++
>  arch/x86/kernel/cpu/resctrl/internal.h |  2 +
> arch/x86/kernel/cpu/resctrl/monitor.c  | 79 +++++++++++++++++---------
> arch/x86/kernel/cpu/resctrl/rdtgroup.c |  7 ++-
>  4 files changed, 82 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 52788f79786f..44d568f3577c 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -7,6 +7,13 @@
>  #include <linux/sched.h>
>  #include <linux/jump_label.h>
> 
> +/*
> + * This value can never be a valid CLOSID, and is used when mapping a
> + * (closid, rmid) pair to an index and back. On x86 only the RMID is
> + * needed.
> + */
> +#define X86_RESCTRL_BAD_CLOSID		~0
> +
>  /**
>   * struct resctrl_pqr_state - State cache for the PQR MSR
>   * @cur_rmid:		The cached Resource Monitoring ID
> @@ -94,6 +101,23 @@ static inline void resctrl_sched_in(void)
>  		__resctrl_sched_in();
>  }
> 
> +static inline u32 resctrl_arch_system_num_rmid_idx(void)
> +{
> +	/* RMID are independent numbers for x86. num_rmid_idx==num_rmid
> */
> +	return boot_cpu_data.x86_cache_max_rmid + 1; }
> +
> +static inline void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid,
> +u32 *rmid) {
> +	*rmid = idx;
> +	*closid = X86_RESCTRL_BAD_CLOSID;
> +}
> +
> +static inline u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid) {

s/u32 closid/u32 ignored/, please.

> +	return rmid;
> +}
> +

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI
  2023-01-13 17:54 ` [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI James Morse
@ 2023-01-17 18:29   ` Yu, Fenghua
  2023-03-06 11:32     ` James Morse
  2023-02-02 23:47   ` Reinette Chatre
  1 sibling, 1 reply; 73+ messages in thread
From: Yu, Fenghua @ 2023-01-17 18:29 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Chatre, Reinette, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi, James,

> x86 is blessed with an abundance of monitors, one per RMID, that can be read
> from any CPU in the domain. MPAMs monitors reside in the MMIO MSC, the
> number implemented is up to the manufacturer. This means when there are
> fewer monitors than needed, they need to be allocated and freed.
> 
> Worse, the domain may be broken up into slices, and the MMIO accesses for
> each slice may need performing from different CPUs.
> 
> These two details mean MPAMs monitor code needs to be able to sleep, and IPI
> another CPU in the domain to read from a resource that has been sliced.
> 
> mon_event_read() already invokes mon_event_count() via IPI, which means this
> isn't possible.
> 
> Change mon_event_read() to schedule mon_event_count() on a remote CPU
> and wait, instead of sending an IPI. This function is only used in response to a
> user-space filesystem request (not the timing sensitive overflow code).

But mkdir mon group needs mon_event_count() to reset RMID state.
If mon_event_count() is called much later, the RMID state may be used
before it's reset. E.g. prev_msr might be non-0 value. That will cause
overflow code failure.

Seems this may happen on both x86 and arm64. At least need to make sure
RMID state reset happens before it's used.

> 
> This allows MPAM to hide the slice behaviour from resctrl, and to keep the
> monitor-allocation in monitor.c.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 7 +++++--
>  arch/x86/kernel/cpu/resctrl/internal.h    | 2 +-
>  arch/x86/kernel/cpu/resctrl/monitor.c     | 6 ++++--
>  3 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 1df0e3262bca..4ee3da6dced7 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -532,8 +532,11 @@ void mon_event_read(struct rmid_read *rr, struct
> rdt_resource *r,
>  		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
>  		    int evtid, int first)
>  {
> +	/* When picking a cpu from cpu_mask, ensure it can't race with cpuhp
> */
> +	lockdep_assert_held(&rdtgroup_mutex);
> +
>  	/*
> -	 * setup the parameters to send to the IPI to read the data.
> +	 * setup the parameters to pass to mon_event_count() to read the data.
>  	 */
>  	rr->rgrp = rdtgrp;
>  	rr->evtid = evtid;
> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct
> rdt_resource *r,
>  	rr->val = 0;
>  	rr->first = first;
> 
> -	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> +	smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr,
> +false);
>  }
> 
>  int rdtgroup_mondata_show(struct seq_file *m, void *arg) diff --git
> a/arch/x86/kernel/cpu/resctrl/internal.h
> b/arch/x86/kernel/cpu/resctrl/internal.h
> index adae6231324f..1f90a10b75a1 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -514,7 +514,7 @@ void closid_free(int closid);  int alloc_rmid(u32 closid);
> void free_rmid(u32 closid, u32 rmid);  int rdt_get_mon_l3_config(struct
> rdt_resource *r); -void mon_event_count(void *info);
> +int mon_event_count(void *info);
>  int rdtgroup_mondata_show(struct seq_file *m, void *arg);  void
> mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>  		    struct rdt_domain *d, struct rdtgroup *rdtgrp, diff --git
> a/arch/x86/kernel/cpu/resctrl/monitor.c
> b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 190ac183139e..d309b830aeb2 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -509,10 +509,10 @@ static void mbm_bw_count(u32 closid, u32 rmid,
> struct rmid_read *rr)  }
> 
>  /*
> - * This is called via IPI to read the CQM/MBM counters
> + * This is scheduled by mon_event_read() to read the CQM/MBM counters
>   * on a domain.
>   */
> -void mon_event_count(void *info)
> +int mon_event_count(void *info)
>  {
>  	struct rdtgroup *rdtgrp, *entry;
>  	struct rmid_read *rr = info;
> @@ -545,6 +545,8 @@ void mon_event_count(void *info)
>  	 */
>  	if (ret == 0)
>  		rr->err = 0;
> +
> +	return 0;
>  }
> 
>  /*
> --
> 2.30.2

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v2 05/18] x86/resctrl: Allow RMID allocation to be scoped by CLOSID
  2023-01-13 17:54 ` [PATCH v2 05/18] x86/resctrl: Allow RMID allocation to be scoped by CLOSID James Morse
@ 2023-01-17 18:53   ` Yu, Fenghua
  2023-03-03 18:34     ` James Morse
  2023-02-02 23:45   ` Reinette Chatre
  1 sibling, 1 reply; 73+ messages in thread
From: Yu, Fenghua @ 2023-01-17 18:53 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Chatre, Reinette, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi, James,

> MPAMs RMID values are not unique unless the CLOSID is considered as well.
> 
> alloc_rmid() expects the RMID to be an independent number.
> 
> Pass the CLOSID in to alloc_rmid(). Use this to compare indexes when allocating.
> If the CLOSID is not relevant to the index, this ends up comparing the free RMID
> with itself, and the first free entry will be used. With MPAM the CLOSID is
> included in the index, so this becomes a walk of the free RMID entries, until one
> that matches the supplied CLOSID is found.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  arch/x86/kernel/cpu/resctrl/internal.h    |  2 +-
>  arch/x86/kernel/cpu/resctrl/monitor.c     | 44 ++++++++++++++++++-----
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  2 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  2 +-
>  4 files changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h
> b/arch/x86/kernel/cpu/resctrl/internal.h
> index af71401c57e2..013c8fc9fd28 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -510,7 +510,7 @@ void rdtgroup_pseudo_lock_remove(struct rdtgroup
> *rdtgrp);  struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource
> *r);  int closids_supported(void);  void closid_free(int closid); -int
> alloc_rmid(void);
> +int alloc_rmid(u32 closid);
>  void free_rmid(u32 closid, u32 rmid);
>  int rdt_get_mon_l3_config(struct rdt_resource *r);  void mon_event_count(void
> *info); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
> b/arch/x86/kernel/cpu/resctrl/monitor.c
> index dbae380e3d1c..347be3767241 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -301,25 +301,51 @@ bool has_busy_rmid(struct rdt_resource *r, struct
> rdt_domain *d)
>  	return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit;  }
> 
> +static struct rmid_entry *resctrl_find_free_rmid(u32 closid) {
> +	struct rmid_entry *itr;
> +	u32 itr_idx, cmp_idx;
> +
> +	if (list_empty(&rmid_free_lru))
> +		return rmid_limbo_count ? ERR_PTR(-EBUSY) : ERR_PTR(-
> ENOSPC);
> +
> +	list_for_each_entry(itr, &rmid_free_lru, list) {
> +		/*
> +		 * get the index of this free RMID, and the index it would need
> +		 * to be if it were used with this CLOSID.
> +		 * If the CLOSID is irrelevant on this architecture, these will
> +		 * always be the same. Otherwise they will only match if this
> +		 * RMID can be used with this CLOSID.
> +		 */
> +		itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid);
> +		cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid);
> +
> +		if (itr_idx == cmp_idx)
> +			return itr;

Finding free rmid may be called frequently depending on usage.

It would be better to have a simpler and faster arch helper that finds the itr on x86.
Something like:
struct rmid_entry *resctrl_arch_rmid_matchd(u32 ignored, u32 ignored)
{
	return list_entry_first(resctrl_free_lru, itr, list);
}

Arm64 implements the complex case going through the rmid_free_lru list in the patch.

> +	}
> +
> +	return ERR_PTR(-ENOSPC);
> +}
> +
>  /*
> - * As of now the RMIDs allocation is global.
> + * As of now the RMIDs allocation is the same in each domain.
>   * However we keep track of which packages the RMIDs
>   * are used to optimize the limbo list management.
> + * The closid is ignored on x86.
>   */
> -int alloc_rmid(void)
> +int alloc_rmid(u32 closid)
>  {
>  	struct rmid_entry *entry;
> 
>  	lockdep_assert_held(&rdtgroup_mutex);
> 
> -	if (list_empty(&rmid_free_lru))
> -		return rmid_limbo_count ? -EBUSY : -ENOSPC;
> +	entry = resctrl_find_free_rmid(closid);
> +	if (!IS_ERR(entry)) {
> +		list_del(&entry->list);
> +		return entry->rmid;
> +	}
> 
> -	entry = list_first_entry(&rmid_free_lru,
> -				 struct rmid_entry, list);
> -	list_del(&entry->list);
> -
> -	return entry->rmid;
> +	return PTR_ERR(entry);
>  }
> 
>  static void add_rmid_to_limbo(struct rmid_entry *entry) diff --git
> a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index c51932516965..3b724a40d3a2 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -763,7 +763,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
>  	int ret;
> 
>  	if (rdt_mon_capable) {
> -		ret = alloc_rmid();
> +		ret = alloc_rmid(rdtgrp->closid);
>  		if (ret < 0) {
>  			rdt_last_cmd_puts("Out of RMIDs\n");
>  			return ret;
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index c67083a8a5f5..ac88610a6946 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2875,7 +2875,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct
> rdtgroup *rdtgrp)
>  	if (!rdt_mon_capable)
>  		return 0;
> 
> -	ret = alloc_rmid();
> +	ret = alloc_rmid(rdtgrp->closid);
>  	if (ret < 0) {
>  		rdt_last_cmd_puts("Out of RMIDs\n");
>  		return ret;
> --
> 2.30.2

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v2 07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
  2023-01-13 17:54 ` [PATCH v2 07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers James Morse
@ 2023-01-17 19:10   ` Yu, Fenghua
  2023-03-03 18:37     ` James Morse
  2023-02-02 23:47   ` Reinette Chatre
  1 sibling, 1 reply; 73+ messages in thread
From: Yu, Fenghua @ 2023-01-17 19:10 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Chatre, Reinette, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi, James,

> When switching tasks, the CLOSID and RMID that the new task should use are
> stored in struct task_struct. For x86 the CLOSID known by resctrl, the value in
> task_struct, and the value written to the CPU register are all the same thing.
> 
> MPAM's CPU interface has two different PARTID's one for data accesses the
> other for instruction fetch. Storing resctrl's CLOSID value in struct task_struct
> implies the arch code knows whether resctrl is using CDP.
> 
> Move the matching and setting of the struct task_struct properties to use
> helpers. This allows arm64 to store the hardware format of the register, instead
> of having to convert it each time.
> 
> __rdtgroup_move_task()s use of READ_ONCE()/WRITE_ONCE() ensures torn
> values aren't seen as another CPU may schedule the task being moved while the
> value is being changed. MPAM has an additional corner-case here as the PMG
> bits extend the PARTID space. If the scheduler sees a new-CLOSID but old-RMID,
> the task will dirty an RMID that the limbo code is not watching causing an
> inaccurate count. x86's RMID are independent values, so the limbo code will still
> be watching the old-RMID in this circumstance.
> To avoid this, arm64 needs both the CLOSID/RMID WRITE_ONCE()d together.
> Both values must be provided together.
> 
> Because MPAM's RMID values are not unique, the CLOSID must be provided
> when matching the RMID.
> 
> CC: Valentin Schneider <vschneid@redhat.com>
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  arch/x86/include/asm/resctrl.h         | 18 ++++++++
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 57 +++++++++++++++-----------
>  2 files changed, 51 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 44d568f3577c..d589a82995ac 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -95,6 +95,24 @@ static inline unsigned int
> resctrl_arch_round_mon_val(unsigned int val)
>  	return val * scale;
>  }
> 
> +static inline void resctrl_arch_set_closid_rmid(struct task_struct *tsk,
> +						u32 closid, u32 rmid)
> +{
> +	WRITE_ONCE(tsk->closid, closid);
> +	WRITE_ONCE(tsk->rmid, rmid);
> +}
> +
> +static inline bool resctrl_arch_match_closid(struct task_struct *tsk,
> +u32 closid) {
> +	return READ_ONCE(tsk->closid) == closid; }
> +
> +static inline bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 ignored,
> +					   u32 rmid)
> +{
> +	return READ_ONCE(tsk->rmid) == rmid;
> +}
> +
>  static inline void resctrl_sched_in(void)  {
>  	if (static_branch_likely(&rdt_enable_key))
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index e1f879e13823..ced7400decae 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -84,7 +84,7 @@ void rdt_last_cmd_printf(const char *fmt, ...)
>   *
>   * Using a global CLOSID across all resources has some advantages and
>   * some drawbacks:
> - * + We can simply set "current->closid" to assign a task to a resource
> + * + We can simply set current's closid to assign a task to a resource
>   *   group.

Seems this change doesn't gain anything. Maybe this change can be removed?

>   * + Context switch code can avoid extra memory references deciding which
>   *   CLOSID to load into the PQR_ASSOC MSR
> @@ -549,14 +549,26 @@ static void update_task_closid_rmid(struct
> task_struct *t)
>  		_update_task_closid_rmid(t);
>  }
> 
> +static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup
> +*rdtgrp) {
> +	u32 closid, rmid = rdtgrp->mon.rmid;
> +
> +	if (rdtgrp->type == RDTCTRL_GROUP)
> +		closid = rdtgrp->closid;
> +	else if (rdtgrp->type == RDTMON_GROUP)
> +		closid = rdtgrp->mon.parent->closid;
> +	else
> +		return false;
> +
> +	return resctrl_arch_match_closid(tsk, closid) &&
> +	       resctrl_arch_match_rmid(tsk, closid, rmid); }
> +
>  static int __rdtgroup_move_task(struct task_struct *tsk,
>  				struct rdtgroup *rdtgrp)
>  {
>  	/* If the task is already in rdtgrp, no need to move the task. */
> -	if ((rdtgrp->type == RDTCTRL_GROUP && tsk->closid == rdtgrp->closid
> &&
> -	     tsk->rmid == rdtgrp->mon.rmid) ||
> -	    (rdtgrp->type == RDTMON_GROUP && tsk->rmid == rdtgrp-
> >mon.rmid &&
> -	     tsk->closid == rdtgrp->mon.parent->closid))
> +	if (task_in_rdtgroup(tsk, rdtgrp))
>  		return 0;
> 
>  	/*
> @@ -567,19 +579,14 @@ static int __rdtgroup_move_task(struct task_struct
> *tsk,
>  	 * For monitor groups, can move the tasks only from
>  	 * their parent CTRL group.
>  	 */
> -
> -	if (rdtgrp->type == RDTCTRL_GROUP) {
> -		WRITE_ONCE(tsk->closid, rdtgrp->closid);
> -		WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
> -	} else if (rdtgrp->type == RDTMON_GROUP) {
> -		if (rdtgrp->mon.parent->closid == tsk->closid) {
> -			WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
> -		} else {
> -			rdt_last_cmd_puts("Can't move task to different
> control group\n");
> -			return -EINVAL;
> -		}
> +	if (rdtgrp->type == RDTMON_GROUP &&
> +	    !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) {
> +		rdt_last_cmd_puts("Can't move task to different control
> group\n");
> +		return -EINVAL;
>  	}
> 
> +	resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, rdtgrp->mon.rmid);
> +
>  	/*
>  	 * Ensure the task's closid and rmid are written before determining if
>  	 * the task is current that will decide if it will be interrupted.
> @@ -599,14 +606,15 @@ static int __rdtgroup_move_task(struct task_struct
> *tsk,
> 
>  static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)  {
> -	return (rdt_alloc_capable &&
> -	       (r->type == RDTCTRL_GROUP) && (t->closid == r->closid));
> +	return (rdt_alloc_capable && (r->type == RDTCTRL_GROUP) &&
> +		resctrl_arch_match_closid(t, r->closid));
>  }
> 
>  static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r)  {
> -	return (rdt_mon_capable &&
> -	       (r->type == RDTMON_GROUP) && (t->rmid == r->mon.rmid));
> +	return (rdt_mon_capable && (r->type == RDTMON_GROUP) &&
> +		resctrl_arch_match_rmid(t, r->mon.parent->closid,
> +					r->mon.rmid));
>  }
> 
>  /**
> @@ -802,7 +810,7 @@ int proc_resctrl_show(struct seq_file *s, struct
> pid_namespace *ns,
>  		    rdtg->mode != RDT_MODE_EXCLUSIVE)
>  			continue;
> 
> -		if (rdtg->closid != tsk->closid)
> +		if (!resctrl_arch_match_closid(tsk, rdtg->closid))
>  			continue;
> 
>  		seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "",
> @@ -810,7 +818,8 @@ int proc_resctrl_show(struct seq_file *s, struct
> pid_namespace *ns,
>  		seq_puts(s, "mon:");
>  		list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
>  				    mon.crdtgrp_list) {
> -			if (tsk->rmid != crg->mon.rmid)
> +			if (!resctrl_arch_match_rmid(tsk, crg->mon.parent-
> >closid,
> +						     crg->mon.rmid))
>  				continue;
>  			seq_printf(s, "%s", crg->kn->name);
>  			break;
> @@ -2401,8 +2410,8 @@ static void rdt_move_group_tasks(struct rdtgroup
> *from, struct rdtgroup *to,
>  	for_each_process_thread(p, t) {
>  		if (!from || is_closid_match(t, from) ||
>  		    is_rmid_match(t, from)) {
> -			WRITE_ONCE(t->closid, to->closid);
> -			WRITE_ONCE(t->rmid, to->mon.rmid);
> +			resctrl_arch_set_closid_rmid(t, to->closid,
> +						     to->mon.rmid);
> 
>  			/*
>  			 * If the task is on a CPU, set the CPU in the mask.
> --
> 2.30.2

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  2023-01-13 17:54 ` [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep James Morse
@ 2023-01-23 13:54   ` Peter Newman
  2023-03-06 11:33     ` James Morse
  2023-01-23 15:33   ` Peter Newman
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Newman @ 2023-01-23 13:54 UTC (permalink / raw)
  To: James Morse
  Cc: x86, linux-kernel, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao

Hi James,

On Fri, Jan 13, 2023 at 6:56 PM James Morse <james.morse@arm.com> wrote:
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index d309b830aeb2..d6ae4b713801 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -206,17 +206,19 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
>         return chunks >> shift;
>  }
>
> -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> -                          u32 closid, u32 rmid, enum resctrl_event_id eventid,
> -                          u64 *val)
> +struct __rmid_read_arg
>  {
> -       struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> -       struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> -       struct arch_mbm_state *am;
> -       u64 msr_val, chunks;
> +       u32 rmid;
> +       enum resctrl_event_id eventid;
>
> -       if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
> -               return -EINVAL;
> +       u64 msr_val;
> +};
> +
> +static void __rmid_read(void *arg)
> +{
> +       enum resctrl_event_id eventid = ((struct __rmid_read_arg *)arg)->eventid;
> +       u32 rmid = ((struct __rmid_read_arg *)arg)->rmid;
> +       u64 msr_val;
>
>         /*
>          * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> @@ -229,6 +231,28 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>         wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
>         rdmsrl(MSR_IA32_QM_CTR, msr_val);
>
> +       ((struct __rmid_read_arg *)arg)->msr_val = msr_val;
> +}
> +
> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> +                          u32 closid, u32 rmid, enum resctrl_event_id eventid,
> +                          u64 *val)
> +{
> +       struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> +       struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> +       struct __rmid_read_arg arg;
> +       struct arch_mbm_state *am;
> +       u64 msr_val, chunks;
> +       int err;
> +
> +       arg.rmid = rmid;
> +       arg.eventid = eventid;
> +
> +       err = smp_call_function_any(&d->cpu_mask, __rmid_read, &arg, true);
> +       if (err)
> +               return err;
> +
> +       msr_val = arg.msr_val;

These changes are conflicting now after v6.2-rc4 due to my recent
changes in resctrl_arch_rmid_read(), which include my own
reintroduction of __rmid_read():

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=2a81160d29d65b5876ab3f824fda99ae0219f05e

Fortunately it looks like our respective versions of __rmid_read()
aren't too much different from the original, but __rmid_read() does
have a new call site in resctrl_arch_reset_rmid() to record initial
event counts.

-Peter

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  2023-01-13 17:54 ` [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep James Morse
  2023-01-23 13:54   ` Peter Newman
@ 2023-01-23 15:33   ` Peter Newman
  2023-03-06 11:33     ` James Morse
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Newman @ 2023-01-23 15:33 UTC (permalink / raw)
  To: James Morse
  Cc: x86, linux-kernel, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao

On Fri, Jan 13, 2023 at 6:56 PM James Morse <james.morse@arm.com> wrote:
> MPAM's cache occupancy counters can take a little while to settle once
> the monitor has been configured. The maximum settling time is described
> to the driver via a firmware table. The value could be large enough
> that it makes sense to sleep.

Would it be easier to return an error when reading the occupancy count
too soon after configuration? On Intel it is already normal for counter
reads to fail on newly-allocated RMIDs.

-Peter

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v2 14/18] x86/resctrl: Add helpers for system wide mon/alloc capable
  2023-01-13 17:54 ` [PATCH v2 14/18] x86/resctrl: Add helpers for system wide mon/alloc capable James Morse
@ 2023-01-25  7:16   ` Shaopeng Tan (Fujitsu)
  2023-03-06 11:34     ` James Morse
  0 siblings, 1 reply; 73+ messages in thread
From: Shaopeng Tan (Fujitsu) @ 2023-01-25  7:16 UTC (permalink / raw)
  To: 'James Morse', x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

> resctrl reads rdt_alloc_capable or rdt_mon_capable to determine whether any
> of the resources support the corresponding features.
> resctrl also uses the static-keys that affect the architecture's context-switch
> code to determine the same thing.
> 
> This forces another architecture to have the same static-keys.
> 
> As the static-key is enabled based on the capable flag, and none of the
> filesystem uses of these are in the scheduler path, move the capable flags
> behind helpers, and use these in the filesystem code instead of the static-key.
> 
> After this change, only the architecture code manages and uses the static-keys
> to ensure __resctrl_sched_in() does not need runtime checks.
> 
> This avoids multiple architectures having to define the same static-keys.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> 
> ---
> Changes since v1:
>  * Added missing conversion in mkdir_rdt_prepare_rmid_free()
> ---
>  arch/x86/include/asm/resctrl.h            | 13 +++++++++
>  arch/x86/kernel/cpu/resctrl/internal.h    |  2 --
>  arch/x86/kernel/cpu/resctrl/monitor.c     |  4 +--
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 ++--
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 34 +++++++++++------------
>  5 files changed, 35 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 5b5ae6d8a343..3364d640f791 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -38,10 +38,18 @@ struct resctrl_pqr_state {
> 
>  DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
> 
> +extern bool rdt_alloc_capable;
> +extern bool rdt_mon_capable;
> +
>  DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
>  DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>  DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
> 
> +static inline bool resctrl_arch_alloc_capable(void) {
> +	return rdt_alloc_capable;
> +}
> +
>  static inline void resctrl_arch_enable_alloc(void)  {
>  	static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
> @@ -54,6 +62,11 @@ static inline void resctrl_arch_disable_alloc(void)
>  	static_branch_dec_cpuslocked(&rdt_enable_key);
>  }
> 
> +static inline bool resctrl_arch_mon_capable(void) {
> +	return rdt_mon_capable;
> +}
> +
>  static inline void resctrl_arch_enable_mon(void)  {
>  	static_branch_enable_cpuslocked(&rdt_mon_enable_key);
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h
> b/arch/x86/kernel/cpu/resctrl/internal.h
> index 3997386cee89..a1bf97adee2e 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -88,8 +88,6 @@ struct rmid_read {
>  	int			arch_mon_ctx;
>  };
> 
> -extern bool rdt_alloc_capable;
> -extern bool rdt_mon_capable;
>  extern unsigned int rdt_mon_features;
>  extern struct list_head resctrl_schema_all;  extern bool resctrl_mounted; diff
> --git a/arch/x86/kernel/cpu/resctrl/monitor.c
> b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 4ff258b49e9c..1a214bd32ed4 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -795,7 +795,7 @@ void mbm_handle_overflow(struct work_struct *work)
> 
>  	mutex_lock(&rdtgroup_mutex);
> 
> -	if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
> +	if (!resctrl_mounted || !resctrl_arch_mon_capable())
>  		goto out_unlock;
> 
>  	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> @@ -823,7 +823,7 @@ void mbm_setup_overflow_handler(struct rdt_domain
> *dom, unsigned long delay_ms)
>  	unsigned long delay = msecs_to_jiffies(delay_ms);
>  	int cpu;
> 
> -	if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
> +	if (!resctrl_mounted || !resctrl_arch_mon_capable())
>  		return;
> 
>  	cpu = cpumask_any(&dom->cpu_mask);
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 3b724a40d3a2..0b4fdb118643 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -567,7 +567,7 @@ static int rdtgroup_locksetup_user_restrict(struct
> rdtgroup *rdtgrp)
>  	if (ret)
>  		goto err_cpus;
> 
> -	if (rdt_mon_capable) {
> +	if (resctrl_arch_mon_capable()) {
>  		ret = rdtgroup_kn_mode_restrict(rdtgrp, "mon_groups");
>  		if (ret)
>  			goto err_cpus_list;
> @@ -614,7 +614,7 @@ static int rdtgroup_locksetup_user_restore(struct
> rdtgroup *rdtgrp)
>  	if (ret)
>  		goto err_cpus;
> 
> -	if (rdt_mon_capable) {
> +	if (resctrl_arch_mon_capable()) {
>  		ret = rdtgroup_kn_mode_restore(rdtgrp, "mon_groups", 0777);
>  		if (ret)
>  			goto err_cpus_list;
> @@ -762,7 +762,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)  {
>  	int ret;
> 
> -	if (rdt_mon_capable) {
> +	if (resctrl_arch_mon_capable()) {
>  		ret = alloc_rmid(rdtgrp->closid);
>  		if (ret < 0) {
>  			rdt_last_cmd_puts("Out of RMIDs\n"); diff --git
> a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 0e22f8361392..44e6d6fbab25 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -609,13 +609,13 @@ static int __rdtgroup_move_task(struct task_struct
> *tsk,
> 
>  static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)  {
> -	return (rdt_alloc_capable && (r->type == RDTCTRL_GROUP) &&
> +	return (resctrl_arch_alloc_capable() && (r->type ==
> RDTCTRL_GROUP) &&
>  		resctrl_arch_match_closid(t, r->closid));  }
> 
>  static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r)  {
> -	return (rdt_mon_capable && (r->type == RDTMON_GROUP) &&
> +	return (resctrl_arch_mon_capable() && (r->type ==
> RDTMON_GROUP) &&
>  		resctrl_arch_match_rmid(t, r->mon.parent->closid,
>  					r->mon.rmid));
>  }
> @@ -2220,7 +2220,7 @@ static int rdt_get_tree(struct fs_context *fc)
>  	if (ret < 0)
>  		goto out_schemata_free;
> 
> -	if (rdt_mon_capable) {
> +	if (resctrl_arch_mon_capable()) {
>  		ret = mongroup_create_dir(rdtgroup_default.kn,
>  					  &rdtgroup_default, "mon_groups",
>  					  &kn_mongrp);
> @@ -2242,12 +2242,12 @@ static int rdt_get_tree(struct fs_context *fc)
>  	if (ret < 0)
>  		goto out_psl;
> 
> -	if (rdt_alloc_capable)
> +	if (resctrl_arch_alloc_capable())
>  		resctrl_arch_enable_alloc();
> -	if (rdt_mon_capable)
> +	if (resctrl_arch_mon_capable())
>  		resctrl_arch_enable_mon();
> 
> -	if (rdt_alloc_capable || rdt_mon_capable)
> +	if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable())
>  		resctrl_mounted = true;
> 
>  	if (is_mbm_enabled()) {
> @@ -2261,10 +2261,10 @@ static int rdt_get_tree(struct fs_context *fc)
>  out_psl:
>  	rdt_pseudo_lock_release();
>  out_mondata:
> -	if (rdt_mon_capable)
> +	if (resctrl_arch_mon_capable())
>  		kernfs_remove(kn_mondata);
>  out_mongrp:
> -	if (rdt_mon_capable)
> +	if (resctrl_arch_mon_capable())
>  		kernfs_remove(kn_mongrp);
>  out_info:
>  	kernfs_remove(kn_info);
> @@ -2512,9 +2512,9 @@ static void rdt_kill_sb(struct super_block *sb)
>  	rdt_pseudo_lock_release();
>  	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>  	schemata_list_destroy();
> -	if (rdt_alloc_capable)
> +	if (resctrl_arch_alloc_capable())
>  		resctrl_arch_disable_alloc();
> -	if (rdt_mon_capable)
> +	if (resctrl_arch_mon_capable())
>  		resctrl_arch_disable_mon();
>  	resctrl_mounted = false;
>  	kernfs_kill_sb(sb);
> @@ -2889,7 +2889,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct
> rdtgroup *rdtgrp)  {
>  	int ret;
> 
> -	if (!rdt_mon_capable)
> +	if (!resctrl_arch_mon_capable())
>  		return 0;
> 
>  	ret = alloc_rmid(rdtgrp->closid);
> @@ -2911,7 +2911,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct
> rdtgroup *rdtgrp)
> 
>  static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp)  {
> -	if (rdt_mon_capable)
> +	if (resctrl_arch_mon_capable())
>  		free_rmid(rgrp->closid, rgrp->mon.rmid);  }
> 
> @@ -3075,7 +3075,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct
> kernfs_node *parent_kn,
> 
>  	list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
> 
> -	if (rdt_mon_capable) {
> +	if (resctrl_arch_mon_capable()) {
>  		/*
>  		 * Create an empty mon_groups directory to hold the subset
>  		 * of tasks and cpus to monitor.
> @@ -3130,14 +3130,14 @@ static int rdtgroup_mkdir(struct kernfs_node
> *parent_kn, const char *name,
>  	 * allocation is supported, add a control and monitoring
>  	 * subdirectory
>  	 */
> -	if (rdt_alloc_capable && parent_kn == rdtgroup_default.kn)
> +	if (resctrl_arch_alloc_capable() && parent_kn == rdtgroup_default.kn)
>  		return rdtgroup_mkdir_ctrl_mon(parent_kn, name, mode);
> 
>  	/*
>  	 * If RDT monitoring is supported and the parent directory is a valid
>  	 * "mon_groups" directory, add a monitoring subdirectory.
>  	 */
> -	if (rdt_mon_capable && is_mon_groups(parent_kn, name))
> +	if (resctrl_arch_mon_capable() && is_mon_groups(parent_kn, name))
>  		return rdtgroup_mkdir_mon(parent_kn, name, mode);
> 
>  	return -EPERM;
> @@ -3341,7 +3341,7 @@ void resctrl_offline_domain(struct rdt_resource *r,
> struct rdt_domain *d)
>  	 * If resctrl is mounted, remove all the
>  	 * per domain monitor data directories.
>  	 */
> -	if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
> +	if (resctrl_mounted && resctrl_arch_mon_capable())
>  		rmdir_mondata_subdir_allrdtgrp(r, d->id);
> 
>  	if (is_mbm_enabled())
> @@ -3418,7 +3418,7 @@ int resctrl_online_domain(struct rdt_resource *r,
> struct rdt_domain *d)
>  	if (is_llc_occupancy_enabled())
>  		INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
> 
> -	if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
> +	if (resctrl_mounted && resctrl_arch_mon_capable())
>  		mkdir_mondata_subdir_allrdtgrp(r, d);
> 
>  	return 0;
> --
> 2.30.2

Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking
  2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
                   ` (17 preceding siblings ...)
  2023-01-13 17:54 ` [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks James Morse
@ 2023-01-25  7:19 ` Shaopeng Tan (Fujitsu)
  18 siblings, 0 replies; 73+ messages in thread
From: Shaopeng Tan (Fujitsu) @ 2023-01-25  7:19 UTC (permalink / raw)
  To: 'James Morse', x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

Tested this patch series(v2) on Intel(R) Xeon(R) Gold 6254 CPU with resctrl kselftest.
No problem.

Thanks
Shaopeng TAN


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 02/18] x86/resctrl: Access per-rmid structures by index
  2023-01-13 17:54 ` [PATCH v2 02/18] x86/resctrl: Access per-rmid structures by index James Morse
  2023-01-17 18:28   ` Yu, Fenghua
  2023-01-17 18:29   ` Yu, Fenghua
@ 2023-02-02 23:44   ` Reinette Chatre
  2023-03-03 18:33     ` James Morse
  2 siblings, 1 reply; 73+ messages in thread
From: Reinette Chatre @ 2023-02-02 23:44 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
> monitors, RMID values on arm64 are not unique unless the CLOSID is
> also included. Bitmaps like rmid_busy_llc need to be sized by the
> number of unique entries for this resource.
> 
> Add helpers to encode/decode the CLOSID and RMID to an index. The
> domain's busy_rmid_llc and the rmid_ptrs[] array are then sized by

busy_rmid_llc -> rmid_busy_llc ?

Could you please also mention the MBM state impacted?

> index. On x86, this is always just the RMID. This gives resctrl a
> unique value it can use to store monitor values, and allows MPAM to
> decode the closid when reading the hardware counters.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> Changes since v1:
>  * Added X86_BAD_CLOSID macro to make it clear what this value means
>  * Added second WARN_ON() for closid checking, and made both _ONCE()
> ---
>  arch/x86/include/asm/resctrl.h         | 24 ++++++++
>  arch/x86/kernel/cpu/resctrl/internal.h |  2 +
>  arch/x86/kernel/cpu/resctrl/monitor.c  | 79 +++++++++++++++++---------
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  7 ++-
>  4 files changed, 82 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 52788f79786f..44d568f3577c 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -7,6 +7,13 @@
>  #include <linux/sched.h>
>  #include <linux/jump_label.h>
>  
> +/*
> + * This value can never be a valid CLOSID, and is used when mapping a
> + * (closid, rmid) pair to an index and back. On x86 only the RMID is
> + * needed.
> + */
> +#define X86_RESCTRL_BAD_CLOSID		~0

Should this be moved to previous patch where first usage of ~0 appears?

Also, not having a size creates opportunity for inconsistencies. How
about ((u32)~0) ?

> +
>  /**
>   * struct resctrl_pqr_state - State cache for the PQR MSR
>   * @cur_rmid:		The cached Resource Monitoring ID
> @@ -94,6 +101,23 @@ static inline void resctrl_sched_in(void)
>  		__resctrl_sched_in();
>  }
>  
> +static inline u32 resctrl_arch_system_num_rmid_idx(void)
> +{
> +	/* RMID are independent numbers for x86. num_rmid_idx==num_rmid */
> +	return boot_cpu_data.x86_cache_max_rmid + 1;
> +}

It seems that this helper and its subsequent usage eliminates the
need for struct rdt_resource::num_rmid? Are any users left?

Reinette

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 04/18] x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()
  2023-01-13 17:54 ` [PATCH v2 04/18] x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare() James Morse
  2023-01-17 18:28   ` Yu, Fenghua
@ 2023-02-02 23:45   ` Reinette Chatre
  2023-03-03 18:33     ` James Morse
  1 sibling, 1 reply; 73+ messages in thread
From: Reinette Chatre @ 2023-02-02 23:45 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> RMID are allocated for each monitor or control group directory, because
> each of these needs its own RMID. For control groups,
> rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID.
> 
> MPAM's equivalent of RMID are not an independent number, so can't be
> allocated until the closid is known. An RMID allocation for one CLOSID

Could you please be consistent with CLOSID vs closid (also RMID vs rmid)?
When reading through the series and seeing the switch it is not clear if
text refers to same concept.

> may fail, whereas another may succeed depending on how many monitor
> groups a control group has.
> 
> The RMID allocation needs to move to be after the CLOSID has been
> allocated.
> 
> Move the RMID allocation out of mkdir_rdt_prepare() to occur in its caller,
> after the mkdir_rdt_prepare() call. This allows the RMID allocator to
> know the CLOSID.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 29 +++++++++++++++++++-------
>  1 file changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 841294ad6263..c67083a8a5f5 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2892,6 +2892,12 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
>  	return 0;
>  }
>  
> +static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp)
> +{
> +	if (rdt_mon_capable)
> +		free_rmid(rgrp->closid, rgrp->mon.rmid);
> +}
> +
>  static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
>  			     const char *name, umode_t mode,
>  			     enum rdt_group_type rtype, struct rdtgroup **r)
> @@ -2957,10 +2963,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
>  		goto out_destroy;
>  	}
>  
> -	ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
> -	if (ret)
> -		goto out_destroy;
> -
>  	kernfs_activate(kn);
>  

This moves the creation of the monitoring related files/directories to later, but leaves
the kernfs_activate() that activates the node and make it visible to user space. Should
this activation be moved?

Reinette

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 05/18] x86/resctrl: Allow RMID allocation to be scoped by CLOSID
  2023-01-13 17:54 ` [PATCH v2 05/18] x86/resctrl: Allow RMID allocation to be scoped by CLOSID James Morse
  2023-01-17 18:53   ` Yu, Fenghua
@ 2023-02-02 23:45   ` Reinette Chatre
  2023-03-03 18:34     ` James Morse
  1 sibling, 1 reply; 73+ messages in thread
From: Reinette Chatre @ 2023-02-02 23:45 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> MPAMs RMID values are not unique unless the CLOSID is considered as well.
> 
> alloc_rmid() expects the RMID to be an independent number.
> 
> Pass the CLOSID in to alloc_rmid(). Use this to compare indexes when
> allocating. If the CLOSID is not relevant to the index, this ends up
> comparing the free RMID with itself, and the first free entry will be
> used. With MPAM the CLOSID is included in the index, so this becomes a
> walk of the free RMID entries, until one that matches the supplied
> CLOSID is found.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>

...

>  /*
> - * As of now the RMIDs allocation is global.
> + * As of now the RMIDs allocation is the same in each domain.

Could you please elaborate what is meant/intended with this change
(global vs per domain)? From the changelog a comment that RMID
allocation is the same in each resource group for MPAM may be
expected but per domain is not clear to me.

Reinette

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID
  2023-01-13 17:54 ` [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID James Morse
  2023-01-17 18:29   ` Yu, Fenghua
@ 2023-02-02 23:46   ` Reinette Chatre
  2023-03-03 18:36     ` James Morse
  1 sibling, 1 reply; 73+ messages in thread
From: Reinette Chatre @ 2023-02-02 23:46 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
> used for different control groups.
> 
> This means once a CLOSID is allocated, all its monitoring ids may still be
> dirty, and held in limbo.
> 
> Add a helper to allow the CLOSID allocator to check if a CLOSID has dirty
> RMID values. This behaviour is enabled by a kconfig option selected by
> the architecture, which avoids a pointless search for x86.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> 
> ---
> Changes since v1:
>  * Removed superflous IS_ENABLED().
> ---
>  arch/x86/kernel/cpu/resctrl/internal.h |  1 +
>  arch/x86/kernel/cpu/resctrl/monitor.c  | 31 ++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 ++++++++------
>  3 files changed, 42 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 013c8fc9fd28..adae6231324f 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -509,6 +509,7 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
>  void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
>  struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
>  int closids_supported(void);
> +bool resctrl_closid_is_dirty(u32 closid);
>  void closid_free(int closid);
>  int alloc_rmid(u32 closid);
>  void free_rmid(u32 closid, u32 rmid);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 347be3767241..190ac183139e 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -327,6 +327,37 @@ static struct rmid_entry *resctrl_find_free_rmid(u32 closid)
>  	return ERR_PTR(-ENOSPC);
>  }
>  
> +/**
> + * resctrl_closid_is_dirty - Determine if clean RMID can be allocate for this
> + *                           CLOSID.
> + * @closid: The CLOSID that is being queried.
> + *
> + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocate CLOSID
> + * may not be able to allocate clean RMID. To avoid this the allocator will
> + * only return clean CLOSID.
> + */
> +bool resctrl_closid_is_dirty(u32 closid)
> +{
> +	struct rmid_entry *entry;
> +	int i;
> +
> +	lockdep_assert_held(&rdtgroup_mutex);
> +
> +	if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
> +		return false;

Why is a config option chosen? Is this not something that can be set in the
architecture specific code using a global in the form matching existing related
items like "arch_has..." or "arch_needs..."? 

> +
> +	for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
> +		entry = &rmid_ptrs[i];
> +		if (entry->closid != closid)
> +			continue;
> +
> +		if (entry->busy)
> +			return true;
> +	}
> +
> +	return false;
> +}

If I understand this correctly resctrl_closid_is_dirty() will return true if
_any_ RMID/PMG associated with a CLOSID is in use. That is, a CLOSID may be
able to support 100s of PMG but if only one of them is busy then the CLOSID
will be considered unusable ("dirty"). It sounds to me that there could be scenarios
when CLOSID could be considered unavailable while there are indeed sufficient
resources?

The function comment states "Determine if clean RMID can be allocate for this
CLOSID." - if I understand correctly it behaves more like "Determine if all
RMID associated with CLOSID are available".

Reinette

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
  2023-01-13 17:54 ` [PATCH v2 07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers James Morse
  2023-01-17 19:10   ` Yu, Fenghua
@ 2023-02-02 23:47   ` Reinette Chatre
  2023-03-06 11:32     ` James Morse
  1 sibling, 1 reply; 73+ messages in thread
From: Reinette Chatre @ 2023-02-02 23:47 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:

...

> @@ -567,19 +579,14 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
>  	 * For monitor groups, can move the tasks only from
>  	 * their parent CTRL group.
>  	 */
> -
> -	if (rdtgrp->type == RDTCTRL_GROUP) {
> -		WRITE_ONCE(tsk->closid, rdtgrp->closid);
> -		WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
> -	} else if (rdtgrp->type == RDTMON_GROUP) {
> -		if (rdtgrp->mon.parent->closid == tsk->closid) {
> -			WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
> -		} else {
> -			rdt_last_cmd_puts("Can't move task to different control group\n");
> -			return -EINVAL;
> -		}
> +	if (rdtgrp->type == RDTMON_GROUP &&
> +	    !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) {
> +		rdt_last_cmd_puts("Can't move task to different control group\n");
> +		return -EINVAL;
>  	}
>  
> +	resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, rdtgrp->mon.rmid);

This does not use the intended closid when rdtgrp->type == RDTMON_GROUP.

Reinette

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI
  2023-01-13 17:54 ` [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI James Morse
  2023-01-17 18:29   ` Yu, Fenghua
@ 2023-02-02 23:47   ` Reinette Chatre
  2023-03-06 11:33     ` James Morse
  1 sibling, 1 reply; 73+ messages in thread
From: Reinette Chatre @ 2023-02-02 23:47 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> x86 is blessed with an abundance of monitors, one per RMID, that can be
> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
> the number implemented is up to the manufacturer. This means when there are
> fewer monitors than needed, they need to be allocated and freed.
> 
> Worse, the domain may be broken up into slices, and the MMIO accesses
> for each slice may need performing from different CPUs.
> 
> These two details mean MPAMs monitor code needs to be able to sleep, and
> IPI another CPU in the domain to read from a resource that has been sliced.
> 
> mon_event_read() already invokes mon_event_count() via IPI, which means
> this isn't possible.
> 
> Change mon_event_read() to schedule mon_event_count() on a remote CPU and
> wait, instead of sending an IPI. This function is only used in response to
> a user-space filesystem request (not the timing sensitive overflow code).
> 
> This allows MPAM to hide the slice behaviour from resctrl, and to keep
> the monitor-allocation in monitor.c.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 7 +++++--
>  arch/x86/kernel/cpu/resctrl/internal.h    | 2 +-
>  arch/x86/kernel/cpu/resctrl/monitor.c     | 6 ++++--
>  3 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 1df0e3262bca..4ee3da6dced7 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -532,8 +532,11 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>  		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
>  		    int evtid, int first)
>  {
> +	/* When picking a cpu from cpu_mask, ensure it can't race with cpuhp */

Please consistently use CPU instead of cpu.

> +	lockdep_assert_held(&rdtgroup_mutex);
> +
>  	/*
> -	 * setup the parameters to send to the IPI to read the data.
> +	 * setup the parameters to pass to mon_event_count() to read the data.
>  	 */
>  	rr->rgrp = rdtgrp;
>  	rr->evtid = evtid;
> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>  	rr->val = 0;
>  	rr->first = first;
>  
> -	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> +	smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);

This would be problematic for the use cases where single tasks are run on
adaptive-tick CPUs. If an adaptive-tick CPU is chosen to run the function then
it may never run. Real-time environments are target usage of resctrl (with examples
in the documentation).

Reinette

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 13/18] x86/resctrl: Make rdt_enable_key the arch's decision to switch
  2023-01-13 17:54 ` [PATCH v2 13/18] x86/resctrl: Make rdt_enable_key the arch's decision to switch James Morse
@ 2023-02-02 23:48   ` Reinette Chatre
  0 siblings, 0 replies; 73+ messages in thread
From: Reinette Chatre @ 2023-02-02 23:48 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> rdt_enable_key is switched when resctrl is mounted. It was also previously
> used to prevent a second mount of the filesystem.
> 
> Any other architecture that wants to support resctrl has to provide
> identical static keys.
> 
> Now that we have helpers for enablign and disabling the alloc/mon keys,

Please remove all usage of "we" in changelog and code comments.

enablign -> enabling

Reinette



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 16/18] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu
  2023-01-13 17:54 ` [PATCH v2 16/18] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu James Morse
@ 2023-02-02 23:49   ` Reinette Chatre
  2023-03-06 11:34     ` James Morse
  0 siblings, 1 reply; 73+ messages in thread
From: Reinette Chatre @ 2023-02-02 23:49 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> When a cpu is taken offline resctrl may need to move the overflow or
> limbo handlers to run on a different CPU.

cpu -> CPU

> 
> Once the offline callbacks have been split, cqm_setup_limbo_handler()
> will be called while the CPU that is going offline is still present
> in the cpu_mask.
> 
> Pass the CPU to exclude to cqm_setup_limbo_handler() and
> mbm_setup_overflow_handler(). These functions can use cpumask_any_but()
> when selecting the CPU. -1 is used to indicate no CPUs need excluding.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> 
> Both cpumask_any() and cpumask_any_but() return a value >= nr_cpu_ids
> on error. schedule_delayed_work_on() doesn't appear to check. Add the
> error handling to be robust. It doesn't look like its possible to hit
> this.
> ---
>  arch/x86/kernel/cpu/resctrl/core.c     |  6 ++--
>  arch/x86/kernel/cpu/resctrl/internal.h |  6 ++--
>  arch/x86/kernel/cpu/resctrl/monitor.c  | 39 +++++++++++++++++++++-----
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +--
>  4 files changed, 42 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 749d9a450749..a3c171bd2de0 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -557,12 +557,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
>  		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
>  			cancel_delayed_work(&d->mbm_over);
> -			mbm_setup_overflow_handler(d, 0);
> +			/* exclude_cpu=-1 as we already cpumask_clear_cpu()d */

Please do not use "we".

> +			mbm_setup_overflow_handler(d, 0, -1);
>  		}
>  		if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
>  		    has_busy_rmid(r, d)) {
>  			cancel_delayed_work(&d->cqm_limbo);
> -			cqm_setup_limbo_handler(d, 0);
> +			/* exclude_cpu=-1 as we already cpumask_clear_cpu()d */
> +			cqm_setup_limbo_handler(d, 0, -1);
>  		}
>  	}
>  }
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a1bf97adee2e..d8c7a549b43a 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -515,11 +515,13 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>  		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
>  		    int evtid, int first);
>  void mbm_setup_overflow_handler(struct rdt_domain *dom,
> -				unsigned long delay_ms);
> +				unsigned long delay_ms,
> +				int exclude_cpu);
>  void mbm_handle_overflow(struct work_struct *work);
>  void __init intel_rdt_mbm_apply_quirk(void);
>  bool is_mba_sc(struct rdt_resource *r);
> -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
> +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
> +			     int exclude_cpu);
>  void cqm_handle_limbo(struct work_struct *work);
>  bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
>  void __check_limbo(struct rdt_domain *d, bool force_free);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 1a214bd32ed4..334fb3f1c6e2 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -440,7 +440,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>  		 * setup up the limbo worker.
>  		 */
>  		if (!has_busy_rmid(r, d))
> -			cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL);
> +			cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, -1);
>  		set_bit(idx, d->rmid_busy_llc);
>  		entry->busy++;
>  	}
> @@ -773,15 +773,27 @@ void cqm_handle_limbo(struct work_struct *work)
>  	mutex_unlock(&rdtgroup_mutex);
>  }
>  
> -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
> +/**
> + * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this
> + *                             domain.
> + * @delay_ms:      How far in the future the handler should run.
> + * @exclude_cpu:   Which CPU the handler should not run on, -1 to pick any CPU.
> + */
> +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
> +			     int exclude_cpu)
>  {
>  	unsigned long delay = msecs_to_jiffies(delay_ms);
>  	int cpu;
>  
> -	cpu = cpumask_any(&dom->cpu_mask);
> +	if (exclude_cpu == -1)
> +		cpu = cpumask_any(&dom->cpu_mask);
> +	else
> +		cpu = cpumask_any_but(&dom->cpu_mask, exclude_cpu);
> +
>  	dom->cqm_work_cpu = cpu;
>  

This assignment is unexpected considering the error handling that follows.
cqm_work_cpu can thus be >= nr_cpu_ids. I assume it is to help during
domain remove where the CPU being removed is checked against this value?
If indeed this invalid CPU assignment is done in support of future code
path, could you please add a comment to help explain this assignment?

Reinette

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks
  2023-01-13 17:54 ` [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks James Morse
@ 2023-02-02 23:50   ` Reinette Chatre
  2023-03-06 11:34     ` James Morse
  0 siblings, 1 reply; 73+ messages in thread
From: Reinette Chatre @ 2023-02-02 23:50 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> resctrl has one mutex that is taken by the architecture specific code,
> and the filesystem parts. The two interact via cpuhp, where the
> architecture code updates the domain list. Filesystem handlers that
> walk the domains list should not run concurrently with the cpuhp
> callback modifying the list.
> 
> Exposing a lock from the filesystem code means the interface is not
> cleanly defined, and creates the possibility of cross-architecture
> lock ordering headaches. The interaction only exists so that certain
> filesystem paths are serialised against cpu hotplug. The cpu hotplug
> code already has a mechanism to do this using cpus_read_lock().
> 
> MPAM's monitors have an overflow interrupt, so it needs to be possible
> to walk the domains list in irq context. RCU is ideal for this,
> but some paths need to be able to sleep to allocate memory.
> 
> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
> of a cpuhp callback, cpus_read_lock() must always be taken first.
> rdtgroup_schemata_write() already does this.
> 
> All but one of the filesystem code's domain list walkers are
> currently protected by the rdtgroup_mutex taken in
> rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
> which takes the lock directly.

The new BMEC code also. You can find it on tip's x86/cache branch,
see mbm_total_bytes_config_write() and mbm_local_bytes_config_write().

> 
> Make the domain list protected by RCU. An architecture-specific
> lock prevents concurrent writers. rdt_bit_usage_show() can
> walk the domain list under rcu_read_lock().
> The other filesystem list walkers need to be able to sleep.
> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
> cpuhp callbacks can't be invoked when file system operations are
> occurring.
> 
> Add lockdep_assert_cpus_held() in the cases where the
> rdtgroup_kn_lock_live() call isn't obvious.
> 
> Resctrl's domain online/offline calls now need to take the
> rdtgroup_mutex themselves.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  arch/x86/kernel/cpu/resctrl/core.c        | 33 ++++++++------
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 14 ++++--
>  arch/x86/kernel/cpu/resctrl/monitor.c     |  3 ++
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  3 ++
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 54 ++++++++++++++++++++---
>  include/linux/resctrl.h                   |  2 +-
>  6 files changed, 84 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 7896fcf11df6..dc1ba580c4db 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -25,8 +25,14 @@
>  #include <asm/resctrl.h>
>  #include "internal.h"
>  
> -/* Mutex to protect rdtgroup access. */
> -DEFINE_MUTEX(rdtgroup_mutex);
> +/*
> + * rdt_domain structures are kfree()d when their last cpu goes offline,
> + * and allocated when the first cpu in a new domain comes online.
> + * The rdt_resource's domain list is updated when this happens. The domain
> + * list is protected by RCU, but callers can also take the cpus_read_lock()
> + * to prevent modification if they need to sleep. All writers take this mutex:

Using "callers can" is not specific (compare to "callers should"). Please provide
clear guidance on how the locks should be used. Reader may wonder "why take cpus_read_lock()
to prevent modification, why not just take the mutex to prevent modification?"

> + */
> +static DEFINE_MUTEX(domain_list_lock);
>  
>  /*
>   * The cached resctrl_pqr_state is strictly per CPU and can never be
> @@ -483,6 +489,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  	struct rdt_domain *d;
>  	int err;
>  
> +	lockdep_assert_held(&domain_list_lock);
> +
>  	d = rdt_find_domain(r, id, &add_pos);
>  	if (IS_ERR(d)) {
>  		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> @@ -516,11 +524,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  		return;
>  	}
>  
> -	list_add_tail(&d->list, add_pos);
> +	list_add_tail_rcu(&d->list, add_pos);
>  
>  	err = resctrl_online_domain(r, d);
>  	if (err) {
> -		list_del(&d->list);
> +		list_del_rcu(&d->list);
> +		synchronize_rcu();
>  		domain_free(hw_dom);
>  	}
>  }
> @@ -541,7 +550,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  	cpumask_clear_cpu(cpu, &d->cpu_mask);
>  	if (cpumask_empty(&d->cpu_mask)) {
>  		resctrl_offline_domain(r, d);
> -		list_del(&d->list);
> +		list_del_rcu(&d->list);
> +		synchronize_rcu();
>  
>  		/*
>  		 * rdt_domain "d" is going to be freed below, so clear

Should domain_remove_cpu() also get a "lockdep_assert_held(&domain_list_lock)"?

> @@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
>  static int resctrl_arch_online_cpu(unsigned int cpu)
>  {
>  	struct rdt_resource *r;
> -	int err;
>  
> -	mutex_lock(&rdtgroup_mutex);
> +	mutex_lock(&domain_list_lock);
>  	for_each_capable_rdt_resource(r)
>  		domain_add_cpu(cpu, r);
>  	clear_closid_rmid(cpu);
> +	mutex_unlock(&domain_list_lock);

Why is clear_closid_rmid(cpu) protected by mutex?

>  
> -	err = resctrl_online_cpu(cpu);
> -	mutex_unlock(&rdtgroup_mutex);
> -
> -	return err;
> +	return resctrl_online_cpu(cpu);
>  }
>  
>  static int resctrl_arch_offline_cpu(unsigned int cpu)
>  {
>  	struct rdt_resource *r;
>  
> -	mutex_lock(&rdtgroup_mutex);
>  	resctrl_offline_cpu(cpu);
>  
> +	mutex_lock(&domain_list_lock);
>  	for_each_capable_rdt_resource(r)
>  		domain_remove_cpu(cpu, r);
>  	clear_closid_rmid(cpu);
> -	mutex_unlock(&rdtgroup_mutex);
> +	mutex_unlock(&domain_list_lock);

Same

>  
>  	return 0;
>  }

Reinette

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 02/18] x86/resctrl: Access per-rmid structures by index
  2023-01-17 18:28   ` Yu, Fenghua
@ 2023-03-03 18:33     ` James Morse
  0 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-03-03 18:33 UTC (permalink / raw)
  To: Yu, Fenghua, x86, linux-kernel
  Cc: Chatre, Reinette, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Fenghua,

On 17/01/2023 18:28, Yu, Fenghua wrote:
>> Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
>> monitors, RMID values on arm64 are not unique unless the CLOSID is also
>> included. Bitmaps like rmid_busy_llc need to be sized by the number of unique
>> entries for this resource.
>>
>> Add helpers to encode/decode the CLOSID and RMID to an index. The domain's
>> busy_rmid_llc and the rmid_ptrs[] array are then sized by index. On x86, this is
>> always just the RMID. This gives resctrl a unique value it can use to store
>> monitor values, and allows MPAM to decode the closid when reading the
>> hardware counters.

>> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
>> index 52788f79786f..44d568f3577c 100644
>> --- a/arch/x86/include/asm/resctrl.h
>> +++ b/arch/x86/include/asm/resctrl.h
>> @@ -94,6 +101,23 @@ static inline void resctrl_sched_in(void)
>>  		__resctrl_sched_in();
>>  }
>>
>> +static inline u32 resctrl_arch_system_num_rmid_idx(void)
>> +{
>> +	/* RMID are independent numbers for x86. num_rmid_idx==num_rmid
>> */

> Is it better to change the comment to something like:
> +       /* RMIDs are independent of CLOSIDs and number of RMIDs is fixed. */

I agree the reference to x86 looks a bit funny, but its to explain why this 'idx'
concept exists before the MPAM code is merged that needs it.

I don't think it helps to say the number of RMID is fixed on x86, for MPAM there is no
direct equivalent, but everything there is fixed too. Its not the fixed property that this
is needed for, but the 'no equivalent'.


>> +	return boot_cpu_data.x86_cache_max_rmid + 1; }
>> +

>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
>> b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 13673cab175a..dbae380e3d1c 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -142,12 +142,27 @@ static inline u64 get_corrected_mbm_count(u32 rmid,
>> unsigned long val)
>>  	return val;
>>  }
>>
>> -static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid)
>> +/*
>> + * x86 and arm64 differ in their handling of monitoring.
>> + * x86's RMID are an independent number, there is one RMID '1'.

> What do you mean by "one RMID '1'"?

If I could snoop the traffic on the interconnect, there would only be one source of
traffic with RMID value 1.

In contrast, if you did this on an arm machine, there would be numerous sources of traffic
with PMG value 1. This is because PMG isn't an independent number, it isn't equivalent to
RMID.


> Should this sentence be "x86's RMID is an independent number."?

Does it help to rephrase it as:
| * x86's RMID are an independent number, there is only one source of traffic
| * an RMID value of '1'.
| * arm64's PMG extend the PARTID/CLOSID space, there are multiple sources of
| * traffic with a PMG value of '1', one for each CLOSID, meaining the RMID
| * value is no longer unique.

?

>> + * arm64's PMG extend the PARTID/CLOSID space, there is one RMID '1'
>> +for each
>> + * CLOSID. The RMID is no longer unique.
>> + * To account for this, resctrl uses an index. On x86 this is just the
>> +RMID,
>> + * on arm64 it encodes the CLOSID and RMID. This gives a unique number.
>> + *
>> + * The domain's rmid_busy_llc and rmid_ptrs are sized by index. The
>> +arch code
>> + * must accept an attempt to read every index.
>> + */
>> +static inline struct rmid_entry *__rmid_entry(u32 idx)
>>  {
>>  	struct rmid_entry *entry;
>> +	u32 closid, rmid;
>>
>> -	entry = &rmid_ptrs[rmid];
>> -	WARN_ON(entry->rmid != rmid);
>> +	entry = &rmid_ptrs[idx];
>> +	resctrl_arch_rmid_idx_decode(idx, &closid, &rmid);
>> +
>> +	WARN_ON_ONCE(entry->closid != closid);
>> +	WARN_ON_ONCE(entry->rmid != rmid);
>>
>>  	return entry;
>>  }

>> @@ -719,7 +744,7 @@ static int dom_data_init(struct rdt_resource *r)
>>  	 * default_rdtgroup control group, which will be setup later. See
>>  	 * rdtgroup_setup_root().
>>  	 */
>> -	entry = __rmid_entry(0, 0);
>> +	entry = __rmid_entry(resctrl_arch_rmid_idx_encode(0, 0));

> Better change to:
> +	entry = __rmid_entry(resctrl_arch_rmid_idx_encode(RESCTRL_BAD_CLOSID, 0));
> because this explicitly tells CLOSID is invalid here on X86.

Its not invalid, its the reserved value that MSR_IA32_PQR_ASSOC is set to when a task is
in the default control group, and when a CPU first comes online.

Reserving CLOSID ~0 would be ignored on x86, but hit a WARN_ON_ONCE() on arm64 because ~0
isn't a valid closid. The X86 in 'X86_RESCTRL_BAD_CLOSID' is the hint it should only
appear in x86 specific code!

How about:
|        idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, 0);
|        entry = __rmid_entry(idx);

?

> 
>>  	list_del(&entry->list);
>>
>>  	return 0;


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 02/18] x86/resctrl: Access per-rmid structures by index
  2023-02-02 23:44   ` Reinette Chatre
@ 2023-03-03 18:33     ` James Morse
  0 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-03-03 18:33 UTC (permalink / raw)
  To: Reinette Chatre, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Reinette,

On 02/02/2023 23:44, Reinette Chatre wrote:
> On 1/13/2023 9:54 AM, James Morse wrote:
>> Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
>> monitors, RMID values on arm64 are not unique unless the CLOSID is
>> also included. Bitmaps like rmid_busy_llc need to be sized by the
>> number of unique entries for this resource.
>>
>> Add helpers to encode/decode the CLOSID and RMID to an index. The
>> domain's busy_rmid_llc and the rmid_ptrs[] array are then sized by
> 
> busy_rmid_llc -> rmid_busy_llc ?
> 
> Could you please also mention the MBM state impacted?

Yup, this paragraph reworded as:
| Add helpers to encode/decode the CLOSID and RMID to an index. The
| domain's rmid_busy__llc and the rmid_ptrs[] array are then sized by
| index, as are the domain mbm_local and mbm_total arrays.
| On x86, the index is always just the RMID, so all these structures
| remain the same size.
|
| The index gives resctrl a unique value it can use to store monitor
| values, and allows MPAM to decode the closid when reading the hardware
| counters.

>> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
>> index 52788f79786f..44d568f3577c 100644
>> --- a/arch/x86/include/asm/resctrl.h
>> +++ b/arch/x86/include/asm/resctrl.h
>> @@ -7,6 +7,13 @@
>>  #include <linux/sched.h>
>>  #include <linux/jump_label.h>
>>  
>> +/*
>> + * This value can never be a valid CLOSID, and is used when mapping a
>> + * (closid, rmid) pair to an index and back. On x86 only the RMID is
>> + * needed.
>> + */
>> +#define X86_RESCTRL_BAD_CLOSID		~0

> Should this be moved to previous patch where first usage of ~0 appears?

Makes sense,


> Also, not having a size creates opportunity for inconsistencies. How
> about ((u32)~0) ?

Yes, the compilers secret expectations on sizes always catch me out!


>> +
>>  /**
>>   * struct resctrl_pqr_state - State cache for the PQR MSR
>>   * @cur_rmid:		The cached Resource Monitoring ID
>> @@ -94,6 +101,23 @@ static inline void resctrl_sched_in(void)
>>  		__resctrl_sched_in();
>>  }
>>  
>> +static inline u32 resctrl_arch_system_num_rmid_idx(void)
>> +{
>> +	/* RMID are independent numbers for x86. num_rmid_idx==num_rmid */
>> +	return boot_cpu_data.x86_cache_max_rmid + 1;
>> +}

> It seems that this helper and its subsequent usage eliminates the
> need for struct rdt_resource::num_rmid? Are any users left?

The only user in the filesystem parts of resctrl is rdt_num_rmids_show(), which exposes
the value to user-space. The value is unfortunately meaningless on MPAM systems, but as
its user-space ABI, it has to stay.

The remaining users in the x86 specific code:
domain_add_cpu() continues to use r->num_rmid as the arch code can know its the same
number as num_idx, and rdt_get_mon_l3_config() calculates the value.


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 04/18] x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()
  2023-02-02 23:45   ` Reinette Chatre
@ 2023-03-03 18:33     ` James Morse
  0 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-03-03 18:33 UTC (permalink / raw)
  To: Reinette Chatre, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Reinette,

On 02/02/2023 23:45, Reinette Chatre wrote:
> On 1/13/2023 9:54 AM, James Morse wrote:
>> RMID are allocated for each monitor or control group directory, because
>> each of these needs its own RMID. For control groups,
>> rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID.
>>
>> MPAM's equivalent of RMID are not an independent number, so can't be
>> allocated until the closid is known. An RMID allocation for one CLOSID

> Could you please be consistent with CLOSID vs closid (also RMID vs rmid)?
> When reading through the series and seeing the switch it is not clear if
> text refers to same concept.

Yup, I'm trying, but there will be some that slip through.


>> may fail, whereas another may succeed depending on how many monitor
>> groups a control group has.
>>
>> The RMID allocation needs to move to be after the CLOSID has been
>> allocated.
>>
>> Move the RMID allocation out of mkdir_rdt_prepare() to occur in its caller,
>> after the mkdir_rdt_prepare() call. This allows the RMID allocator to
>> know the CLOSID.

>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 841294ad6263..c67083a8a5f5 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c

>> @@ -2957,10 +2963,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
>>  		goto out_destroy;
>>  	}
>>  
>> -	ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
>> -	if (ret)
>> -		goto out_destroy;
>> -
>>  	kernfs_activate(kn);
>>  
> 
> This moves the creation of the monitoring related files/directories to later, but leaves
> the kernfs_activate() that activates the node and make it visible to user space. Should
> this activation be moved?

I hadn't properly grasped what that was doing Yes, I've moved it to after the
mkdir_rdt_prepare_rmid_alloc() calls in the two callers.


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 05/18] x86/resctrl: Allow RMID allocation to be scoped by CLOSID
  2023-01-17 18:53   ` Yu, Fenghua
@ 2023-03-03 18:34     ` James Morse
  0 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-03-03 18:34 UTC (permalink / raw)
  To: Yu, Fenghua, x86, linux-kernel
  Cc: Chatre, Reinette, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Fenghua,

On 17/01/2023 18:53, Yu, Fenghua wrote:
>> MPAMs RMID values are not unique unless the CLOSID is considered as well.
>>
>> alloc_rmid() expects the RMID to be an independent number.
>>
>> Pass the CLOSID in to alloc_rmid(). Use this to compare indexes when allocating.
>> If the CLOSID is not relevant to the index, this ends up comparing the free RMID
>> with itself, and the first free entry will be used. With MPAM the CLOSID is
>> included in the index, so this becomes a walk of the free RMID entries, until one
>> that matches the supplied CLOSID is found.


>> b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index dbae380e3d1c..347be3767241 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -301,25 +301,51 @@ bool has_busy_rmid(struct rdt_resource *r, struct
>> rdt_domain *d)
>>  	return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit;  }
>>
>> +static struct rmid_entry *resctrl_find_free_rmid(u32 closid) {
>> +	struct rmid_entry *itr;
>> +	u32 itr_idx, cmp_idx;
>> +
>> +	if (list_empty(&rmid_free_lru))
>> +		return rmid_limbo_count ? ERR_PTR(-EBUSY) : ERR_PTR(-
>> ENOSPC);
>> +
>> +	list_for_each_entry(itr, &rmid_free_lru, list) {
>> +		/*
>> +		 * get the index of this free RMID, and the index it would need
>> +		 * to be if it were used with this CLOSID.
>> +		 * If the CLOSID is irrelevant on this architecture, these will
>> +		 * always be the same. Otherwise they will only match if this
>> +		 * RMID can be used with this CLOSID.
>> +		 */
>> +		itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid);
>> +		cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid);
>> +
>> +		if (itr_idx == cmp_idx)
>> +			return itr;
> 
> Finding free rmid may be called frequently depending on usage.
> 
> It would be better to have a simpler and faster arch helper that finds the itr on x86.
> Something like:
> struct rmid_entry *resctrl_arch_rmid_matchd(u32 ignored, u32 ignored)
> {
> 	return list_entry_first(resctrl_free_lru, itr, list);
> }
> 
> Arm64 implements the complex case going through the rmid_free_lru list in the patch.

The trick here is that one degenerates into the other:

>> +	list_for_each_entry(itr, &rmid_free_lru, list) {

The first time round the loop, this is equivalent to:
| itr = list_entry_first(&rmid_free_lru, itr, list);


>> +		/*
>> +		 * get the index of this free RMID, and the index it would need
>> +		 * to be if it were used with this CLOSID.
>> +		 * If the CLOSID is irrelevant on this architecture, these will
>> +		 * always be the same. Otherwise they will only match if this
>> +		 * RMID can be used with this CLOSID.
>> +		 */
>> +		itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid);

On x86, after inline-ing this is:
| itr_idx = itr->rmid

>> +		cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid);

and this is:
| cmp_idx = itr->rmid

>> +		if (itr_idx == cmp_idx)
>> +			return itr;

So now any half decent compiler can spot that this condition is always true and the loop
only ever runs once, and the whole thing reduces to what you wanted it to be.

This saves exposing things that should be private to the filesystem code and having
per-arch helpers to mess with it.

The commit message described this, I'll expand the comment in the loop to be:
|		 * If the CLOSID is irrelevant on this architecture, these will
|		 * always be the same meaning the compiler can reduce this loop
|		 * to a single list_entry_first() call.


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 05/18] x86/resctrl: Allow RMID allocation to be scoped by CLOSID
  2023-02-02 23:45   ` Reinette Chatre
@ 2023-03-03 18:34     ` James Morse
  2023-03-10 19:57       ` Reinette Chatre
  0 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-03-03 18:34 UTC (permalink / raw)
  To: Reinette Chatre, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Reinette,

On 02/02/2023 23:45, Reinette Chatre wrote:
> On 1/13/2023 9:54 AM, James Morse wrote:
>> MPAMs RMID values are not unique unless the CLOSID is considered as well.
>>
>> alloc_rmid() expects the RMID to be an independent number.
>>
>> Pass the CLOSID in to alloc_rmid(). Use this to compare indexes when
>> allocating. If the CLOSID is not relevant to the index, this ends up
>> comparing the free RMID with itself, and the first free entry will be
>> used. With MPAM the CLOSID is included in the index, so this becomes a
>> walk of the free RMID entries, until one that matches the supplied
>> CLOSID is found.
>>
>> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
>> Signed-off-by: James Morse <james.morse@arm.com>
> 
> ...
> 
>>  /*
>> - * As of now the RMIDs allocation is global.
>> + * As of now the RMIDs allocation is the same in each domain.

> Could you please elaborate what is meant/intended with this change
> (global vs per domain)? From the changelog a comment that RMID
> allocation is the same in each resource group for MPAM may be
> expected but per domain is not clear to me.

This is badly worded. It's referring to the limbo list management, while RMID=7 isn't
unique on MPAM, the struct rmid_entry used in two domains will be the same because the
CLOSID doesn't change. This means its still sufficient to move around the struct
rmid_entry to manage the limbo list.

I think this had me confused because 'as of now' implies the RMID won't always be globally
allocated, and MPAM has non-unique RMID/PMG values which are a different kind of global.


I'll change this to read:
/*
 * For MPAM the RMID value is not unique, and has to be considered with
 * the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which
 * allows all domains to be managed by a single limbo list.
 * Each domain also has a rmid_busy_llc to reduce the work of the limbo handler.
 */

(seeing as the function doesn't touch rmid_budy_llc, or refer to it by name)

Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID
  2023-01-17 18:29   ` Yu, Fenghua
@ 2023-03-03 18:35     ` James Morse
  0 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-03-03 18:35 UTC (permalink / raw)
  To: Yu, Fenghua, x86, linux-kernel
  Cc: Chatre, Reinette, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Fenghua,

On 17/01/2023 18:29, Yu, Fenghua wrote:
>> MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
>> used for different control groups.
>>
>> This means once a CLOSID is allocated, all its monitoring ids may still be dirty,
>> and held in limbo.
>>
>> Add a helper to allow the CLOSID allocator to check if a CLOSID has dirty RMID
>> values. This behaviour is enabled by a kconfig option selected by the
>> architecture, which avoids a pointless search for x86.

>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
>> b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 347be3767241..190ac183139e 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -327,6 +327,37 @@ static struct rmid_entry *resctrl_find_free_rmid(u32
>> closid)
>>  	return ERR_PTR(-ENOSPC);
>>  }
>>
>> +/**
>> + * resctrl_closid_is_dirty - Determine if clean RMID can be allocate for this
> 
> s/allocate/allocated/
> 
>> + *                           CLOSID.
>> + * @closid: The CLOSID that is being queried.
>> + *
>> + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocate
> 
> s/allocate/allocated/

(Both fixed, thanks)


>> +CLOSID
>> + * may not be able to allocate clean RMID. To avoid this the allocator
>> +will
>> + * only return clean CLOSID.
>> + */
>> +bool resctrl_closid_is_dirty(u32 closid) {
>> +	struct rmid_entry *entry;
>> +	int i;
>> +
>> +	lockdep_assert_held(&rdtgroup_mutex);
> 
> It's better to move lockdep_asser_held() after if (!IS_ENABLE()).
> Then compiler might optimize this function to empty on X86.

If you compile without lockdep it will be empty!
Is anyone worried about performance with lockdep enabled?

The reason for it being here is documentation and for the runtime check if you run with
lockdep. Having it here is so that new code that only runs on x86 (with lockdep) also
checks this, even though it doesn't have CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID.

I'd prefer to keep it so we can catch bugs early. Lockdep isn't on by default.


>> +
>> +	if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
>> +		return false;
>> +
>> +	for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
>> +		entry = &rmid_ptrs[i];
>> +		if (entry->closid != closid)
>> +			continue;
>> +
>> +		if (entry->busy)
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID
  2023-02-02 23:46   ` Reinette Chatre
@ 2023-03-03 18:36     ` James Morse
  2023-03-10 19:59       ` Reinette Chatre
  0 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-03-03 18:36 UTC (permalink / raw)
  To: Reinette Chatre, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Reinette,

On 02/02/2023 23:46, Reinette Chatre wrote:
> On 1/13/2023 9:54 AM, James Morse wrote:
>> MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
>> used for different control groups.
>>
>> This means once a CLOSID is allocated, all its monitoring ids may still be
>> dirty, and held in limbo.
>>
>> Add a helper to allow the CLOSID allocator to check if a CLOSID has dirty
>> RMID values. This behaviour is enabled by a kconfig option selected by
>> the architecture, which avoids a pointless search for x86.

>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 347be3767241..190ac183139e 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -327,6 +327,37 @@ static struct rmid_entry *resctrl_find_free_rmid(u32 closid)
>>  	return ERR_PTR(-ENOSPC);
>>  }
>>  
>> +/**
>> + * resctrl_closid_is_dirty - Determine if clean RMID can be allocate for this
>> + *                           CLOSID.
>> + * @closid: The CLOSID that is being queried.
>> + *
>> + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocate CLOSID
>> + * may not be able to allocate clean RMID. To avoid this the allocator will
>> + * only return clean CLOSID.
>> + */
>> +bool resctrl_closid_is_dirty(u32 closid)
>> +{
>> +	struct rmid_entry *entry;
>> +	int i;
>> +
>> +	lockdep_assert_held(&rdtgroup_mutex);
>> +
>> +	if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
>> +		return false;

> Why is a config option chosen? Is this not something that can be set in the
> architecture specific code using a global in the form matching existing related
> items like "arch_has..." or "arch_needs..."?

It doesn't vary by platform, so making it a runtime variable would mean x86 has to carry
this extra code around, even though it will never use it. Done like this, the compiler can
dead-code eliminate the below checks and embed the constant return value in all the callers.


> 
>> +
>> +	for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
>> +		entry = &rmid_ptrs[i];
>> +		if (entry->closid != closid)
>> +			continue;
>> +
>> +		if (entry->busy)
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
> 
> If I understand this correctly resctrl_closid_is_dirty() will return true if
> _any_ RMID/PMG associated with a CLOSID is in use. That is, a CLOSID may be
> able to support 100s of PMG but if only one of them is busy then the CLOSID
> will be considered unusable ("dirty"). It sounds to me that there could be scenarios
> when CLOSID could be considered unavailable while there are indeed sufficient
> resources?

You are right this could happen. I guess the better approach would be to prefer the
cleanest CLOSID that has a clean PMG=0. User-space may not be able to allocate all the
monitor groups immediately, but that would be preferable to failing the control group
creation.

But as this code doesn't get built until the MPAM driver is merged, I'd like to keep it to
an absolute minimum. This would be more than is needed for MPAM to have close to resctrl
feature-parity, so I'd prefer to do this as an improvement once the MPAM driver is upstream.

(also in this category is better use of MPAM's monitors and labelling traffic from the iommu)


> The function comment states "Determine if clean RMID can be allocate for this
> CLOSID." - if I understand correctly it behaves more like "Determine if all
> RMID associated with CLOSID are available".

Yes, I'll fix that.


Thanks!

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
  2023-01-17 19:10   ` Yu, Fenghua
@ 2023-03-03 18:37     ` James Morse
  0 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-03-03 18:37 UTC (permalink / raw)
  To: Yu, Fenghua, x86, linux-kernel
  Cc: Chatre, Reinette, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Fenghua,

On 17/01/2023 19:10, Yu, Fenghua wrote:
>> When switching tasks, the CLOSID and RMID that the new task should use are
>> stored in struct task_struct. For x86 the CLOSID known by resctrl, the value in
>> task_struct, and the value written to the CPU register are all the same thing.
>>
>> MPAM's CPU interface has two different PARTID's one for data accesses the
>> other for instruction fetch. Storing resctrl's CLOSID value in struct task_struct
>> implies the arch code knows whether resctrl is using CDP.
>>
>> Move the matching and setting of the struct task_struct properties to use
>> helpers. This allows arm64 to store the hardware format of the register, instead
>> of having to convert it each time.
>>
>> __rdtgroup_move_task()s use of READ_ONCE()/WRITE_ONCE() ensures torn
>> values aren't seen as another CPU may schedule the task being moved while the
>> value is being changed. MPAM has an additional corner-case here as the PMG
>> bits extend the PARTID space. If the scheduler sees a new-CLOSID but old-RMID,
>> the task will dirty an RMID that the limbo code is not watching causing an
>> inaccurate count. x86's RMID are independent values, so the limbo code will still
>> be watching the old-RMID in this circumstance.
>> To avoid this, arm64 needs both the CLOSID/RMID WRITE_ONCE()d together.
>> Both values must be provided together.
>>
>> Because MPAM's RMID values are not unique, the CLOSID must be provided
>> when matching the RMID.

>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index e1f879e13823..ced7400decae 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -84,7 +84,7 @@ void rdt_last_cmd_printf(const char *fmt, ...)
>>   *
>>   * Using a global CLOSID across all resources has some advantages and
>>   * some drawbacks:
>> - * + We can simply set "current->closid" to assign a task to a resource
>> + * + We can simply set current's closid to assign a task to a resource
>>   *   group.
> 
> Seems this change doesn't gain anything. Maybe this change can be removed?

After this patch the CLOSID might not be in current at all, this comment would be the only
thing that suggests it is. I'd prefer not to suggest anyone access 'current->closid'
directly in resctrl, as such code wouldn't compile on arm64.

This is a 'saves bugs in the future' change.


>>   * + Context switch code can avoid extra memory references deciding which
>>   *   CLOSID to load into the PQR_ASSOC MSR

(please trim your replies!)


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
  2023-02-02 23:47   ` Reinette Chatre
@ 2023-03-06 11:32     ` James Morse
  2023-03-08 10:30       ` Peter Newman
  2023-03-10 20:00       ` Reinette Chatre
  0 siblings, 2 replies; 73+ messages in thread
From: James Morse @ 2023-03-06 11:32 UTC (permalink / raw)
  To: Reinette Chatre, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Reinette,

On 02/02/2023 23:47, Reinette Chatre wrote:
> On 1/13/2023 9:54 AM, James Morse wrote:
> 
> ...
> 
>> @@ -567,19 +579,14 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
>>  	 * For monitor groups, can move the tasks only from
>>  	 * their parent CTRL group.
>>  	 */
>> -
>> -	if (rdtgrp->type == RDTCTRL_GROUP) {
>> -		WRITE_ONCE(tsk->closid, rdtgrp->closid);
>> -		WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
>> -	} else if (rdtgrp->type == RDTMON_GROUP) {
>> -		if (rdtgrp->mon.parent->closid == tsk->closid) {
>> -			WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
>> -		} else {
>> -			rdt_last_cmd_puts("Can't move task to different control group\n");
>> -			return -EINVAL;
>> -		}
>> +	if (rdtgrp->type == RDTMON_GROUP &&
>> +	    !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) {
>> +		rdt_last_cmd_puts("Can't move task to different control group\n");
>> +		return -EINVAL;
>>  	}
>>  
>> +	resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, rdtgrp->mon.rmid);
> 
> This does not use the intended closid when rdtgrp->type == RDTMON_GROUP.

Yes, it should be rdtgrp->mon.parent->closid.

rdtgroup_mkdir_mon() initialises them to be the same, I guess its Peter's monitor-group
rename that means this could get the wrong value?

I've fixed it as:
---------%<---------
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index c7392d10dc5b..30d8961b833c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -585,7 +585,12 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
                return -EINVAL;
        }

-       resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, rdtgrp->mon.rmid);
+       if (rdtgrp->type == RDTMON_GROUP)
+               resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid,
+                                            rdtgrp->mon.rmid);
+       else
+               resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid,
+                                            rdtgrp->mon.rmid);

        /*
         * Ensure the task's closid and rmid are written before determining if
---------%<---------


Thanks,

James

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI
  2023-01-17 18:29   ` Yu, Fenghua
@ 2023-03-06 11:32     ` James Morse
  2023-03-10 20:00       ` Reinette Chatre
  0 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-03-06 11:32 UTC (permalink / raw)
  To: Yu, Fenghua, x86, linux-kernel
  Cc: Chatre, Reinette, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Fenghua,

On 17/01/2023 18:29, Yu, Fenghua wrote:
>> x86 is blessed with an abundance of monitors, one per RMID, that can be read
>> from any CPU in the domain. MPAMs monitors reside in the MMIO MSC, the
>> number implemented is up to the manufacturer. This means when there are
>> fewer monitors than needed, they need to be allocated and freed.
>>
>> Worse, the domain may be broken up into slices, and the MMIO accesses for
>> each slice may need performing from different CPUs.
>>
>> These two details mean MPAMs monitor code needs to be able to sleep, and IPI
>> another CPU in the domain to read from a resource that has been sliced.
>>
>> mon_event_read() already invokes mon_event_count() via IPI, which means this
>> isn't possible.
>>
>> Change mon_event_read() to schedule mon_event_count() on a remote CPU
>> and wait, instead of sending an IPI. This function is only used in response to a
>> user-space filesystem request (not the timing sensitive overflow code).

> But mkdir mon group needs mon_event_count() to reset RMID state.
> If mon_event_count() is called much later, the RMID state may be used
> before it's reset. E.g. prev_msr might be non-0 value. That will cause
> overflow code failure.
> 
> Seems this may happen on both x86 and arm64. At least need to make sure
> RMID state reset happens before it's used.

On the architecture side, there is a patch from Peter that records the MSR value on the
architecture side when an RMID is reset/re-allocated. 2a81160d29d6 ("x86/resctrl: Fix
event counts regression in reused RMIDs")

For the filesystem, the 'first' value is passed through and handled by the CPU that reads
the MSR. I don't see what problem any extra delay due to scheduling would cause.


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI
  2023-02-02 23:47   ` Reinette Chatre
@ 2023-03-06 11:33     ` James Morse
  2023-03-08 16:09       ` James Morse
  0 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-03-06 11:33 UTC (permalink / raw)
  To: Reinette Chatre, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Reinette,

On 02/02/2023 23:47, Reinette Chatre wrote:
> Hi James,
> 
> On 1/13/2023 9:54 AM, James Morse wrote:
>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>> the number implemented is up to the manufacturer. This means when there are
>> fewer monitors than needed, they need to be allocated and freed.
>>
>> Worse, the domain may be broken up into slices, and the MMIO accesses
>> for each slice may need performing from different CPUs.
>>
>> These two details mean MPAMs monitor code needs to be able to sleep, and
>> IPI another CPU in the domain to read from a resource that has been sliced.
>>
>> mon_event_read() already invokes mon_event_count() via IPI, which means
>> this isn't possible.
>>
>> Change mon_event_read() to schedule mon_event_count() on a remote CPU and
>> wait, instead of sending an IPI. This function is only used in response to
>> a user-space filesystem request (not the timing sensitive overflow code).
>>
>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>> the monitor-allocation in monitor.c.

>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>> index 1df0e3262bca..4ee3da6dced7 100644
>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>> @@ -532,8 +532,11 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>  		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
>>  		    int evtid, int first)
>>  {
>> +	/* When picking a cpu from cpu_mask, ensure it can't race with cpuhp */
> 
> Please consistently use CPU instead of cpu.

(done)


>> +	lockdep_assert_held(&rdtgroup_mutex);
>> +
>>  	/*
>> -	 * setup the parameters to send to the IPI to read the data.
>> +	 * setup the parameters to pass to mon_event_count() to read the data.
>>  	 */
>>  	rr->rgrp = rdtgrp;
>>  	rr->evtid = evtid;
>> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>  	rr->val = 0;
>>  	rr->first = first;
>>  
>> -	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>> +	smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);

> This would be problematic for the use cases where single tasks are run on
> adaptive-tick CPUs. If an adaptive-tick CPU is chosen to run the function then
> it may never run. Real-time environments are target usage of resctrl (with examples
> in the documentation).

Interesting. I can't find an IPI wakeup under smp_call_on_cpu() ... I wonder what else
this breaks!

Resctrl doesn't consider the nohz-cpus when doing any of this work, or when setting up the
limbo or overflow timer work.

I think the right thing to do here is add some cpumask_any_housekeeping() helper to avoid
nohz-full CPUs where possible, and fall back to an IPI if all the CPUs in a domain are
nohz-full.

Ideally cpumask_any() would do this but it isn't possible without allocating memory.
If I can reproduce this problem, I'll propose adding the behaviour to
smp_call_function_any(), probably overloading 'wait' to return an error if all the target
CPUs are nohz-full.


This means those MPAM systems that need this can't use nohz_full, or at least need a
housekeeping CPU that can access each MSC. The driver already considers whether interrupts
are masked for this stuff because of the prototype perf support, which those machines
wouldn't be able to support either.

(I'll look at hooking this up to the limbo/overflow code too)


Thanks,

James

untested and incomplete hunk of code:
------------%<------------
	cpu = get_cpu();
	if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
		smp_call_function_single(cpu, smp_mon_event_count, rr, true);
		put_cpu();
	} else {
		put_cpu();

		/*
		 * If all the CPUs available use nohz_full, fall back to an IPI.
		 * Some MPAM systems will be unable to use their counters in this case.
		 */
		cpu = cpumask_any_housekeeping(&d->cpu_mask);
		if (tick_nohz_full_cpu(cpu))
			smp_call_function_single(cpu, smp_mon_event_count, rr, true);
		else
			smp_call_on_cpu(cpu, mon_event_count, rr, false);
	}
------------%<------------

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  2023-01-23 13:54   ` Peter Newman
@ 2023-03-06 11:33     ` James Morse
  0 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-03-06 11:33 UTC (permalink / raw)
  To: Peter Newman
  Cc: x86, linux-kernel, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao

Hi Peter,

On 23/01/2023 13:54, Peter Newman wrote:
> On Fri, Jan 13, 2023 at 6:56 PM James Morse <james.morse@arm.com> wrote:
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index d309b830aeb2..d6ae4b713801 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -206,17 +206,19 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
>>         return chunks >> shift;
>>  }
>>
>> -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>> -                          u32 closid, u32 rmid, enum resctrl_event_id eventid,
>> -                          u64 *val)
>> +struct __rmid_read_arg
>>  {
>> -       struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>> -       struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
>> -       struct arch_mbm_state *am;
>> -       u64 msr_val, chunks;
>> +       u32 rmid;
>> +       enum resctrl_event_id eventid;
>>
>> -       if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
>> -               return -EINVAL;
>> +       u64 msr_val;
>> +};
>> +
>> +static void __rmid_read(void *arg)
>> +{
>> +       enum resctrl_event_id eventid = ((struct __rmid_read_arg *)arg)->eventid;
>> +       u32 rmid = ((struct __rmid_read_arg *)arg)->rmid;
>> +       u64 msr_val;
>>
>>         /*
>>          * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
>> @@ -229,6 +231,28 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>>         wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
>>         rdmsrl(MSR_IA32_QM_CTR, msr_val);
>>
>> +       ((struct __rmid_read_arg *)arg)->msr_val = msr_val;
>> +}
>> +
>> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>> +                          u32 closid, u32 rmid, enum resctrl_event_id eventid,
>> +                          u64 *val)
>> +{
>> +       struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>> +       struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
>> +       struct __rmid_read_arg arg;
>> +       struct arch_mbm_state *am;
>> +       u64 msr_val, chunks;
>> +       int err;
>> +
>> +       arg.rmid = rmid;
>> +       arg.eventid = eventid;
>> +
>> +       err = smp_call_function_any(&d->cpu_mask, __rmid_read, &arg, true);
>> +       if (err)
>> +               return err;
>> +
>> +       msr_val = arg.msr_val;
> 
> These changes are conflicting now after v6.2-rc4 due to my recent
> changes in resctrl_arch_rmid_read(), which include my own
> reintroduction of __rmid_read():
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=2a81160d29d65b5876ab3f824fda99ae0219f05e
> 
> Fortunately it looks like our respective versions of __rmid_read()
> aren't too much different from the original, but __rmid_read() does
> have a new call site in resctrl_arch_reset_rmid() to record initial
> event counts.

Yup, this is the normal headache when rebasing over other changes.
Thanks for fixing that thing - I thought the 'first' behaviour in the filesystem code
covered it, but clearly it doesn't.


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  2023-01-23 15:33   ` Peter Newman
@ 2023-03-06 11:33     ` James Morse
  2023-03-06 13:14       ` Peter Newman
  0 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-03-06 11:33 UTC (permalink / raw)
  To: Peter Newman
  Cc: x86, linux-kernel, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao

Hi Peter,

On 23/01/2023 15:33, Peter Newman wrote:
> On Fri, Jan 13, 2023 at 6:56 PM James Morse <james.morse@arm.com> wrote:
>> MPAM's cache occupancy counters can take a little while to settle once
>> the monitor has been configured. The maximum settling time is described
>> to the driver via a firmware table. The value could be large enough
>> that it makes sense to sleep.
> 
> Would it be easier to return an error when reading the occupancy count
> too soon after configuration? On Intel it is already normal for counter
> reads to fail on newly-allocated RMIDs.

For x86, you have as many counters as there are RMIDs, so there is no issue just accessing
the counter.

With MPAM there may be as few as 1 monitor for the CSU (cache storage utilisation)
counter, which needs to be multiplexed between different PARTID to find the cache
occupancy (This works for CSU because its a stable count, it doesn't work for the
bandwidth monitors)
On such a platform the monitor needs to be allocated and programmed before it reads a
value for a particular PARTID/CLOSID. If you had two threads trying to read the same
counter, they could interleave perfectly to prevent either thread managing to read a value.
The 'not ready' time is advertised in a firmware table, and the driver will wait at most
that long before giving up and returning an error.

Clearly 1 monitor is a corner case, and I hope no-one ever builds that. But if there are
fewer monitors than there are PARTID*PMG you get the same problem, (you just need more
threads reading the counters)


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 14/18] x86/resctrl: Add helpers for system wide mon/alloc capable
  2023-01-25  7:16   ` Shaopeng Tan (Fujitsu)
@ 2023-03-06 11:34     ` James Morse
  0 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-03-06 11:34 UTC (permalink / raw)
  To: Shaopeng Tan (Fujitsu), x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

On 25/01/2023 07:16, Shaopeng Tan (Fujitsu) wrote:
>> resctrl reads rdt_alloc_capable or rdt_mon_capable to determine whether any
>> of the resources support the corresponding features.
>> resctrl also uses the static-keys that affect the architecture's context-switch
>> code to determine the same thing.
>>
>> This forces another architecture to have the same static-keys.
>>
>> As the static-key is enabled based on the capable flag, and none of the
>> filesystem uses of these are in the scheduler path, move the capable flags
>> behind helpers, and use these in the filesystem code instead of the static-key.
>>
>> After this change, only the architecture code manages and uses the static-keys
>> to ensure __resctrl_sched_in() does not need runtime checks.
>>
>> This avoids multiple architectures having to define the same static-keys.

> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>

Thanks!


James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 16/18] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu
  2023-02-02 23:49   ` Reinette Chatre
@ 2023-03-06 11:34     ` James Morse
  0 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-03-06 11:34 UTC (permalink / raw)
  To: Reinette Chatre, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Reinette,

On 02/02/2023 23:49, Reinette Chatre wrote:
> On 1/13/2023 9:54 AM, James Morse wrote:
>> When a cpu is taken offline resctrl may need to move the overflow or
>> limbo handlers to run on a different CPU.
>> Once the offline callbacks have been split, cqm_setup_limbo_handler()
>> will be called while the CPU that is going offline is still present
>> in the cpu_mask.
>>
>> Pass the CPU to exclude to cqm_setup_limbo_handler() and
>> mbm_setup_overflow_handler(). These functions can use cpumask_any_but()
>> when selecting the CPU. -1 is used to indicate no CPUs need excluding.

>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 1a214bd32ed4..334fb3f1c6e2 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c

>> @@ -773,15 +773,27 @@ void cqm_handle_limbo(struct work_struct *work)
>>  	mutex_unlock(&rdtgroup_mutex);
>>  }
>>  
>> -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
>> +/**
>> + * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this
>> + *                             domain.
>> + * @delay_ms:      How far in the future the handler should run.
>> + * @exclude_cpu:   Which CPU the handler should not run on, -1 to pick any CPU.
>> + */
>> +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
>> +			     int exclude_cpu)
>>  {
>>  	unsigned long delay = msecs_to_jiffies(delay_ms);
>>  	int cpu;
>>  
>> -	cpu = cpumask_any(&dom->cpu_mask);
>> +	if (exclude_cpu == -1)
>> +		cpu = cpumask_any(&dom->cpu_mask);
>> +	else
>> +		cpu = cpumask_any_but(&dom->cpu_mask, exclude_cpu);
>> +
>>  	dom->cqm_work_cpu = cpu;
>>  
> 
> This assignment is unexpected considering the error handling that follows.
> cqm_work_cpu can thus be >= nr_cpu_ids. I assume it is to help during
> domain remove where the CPU being removed is checked against this value?
> If indeed this invalid CPU assignment is done in support of future code
> path, could you please add a comment to help explain this assignment?

Looks like I ignored it because in the last-man-standing case, the domain is going to get
free()d anyway ... but I couldn't find a 'cpu >= nr_cpu_ids' check under
schedule_delayed_work_on() hence the error handling.

I'll move the dom->mbm_work_cpu under the nr_cpu_ids check too so that it doesn't look funny.


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks
  2023-02-02 23:50   ` Reinette Chatre
@ 2023-03-06 11:34     ` James Morse
  2023-03-11  0:22       ` Reinette Chatre
  0 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-03-06 11:34 UTC (permalink / raw)
  To: Reinette Chatre, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Reinette,

On 02/02/2023 23:50, Reinette Chatre wrote:
> On 1/13/2023 9:54 AM, James Morse wrote:
>> resctrl has one mutex that is taken by the architecture specific code,
>> and the filesystem parts. The two interact via cpuhp, where the
>> architecture code updates the domain list. Filesystem handlers that
>> walk the domains list should not run concurrently with the cpuhp
>> callback modifying the list.
>>
>> Exposing a lock from the filesystem code means the interface is not
>> cleanly defined, and creates the possibility of cross-architecture
>> lock ordering headaches. The interaction only exists so that certain
>> filesystem paths are serialised against cpu hotplug. The cpu hotplug
>> code already has a mechanism to do this using cpus_read_lock().
>>
>> MPAM's monitors have an overflow interrupt, so it needs to be possible
>> to walk the domains list in irq context. RCU is ideal for this,
>> but some paths need to be able to sleep to allocate memory.
>>
>> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
>> of a cpuhp callback, cpus_read_lock() must always be taken first.
>> rdtgroup_schemata_write() already does this.
>>
>> All but one of the filesystem code's domain list walkers are
>> currently protected by the rdtgroup_mutex taken in
>> rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
>> which takes the lock directly.
> 
> The new BMEC code also. You can find it on tip's x86/cache branch,
> see mbm_total_bytes_config_write() and mbm_local_bytes_config_write().
> 
>>
>> Make the domain list protected by RCU. An architecture-specific
>> lock prevents concurrent writers. rdt_bit_usage_show() can
>> walk the domain list under rcu_read_lock().
>> The other filesystem list walkers need to be able to sleep.
>> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
>> cpuhp callbacks can't be invoked when file system operations are
>> occurring.
>>
>> Add lockdep_assert_cpus_held() in the cases where the
>> rdtgroup_kn_lock_live() call isn't obvious.
>>
>> Resctrl's domain online/offline calls now need to take the
>> rdtgroup_mutex themselves.


>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>> index 7896fcf11df6..dc1ba580c4db 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -25,8 +25,14 @@
>>  #include <asm/resctrl.h>
>>  #include "internal.h"
>>  
>> -/* Mutex to protect rdtgroup access. */
>> -DEFINE_MUTEX(rdtgroup_mutex);
>> +/*
>> + * rdt_domain structures are kfree()d when their last cpu goes offline,
>> + * and allocated when the first cpu in a new domain comes online.
>> + * The rdt_resource's domain list is updated when this happens. The domain
>> + * list is protected by RCU, but callers can also take the cpus_read_lock()
>> + * to prevent modification if they need to sleep. All writers take this mutex:
> 
> Using "callers can" is not specific (compare to "callers should"). Please provide
> clear guidance on how the locks should be used. Reader may wonder "why take cpus_read_lock()
> to prevent modification, why not just take the mutex to prevent modification?"

'if they need to sleep' is the answer to this. I think a certain amount of background
knowledge can be assumed. My aim here wasn't to write an essay, but indicate not all
readers do the same thing. This is already the case in resctrl, and the MPAM pmu stuff
makes that worse.

Is this more robust:
| * rdt_domain structures are kfree()d when their last cpu goes offline,
| * and allocated when the first cpu in a new domain comes online.
| * The rdt_resource's domain list is updated when this happens. Readers of
| * the domain list must either take cpus_read_lock(), or rely on an RCU
| * read-side critical section, to avoid observing concurrent modification.
| * For information about RCU, see Docuemtation/RCU/rcu.rst.
| * All writers take this mutex:

?


>> @@ -541,7 +550,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>>  	cpumask_clear_cpu(cpu, &d->cpu_mask);
>>  	if (cpumask_empty(&d->cpu_mask)) {
>>  		resctrl_offline_domain(r, d);
>> -		list_del(&d->list);
>> +		list_del_rcu(&d->list);
>> +		synchronize_rcu();
>>  
>>  		/*
>>  		 * rdt_domain "d" is going to be freed below, so clear

> Should domain_remove_cpu() also get a "lockdep_assert_held(&domain_list_lock)"?

Yes, not sure why I didn't do that!


>> @@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
>>  static int resctrl_arch_online_cpu(unsigned int cpu)
>>  {
>>  	struct rdt_resource *r;
>> -	int err;
>>  
>> -	mutex_lock(&rdtgroup_mutex);
>> +	mutex_lock(&domain_list_lock);
>>  	for_each_capable_rdt_resource(r)
>>  		domain_add_cpu(cpu, r);
>>  	clear_closid_rmid(cpu);
>> +	mutex_unlock(&domain_list_lock);

> Why is clear_closid_rmid(cpu) protected by mutex?

It doesn't need to be, its just an artefact of changing the lock, then moving the
filesystem calls out. (its doesn't need to be protected by rdtgroup_mutex today).

If you don't think its churn, I'll move it to make it clearer.


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  2023-03-06 11:33     ` James Morse
@ 2023-03-06 13:14       ` Peter Newman
  2023-03-08 17:45         ` James Morse
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Newman @ 2023-03-06 13:14 UTC (permalink / raw)
  To: James Morse
  Cc: x86, linux-kernel, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, Stephane Eranian

Hi James,

On Mon, Mar 6, 2023 at 12:34 PM James Morse <james.morse@arm.com> wrote:
> On 23/01/2023 15:33, Peter Newman wrote:
> > On Fri, Jan 13, 2023 at 6:56 PM James Morse <james.morse@arm.com> wrote:
> >> MPAM's cache occupancy counters can take a little while to settle once
> >> the monitor has been configured. The maximum settling time is described
> >> to the driver via a firmware table. The value could be large enough
> >> that it makes sense to sleep.
> >
> > Would it be easier to return an error when reading the occupancy count
> > too soon after configuration? On Intel it is already normal for counter
> > reads to fail on newly-allocated RMIDs.
>
> For x86, you have as many counters as there are RMIDs, so there is no issue just accessing
> the counter.

I should have said AMD instead of Intel, because their implementations
have far fewer counters than RMIDs.

>
> With MPAM there may be as few as 1 monitor for the CSU (cache storage utilisation)
> counter, which needs to be multiplexed between different PARTID to find the cache
> occupancy (This works for CSU because its a stable count, it doesn't work for the
> bandwidth monitors)
> On such a platform the monitor needs to be allocated and programmed before it reads a
> value for a particular PARTID/CLOSID. If you had two threads trying to read the same
> counter, they could interleave perfectly to prevent either thread managing to read a value.
> The 'not ready' time is advertised in a firmware table, and the driver will wait at most
> that long before giving up and returning an error.

Likewise, on AMD, a repeating sequence of tasks which are LRU in terms
of counter -> RMID allocation could prevent RMID event reads from ever
returning a value.

The main difference I see with MPAM is that software allocates the
counters instead of hardware, but the overall behavior sounds the same.

The part I object to is introducing the wait to the counter read because
existing software already expects an immediate error when reading a
counter too soon. To produce accurate data, these readings are usually
read at intervals of multiple seconds.

Instead, when configuring a counter, could you use the firmware table
value to compute the time when the counter will next be valid and return
errors on read requests received before that?

-Peter

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
  2023-03-06 11:32     ` James Morse
@ 2023-03-08 10:30       ` Peter Newman
  2023-03-10 20:00       ` Reinette Chatre
  1 sibling, 0 replies; 73+ messages in thread
From: Peter Newman @ 2023-03-08 10:30 UTC (permalink / raw)
  To: James Morse
  Cc: Reinette Chatre, x86, linux-kernel, Fenghua Yu, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao

Hi James,

On Mon, Mar 6, 2023 at 12:32 PM James Morse <james.morse@arm.com> wrote:
> On 02/02/2023 23:47, Reinette Chatre wrote:
> > On 1/13/2023 9:54 AM, James Morse wrote:
> >> +    resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, rdtgrp->mon.rmid);
> >
> > This does not use the intended closid when rdtgrp->type == RDTMON_GROUP.
>
> Yes, it should be rdtgrp->mon.parent->closid.
>
> rdtgroup_mkdir_mon() initialises them to be the same, I guess its Peter's monitor-group
> rename that means this could get the wrong value?

I noticed this earlier. The next revision of my MON group rename patch
series updates rdtgrp->closid to that of the new parent so that for
MON groups, rdtgrp->closid == rdtgrp->mon.parent->closid continues to
hold.

It looks like rdt_move_group_tasks() assumes this, as it always sets
t->closid to rdtgrp->closid.

-Peter

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI
  2023-03-06 11:33     ` James Morse
@ 2023-03-08 16:09       ` James Morse
  2023-03-10 20:06         ` Reinette Chatre
  0 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-03-08 16:09 UTC (permalink / raw)
  To: Reinette Chatre, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Reinette,

On 06/03/2023 11:33, James Morse wrote:
> On 02/02/2023 23:47, Reinette Chatre wrote:
>> On 1/13/2023 9:54 AM, James Morse wrote:
>>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>>> the number implemented is up to the manufacturer. This means when there are
>>> fewer monitors than needed, they need to be allocated and freed.
>>>
>>> Worse, the domain may be broken up into slices, and the MMIO accesses
>>> for each slice may need performing from different CPUs.
>>>
>>> These two details mean MPAMs monitor code needs to be able to sleep, and
>>> IPI another CPU in the domain to read from a resource that has been sliced.
>>>
>>> mon_event_read() already invokes mon_event_count() via IPI, which means
>>> this isn't possible.
>>>
>>> Change mon_event_read() to schedule mon_event_count() on a remote CPU and
>>> wait, instead of sending an IPI. This function is only used in response to
>>> a user-space filesystem request (not the timing sensitive overflow code).
>>>
>>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>>> the monitor-allocation in monitor.c.

>>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> index 1df0e3262bca..4ee3da6dced7 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>>  	rr->val = 0;
>>>  	rr->first = first;
>>>  
>>> -	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>>> +	smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);
> 
>> This would be problematic for the use cases where single tasks are run on
>> adaptive-tick CPUs. If an adaptive-tick CPU is chosen to run the function then
>> it may never run. Real-time environments are target usage of resctrl (with examples
>> in the documentation).
> 
> Interesting. I can't find an IPI wakeup under smp_call_on_cpu() ... I wonder what else
> this breaks!
> 
> Resctrl doesn't consider the nohz-cpus when doing any of this work, or when setting up the
> limbo or overflow timer work.
> 
> I think the right thing to do here is add some cpumask_any_housekeeping() helper to avoid
> nohz-full CPUs where possible, and fall back to an IPI if all the CPUs in a domain are
> nohz-full.
> 
> Ideally cpumask_any() would do this but it isn't possible without allocating memory.
> If I can reproduce this problem,  ...

... I haven't been able to reproduce this.

With "nohz_full=1 isolcpus=nohz,domain,1" on the command-line I can still
smp_call_on_cpu() on cpu-1 even when its running a SCHED_FIFO task that spins in
user-space as much as possible.

This looks to be down to "sched: RT throttling activated", which seems to be to prevent RT
CPU hogs from blocking kernel work. From Peter's comments at [0], it looks like running
tasks 100% in user-space isn't a realistic use-case.

Given that, I think resctrl should use smp_call_on_cpu() to avoid interrupting a nohz_full
CPUs, and the limbo/overflow code should equally avoid these CPUs. If work does get
scheduled on those CPUs, it is expected to run eventually.


Thanks,

James

[0] https://lore.kernel.org/all/20130823110254.GU31370@twins.programming.kicks-ass.net/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  2023-03-06 13:14       ` Peter Newman
@ 2023-03-08 17:45         ` James Morse
  2023-03-09 13:41           ` Peter Newman
  0 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-03-08 17:45 UTC (permalink / raw)
  To: Peter Newman
  Cc: x86, linux-kernel, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, Stephane Eranian

Hi Peter,

On 06/03/2023 13:14, Peter Newman wrote:
> On Mon, Mar 6, 2023 at 12:34 PM James Morse <james.morse@arm.com> wrote:
>> On 23/01/2023 15:33, Peter Newman wrote:
>>> On Fri, Jan 13, 2023 at 6:56 PM James Morse <james.morse@arm.com> wrote:
>>>> MPAM's cache occupancy counters can take a little while to settle once
>>>> the monitor has been configured. The maximum settling time is described
>>>> to the driver via a firmware table. The value could be large enough
>>>> that it makes sense to sleep.
>>>
>>> Would it be easier to return an error when reading the occupancy count
>>> too soon after configuration? On Intel it is already normal for counter
>>> reads to fail on newly-allocated RMIDs.
>>
>> For x86, you have as many counters as there are RMIDs, so there is no issue just accessing
>> the counter.
> 
> I should have said AMD instead of Intel, because their implementations
> have far fewer counters than RMIDs.

Right, I assume Intel and AMD behaved in the same way here.


>> With MPAM there may be as few as 1 monitor for the CSU (cache storage utilisation)
>> counter, which needs to be multiplexed between different PARTID to find the cache
>> occupancy (This works for CSU because its a stable count, it doesn't work for the
>> bandwidth monitors)
>> On such a platform the monitor needs to be allocated and programmed before it reads a
>> value for a particular PARTID/CLOSID. If you had two threads trying to read the same
>> counter, they could interleave perfectly to prevent either thread managing to read a value.
>> The 'not ready' time is advertised in a firmware table, and the driver will wait at most
>> that long before giving up and returning an error.

> Likewise, on AMD, a repeating sequence of tasks which are LRU in terms
> of counter -> RMID allocation could prevent RMID event reads from ever
> returning a value.

Isn't that a terrible user-space interface? "If someone else is reading a similar file,
neither of you make progress".


> The main difference I see with MPAM is that software allocates the
> counters instead of hardware, but the overall behavior sounds the same.
> 
> The part I object to is introducing the wait to the counter read because
> existing software already expects an immediate error when reading a
> counter too soon. To produce accurate data, these readings are usually
> read at intervals of multiple seconds.


> Instead, when configuring a counter, could you use the firmware table
> value to compute the time when the counter will next be valid and return
> errors on read requests received before that?

The monitor might get re-allocated, re-programmed and become valid for a different
PARTID+PMG in the mean time. I don't think these things should remain allocated over a
return to user-space. Without doing that I don't see how we can return-early and make
progress.

How long should a CSU monitor remain allocated to a PARTID+PMG? Currently its only for the
duration of the read() syscall on the file.


The problem with MPAM is too much of it is optional. This particular behaviour is only
valid for CSU monitors, (llc_occupancy), and then, only if your hardware designers didn't
have a value to hand when the monitor is programmed, and need to do a scan of the cache to
come up with a result. The retry is only triggered if the hardware sets NRDY.
This is also only necessary if there aren't enough monitors for every RMID/(PARTID*PMG) to
have its own. If there were enough, the monitors could be allocated and programmed at
startup, and the whole thing becomes cheaper to access.

If a hardware platform needs time to do this, it has to come from somewhere. I don't think
maintaining an epoch based list of which monitor secretly belongs to a PARTID+PMG in the
hope user-space reads the file again 'quickly enough' is going to be maintainable.

If returning errors early is an important use-case, I can suggest ensuring the MPAM driver
allocates CSU monitors up-front if there are enough (today it only does this for MBWU
monitors). We then have to hope that folk who care about this also build hardware
platforms with enough monitors.


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  2023-03-08 17:45         ` James Morse
@ 2023-03-09 13:41           ` Peter Newman
  2023-03-09 17:35             ` James Morse
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Newman @ 2023-03-09 13:41 UTC (permalink / raw)
  To: James Morse
  Cc: x86, linux-kernel, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, Stephane Eranian

Hi James,

On Wed, Mar 8, 2023 at 6:45 PM James Morse <james.morse@arm.com> wrote:
> On 06/03/2023 13:14, Peter Newman wrote:
> > On Mon, Mar 6, 2023 at 12:34 PM James Morse <james.morse@arm.com> wrote:
>
> > Instead, when configuring a counter, could you use the firmware table
> > value to compute the time when the counter will next be valid and return
> > errors on read requests received before that?
>
> The monitor might get re-allocated, re-programmed and become valid for a different
> PARTID+PMG in the mean time. I don't think these things should remain allocated over a
> return to user-space. Without doing that I don't see how we can return-early and make
> progress.
>
> How long should a CSU monitor remain allocated to a PARTID+PMG? Currently its only for the
> duration of the read() syscall on the file.
>
>
> The problem with MPAM is too much of it is optional. This particular behaviour is only
> valid for CSU monitors, (llc_occupancy), and then, only if your hardware designers didn't
> have a value to hand when the monitor is programmed, and need to do a scan of the cache to
> come up with a result. The retry is only triggered if the hardware sets NRDY.
> This is also only necessary if there aren't enough monitors for every RMID/(PARTID*PMG) to
> have its own. If there were enough, the monitors could be allocated and programmed at
> startup, and the whole thing becomes cheaper to access.
>
> If a hardware platform needs time to do this, it has to come from somewhere. I don't think
> maintaining an epoch based list of which monitor secretly belongs to a PARTID+PMG in the
> hope user-space reads the file again 'quickly enough' is going to be maintainable.
>
> If returning errors early is an important use-case, I can suggest ensuring the MPAM driver
> allocates CSU monitors up-front if there are enough (today it only does this for MBWU
> monitors). We then have to hope that folk who care about this also build hardware
> platforms with enough monitors.

Thanks, this makes more sense now. Since CSU data isn't cumulative, I
see how synchronously collecting a snapshot is useful in this situation.
I was more concerned about understanding the need for the new behavior
than getting errors back quickly.

However, I do want to be sure that MBWU counters will never be silently
deallocated because we will never be able to trust the data unless we
know that the counter has been watching the group's tasks for the
entirety of the measurement window.

Unlike on AMD, MPAM allows software to control which PARTID+PMG the
monitoring hardware is watching. Could we instead make the user
explicitly request the mbm_{total,local}_bytes events be allocated to
monitoring groups after creating them? Or even just allocating the
events on monitoring group creation only when they're available could
also be marginably usable if a single user agent is managing rdtgroups.

Thanks!
-Peter

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  2023-03-09 13:41           ` Peter Newman
@ 2023-03-09 17:35             ` James Morse
  2023-03-10  9:28               ` Peter Newman
  0 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-03-09 17:35 UTC (permalink / raw)
  To: Peter Newman
  Cc: x86, linux-kernel, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, Stephane Eranian

Hi Peter,

On 09/03/2023 13:41, Peter Newman wrote:
> On Wed, Mar 8, 2023 at 6:45 PM James Morse <james.morse@arm.com> wrote:
>> On 06/03/2023 13:14, Peter Newman wrote:
>>> On Mon, Mar 6, 2023 at 12:34 PM James Morse <james.morse@arm.com> wrote:
>>
>>> Instead, when configuring a counter, could you use the firmware table
>>> value to compute the time when the counter will next be valid and return
>>> errors on read requests received before that?
>>
>> The monitor might get re-allocated, re-programmed and become valid for a different
>> PARTID+PMG in the mean time. I don't think these things should remain allocated over a
>> return to user-space. Without doing that I don't see how we can return-early and make
>> progress.
>>
>> How long should a CSU monitor remain allocated to a PARTID+PMG? Currently its only for the
>> duration of the read() syscall on the file.
>>
>>
>> The problem with MPAM is too much of it is optional. This particular behaviour is only
>> valid for CSU monitors, (llc_occupancy), and then, only if your hardware designers didn't
>> have a value to hand when the monitor is programmed, and need to do a scan of the cache to
>> come up with a result. The retry is only triggered if the hardware sets NRDY.
>> This is also only necessary if there aren't enough monitors for every RMID/(PARTID*PMG) to
>> have its own. If there were enough, the monitors could be allocated and programmed at
>> startup, and the whole thing becomes cheaper to access.
>>
>> If a hardware platform needs time to do this, it has to come from somewhere. I don't think
>> maintaining an epoch based list of which monitor secretly belongs to a PARTID+PMG in the
>> hope user-space reads the file again 'quickly enough' is going to be maintainable.
>>
>> If returning errors early is an important use-case, I can suggest ensuring the MPAM driver
>> allocates CSU monitors up-front if there are enough (today it only does this for MBWU
>> monitors). We then have to hope that folk who care about this also build hardware
>> platforms with enough monitors.
> 
> Thanks, this makes more sense now. Since CSU data isn't cumulative, I
> see how synchronously collecting a snapshot is useful in this situation.
> I was more concerned about understanding the need for the new behavior
> than getting errors back quickly.
> 
> However, I do want to be sure that MBWU counters will never be silently
> deallocated because we will never be able to trust the data unless we
> know that the counter has been watching the group's tasks for the
> entirety of the measurement window.

Absolutely.

The MPAM driver requires the number of monitors to match the value of
resctrl_arch_system_num_rmid_idx(), otherwise 'mbm_local' won't be offered via resctrl.
(see class_has_usable_mbwu() in [0])

If the files exist in resctrl, then a monitor was reserved for this PARTID+PMG, and won't
get allocated for anything else.


[0]
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/commit/?h=mpam/snapshot/v6.2&id=f28d3fefdcf7022a49f62752acbecf180ea7d32f


> Unlike on AMD, MPAM allows software to control which PARTID+PMG the
> monitoring hardware is watching. Could we instead make the user
> explicitly request the mbm_{total,local}_bytes events be allocated to
> monitoring groups after creating them? Or even just allocating the
> events on monitoring group creation only when they're available could
> also be marginably usable if a single user agent is managing rdtgroups.

Hmmmm, what would that look like to user-space?

I'm against inventing anything new here until there is feature-parity where possible
upstream. It's a walk, then run kind of thing.

I worry that extra steps to setup the monitoring on MPAM:resctrl will be missing or broken
in many (all?) software projects if they're not also required on Intel:resctrl.

My plan for hardware with insufficient counters is to make the counters accessible via
perf, and do that in a way that works on Intel and AMD too.


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  2023-03-09 17:35             ` James Morse
@ 2023-03-10  9:28               ` Peter Newman
  2023-03-20 17:12                 ` James Morse
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Newman @ 2023-03-10  9:28 UTC (permalink / raw)
  To: James Morse
  Cc: x86, linux-kernel, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, Stephane Eranian

Hi James,

On Thu, Mar 9, 2023 at 6:35 PM James Morse <james.morse@arm.com> wrote:
> On 09/03/2023 13:41, Peter Newman wrote:
> > However, I do want to be sure that MBWU counters will never be silently
> > deallocated because we will never be able to trust the data unless we
> > know that the counter has been watching the group's tasks for the
> > entirety of the measurement window.
>
> Absolutely.
>
> The MPAM driver requires the number of monitors to match the value of
> resctrl_arch_system_num_rmid_idx(), otherwise 'mbm_local' won't be offered via resctrl.
> (see class_has_usable_mbwu() in [0])
>
> If the files exist in resctrl, then a monitor was reserved for this PARTID+PMG, and won't
> get allocated for anything else.
>
>
> [0]
> https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/commit/?h=mpam/snapshot/v6.2&id=f28d3fefdcf7022a49f62752acbecf180ea7d32f
>
>
> > Unlike on AMD, MPAM allows software to control which PARTID+PMG the
> > monitoring hardware is watching. Could we instead make the user
> > explicitly request the mbm_{total,local}_bytes events be allocated to
> > monitoring groups after creating them? Or even just allocating the
> > events on monitoring group creation only when they're available could
> > also be marginably usable if a single user agent is managing rdtgroups.
>
> Hmmmm, what would that look like to user-space?
>
> I'm against inventing anything new here until there is feature-parity where possible
> upstream. It's a walk, then run kind of thing.
>
> I worry that extra steps to setup the monitoring on MPAM:resctrl will be missing or broken
> in many (all?) software projects if they're not also required on Intel:resctrl.
>
> My plan for hardware with insufficient counters is to make the counters accessible via
> perf, and do that in a way that works on Intel and AMD too.

In the interest of enabling MPAM functionality, I think the low-effort
approach is to only allocate an MBWU monitor to a newly-created MON or
CTRL_MON group if one is available. On Intel and AMD, the resources are
simply always available.

The downside on monitor-poor (or PARTID-rich) hardware is the user gets
maximually-featureful monitoring groups first, whether they want them or
not, but I think it's workable. Perhaps in a later change we can make an
interface to prevent monitors from being allocated to new groups or one
to release them when they're not needed after group creation.

At least in this approach there's still a way to use MBWU with resctrl
when systems have more PARTIDs than monitors.

This also seems like less work than making resctrl able to interface
with the perf subsystem.

-Peter

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 05/18] x86/resctrl: Allow RMID allocation to be scoped by CLOSID
  2023-03-03 18:34     ` James Morse
@ 2023-03-10 19:57       ` Reinette Chatre
  0 siblings, 0 replies; 73+ messages in thread
From: Reinette Chatre @ 2023-03-10 19:57 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 3/3/2023 10:34 AM, James Morse wrote:
> On 02/02/2023 23:45, Reinette Chatre wrote:
>> On 1/13/2023 9:54 AM, James Morse wrote:

...

>>>  /*
>>> - * As of now the RMIDs allocation is global.
>>> + * As of now the RMIDs allocation is the same in each domain.
> 
>> Could you please elaborate what is meant/intended with this change
>> (global vs per domain)? From the changelog a comment that RMID
>> allocation is the same in each resource group for MPAM may be
>> expected but per domain is not clear to me.
> 
> This is badly worded. It's referring to the limbo list management, while RMID=7 isn't
> unique on MPAM, the struct rmid_entry used in two domains will be the same because the
> CLOSID doesn't change. This means its still sufficient to move around the struct
> rmid_entry to manage the limbo list.
> 
> I think this had me confused because 'as of now' implies the RMID won't always be globally
> allocated, and MPAM has non-unique RMID/PMG values which are a different kind of global.
> 
> 
> I'll change this to read:
> /*
>  * For MPAM the RMID value is not unique, and has to be considered with
>  * the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which
>  * allows all domains to be managed by a single limbo list.
>  * Each domain also has a rmid_busy_llc to reduce the work of the limbo handler.
>  */
> 
> (seeing as the function doesn't touch rmid_budy_llc, or refer to it by name)
> 

Thank you. This is easier to understand.

Reinette

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID
  2023-03-03 18:36     ` James Morse
@ 2023-03-10 19:59       ` Reinette Chatre
  2023-03-20 17:12         ` James Morse
  0 siblings, 1 reply; 73+ messages in thread
From: Reinette Chatre @ 2023-03-10 19:59 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 3/3/2023 10:36 AM, James Morse wrote:
> Hi Reinette,
> 
> On 02/02/2023 23:46, Reinette Chatre wrote:
>> On 1/13/2023 9:54 AM, James Morse wrote:

...

>>> +bool resctrl_closid_is_dirty(u32 closid)
>>> +{
>>> +	struct rmid_entry *entry;
>>> +	int i;
>>> +
>>> +	lockdep_assert_held(&rdtgroup_mutex);
>>> +
>>> +	if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
>>> +		return false;
> 
>> Why is a config option chosen? Is this not something that can be set in the
>> architecture specific code using a global in the form matching existing related
>> items like "arch_has..." or "arch_needs..."?
> 
> It doesn't vary by platform, so making it a runtime variable would mean x86 has to carry
> this extra code around, even though it will never use it. Done like this, the compiler can
> dead-code eliminate the below checks and embed the constant return value in all the callers.

This is fair. I missed any other mention of this option in this series so I
assume this will be a config that will be "select"ed automatically without
users needing to think about whether it is needed?

>>> +
>>> +	for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
>>> +		entry = &rmid_ptrs[i];
>>> +		if (entry->closid != closid)
>>> +			continue;
>>> +
>>> +		if (entry->busy)
>>> +			return true;
>>> +	}
>>> +
>>> +	return false;
>>> +}
>>
>> If I understand this correctly resctrl_closid_is_dirty() will return true if
>> _any_ RMID/PMG associated with a CLOSID is in use. That is, a CLOSID may be
>> able to support 100s of PMG but if only one of them is busy then the CLOSID
>> will be considered unusable ("dirty"). It sounds to me that there could be scenarios
>> when CLOSID could be considered unavailable while there are indeed sufficient
>> resources?
> 
> You are right this could happen. I guess the better approach would be to prefer the
> cleanest CLOSID that has a clean PMG=0. User-space may not be able to allocate all the
> monitor groups immediately, but that would be preferable to failing the control group
> creation.
> 
> But as this code doesn't get built until the MPAM driver is merged, I'd like to keep it to
> an absolute minimum. This would be more than is needed for MPAM to have close to resctrl
> feature-parity, so I'd prefer to do this as an improvement once the MPAM driver is upstream.
> 
> (also in this category is better use of MPAM's monitors and labelling traffic from the iommu)
> 
> 
>> The function comment states "Determine if clean RMID can be allocate for this
>> CLOSID." - if I understand correctly it behaves more like "Determine if all
>> RMID associated with CLOSID are available".
> 
> Yes, I'll fix that.

I understand your reasoning for the solution chosen. Would you be ok to expand on
the function comment more to document the intentions that you summarize above (eg. "This
is the absolute minimum solution that will be/should be/could be improved ...")?

Reinette


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
  2023-03-06 11:32     ` James Morse
  2023-03-08 10:30       ` Peter Newman
@ 2023-03-10 20:00       ` Reinette Chatre
  1 sibling, 0 replies; 73+ messages in thread
From: Reinette Chatre @ 2023-03-10 20:00 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 3/6/2023 3:32 AM, James Morse wrote:
> On 02/02/2023 23:47, Reinette Chatre wrote:
>> On 1/13/2023 9:54 AM, James Morse wrote:
>>
>> ...
>>
>>> @@ -567,19 +579,14 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
>>>  	 * For monitor groups, can move the tasks only from
>>>  	 * their parent CTRL group.
>>>  	 */
>>> -
>>> -	if (rdtgrp->type == RDTCTRL_GROUP) {
>>> -		WRITE_ONCE(tsk->closid, rdtgrp->closid);
>>> -		WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
>>> -	} else if (rdtgrp->type == RDTMON_GROUP) {
>>> -		if (rdtgrp->mon.parent->closid == tsk->closid) {
>>> -			WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
>>> -		} else {
>>> -			rdt_last_cmd_puts("Can't move task to different control group\n");
>>> -			return -EINVAL;
>>> -		}
>>> +	if (rdtgrp->type == RDTMON_GROUP &&
>>> +	    !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) {
>>> +		rdt_last_cmd_puts("Can't move task to different control group\n");
>>> +		return -EINVAL;
>>>  	}
>>>  
>>> +	resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, rdtgrp->mon.rmid);
>>
>> This does not use the intended closid when rdtgrp->type == RDTMON_GROUP.
> 
> Yes, it should be rdtgrp->mon.parent->closid.
> 
> rdtgroup_mkdir_mon() initialises them to be the same

Even so, I do find the code to be confusing when it jumps from one to
the other within the same function.

>, I guess its Peter's monitor-group
> rename that means this could get the wrong value?
> 
> I've fixed it as:
> ---------%<---------
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index c7392d10dc5b..30d8961b833c 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -585,7 +585,12 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
>                 return -EINVAL;
>         }
> 
> -       resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, rdtgrp->mon.rmid);
> +       if (rdtgrp->type == RDTMON_GROUP)
> +               resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid,
> +                                            rdtgrp->mon.rmid);
> +       else
> +               resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid,
> +                                            rdtgrp->mon.rmid);
> 
>         /*
>          * Ensure the task's closid and rmid are written before determining if
> ---------%<---------
> 
> 

ok, thank you.

Reinette


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI
  2023-03-06 11:32     ` James Morse
@ 2023-03-10 20:00       ` Reinette Chatre
  0 siblings, 0 replies; 73+ messages in thread
From: Reinette Chatre @ 2023-03-10 20:00 UTC (permalink / raw)
  To: James Morse, Yu, Fenghua, x86, linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H Peter Anvin,
	Babu Moger, shameerali.kolothum.thodi, D Scott Phillips OS, carl,
	lcherian, bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, peternewman

Hi James and Fenghua,

On 3/6/2023 3:32 AM, James Morse wrote:
> On 17/01/2023 18:29, Yu, Fenghua wrote:
>>> x86 is blessed with an abundance of monitors, one per RMID, that can be read
>>> from any CPU in the domain. MPAMs monitors reside in the MMIO MSC, the
>>> number implemented is up to the manufacturer. This means when there are
>>> fewer monitors than needed, they need to be allocated and freed.
>>>
>>> Worse, the domain may be broken up into slices, and the MMIO accesses for
>>> each slice may need performing from different CPUs.
>>>
>>> These two details mean MPAMs monitor code needs to be able to sleep, and IPI
>>> another CPU in the domain to read from a resource that has been sliced.
>>>
>>> mon_event_read() already invokes mon_event_count() via IPI, which means this
>>> isn't possible.
>>>
>>> Change mon_event_read() to schedule mon_event_count() on a remote CPU
>>> and wait, instead of sending an IPI. This function is only used in response to a
>>> user-space filesystem request (not the timing sensitive overflow code).
> 
>> But mkdir mon group needs mon_event_count() to reset RMID state.
>> If mon_event_count() is called much later, the RMID state may be used
>> before it's reset. E.g. prev_msr might be non-0 value. That will cause
>> overflow code failure.
>>
>> Seems this may happen on both x86 and arm64. At least need to make sure
>> RMID state reset happens before it's used.
> 
> On the architecture side, there is a patch from Peter that records the MSR value on the
> architecture side when an RMID is reset/re-allocated. 2a81160d29d6 ("x86/resctrl: Fix
> event counts regression in reused RMIDs")
> 
> For the filesystem, the 'first' value is passed through and handled by the CPU that reads
> the MSR. I don't see what problem any extra delay due to scheduling would cause.
> 

Both the monitor directory creation and the overflow handler rely on the
rdtgroup mutex so I do not see how an overflow code failure could sneak in here.

Reinette

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI
  2023-03-08 16:09       ` James Morse
@ 2023-03-10 20:06         ` Reinette Chatre
  2023-03-20 17:12           ` James Morse
  0 siblings, 1 reply; 73+ messages in thread
From: Reinette Chatre @ 2023-03-10 20:06 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 3/8/2023 8:09 AM, James Morse wrote:
> Hi Reinette,
> 
> On 06/03/2023 11:33, James Morse wrote:
>> On 02/02/2023 23:47, Reinette Chatre wrote:
>>> On 1/13/2023 9:54 AM, James Morse wrote:
>>>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>>>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>>>> the number implemented is up to the manufacturer. This means when there are
>>>> fewer monitors than needed, they need to be allocated and freed.
>>>>
>>>> Worse, the domain may be broken up into slices, and the MMIO accesses
>>>> for each slice may need performing from different CPUs.
>>>>
>>>> These two details mean MPAMs monitor code needs to be able to sleep, and
>>>> IPI another CPU in the domain to read from a resource that has been sliced.
>>>>
>>>> mon_event_read() already invokes mon_event_count() via IPI, which means
>>>> this isn't possible.
>>>>
>>>> Change mon_event_read() to schedule mon_event_count() on a remote CPU and
>>>> wait, instead of sending an IPI. This function is only used in response to
>>>> a user-space filesystem request (not the timing sensitive overflow code).
>>>>
>>>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>>>> the monitor-allocation in monitor.c.
> 
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>> index 1df0e3262bca..4ee3da6dced7 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>>>  	rr->val = 0;
>>>>  	rr->first = first;
>>>>  
>>>> -	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>>>> +	smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);
>>
>>> This would be problematic for the use cases where single tasks are run on
>>> adaptive-tick CPUs. If an adaptive-tick CPU is chosen to run the function then
>>> it may never run. Real-time environments are target usage of resctrl (with examples
>>> in the documentation).
>>
>> Interesting. I can't find an IPI wakeup under smp_call_on_cpu() ... I wonder what else
>> this breaks!
>>
>> Resctrl doesn't consider the nohz-cpus when doing any of this work, or when setting up the
>> limbo or overflow timer work.
>>
>> I think the right thing to do here is add some cpumask_any_housekeeping() helper to avoid
>> nohz-full CPUs where possible, and fall back to an IPI if all the CPUs in a domain are
>> nohz-full.
>>
>> Ideally cpumask_any() would do this but it isn't possible without allocating memory.
>> If I can reproduce this problem,  ...
> 
> ... I haven't been able to reproduce this.
> 
> With "nohz_full=1 isolcpus=nohz,domain,1" on the command-line I can still
> smp_call_on_cpu() on cpu-1 even when its running a SCHED_FIFO task that spins in
> user-space as much as possible.
> 
> This looks to be down to "sched: RT throttling activated", which seems to be to prevent RT
> CPU hogs from blocking kernel work. From Peter's comments at [0], it looks like running
> tasks 100% in user-space isn't a realistic use-case.
> 
> Given that, I think resctrl should use smp_call_on_cpu() to avoid interrupting a nohz_full
> CPUs, and the limbo/overflow code should equally avoid these CPUs. If work does get
> scheduled on those CPUs, it is expected to run eventually.

From what I understand the email you point to, and I assume your testing,
used the system defaults (SCHED_FIFO gets 0.95s out of 1s).

Users are not constrained by these defaults. Please see
Documentation/scheduler/sched-rt-group.rst

It is thus possible for tightly controlled task to have a CPU dedicated to
it for great lengths or even forever. Ideally written in a way to manage power
and thermal constraints.

In the current behavior, users can use resctrl to read the data at any time
and expect to understand consequences of such action. 

It seems to me that there may be scenarios under which this change could
result in a read of data to never return? As you indicated it is expected
to run eventually, but that would be dictated by the RT scheduling period
that can be up to about 35 minutes (or "no limit" prompting me to make the
"never return" statement).

I do not see at this time that limbo/overflow should avoid these CPUs. Limbo
could be avoided from user space. I have not hear about overflow impacting
such workloads negatively.

Reinette

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks
  2023-03-06 11:34     ` James Morse
@ 2023-03-11  0:22       ` Reinette Chatre
  2023-03-20 17:12         ` James Morse
  0 siblings, 1 reply; 73+ messages in thread
From: Reinette Chatre @ 2023-03-11  0:22 UTC (permalink / raw)
  To: James Morse, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi James,

On 3/6/2023 3:34 AM, James Morse wrote:
> Hi Reinette,
> 
> On 02/02/2023 23:50, Reinette Chatre wrote:
>> On 1/13/2023 9:54 AM, James Morse wrote:
>>> resctrl has one mutex that is taken by the architecture specific code,
>>> and the filesystem parts. The two interact via cpuhp, where the
>>> architecture code updates the domain list. Filesystem handlers that
>>> walk the domains list should not run concurrently with the cpuhp
>>> callback modifying the list.
>>>
>>> Exposing a lock from the filesystem code means the interface is not
>>> cleanly defined, and creates the possibility of cross-architecture
>>> lock ordering headaches. The interaction only exists so that certain
>>> filesystem paths are serialised against cpu hotplug. The cpu hotplug
>>> code already has a mechanism to do this using cpus_read_lock().
>>>
>>> MPAM's monitors have an overflow interrupt, so it needs to be possible
>>> to walk the domains list in irq context. RCU is ideal for this,
>>> but some paths need to be able to sleep to allocate memory.
>>>
>>> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
>>> of a cpuhp callback, cpus_read_lock() must always be taken first.
>>> rdtgroup_schemata_write() already does this.
>>>
>>> All but one of the filesystem code's domain list walkers are
>>> currently protected by the rdtgroup_mutex taken in
>>> rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
>>> which takes the lock directly.
>>
>> The new BMEC code also. You can find it on tip's x86/cache branch,
>> see mbm_total_bytes_config_write() and mbm_local_bytes_config_write().
>>
>>>
>>> Make the domain list protected by RCU. An architecture-specific
>>> lock prevents concurrent writers. rdt_bit_usage_show() can
>>> walk the domain list under rcu_read_lock().
>>> The other filesystem list walkers need to be able to sleep.
>>> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
>>> cpuhp callbacks can't be invoked when file system operations are
>>> occurring.
>>>
>>> Add lockdep_assert_cpus_held() in the cases where the
>>> rdtgroup_kn_lock_live() call isn't obvious.
>>>
>>> Resctrl's domain online/offline calls now need to take the
>>> rdtgroup_mutex themselves.
> 
> 
>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>> index 7896fcf11df6..dc1ba580c4db 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>> @@ -25,8 +25,14 @@
>>>  #include <asm/resctrl.h>
>>>  #include "internal.h"
>>>  
>>> -/* Mutex to protect rdtgroup access. */
>>> -DEFINE_MUTEX(rdtgroup_mutex);
>>> +/*
>>> + * rdt_domain structures are kfree()d when their last cpu goes offline,
>>> + * and allocated when the first cpu in a new domain comes online.
>>> + * The rdt_resource's domain list is updated when this happens. The domain
>>> + * list is protected by RCU, but callers can also take the cpus_read_lock()
>>> + * to prevent modification if they need to sleep. All writers take this mutex:
>>
>> Using "callers can" is not specific (compare to "callers should"). Please provide
>> clear guidance on how the locks should be used. Reader may wonder "why take cpus_read_lock()
>> to prevent modification, why not just take the mutex to prevent modification?"
> 
> 'if they need to sleep' is the answer to this. I think a certain amount of background
> knowledge can be assumed. My aim here wasn't to write an essay, but indicate not all
> readers do the same thing. This is already the case in resctrl, and the MPAM pmu stuff
> makes that worse.
> 
> Is this more robust:
> | * rdt_domain structures are kfree()d when their last cpu goes offline,
> | * and allocated when the first cpu in a new domain comes online.
> | * The rdt_resource's domain list is updated when this happens. Readers of
> | * the domain list must either take cpus_read_lock(), or rely on an RCU
> | * read-side critical section, to avoid observing concurrent modification.
> | * For information about RCU, see Docuemtation/RCU/rcu.rst.
> | * All writers take this mutex:
> 
> ?

Yes, I do think this is more robust. Since you do mention, "'if they need to sleep'
is the answer to this", how about "... must take cpus_read_lock() if they need to
sleep, or otherwise rely on an RCU read-side critical section, ..."? I do not
think it is necessary to provide a link to the documentation. If you do prefer
to keep it, please note the typo.

Also, please cpu -> CPU. 

>>> @@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
>>>  static int resctrl_arch_online_cpu(unsigned int cpu)
>>>  {
>>>  	struct rdt_resource *r;
>>> -	int err;
>>>  
>>> -	mutex_lock(&rdtgroup_mutex);
>>> +	mutex_lock(&domain_list_lock);
>>>  	for_each_capable_rdt_resource(r)
>>>  		domain_add_cpu(cpu, r);
>>>  	clear_closid_rmid(cpu);
>>> +	mutex_unlock(&domain_list_lock);
> 
>> Why is clear_closid_rmid(cpu) protected by mutex?
> 
> It doesn't need to be, its just an artefact of changing the lock, then moving the
> filesystem calls out. (its doesn't need to be protected by rdtgroup_mutex today).
> 
> If you don't think its churn, I'll move it to make it clearer.
> 

I do not see a problem with keeping the lock/unlock as before but
if you do find that you can make the locking clearer then
please do.

Reinette




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID
  2023-03-10 19:59       ` Reinette Chatre
@ 2023-03-20 17:12         ` James Morse
  0 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-03-20 17:12 UTC (permalink / raw)
  To: Reinette Chatre, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Reinette,

On 10/03/2023 19:59, Reinette Chatre wrote:
> On 3/3/2023 10:36 AM, James Morse wrote:
>> On 02/02/2023 23:46, Reinette Chatre wrote:
>>> On 1/13/2023 9:54 AM, James Morse wrote:
> 
> ...
> 
>>>> +bool resctrl_closid_is_dirty(u32 closid)
>>>> +{
>>>> +	struct rmid_entry *entry;
>>>> +	int i;
>>>> +
>>>> +	lockdep_assert_held(&rdtgroup_mutex);
>>>> +
>>>> +	if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
>>>> +		return false;
>>
>>> Why is a config option chosen? Is this not something that can be set in the
>>> architecture specific code using a global in the form matching existing related
>>> items like "arch_has..." or "arch_needs..."?
>>
>> It doesn't vary by platform, so making it a runtime variable would mean x86 has to carry
>> this extra code around, even though it will never use it. Done like this, the compiler can
>> dead-code eliminate the below checks and embed the constant return value in all the callers.

> This is fair. I missed any other mention of this option in this series so I
> assume this will be a config that will be "select"ed automatically without
> users needing to think about whether it is needed?

Yes, MPAM platforms would unconditionally enable it, as it reflects how MPAM works. [0]
Users would never need to know it exists.


>>>> +	for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
>>>> +		entry = &rmid_ptrs[i];
>>>> +		if (entry->closid != closid)
>>>> +			continue;
>>>> +
>>>> +		if (entry->busy)
>>>> +			return true;
>>>> +	}
>>>> +
>>>> +	return false;
>>>> +}
>>>
>>> If I understand this correctly resctrl_closid_is_dirty() will return true if
>>> _any_ RMID/PMG associated with a CLOSID is in use. That is, a CLOSID may be
>>> able to support 100s of PMG but if only one of them is busy then the CLOSID
>>> will be considered unusable ("dirty"). It sounds to me that there could be scenarios
>>> when CLOSID could be considered unavailable while there are indeed sufficient
>>> resources?
>>
>> You are right this could happen. I guess the better approach would be to prefer the
>> cleanest CLOSID that has a clean PMG=0. User-space may not be able to allocate all the
>> monitor groups immediately, but that would be preferable to failing the control group
>> creation.
>>
>> But as this code doesn't get built until the MPAM driver is merged, I'd like to keep it to
>> an absolute minimum. This would be more than is needed for MPAM to have close to resctrl
>> feature-parity, so I'd prefer to do this as an improvement once the MPAM driver is upstream.
>>
>> (also in this category is better use of MPAM's monitors and labelling traffic from the iommu)
>>
>>
>>> The function comment states "Determine if clean RMID can be allocate for this
>>> CLOSID." - if I understand correctly it behaves more like "Determine if all
>>> RMID associated with CLOSID are available".
>>
>> Yes, I'll fix that.
> 
> I understand your reasoning for the solution chosen. Would you be ok to expand on
> the function comment more to document the intentions that you summarize above (eg. "This
> is the absolute minimum solution that will be/should be/could be improved ...")?

Sure thing,


Thanks,

James

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/tree/drivers/platform/mpam/Kconfig?h=mpam/snapshot/v6.2&id=ef6b11256ba2cceaff846c96183e8eb6019642d7

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI
  2023-03-10 20:06         ` Reinette Chatre
@ 2023-03-20 17:12           ` James Morse
  0 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-03-20 17:12 UTC (permalink / raw)
  To: Reinette Chatre, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Reinette,

On 10/03/2023 20:06, Reinette Chatre wrote:
> On 3/8/2023 8:09 AM, James Morse wrote:
>> On 06/03/2023 11:33, James Morse wrote:
>>> On 02/02/2023 23:47, Reinette Chatre wrote:
>>>> On 1/13/2023 9:54 AM, James Morse wrote:
>>>>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>>>>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>>>>> the number implemented is up to the manufacturer. This means when there are
>>>>> fewer monitors than needed, they need to be allocated and freed.
>>>>>
>>>>> Worse, the domain may be broken up into slices, and the MMIO accesses
>>>>> for each slice may need performing from different CPUs.
>>>>>
>>>>> These two details mean MPAMs monitor code needs to be able to sleep, and
>>>>> IPI another CPU in the domain to read from a resource that has been sliced.
>>>>>
>>>>> mon_event_read() already invokes mon_event_count() via IPI, which means
>>>>> this isn't possible.
>>>>>
>>>>> Change mon_event_read() to schedule mon_event_count() on a remote CPU and
>>>>> wait, instead of sending an IPI. This function is only used in response to
>>>>> a user-space filesystem request (not the timing sensitive overflow code).
>>>>>
>>>>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>>>>> the monitor-allocation in monitor.c.
>>
>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>>> index 1df0e3262bca..4ee3da6dced7 100644
>>>>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>>> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>>>>  	rr->val = 0;
>>>>>  	rr->first = first;
>>>>>  
>>>>> -	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>>>>> +	smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);
>>>
>>>> This would be problematic for the use cases where single tasks are run on
>>>> adaptive-tick CPUs. If an adaptive-tick CPU is chosen to run the function then
>>>> it may never run. Real-time environments are target usage of resctrl (with examples
>>>> in the documentation).
>>>
>>> Interesting. I can't find an IPI wakeup under smp_call_on_cpu() ... I wonder what else
>>> this breaks!
>>>
>>> Resctrl doesn't consider the nohz-cpus when doing any of this work, or when setting up the
>>> limbo or overflow timer work.
>>>
>>> I think the right thing to do here is add some cpumask_any_housekeeping() helper to avoid
>>> nohz-full CPUs where possible, and fall back to an IPI if all the CPUs in a domain are
>>> nohz-full.
>>>
>>> Ideally cpumask_any() would do this but it isn't possible without allocating memory.
>>> If I can reproduce this problem,  ...
>>
>> ... I haven't been able to reproduce this.
>>
>> With "nohz_full=1 isolcpus=nohz,domain,1" on the command-line I can still
>> smp_call_on_cpu() on cpu-1 even when its running a SCHED_FIFO task that spins in
>> user-space as much as possible.
>>
>> This looks to be down to "sched: RT throttling activated", which seems to be to prevent RT
>> CPU hogs from blocking kernel work. From Peter's comments at [0], it looks like running
>> tasks 100% in user-space isn't a realistic use-case.
>>
>> Given that, I think resctrl should use smp_call_on_cpu() to avoid interrupting a nohz_full
>> CPUs, and the limbo/overflow code should equally avoid these CPUs. If work does get
>> scheduled on those CPUs, it is expected to run eventually.

> From what I understand the email you point to, and I assume your testing,
> used the system defaults (SCHED_FIFO gets 0.95s out of 1s).
> 
> Users are not constrained by these defaults. Please see
> Documentation/scheduler/sched-rt-group.rst

Aha, I didn't find thiese to change. But I note most things down here say:
| Fiddling with these settings can result in an unstable system, the knobs are
| root only and assumes root knows what he is doing.

on them.


> It is thus possible for tightly controlled task to have a CPU dedicated to
> it for great lengths or even forever. Ideally written in a way to manage power
> and thermal constraints.
> 
> In the current behavior, users can use resctrl to read the data at any time
> and expect to understand consequences of such action. 

Those consequences are that resctrl might pick that CPU as the victim of an IPI, so the
time taken to read the counters goes missing from user-space.


> It seems to me that there may be scenarios under which this change could
> result in a read of data to never return? As you indicated it is expected
> to run eventually, but that would be dictated by the RT scheduling period
> that can be up to about 35 minutes (or "no limit" prompting me to make the
> "never return" statement).

Surely not interrupting an RT task is a better state of affairs. User-space can't know
which CPU resctrl is going to IPI.
If this means resctrl doesn't work properly, I'd file that under 'can result in an
unstable system' as quoted above.

I think the best solution here is for resctrl to assume there is one housekeeping CPU per
domain, (e.g. for processing offloaded RCU callbacks), and that it should prefer that CPU
for all cross-call work. This avoids ever interrupting RT tasks.

If you feel strongly that all CPUs in a domain could be dedicated 100% to user-space work,
can always use an IPI when nohz_full is in use, (and hope for the best on the CPU choice).
This will prevent a class of MPAM platforms from using their monitors with nohz_full,
which I'm fine with as the conditions are detectable.


> I do not see at this time that limbo/overflow should avoid these CPUs. Limbo
> could be avoided from user space. I have not hear about overflow impacting
> such workloads negatively.

Its got all the same properties. The limbo/overflow work picks a CPU to run on, it may
pick a nohz_full CPU. I suspect no-one has complained is because this 100%-in-userspace is
a niche sport.

Again, I think the best solution here is for the limbo/overflow code to prefer
housekeeping CPUs for all their work. This is what I've done for v3.


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  2023-03-10  9:28               ` Peter Newman
@ 2023-03-20 17:12                 ` James Morse
  2023-03-22 13:21                   ` Peter Newman
  0 siblings, 1 reply; 73+ messages in thread
From: James Morse @ 2023-03-20 17:12 UTC (permalink / raw)
  To: Peter Newman
  Cc: x86, linux-kernel, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, Stephane Eranian

Hi Peter,

On 10/03/2023 09:28, Peter Newman wrote:
> On Thu, Mar 9, 2023 at 6:35 PM James Morse <james.morse@arm.com> wrote:
>> On 09/03/2023 13:41, Peter Newman wrote:
>>> However, I do want to be sure that MBWU counters will never be silently
>>> deallocated because we will never be able to trust the data unless we
>>> know that the counter has been watching the group's tasks for the
>>> entirety of the measurement window.
>>
>> Absolutely.
>>
>> The MPAM driver requires the number of monitors to match the value of
>> resctrl_arch_system_num_rmid_idx(), otherwise 'mbm_local' won't be offered via resctrl.
>> (see class_has_usable_mbwu() in [0])
>>
>> If the files exist in resctrl, then a monitor was reserved for this PARTID+PMG, and won't
>> get allocated for anything else.
>>
>>
>> [0]
>> https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/commit/?h=mpam/snapshot/v6.2&id=f28d3fefdcf7022a49f62752acbecf180ea7d32f
>>
>>
>>> Unlike on AMD, MPAM allows software to control which PARTID+PMG the
>>> monitoring hardware is watching. Could we instead make the user
>>> explicitly request the mbm_{total,local}_bytes events be allocated to
>>> monitoring groups after creating them? Or even just allocating the
>>> events on monitoring group creation only when they're available could
>>> also be marginably usable if a single user agent is managing rdtgroups.
>>
>> Hmmmm, what would that look like to user-space?
>>
>> I'm against inventing anything new here until there is feature-parity where possible
>> upstream. It's a walk, then run kind of thing.
>>
>> I worry that extra steps to setup the monitoring on MPAM:resctrl will be missing or broken
>> in many (all?) software projects if they're not also required on Intel:resctrl.
>>
>> My plan for hardware with insufficient counters is to make the counters accessible via
>> perf, and do that in a way that works on Intel and AMD too.

> In the interest of enabling MPAM functionality, I think the low-effort
> approach is to only allocate an MBWU monitor to a newly-created MON or
> CTRL_MON group if one is available. On Intel and AMD, the resources are
> simply always available.

I agree its low-effort, but I think the result is not worth having.

What does user-space get when it reads 'mbm_total_bytes'? Returning an error here sucks.
How is user-space supposed to identify the groups it wants to monitor, and those it
doesn't care about?

Taking "the only way to win is not to play" means the MPAM driver will only offer those
'mbm_total_bytes' files if they are going to work in the same way they do today. (as you
said, on Intel and AMD the resources are simply always available).

I agree those files have always been able to return errors - but I've never managed to
make the Intel system I have do it... so I bet user-space doesn't expect errors here.
(let alone persistent errors)


My fix for those systems that don't have enough monitors is to expose the counters via
perf, which lets user-space say which groups it wants to monitor (and when!). To make it
easier to use I've done it so the perf stuff works on Intel and AMD too.


> The downside on monitor-poor (or PARTID-rich) hardware is the user gets
> maximually-featureful monitoring groups first, whether they want them or
> not, but I think it's workable.

Really? Depending on the order groups are created in is a terrible user-space interface!
If someone needing to be monitored comes along later, you have do delete the existing
groups ... wait for limbo to do its thing ... then re-create them in some new order.


> Perhaps in a later change we can make an
> interface to prevent monitors from being allocated to new groups or one
> to release them when they're not needed after group creation.
> 
> At least in this approach there's still a way to use MBWU with resctrl
> when systems have more PARTIDs than monitors.
> 
> This also seems like less work than making resctrl able to interface
> with the perf subsystem.

Not correctly supporting resctrl (the mbm_total_files exist, but persistently return an
error) means some user-space users of this thing are broken on those systems.
Adding extra knobs to indicate when the underlying monitor hardware needs to be allocated
is in practice going to be missing if its only needed for MPAM, and is most likely to
bit-rot as codebases that do use it don't regularly test it.


This patch to allow resctrl_arch_rmid_read() to sleep is about MPAM's CSU NRDY and the
high likelyhood that folk build systems where MSCs are sliced up and private to something
smaller than the resctrl:domain. Without the perf support, this would still be necessary.

The changes needed for perf support are to make resctrl_arch_rmid_read() re-entrant, and
for the domain list to be protected by RCU. Neither of these are as onerous as changes to
the user-space interface, and the associated risk of breaking programs that work on other
platforms.


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks
  2023-03-11  0:22       ` Reinette Chatre
@ 2023-03-20 17:12         ` James Morse
  0 siblings, 0 replies; 73+ messages in thread
From: James Morse @ 2023-03-20 17:12 UTC (permalink / raw)
  To: Reinette Chatre, x86, linux-kernel
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H Peter Anvin, Babu Moger, shameerali.kolothum.thodi,
	D Scott Phillips OS, carl, lcherian, bobo.shaobowang,
	tan.shaopeng, xingxin.hx, baolin.wang, Jamie Iles, Xin Hao,
	peternewman

Hi Reinette

On 11/03/2023 00:22, Reinette Chatre wrote:
> On 3/6/2023 3:34 AM, James Morse wrote:
>> On 02/02/2023 23:50, Reinette Chatre wrote:
>>> On 1/13/2023 9:54 AM, James Morse wrote:
>>>> resctrl has one mutex that is taken by the architecture specific code,
>>>> and the filesystem parts. The two interact via cpuhp, where the
>>>> architecture code updates the domain list. Filesystem handlers that
>>>> walk the domains list should not run concurrently with the cpuhp
>>>> callback modifying the list.
>>>>
>>>> Exposing a lock from the filesystem code means the interface is not
>>>> cleanly defined, and creates the possibility of cross-architecture
>>>> lock ordering headaches. The interaction only exists so that certain
>>>> filesystem paths are serialised against cpu hotplug. The cpu hotplug
>>>> code already has a mechanism to do this using cpus_read_lock().
>>>>
>>>> MPAM's monitors have an overflow interrupt, so it needs to be possible
>>>> to walk the domains list in irq context. RCU is ideal for this,
>>>> but some paths need to be able to sleep to allocate memory.
>>>>
>>>> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
>>>> of a cpuhp callback, cpus_read_lock() must always be taken first.
>>>> rdtgroup_schemata_write() already does this.
>>>>
>>>> All but one of the filesystem code's domain list walkers are
>>>> currently protected by the rdtgroup_mutex taken in
>>>> rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
>>>> which takes the lock directly.
>>>
>>> The new BMEC code also. You can find it on tip's x86/cache branch,
>>> see mbm_total_bytes_config_write() and mbm_local_bytes_config_write().
>>>
>>>>
>>>> Make the domain list protected by RCU. An architecture-specific
>>>> lock prevents concurrent writers. rdt_bit_usage_show() can
>>>> walk the domain list under rcu_read_lock().
>>>> The other filesystem list walkers need to be able to sleep.
>>>> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
>>>> cpuhp callbacks can't be invoked when file system operations are
>>>> occurring.
>>>>
>>>> Add lockdep_assert_cpus_held() in the cases where the
>>>> rdtgroup_kn_lock_live() call isn't obvious.
>>>>
>>>> Resctrl's domain online/offline calls now need to take the
>>>> rdtgroup_mutex themselves.

>>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>>> index 7896fcf11df6..dc1ba580c4db 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>>> @@ -25,8 +25,14 @@
>>>>  #include <asm/resctrl.h>
>>>>  #include "internal.h"
>>>>  
>>>> -/* Mutex to protect rdtgroup access. */
>>>> -DEFINE_MUTEX(rdtgroup_mutex);
>>>> +/*
>>>> + * rdt_domain structures are kfree()d when their last cpu goes offline,
>>>> + * and allocated when the first cpu in a new domain comes online.
>>>> + * The rdt_resource's domain list is updated when this happens. The domain
>>>> + * list is protected by RCU, but callers can also take the cpus_read_lock()
>>>> + * to prevent modification if they need to sleep. All writers take this mutex:
>>>
>>> Using "callers can" is not specific (compare to "callers should"). Please provide
>>> clear guidance on how the locks should be used. Reader may wonder "why take cpus_read_lock()
>>> to prevent modification, why not just take the mutex to prevent modification?"
>>
>> 'if they need to sleep' is the answer to this. I think a certain amount of background
>> knowledge can be assumed. My aim here wasn't to write an essay, but indicate not all
>> readers do the same thing. This is already the case in resctrl, and the MPAM pmu stuff
>> makes that worse.
>>
>> Is this more robust:
>> | * rdt_domain structures are kfree()d when their last cpu goes offline,
>> | * and allocated when the first cpu in a new domain comes online.
>> | * The rdt_resource's domain list is updated when this happens. Readers of
>> | * the domain list must either take cpus_read_lock(), or rely on an RCU
>> | * read-side critical section, to avoid observing concurrent modification.
>> | * For information about RCU, see Docuemtation/RCU/rcu.rst.
>> | * All writers take this mutex:
>>
>> ?
> 
> Yes, I do think this is more robust. Since you do mention, "'if they need to sleep'
> is the answer to this", how about "... must take cpus_read_lock() if they need to
> sleep, or otherwise rely on an RCU read-side critical section, ..."?

Yes, I've changed this to
| * The rdt_resource's domain list is updated when this happens. Readers of
| * the domain list must either take cpus_read_lock() if they need to sleep,
| * or rely on an RCU read-side critical section, to avoid observing concurrent
| * modification.


> I do not
> think it is necessary to provide a link to the documentation. If you do prefer
> to keep it, please note the typo.

I'll drop that then.

> Also, please cpu -> CPU. 

Fixed.


>>>> @@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
>>>>  static int resctrl_arch_online_cpu(unsigned int cpu)
>>>>  {
>>>>  	struct rdt_resource *r;
>>>> -	int err;
>>>>  
>>>> -	mutex_lock(&rdtgroup_mutex);
>>>> +	mutex_lock(&domain_list_lock);
>>>>  	for_each_capable_rdt_resource(r)
>>>>  		domain_add_cpu(cpu, r);
>>>>  	clear_closid_rmid(cpu);
>>>> +	mutex_unlock(&domain_list_lock);
>>
>>> Why is clear_closid_rmid(cpu) protected by mutex?
>>
>> It doesn't need to be, its just an artefact of changing the lock, then moving the
>> filesystem calls out. (its doesn't need to be protected by rdtgroup_mutex today).
>>
>> If you don't think its churn, I'll move it to make it clearer.

> I do not see a problem with keeping the lock/unlock as before but
> if you do find that you can make the locking clearer then
> please do.

Done,


Thanks,

James

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  2023-03-20 17:12                 ` James Morse
@ 2023-03-22 13:21                   ` Peter Newman
  0 siblings, 0 replies; 73+ messages in thread
From: Peter Newman @ 2023-03-22 13:21 UTC (permalink / raw)
  To: James Morse
  Cc: x86, linux-kernel, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H Peter Anvin, Babu Moger,
	shameerali.kolothum.thodi, D Scott Phillips OS, carl, lcherian,
	bobo.shaobowang, tan.shaopeng, xingxin.hx, baolin.wang,
	Jamie Iles, Xin Hao, Stephane Eranian

Hi James,

On Mon, Mar 20, 2023 at 6:12 PM James Morse <james.morse@arm.com> wrote:
> On 10/03/2023 09:28, Peter Newman wrote:
> > In the interest of enabling MPAM functionality, I think the low-effort
> > approach is to only allocate an MBWU monitor to a newly-created MON or
> > CTRL_MON group if one is available. On Intel and AMD, the resources are
> > simply always available.
>
> I agree its low-effort, but I think the result is not worth having.
>
> What does user-space get when it reads 'mbm_total_bytes'? Returning an error here sucks.
> How is user-space supposed to identify the groups it wants to monitor, and those it
> doesn't care about?
>
> Taking "the only way to win is not to play" means the MPAM driver will only offer those
> 'mbm_total_bytes' files if they are going to work in the same way they do today. (as you
> said, on Intel and AMD the resources are simply always available).

I told you that only Intel so far has resources for all RMIDs. AMD
implementations allocate MBW monitors on demand, even reallocating ones
that are actively in use.

> I agree those files have always been able to return errors - but I've never managed to
> make the Intel system I have do it... so I bet user-space doesn't expect errors here.
> (let alone persistent errors)

Find some AMD hardware. It's very easy to get persistent errors due to
no counters being allocated for an RMID:

(this is an 'AMD Ryzen Threadripper PRO 3995WX 64-Cores')

# cd /sys/fs/resctrl/mon_groups
# mkdir test
# cat test/mon_data/*/mbm_total_bytes
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
# cat test/mon_data/*/mbm_total_bytes
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable
Unavailable


> This patch to allow resctrl_arch_rmid_read() to sleep is about MPAM's CSU NRDY and the
> high likelyhood that folk build systems where MSCs are sliced up and private to something
> smaller than the resctrl:domain. Without the perf support, this would still be necessary.

I was worried about the blocking more when I thought you were doing it
for MBWU monitoring. Serializing access to limited CSU monitors makes
more sense.

> The changes needed for perf support are to make resctrl_arch_rmid_read() re-entrant, and
> for the domain list to be protected by RCU. Neither of these are as onerous as changes to
> the user-space interface, and the associated risk of breaking programs that work on other
> platforms.

I went ahead and tried to rebase my reliable-MBM-on-AMD changes onto
your series and they seemed to work with less difficulty than I was
expecting, so I'll try to stop worrying about the churn of this series
now.

-Peter

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2023-03-22 13:22 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-13 17:54 [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
2023-01-13 17:54 ` [PATCH v2 01/18] x86/resctrl: Track the closid with the rmid James Morse
2023-01-13 17:54 ` [PATCH v2 02/18] x86/resctrl: Access per-rmid structures by index James Morse
2023-01-17 18:28   ` Yu, Fenghua
2023-03-03 18:33     ` James Morse
2023-01-17 18:29   ` Yu, Fenghua
2023-02-02 23:44   ` Reinette Chatre
2023-03-03 18:33     ` James Morse
2023-01-13 17:54 ` [PATCH v2 03/18] x86/resctrl: Create helper for RMID allocation and mondata dir creation James Morse
2023-01-13 17:54 ` [PATCH v2 04/18] x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare() James Morse
2023-01-17 18:28   ` Yu, Fenghua
2023-02-02 23:45   ` Reinette Chatre
2023-03-03 18:33     ` James Morse
2023-01-13 17:54 ` [PATCH v2 05/18] x86/resctrl: Allow RMID allocation to be scoped by CLOSID James Morse
2023-01-17 18:53   ` Yu, Fenghua
2023-03-03 18:34     ` James Morse
2023-02-02 23:45   ` Reinette Chatre
2023-03-03 18:34     ` James Morse
2023-03-10 19:57       ` Reinette Chatre
2023-01-13 17:54 ` [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID James Morse
2023-01-17 18:29   ` Yu, Fenghua
2023-03-03 18:35     ` James Morse
2023-02-02 23:46   ` Reinette Chatre
2023-03-03 18:36     ` James Morse
2023-03-10 19:59       ` Reinette Chatre
2023-03-20 17:12         ` James Morse
2023-01-13 17:54 ` [PATCH v2 07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers James Morse
2023-01-17 19:10   ` Yu, Fenghua
2023-03-03 18:37     ` James Morse
2023-02-02 23:47   ` Reinette Chatre
2023-03-06 11:32     ` James Morse
2023-03-08 10:30       ` Peter Newman
2023-03-10 20:00       ` Reinette Chatre
2023-01-13 17:54 ` [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI James Morse
2023-01-17 18:29   ` Yu, Fenghua
2023-03-06 11:32     ` James Morse
2023-03-10 20:00       ` Reinette Chatre
2023-02-02 23:47   ` Reinette Chatre
2023-03-06 11:33     ` James Morse
2023-03-08 16:09       ` James Morse
2023-03-10 20:06         ` Reinette Chatre
2023-03-20 17:12           ` James Morse
2023-01-13 17:54 ` [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep James Morse
2023-01-23 13:54   ` Peter Newman
2023-03-06 11:33     ` James Morse
2023-01-23 15:33   ` Peter Newman
2023-03-06 11:33     ` James Morse
2023-03-06 13:14       ` Peter Newman
2023-03-08 17:45         ` James Morse
2023-03-09 13:41           ` Peter Newman
2023-03-09 17:35             ` James Morse
2023-03-10  9:28               ` Peter Newman
2023-03-20 17:12                 ` James Morse
2023-03-22 13:21                   ` Peter Newman
2023-01-13 17:54 ` [PATCH v2 10/18] x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read() James Morse
2023-01-13 17:54 ` [PATCH v2 11/18] x86/resctrl: Make resctrl_mounted checks explicit James Morse
2023-01-13 17:54 ` [PATCH v2 12/18] x86/resctrl: Move alloc/mon static keys into helpers James Morse
2023-01-13 17:54 ` [PATCH v2 13/18] x86/resctrl: Make rdt_enable_key the arch's decision to switch James Morse
2023-02-02 23:48   ` Reinette Chatre
2023-01-13 17:54 ` [PATCH v2 14/18] x86/resctrl: Add helpers for system wide mon/alloc capable James Morse
2023-01-25  7:16   ` Shaopeng Tan (Fujitsu)
2023-03-06 11:34     ` James Morse
2023-01-13 17:54 ` [PATCH v2 15/18] x86/resctrl: Add cpu online callback for resctrl work James Morse
2023-01-13 17:54 ` [PATCH v2 16/18] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu James Morse
2023-02-02 23:49   ` Reinette Chatre
2023-03-06 11:34     ` James Morse
2023-01-13 17:54 ` [PATCH v2 17/18] x86/resctrl: Add cpu offline callback for resctrl work James Morse
2023-01-13 17:54 ` [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks James Morse
2023-02-02 23:50   ` Reinette Chatre
2023-03-06 11:34     ` James Morse
2023-03-11  0:22       ` Reinette Chatre
2023-03-20 17:12         ` James Morse
2023-01-25  7:19 ` [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking Shaopeng Tan (Fujitsu)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.