linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent
@ 2024-03-25 17:27 Peter Newman
  2024-03-25 17:27 ` [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line Peter Newman
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Peter Newman @ 2024-03-25 17:27 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel, Peter Newman

Hi Reinette,

I've been working with users of the recently-added mongroup rename
operation[1] who have observed the impact of back-to-back operations on
latency-sensitive, thread pool-based services. Because changing a
resctrl group's CLOSID (or RMID) requires all task_structs in the system
to be inspected with the tasklist_lock read-locked, a series of move
operations can block out thread creation for long periods of time, as
creating threads needs to write-lock the tasklist_lock.

To avoid this issue, this series replaces the CLOSID and RMID values
cached in the task_struct with a pointer to the task's rdtgroup, through
which the current CLOSID and RMID can be obtained indirectly during a
context switch. Updating a group's ID then only requires the current
task to be switched back in on all CPUs. On server hosts with very large
thread counts, this is much less disruptive than making thread creation
globally unavailable. However, this is less desirable on CPU-isolated,
realtime workloads, so I am interested in suggestions on how to reach a
compromise for the two use cases.

Also note that the soft-ABMC[2] proposal is based on swapping the RMID
values in mongroups when monitors are assigned to them, which will
result in the same slowdown as was encountered with re-parenting
monitoring groups.

Using pointers to rdtgroups from the task_struct been used internally at
Google for a few years to support an alternate CLOSID management
technique[3], which was replaced by mongroup rename. Usage of this
technique during production revealed an issue where threads in an
exiting process become unreachable via for_each_process_thread() before
destruction, but can still switch in and out. Consequently, an rmdir
operation can fail to remove references to an rdtgroup before it frees
it, causing the pointer to freed memory to be dereferenced after the
structure is freed. This would result in invalid CLOSID/RMID values
being written into MSRs, causing an exception. The workaround for this
is to clear a task's rdtgroup pointer when it exits.

As a benefit, pointers to rdtgroups are architecture-independent,
resulting in __resctrl_sched_in() and more of the task assignment code
becoming portable, simplifying the effort of supporting MPAM. Using a
single pointer allows the task's current monitor and control groups to
be determined atomically.

Similar changes have been used internally on a kernel supporting MPAM.
Upon request, I can provide the required changes to the MPAM-resctrl
interface based on James Morse's latest MPAM snapshot[4] for reference.

Thanks!
-Peter

[1] https://lore.kernel.org/r/20230419125015.693566-3-peternewman@google.com
[2] https://lore.kernel.org/lkml/CALPaoChhKJiMAueFtgCTc7ffO++S5DJCySmxqf9ZDmhR9RQapw@mail.gmail.com
[3] https://lore.kernel.org/lkml/CALPaoCj-zav4x6H3ffXo_O+RFan8Qb-uLy-DdtkaQTfuxY4a0w@mail.gmail.com
[4] https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/v6.7-rc2

Peter Newman (6):
  x86/resctrl: Move __resctrl_sched_in() out-of-line
  x86/resctrl: Add hook for releasing task_struct references
  x86/resctrl: Disallow mongroup rename on MPAM
  x86/resctrl: Use rdtgroup pointer to indicate task membership
  x86/resctrl: Abstract PQR_ASSOC from generic code
  x86/resctrl: Don't search tasklist in mongroup rename

 arch/x86/include/asm/resctrl.h            |  93 --------
 arch/x86/kernel/cpu/resctrl/core.c        |  14 +-
 arch/x86/kernel/cpu/resctrl/internal.h    |  17 ++
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 279 +++++++++++++---------
 arch/x86/kernel/process_32.c              |   2 +-
 arch/x86/kernel/process_64.c              |   2 +-
 include/linux/resctrl.h                   |  38 +++
 include/linux/sched.h                     |   3 +-
 kernel/exit.c                             |   3 +
 10 files changed, 239 insertions(+), 216 deletions(-)


base-commit: 4cece764965020c22cff7665b18a012006359095
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line
  2024-03-25 17:27 [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Peter Newman
@ 2024-03-25 17:27 ` Peter Newman
  2024-04-04 23:09   ` Reinette Chatre
  2024-03-25 17:27 ` [PATCH v1 2/6] x86/resctrl: Add hook for releasing task_struct references Peter Newman
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Peter Newman @ 2024-03-25 17:27 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel, Peter Newman

__resctrl_sched_in() is unable to dereference a struct rdtgroup pointer
when defined inline because rdtgroup is a private structure defined in
internal.h.

This function is defined inline to avoid impacting context switch
performance for the majority of users who aren't using resctrl at all.
These benefits can already be realized without access to internal
resctrl data structures.

The logic of performing an out-of-line call to __resctrl_sched_in() only
when resctrl is mounted is architecture-independent, so the inline
definition of resctrl_sched_in() can be moved into linux/resctrl.h.

Signed-off-by: Peter Newman <peternewman@google.com>
---
 arch/x86/include/asm/resctrl.h         | 75 --------------------------
 arch/x86/kernel/cpu/resctrl/internal.h | 24 +++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 41 ++++++++++++++
 arch/x86/kernel/process_32.c           |  2 +-
 arch/x86/kernel/process_64.c           |  2 +-
 include/linux/resctrl.h                | 21 ++++++++
 6 files changed, 88 insertions(+), 77 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 12dbd2588ca7..99ba8c0dc155 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -14,30 +14,6 @@
  */
 #define X86_RESCTRL_EMPTY_CLOSID         ((u32)~0)
 
-/**
- * struct resctrl_pqr_state - State cache for the PQR MSR
- * @cur_rmid:		The cached Resource Monitoring ID
- * @cur_closid:	The cached Class Of Service ID
- * @default_rmid:	The user assigned Resource Monitoring ID
- * @default_closid:	The user assigned cached Class Of Service ID
- *
- * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
- * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
- * contains both parts, so we need to cache them. This also
- * stores the user configured per cpu CLOSID and RMID.
- *
- * The cache also helps to avoid pointless updates if the value does
- * not change.
- */
-struct resctrl_pqr_state {
-	u32			cur_rmid;
-	u32			cur_closid;
-	u32			default_rmid;
-	u32			default_closid;
-};
-
-DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
-
 extern bool rdt_alloc_capable;
 extern bool rdt_mon_capable;
 
@@ -79,50 +55,6 @@ static inline void resctrl_arch_disable_mon(void)
 	static_branch_dec_cpuslocked(&rdt_enable_key);
 }
 
-/*
- * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
- *
- * Following considerations are made so that this has minimal impact
- * on scheduler hot path:
- * - This will stay as no-op unless we are running on an Intel SKU
- *   which supports resource control or monitoring and we enable by
- *   mounting the resctrl file system.
- * - Caches the per cpu CLOSid/RMID values and does the MSR write only
- *   when a task with a different CLOSid/RMID is scheduled in.
- * - We allocate RMIDs/CLOSids globally in order to keep this as
- *   simple as possible.
- * Must be called with preemption disabled.
- */
-static inline void __resctrl_sched_in(struct task_struct *tsk)
-{
-	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
-	u32 closid = state->default_closid;
-	u32 rmid = state->default_rmid;
-	u32 tmp;
-
-	/*
-	 * If this task has a closid/rmid assigned, use it.
-	 * Else use the closid/rmid assigned to this cpu.
-	 */
-	if (static_branch_likely(&rdt_alloc_enable_key)) {
-		tmp = READ_ONCE(tsk->closid);
-		if (tmp)
-			closid = tmp;
-	}
-
-	if (static_branch_likely(&rdt_mon_enable_key)) {
-		tmp = READ_ONCE(tsk->rmid);
-		if (tmp)
-			rmid = tmp;
-	}
-
-	if (closid != state->cur_closid || rmid != state->cur_rmid) {
-		state->cur_closid = closid;
-		state->cur_rmid = rmid;
-		wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
-	}
-}
-
 static inline unsigned int resctrl_arch_round_mon_val(unsigned int val)
 {
 	unsigned int scale = boot_cpu_data.x86_cache_occ_scale;
@@ -150,12 +82,6 @@ static inline bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 ignored,
 	return READ_ONCE(tsk->rmid) == rmid;
 }
 
-static inline void resctrl_sched_in(struct task_struct *tsk)
-{
-	if (static_branch_likely(&rdt_enable_key))
-		__resctrl_sched_in(tsk);
-}
-
 static inline u32 resctrl_arch_system_num_rmid_idx(void)
 {
 	/* RMID are independent numbers for x86. num_rmid_idx == num_rmid */
@@ -188,7 +114,6 @@ void resctrl_cpu_detect(struct cpuinfo_x86 *c);
 
 #else
 
-static inline void resctrl_sched_in(struct task_struct *tsk) {}
 static inline void resctrl_cpu_detect(struct cpuinfo_x86 *c) {}
 
 #endif /* CONFIG_X86_CPU_RESCTRL */
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c99f26ebe7a6..56a68e542572 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -331,6 +331,30 @@ struct rftype {
 			 char *buf, size_t nbytes, loff_t off);
 };
 
+/**
+ * struct resctrl_pqr_state - State cache for the PQR MSR
+ * @cur_rmid:		The cached Resource Monitoring ID
+ * @cur_closid:	The cached Class Of Service ID
+ * @default_rmid:	The user assigned Resource Monitoring ID
+ * @default_closid:	The user assigned cached Class Of Service ID
+ *
+ * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
+ * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
+ * contains both parts, so we need to cache them. This also
+ * stores the user configured per cpu CLOSID and RMID.
+ *
+ * The cache also helps to avoid pointless updates if the value does
+ * not change.
+ */
+struct resctrl_pqr_state {
+	u32			cur_rmid;
+	u32			cur_closid;
+	u32			default_rmid;
+	u32			default_closid;
+};
+
+DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
+
 /**
  * struct mbm_state - status for each MBM counter in each domain
  * @prev_bw_bytes: Previous bytes value read for bandwidth calculation
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 011e17efb1a6..5d599d99f94b 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -334,6 +334,47 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
 	return ret;
 }
 
+/*
+ * __resctrl_sched_in() - Writes the task's control and monitor IDs into the CPU
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - Caches the per cpu CLOSid/RMID values and does the MSR write only
+ *   when a task with a different CLOSid/RMID is scheduled in.
+ * - We allocate RMIDs/CLOSids globally in order to keep this as
+ *   simple as possible.
+ * Must be called with preemption disabled.
+ */
+void __resctrl_sched_in(struct task_struct *tsk)
+{
+	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
+	u32 closid = state->default_closid;
+	u32 rmid = state->default_rmid;
+	u32 tmp;
+
+	/*
+	 * If this task has a closid/rmid assigned, use it.
+	 * Else use the closid/rmid assigned to this cpu.
+	 */
+	if (static_branch_likely(&rdt_alloc_enable_key)) {
+		tmp = READ_ONCE(tsk->closid);
+		if (tmp)
+			closid = tmp;
+	}
+
+	if (static_branch_likely(&rdt_mon_enable_key)) {
+		tmp = READ_ONCE(tsk->rmid);
+		if (tmp)
+			rmid = tmp;
+	}
+
+	if (closid != state->cur_closid || rmid != state->cur_rmid) {
+		state->cur_closid = closid;
+		state->cur_rmid = rmid;
+		wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
+	}
+}
+
 /*
  * This is safe against resctrl_sched_in() called from __switch_to()
  * because __switch_to() is executed with interrupts disabled. A local call
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 0917c7f25720..8f92a87d381d 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -38,6 +38,7 @@
 #include <linux/io.h>
 #include <linux/kdebug.h>
 #include <linux/syscalls.h>
+#include <linux/resctrl.h>
 
 #include <asm/ldt.h>
 #include <asm/processor.h>
@@ -51,7 +52,6 @@
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
 #include <asm/vm86.h>
-#include <asm/resctrl.h>
 #include <asm/proto.h>
 
 #include "process.h"
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 7062b84dd467..d442269bb25b 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -40,6 +40,7 @@
 #include <linux/ftrace.h>
 #include <linux/syscalls.h>
 #include <linux/iommu.h>
+#include <linux/resctrl.h>
 
 #include <asm/processor.h>
 #include <asm/pkru.h>
@@ -53,7 +54,6 @@
 #include <asm/switch_to.h>
 #include <asm/xen/hypervisor.h>
 #include <asm/vdso.h>
-#include <asm/resctrl.h>
 #include <asm/unistd.h>
 #include <asm/fsgsbase.h>
 #include <asm/fred.h>
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a365f67131ec..62d607939a73 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -304,4 +304,25 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
 
+DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
+
+void __resctrl_sched_in(struct task_struct *tsk);
+
+/*
+ * resctrl_sched_in() - Assigns the incoming task's control/monitor IDs to the
+ *			current CPU
+ *
+ * To minimize impact to the scheduler hot path, this will stay as no-op unless
+ * running on a system supporting resctrl and the filesystem is mounted.
+ *
+ * Must be called with preemption disabled.
+ */
+static inline void resctrl_sched_in(struct task_struct *tsk)
+{
+#ifdef CONFIG_X86_CPU_RESCTRL
+	if (static_branch_likely(&rdt_enable_key))
+		__resctrl_sched_in(tsk);
+#endif
+}
+
 #endif /* _RESCTRL_H */
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v1 2/6] x86/resctrl: Add hook for releasing task_struct references
  2024-03-25 17:27 [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Peter Newman
  2024-03-25 17:27 ` [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line Peter Newman
@ 2024-03-25 17:27 ` Peter Newman
  2024-04-04 23:10   ` Reinette Chatre
  2024-03-25 17:27 ` [PATCH v1 3/6] x86/resctrl: Disallow mongroup rename on MPAM Peter Newman
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Peter Newman @ 2024-03-25 17:27 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel, Peter Newman

In order for the task_struct to hold references to rdtgroups, it must be
possible to release these references before a concurrent deletion causes
them to be freed.

It is not possible for resctrl code to do this with
for_each_process_thread() because the task can still switch in after it
has been removed from the tasklist, at which point the task_struct could
be referring to freed memory.

Signed-off-by: Peter Newman <peternewman@google.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 10 ++++++++++
 include/linux/resctrl.h                |  6 ++++++
 kernel/exit.c                          |  3 +++
 3 files changed, 19 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5d599d99f94b..9b1969e4235a 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2931,6 +2931,16 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
 	read_unlock(&tasklist_lock);
 }
 
+/**
+ * exit_resctrl() - called at thread destruction to release resources
+ *
+ * This hook is called just before the task is removed from the global tasklist
+ * and still reachable via for_each_process_thread().
+ */
+void exit_resctrl(struct task_struct *tsk)
+{
+}
+
 static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
 {
 	struct rdtgroup *sentry, *stmp;
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 62d607939a73..b2af1fbc7aa1 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -325,4 +325,10 @@ static inline void resctrl_sched_in(struct task_struct *tsk)
 #endif
 }
 
+#ifdef CONFIG_X86_CPU_RESCTRL
+void exit_resctrl(struct task_struct *tsk);
+#else
+static inline void exit_resctrl(struct task_struct *tsk) {}
+#endif
+
 #endif /* _RESCTRL_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index 41a12630cbbc..ccdc90ff6d71 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -70,6 +70,7 @@
 #include <linux/sysfs.h>
 #include <linux/user_events.h>
 #include <linux/uaccess.h>
+#include <linux/resctrl.h>
 
 #include <uapi/linux/wait.h>
 
@@ -862,6 +863,8 @@ void __noreturn do_exit(long code)
 	tsk->exit_code = code;
 	taskstats_exit(tsk, group_dead);
 
+	exit_resctrl(tsk);
+
 	exit_mm();
 
 	if (group_dead)
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v1 3/6] x86/resctrl: Disallow mongroup rename on MPAM
  2024-03-25 17:27 [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Peter Newman
  2024-03-25 17:27 ` [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line Peter Newman
  2024-03-25 17:27 ` [PATCH v1 2/6] x86/resctrl: Add hook for releasing task_struct references Peter Newman
@ 2024-03-25 17:27 ` Peter Newman
  2024-04-04 23:11   ` Reinette Chatre
  2024-03-25 17:27 ` [PATCH v1 4/6] x86/resctrl: Use rdtgroup pointer to indicate task membership Peter Newman
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Peter Newman @ 2024-03-25 17:27 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel, Peter Newman

Moving a monitoring group to a different parent control assumes that the
monitors will not be impacted. This is not the case on MPAM where the
PMG is an extension of the PARTID.

Detect this situation by requiring the change in CLOSID not to affect
the result of resctrl_arch_rmid_idx_encode(), otherwise return
-EOPNOTSUPP.

Signed-off-by: Peter Newman <peternewman@google.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9b1969e4235a..8d6979dbfd02 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3879,6 +3879,19 @@ static int rdtgroup_rename(struct kernfs_node *kn,
 		goto out;
 	}
 
+	/*
+	 * If changing the CLOSID impacts the RMID, this operation is not
+	 * supported.
+	 */
+	if (resctrl_arch_rmid_idx_encode(rdtgrp->mon.parent->closid,
+					 rdtgrp->mon.rmid) !=
+	    resctrl_arch_rmid_idx_encode(new_prdtgrp->closid,
+					 rdtgrp->mon.rmid)) {
+		rdt_last_cmd_puts("changing parent control group not supported\n");
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
 	/*
 	 * If the MON group is monitoring CPUs, the CPUs must be assigned to the
 	 * current parent CTRL_MON group and therefore cannot be assigned to
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v1 4/6] x86/resctrl: Use rdtgroup pointer to indicate task membership
  2024-03-25 17:27 [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Peter Newman
                   ` (2 preceding siblings ...)
  2024-03-25 17:27 ` [PATCH v1 3/6] x86/resctrl: Disallow mongroup rename on MPAM Peter Newman
@ 2024-03-25 17:27 ` Peter Newman
  2024-04-04 23:15   ` Reinette Chatre
  2024-03-25 17:27 ` [PATCH v1 5/6] x86/resctrl: Abstract PQR_ASSOC from generic code Peter Newman
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Peter Newman @ 2024-03-25 17:27 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel, Peter Newman

Caching the CLOSID and RMID values in all member tasks makes changing
either ID for a group expensive, as all task_structs must be inspected
while read-locking the tasklist_lock.

A single rdtgroup reference from the task_struct can indicate the
mongroup and ctrl group membership of a task. In the case of mongroups,
the parent pointer can be used to determine the CLOSID indirectly,
avoiding the need for invalidating a cached CLOSID in all task_structs.

This also solves the problem of tearing PARTID/PMG values in MPAM, as
the parent pointer of a mongroup does not change. Therefore an atomic
read of the rdt_group pointer provides a consistent view of current
mongroup and control group membership, making __resctrl_sched_in()
portable.

Care must be taken to ensure that __resctrl_sched_in() does not
dereference a pointer to a freed rdtgroup struct. Tasks may no longer be
reachable via for_each_process_thread() but can still be switched in, so
update the rdt_group pointer before the thread is removed from the
tasklist.

Co-developed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Newman <peternewman@google.com>
---
 arch/x86/include/asm/resctrl.h         |  18 ---
 arch/x86/kernel/cpu/resctrl/core.c     |   3 +-
 arch/x86/kernel/cpu/resctrl/internal.h |  13 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 205 +++++++++++++------------
 include/linux/sched.h                  |   3 +-
 5 files changed, 110 insertions(+), 132 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 99ba8c0dc155..be4afbc6180f 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -64,24 +64,6 @@ static inline unsigned int resctrl_arch_round_mon_val(unsigned int val)
 	return val * scale;
 }
 
-static inline void resctrl_arch_set_closid_rmid(struct task_struct *tsk,
-						u32 closid, u32 rmid)
-{
-	WRITE_ONCE(tsk->closid, closid);
-	WRITE_ONCE(tsk->rmid, rmid);
-}
-
-static inline bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid)
-{
-	return READ_ONCE(tsk->closid) == closid;
-}
-
-static inline bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 ignored,
-					   u32 rmid)
-{
-	return READ_ONCE(tsk->rmid) == rmid;
-}
-
 static inline u32 resctrl_arch_system_num_rmid_idx(void)
 {
 	/* RMID are independent numbers for x86. num_rmid_idx == num_rmid */
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 83e40341583e..ae5878d748fc 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -600,8 +600,7 @@ static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
 
-	state->default_closid = RESCTRL_RESERVED_CLOSID;
-	state->default_rmid = RESCTRL_RESERVED_RMID;
+	state->default_group = &rdtgroup_default;
 	state->cur_closid = RESCTRL_RESERVED_CLOSID;
 	state->cur_rmid = RESCTRL_RESERVED_RMID;
 	wrmsr(MSR_IA32_PQR_ASSOC, RESCTRL_RESERVED_RMID,
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 56a68e542572..0ba0d2428780 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -334,14 +334,8 @@ struct rftype {
 /**
  * struct resctrl_pqr_state - State cache for the PQR MSR
  * @cur_rmid:		The cached Resource Monitoring ID
- * @cur_closid:	The cached Class Of Service ID
- * @default_rmid:	The user assigned Resource Monitoring ID
- * @default_closid:	The user assigned cached Class Of Service ID
- *
- * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
- * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
- * contains both parts, so we need to cache them. This also
- * stores the user configured per cpu CLOSID and RMID.
+ * @cur_closid:		The cached Class Of Service ID
+ * @default_group:	The user assigned rdtgroup
  *
  * The cache also helps to avoid pointless updates if the value does
  * not change.
@@ -349,8 +343,7 @@ struct rftype {
 struct resctrl_pqr_state {
 	u32			cur_rmid;
 	u32			cur_closid;
-	u32			default_rmid;
-	u32			default_closid;
+	struct rdtgroup		*default_group;
 };
 
 DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 8d6979dbfd02..badf181c8cbb 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -348,25 +348,55 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
 void __resctrl_sched_in(struct task_struct *tsk)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
-	u32 closid = state->default_closid;
-	u32 rmid = state->default_rmid;
-	u32 tmp;
+	u32 closid = state->cur_closid;
+	u32 rmid = state->cur_rmid;
+	struct rdtgroup *rgrp;
 
 	/*
-	 * If this task has a closid/rmid assigned, use it.
-	 * Else use the closid/rmid assigned to this cpu.
+	 * A task's group assignment can change concurrently, but the CLOSID or
+	 * RMID assigned to a group cannot change.
 	 */
+	rgrp = READ_ONCE(tsk->rdt_group);
+	if (!rgrp || rgrp == &rdtgroup_default)
+		/*
+		 * If this task is a member of a control or monitoring group,
+		 * use the IDs assigned to these groups. Else use the
+		 * closid/rmid assigned to this cpu.
+		 */
+		rgrp = state->default_group;
+
+	/*
+	 * Context switches are possible before the cpuonline handler
+	 * initializes default_group.
+	 */
+	if (!rgrp)
+		rgrp = &rdtgroup_default;
+
 	if (static_branch_likely(&rdt_alloc_enable_key)) {
-		tmp = READ_ONCE(tsk->closid);
-		if (tmp)
-			closid = tmp;
+		/*
+		 * If the task is assigned to a monitoring group, the CLOSID is
+		 * determined by the parent control group.
+		 */
+		if (rgrp->type == RDTMON_GROUP) {
+			if (!WARN_ON(!rgrp->mon.parent))
+				/*
+				 * The parent rdtgroup cannot be freed until
+				 * after the mon group is freed. In the event
+				 * that the parent rdtgroup is removed (by
+				 * rdtgroup_rmdir_ctrl()), rdt_mon_group would
+				 * be redirected to rdtgroup_default, followed
+				 * by a full barrier and synchronous IPI
+				 * broadcast before proceeding to free the
+				 * group.
+				 */
+				closid = rgrp->mon.parent->closid;
+		} else {
+			closid = rgrp->closid;
+		}
 	}
 
-	if (static_branch_likely(&rdt_mon_enable_key)) {
-		tmp = READ_ONCE(tsk->rmid);
-		if (tmp)
-			rmid = tmp;
-	}
+	if (static_branch_likely(&rdt_mon_enable_key))
+		rmid = rgrp->mon.rmid;
 
 	if (closid != state->cur_closid || rmid != state->cur_rmid) {
 		state->cur_closid = closid;
@@ -385,10 +415,8 @@ static void update_cpu_closid_rmid(void *info)
 {
 	struct rdtgroup *r = info;
 
-	if (r) {
-		this_cpu_write(pqr_state.default_closid, r->closid);
-		this_cpu_write(pqr_state.default_rmid, r->mon.rmid);
-	}
+	if (r)
+		this_cpu_write(pqr_state.default_group, r);
 
 	/*
 	 * We cannot unconditionally write the MSR because the current
@@ -624,49 +652,61 @@ static void update_task_closid_rmid(struct task_struct *t)
 
 static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup *rdtgrp)
 {
-	u32 closid, rmid = rdtgrp->mon.rmid;
+	struct rdtgroup *task_group = READ_ONCE(tsk->rdt_group);
 
-	if (rdtgrp->type == RDTCTRL_GROUP)
-		closid = rdtgrp->closid;
-	else if (rdtgrp->type == RDTMON_GROUP)
-		closid = rdtgrp->mon.parent->closid;
-	else
-		return false;
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	/* Uninitalized rdt_group pointer implies rdtgroup_default. */
+	if (!task_group)
+		task_group = &rdtgroup_default;
+
+	if (rdtgrp == task_group)
+		return true;
+
+	/* Tasks in child mongroups are members of the parent ctrlmon group. */
+	if (task_group->type == RDTMON_GROUP &&
+	    task_group->mon.parent == rdtgrp)
+		return true;
 
-	return resctrl_arch_match_closid(tsk, closid) &&
-	       resctrl_arch_match_rmid(tsk, closid, rmid);
+	return false;
 }
 
 static int __rdtgroup_move_task(struct task_struct *tsk,
 				struct rdtgroup *rdtgrp)
 {
+	struct rdtgroup *task_group = READ_ONCE(tsk->rdt_group);
+
 	/* If the task is already in rdtgrp, no need to move the task. */
 	if (task_in_rdtgroup(tsk, rdtgrp))
 		return 0;
 
 	/*
-	 * Set the task's closid/rmid before the PQR_ASSOC MSR can be
-	 * updated by them.
+	 * NULL is used in the task_struct so it can be overridden by a CPU's
+	 * default_group
+	 */
+	if (!task_group)
+		task_group = &rdtgroup_default;
+
+	/*
+	 * Set the task's group before the CPU can be updated by them.
 	 *
 	 * For ctrl_mon groups, move both closid and rmid.
 	 * For monitor groups, can move the tasks only from
-	 * their parent CTRL group.
+	 * their parent CTRL group or another mon group under the same parent.
 	 */
-	if (rdtgrp->type == RDTMON_GROUP &&
-	    !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) {
+	if (rdtgrp->type == RDTCTRL_GROUP) {
+		WRITE_ONCE(tsk->rdt_group, rdtgrp);
+	} else if (rdtgrp->type == RDTMON_GROUP &&
+		   (task_group == rdtgrp->mon.parent ||
+		    task_group->mon.parent == rdtgrp->mon.parent)) {
+		WRITE_ONCE(tsk->rdt_group, rdtgrp);
+	} else {
 		rdt_last_cmd_puts("Can't move task to different control group\n");
 		return -EINVAL;
 	}
 
-	if (rdtgrp->type == RDTMON_GROUP)
-		resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid,
-					     rdtgrp->mon.rmid);
-	else
-		resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid,
-					     rdtgrp->mon.rmid);
-
 	/*
-	 * Ensure the task's closid and rmid are written before determining if
+	 * Ensure the task's group is written before determining if
 	 * the task is current that will decide if it will be interrupted.
 	 * This pairs with the full barrier between the rq->curr update and
 	 * resctrl_sched_in() during context switch.
@@ -684,19 +724,6 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 	return 0;
 }
 
-static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
-{
-	return (resctrl_arch_alloc_capable() && (r->type == RDTCTRL_GROUP) &&
-		resctrl_arch_match_closid(t, r->closid));
-}
-
-static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r)
-{
-	return (resctrl_arch_mon_capable() && (r->type == RDTMON_GROUP) &&
-		resctrl_arch_match_rmid(t, r->mon.parent->closid,
-					r->mon.rmid));
-}
-
 /**
  * rdtgroup_tasks_assigned - Test if tasks have been assigned to resource group
  * @r: Resource group
@@ -712,7 +739,7 @@ int rdtgroup_tasks_assigned(struct rdtgroup *r)
 
 	rcu_read_lock();
 	for_each_process_thread(p, t) {
-		if (is_closid_match(t, r) || is_rmid_match(t, r)) {
+		if (task_in_rdtgroup(t, r)) {
 			ret = 1;
 			break;
 		}
@@ -830,7 +857,7 @@ static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)
 
 	rcu_read_lock();
 	for_each_process_thread(p, t) {
-		if (is_closid_match(t, r) || is_rmid_match(t, r)) {
+		if (task_in_rdtgroup(t, r)) {
 			pid = task_pid_vnr(t);
 			if (pid)
 				seq_printf(s, "%d\n", pid);
@@ -924,53 +951,34 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
 		      struct pid *pid, struct task_struct *tsk)
 {
 	struct rdtgroup *rdtg;
-	int ret = 0;
-
-	mutex_lock(&rdtgroup_mutex);
+	struct rdtgroup *crg;
+	struct rdtgroup *mrg;
 
 	/* Return empty if resctrl has not been mounted. */
 	if (!resctrl_mounted) {
 		seq_puts(s, "res:\nmon:\n");
-		goto unlock;
+		return 0;
 	}
 
-	list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
-		struct rdtgroup *crg;
+	mutex_lock(&rdtgroup_mutex);
 
-		/*
-		 * Task information is only relevant for shareable
-		 * and exclusive groups.
-		 */
-		if (rdtg->mode != RDT_MODE_SHAREABLE &&
-		    rdtg->mode != RDT_MODE_EXCLUSIVE)
-			continue;
+	rdtg = READ_ONCE(tsk->rdt_group);
+	if (!rdtg)
+		rdtg = &rdtgroup_default;
 
-		if (!resctrl_arch_match_closid(tsk, rdtg->closid))
-			continue;
+	mrg = rdtg;
+	crg = rdtg;
+	if (rdtg->type == RDTMON_GROUP)
+		crg = rdtg->mon.parent;
+
+	seq_printf(s, "res:%s%s\n", (crg == &rdtgroup_default) ? "/" : "",
+		   crg->kn->name);
+	seq_printf(s, "mon:%s%s\n", (mrg == &rdtgroup_default) ? "/" : "",
+		   mrg->kn->name);
 
-		seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "",
-			   rdtg->kn->name);
-		seq_puts(s, "mon:");
-		list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
-				    mon.crdtgrp_list) {
-			if (!resctrl_arch_match_rmid(tsk, crg->mon.parent->closid,
-						     crg->mon.rmid))
-				continue;
-			seq_printf(s, "%s", crg->kn->name);
-			break;
-		}
-		seq_putc(s, '\n');
-		goto unlock;
-	}
-	/*
-	 * The above search should succeed. Otherwise return
-	 * with an error.
-	 */
-	ret = -ENOENT;
-unlock:
 	mutex_unlock(&rdtgroup_mutex);
 
-	return ret;
+	return 0;
 }
 #endif
 
@@ -2904,13 +2912,11 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
 
 	read_lock(&tasklist_lock);
 	for_each_process_thread(p, t) {
-		if (!from || is_closid_match(t, from) ||
-		    is_rmid_match(t, from)) {
-			resctrl_arch_set_closid_rmid(t, to->closid,
-						     to->mon.rmid);
+		if (!from || task_in_rdtgroup(t, from)) {
+			WRITE_ONCE(t->rdt_group, to);
 
 			/*
-			 * Order the closid/rmid stores above before the loads
+			 * Order the group store above before the loads
 			 * in task_curr(). This pairs with the full barrier
 			 * between the rq->curr update and resctrl_sched_in()
 			 * during context switch.
@@ -2939,6 +2945,7 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
  */
 void exit_resctrl(struct task_struct *tsk)
 {
+	WRITE_ONCE(tsk->rdt_group, &rdtgroup_default);
 }
 
 static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
@@ -3681,7 +3688,7 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
 
 	/* Update per cpu rmid of the moved CPUs first */
 	for_each_cpu(cpu, &rdtgrp->cpu_mask)
-		per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid;
+		per_cpu(pqr_state.default_group, cpu) = prdtgrp;
 	/*
 	 * Update the MSR on moved CPUs and CPUs which have moved
 	 * task running on them.
@@ -3724,10 +3731,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
 		   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
 
 	/* Update per cpu closid and rmid of the moved CPUs first */
-	for_each_cpu(cpu, &rdtgrp->cpu_mask) {
-		per_cpu(pqr_state.default_closid, cpu) = rdtgroup_default.closid;
-		per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid;
-	}
+	for_each_cpu(cpu, &rdtgrp->cpu_mask)
+		per_cpu(pqr_state.default_group, cpu) = &rdtgroup_default;
 
 	/*
 	 * Update the MSR on moved CPUs and CPUs which have moved
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3c2abbc587b4..d07d7a80006b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1236,8 +1236,7 @@ struct task_struct {
 	struct list_head		cg_list;
 #endif
 #ifdef CONFIG_X86_CPU_RESCTRL
-	u32				closid;
-	u32				rmid;
+	struct rdtgroup			*rdt_group;
 #endif
 #ifdef CONFIG_FUTEX
 	struct robust_list_head __user	*robust_list;
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v1 5/6] x86/resctrl: Abstract PQR_ASSOC from generic code
  2024-03-25 17:27 [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Peter Newman
                   ` (3 preceding siblings ...)
  2024-03-25 17:27 ` [PATCH v1 4/6] x86/resctrl: Use rdtgroup pointer to indicate task membership Peter Newman
@ 2024-03-25 17:27 ` Peter Newman
  2024-04-04 23:16   ` Reinette Chatre
                     ` (2 more replies)
  2024-03-25 17:27 ` [PATCH v1 6/6] x86/resctrl: Don't search tasklist in mongroup rename Peter Newman
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 27+ messages in thread
From: Peter Newman @ 2024-03-25 17:27 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel, Peter Newman

While CLOSID and RMID originated in RDT, the concept applies to other
architectures, as it's standard to write allocation and monitoring IDs
into per-CPU registers.

 - Rename resctrl_pqr_state and pqr_state to be more
   architecturally-neutral.

 - Introduce resctrl_arch_update_cpu() to replace the explicit write to
   MSR_IA32_PQR_ASSOC in __resctrl_sched_in(). In the case of MPAM,
   PARTID[_I,D] and PMG are a simple function of closid, rmid, and an
   internal global.

 - Update terminology containing explicit references to the PQR_ASSOC
   register.

Signed-off-by: Peter Newman <peternewman@google.com>
---
 arch/x86/kernel/cpu/resctrl/core.c        | 11 ++++++++---
 arch/x86/kernel/cpu/resctrl/internal.h    |  6 +++---
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  4 ++--
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 18 +++++++++---------
 include/linux/resctrl.h                   | 11 +++++++++++
 5 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index ae5878d748fc..4cc584754f8b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -37,12 +37,12 @@
 static DEFINE_MUTEX(domain_list_lock);
 
 /*
- * The cached resctrl_pqr_state is strictly per CPU and can never be
+ * The cached resctrl_cpu_state is strictly per CPU and can never be
  * updated from a remote CPU. Functions which modify the state
  * are called with interrupts disabled and no preemption, which
  * is sufficient for the protection.
  */
-DEFINE_PER_CPU(struct resctrl_pqr_state, pqr_state);
+DEFINE_PER_CPU(struct resctrl_cpu_state, resctrl_state);
 
 /*
  * Used to store the max resource name width and max resource data width
@@ -309,6 +309,11 @@ static void rdt_get_cdp_l2_config(void)
 	rdt_get_cdp_config(RDT_RESOURCE_L2);
 }
 
+void resctrl_arch_update_cpu(u32 ctrl_id, u32 mon_id)
+{
+	wrmsr(MSR_IA32_PQR_ASSOC, mon_id, ctrl_id);
+}
+
 static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
 {
@@ -598,7 +603,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 static void clear_closid_rmid(int cpu)
 {
-	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
+	struct resctrl_cpu_state *state = this_cpu_ptr(&resctrl_state);
 
 	state->default_group = &rdtgroup_default;
 	state->cur_closid = RESCTRL_RESERVED_CLOSID;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 0ba0d2428780..e30f42744ac7 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -332,7 +332,7 @@ struct rftype {
 };
 
 /**
- * struct resctrl_pqr_state - State cache for the PQR MSR
+ * struct resctrl_cpu_state - State cache for allocation/monitoring group IDs
  * @cur_rmid:		The cached Resource Monitoring ID
  * @cur_closid:		The cached Class Of Service ID
  * @default_group:	The user assigned rdtgroup
@@ -340,13 +340,13 @@ struct rftype {
  * The cache also helps to avoid pointless updates if the value does
  * not change.
  */
-struct resctrl_pqr_state {
+struct resctrl_cpu_state {
 	u32			cur_rmid;
 	u32			cur_closid;
 	struct rdtgroup		*default_group;
 };
 
-DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
+DECLARE_PER_CPU(struct resctrl_cpu_state, resctrl_state);
 
 /**
  * struct mbm_state - status for each MBM counter in each domain
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 884b88e25141..ca1805a566cb 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -480,8 +480,8 @@ static int pseudo_lock_fn(void *_rdtgrp)
 	 */
 	saved_msr = __rdmsr(MSR_MISC_FEATURE_CONTROL);
 	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
-	closid_p = this_cpu_read(pqr_state.cur_closid);
-	rmid_p = this_cpu_read(pqr_state.cur_rmid);
+	closid_p = this_cpu_read(resctrl_state.cur_closid);
+	rmid_p = this_cpu_read(resctrl_state.cur_rmid);
 	mem_r = plr->kmem;
 	size = plr->size;
 	line_size = plr->line_size;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index badf181c8cbb..bd067f7ed5b6 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -112,7 +112,7 @@ void rdt_staged_configs_clear(void)
  * + We can simply set current's closid to assign a task to a resource
  *   group.
  * + Context switch code can avoid extra memory references deciding which
- *   CLOSID to load into the PQR_ASSOC MSR
+ *   CLOSID to load into the CPU
  * - We give up some options in configuring resource groups across multi-socket
  *   systems.
  * - Our choices on how to configure each resource become progressively more
@@ -347,7 +347,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
  */
 void __resctrl_sched_in(struct task_struct *tsk)
 {
-	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
+	struct resctrl_cpu_state *state = this_cpu_ptr(&resctrl_state);
 	u32 closid = state->cur_closid;
 	u32 rmid = state->cur_rmid;
 	struct rdtgroup *rgrp;
@@ -401,7 +401,7 @@ void __resctrl_sched_in(struct task_struct *tsk)
 	if (closid != state->cur_closid || rmid != state->cur_rmid) {
 		state->cur_closid = closid;
 		state->cur_rmid = rmid;
-		wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
+		resctrl_arch_update_cpu(closid, rmid);
 	}
 }
 
@@ -416,7 +416,7 @@ static void update_cpu_closid_rmid(void *info)
 	struct rdtgroup *r = info;
 
 	if (r)
-		this_cpu_write(pqr_state.default_group, r);
+		this_cpu_write(resctrl_state.default_group, r);
 
 	/*
 	 * We cannot unconditionally write the MSR because the current
@@ -635,8 +635,8 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
 static void _update_task_closid_rmid(void *task)
 {
 	/*
-	 * If the task is still current on this CPU, update PQR_ASSOC MSR.
-	 * Otherwise, the MSR is updated when the task is scheduled in.
+	 * If the task is still current on this CPU, update the current ctrl
+	 * group. Otherwise, the CPU is updated when the task is scheduled in.
 	 */
 	if (task == current)
 		resctrl_sched_in(task);
@@ -3005,7 +3005,7 @@ static void rmdir_all_sub(void)
 		else
 			rdtgroup_remove(rdtgrp);
 	}
-	/* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */
+	/* Update online CPUs to propagate group membership changes. */
 	update_closid_rmid(cpu_online_mask, &rdtgroup_default);
 
 	kernfs_remove(kn_info);
@@ -3688,7 +3688,7 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
 
 	/* Update per cpu rmid of the moved CPUs first */
 	for_each_cpu(cpu, &rdtgrp->cpu_mask)
-		per_cpu(pqr_state.default_group, cpu) = prdtgrp;
+		per_cpu(resctrl_state.default_group, cpu) = prdtgrp;
 	/*
 	 * Update the MSR on moved CPUs and CPUs which have moved
 	 * task running on them.
@@ -3732,7 +3732,7 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
 
 	/* Update per cpu closid and rmid of the moved CPUs first */
 	for_each_cpu(cpu, &rdtgrp->cpu_mask)
-		per_cpu(pqr_state.default_group, cpu) = &rdtgroup_default;
+		per_cpu(resctrl_state.default_group, cpu) = &rdtgroup_default;
 
 	/*
 	 * Update the MSR on moved CPUs and CPUs which have moved
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index b2af1fbc7aa1..a6b1b13cc769 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -306,6 +306,17 @@ extern unsigned int resctrl_rmid_realloc_limit;
 
 DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
 
+/*
+ * resctrl_arch_update_cpu() - Make control and monitoring group IDs effective
+ *			       on the current CPU
+ *
+ * @ctrl_id:	An identifier for the control group which is to be used on the
+ *		current CPU.
+ * @mon_id:	An identifier for the monitoring group which is to be used on
+ *		the current CPU.
+ */
+void resctrl_arch_update_cpu(u32 ctrl_id, u32 mon_id);
+
 void __resctrl_sched_in(struct task_struct *tsk);
 
 /*
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v1 6/6] x86/resctrl: Don't search tasklist in mongroup rename
  2024-03-25 17:27 [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Peter Newman
                   ` (4 preceding siblings ...)
  2024-03-25 17:27 ` [PATCH v1 5/6] x86/resctrl: Abstract PQR_ASSOC from generic code Peter Newman
@ 2024-03-25 17:27 ` Peter Newman
  2024-04-04 23:18   ` Reinette Chatre
  2024-04-04 23:09 ` [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Reinette Chatre
  2024-04-11 22:34 ` Moger, Babu
  7 siblings, 1 reply; 27+ messages in thread
From: Peter Newman @ 2024-03-25 17:27 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel, Peter Newman

Iterating over all task_structs while read-locking the tasklist_lock
results in significant task creation/destruction latency. Back-to-back
move operations can thus be disastrous to the responsiveness of
threadpool-based services.

Now that the CLOSID is determined indirectly through a reference to the
task's current rdtgroup, it is not longer necessary to update the CLOSID
in all tasks belonging to the moved mongroup. The context switch handler
just needs to be prepared for concurrent writes to the parent pointer.

Signed-off-by: Peter Newman <peternewman@google.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 30 +++++++-------------------
 1 file changed, 8 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index bd067f7ed5b6..a007c0ec478f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -388,8 +388,11 @@ void __resctrl_sched_in(struct task_struct *tsk)
 				 * by a full barrier and synchronous IPI
 				 * broadcast before proceeding to free the
 				 * group.
+				 *
+				 * parent can be concurrently updated to a new
+				 * group as a result of mongrp_reparent().
 				 */
-				closid = rgrp->mon.parent->closid;
+				closid = READ_ONCE(rgrp->mon.parent)->closid;
 		} else {
 			closid = rgrp->closid;
 		}
@@ -3809,8 +3812,7 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
  * Monitoring data for the group is unaffected by this operation.
  */
 static void mongrp_reparent(struct rdtgroup *rdtgrp,
-			    struct rdtgroup *new_prdtgrp,
-			    cpumask_var_t cpus)
+			    struct rdtgroup *new_prdtgrp)
 {
 	struct rdtgroup *prdtgrp = rdtgrp->mon.parent;
 
@@ -3825,13 +3827,10 @@ static void mongrp_reparent(struct rdtgroup *rdtgrp,
 	list_move_tail(&rdtgrp->mon.crdtgrp_list,
 		       &new_prdtgrp->mon.crdtgrp_list);
 
-	rdtgrp->mon.parent = new_prdtgrp;
+	WRITE_ONCE(rdtgrp->mon.parent, new_prdtgrp);
 	rdtgrp->closid = new_prdtgrp->closid;
 
-	/* Propagate updated closid to all tasks in this group. */
-	rdt_move_group_tasks(rdtgrp, rdtgrp, cpus);
-
-	update_closid_rmid(cpus, NULL);
+	update_closid_rmid(cpu_online_mask, NULL);
 }
 
 static int rdtgroup_rename(struct kernfs_node *kn,
@@ -3839,7 +3838,6 @@ static int rdtgroup_rename(struct kernfs_node *kn,
 {
 	struct rdtgroup *new_prdtgrp;
 	struct rdtgroup *rdtgrp;
-	cpumask_var_t tmpmask;
 	int ret;
 
 	rdtgrp = kernfs_to_rdtgroup(kn);
@@ -3909,16 +3907,6 @@ static int rdtgroup_rename(struct kernfs_node *kn,
 		goto out;
 	}
 
-	/*
-	 * Allocate the cpumask for use in mongrp_reparent() to avoid the
-	 * possibility of failing to allocate it after kernfs_rename() has
-	 * succeeded.
-	 */
-	if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
 	/*
 	 * Perform all input validation and allocations needed to ensure
 	 * mongrp_reparent() will succeed before calling kernfs_rename(),
@@ -3927,9 +3915,7 @@ static int rdtgroup_rename(struct kernfs_node *kn,
 	 */
 	ret = kernfs_rename(kn, new_parent, new_name);
 	if (!ret)
-		mongrp_reparent(rdtgrp, new_prdtgrp, tmpmask);
-
-	free_cpumask_var(tmpmask);
+		mongrp_reparent(rdtgrp, new_prdtgrp);
 
 out:
 	mutex_unlock(&rdtgroup_mutex);
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent
  2024-03-25 17:27 [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Peter Newman
                   ` (5 preceding siblings ...)
  2024-03-25 17:27 ` [PATCH v1 6/6] x86/resctrl: Don't search tasklist in mongroup rename Peter Newman
@ 2024-04-04 23:09 ` Reinette Chatre
  2024-04-05 21:30   ` Peter Newman
  2024-04-11 22:34 ` Moger, Babu
  7 siblings, 1 reply; 27+ messages in thread
From: Reinette Chatre @ 2024-04-04 23:09 UTC (permalink / raw)
  To: Peter Newman, Fenghua Yu, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 3/25/2024 10:27 AM, Peter Newman wrote:
> Hi Reinette,
> 
> I've been working with users of the recently-added mongroup rename
> operation[1] who have observed the impact of back-to-back operations on
> latency-sensitive, thread pool-based services. Because changing a
> resctrl group's CLOSID (or RMID) requires all task_structs in the system
> to be inspected with the tasklist_lock read-locked, a series of move
> operations can block out thread creation for long periods of time, as
> creating threads needs to write-lock the tasklist_lock.

Could you please give some insight into the delays experienced? "long
periods of time" mean different things to different people and this
series seems to get more ominous as is progresses with the cover letter
starting with "long periods of time" and by the time the final patch
appears it has become "disastrous".

> 
> To avoid this issue, this series replaces the CLOSID and RMID values
> cached in the task_struct with a pointer to the task's rdtgroup, through
> which the current CLOSID and RMID can be obtained indirectly during a

On a high level this seems fair but using a pointer to the rdtgroup in the
task_struct means that the scheduling code needs to dereference that pointer
without any lock held. The changes do take great care and it
would be valuable to clearly document how these accesses are safe. (please
see patch #4 and #6)

> context switch. Updating a group's ID then only requires the current
> task to be switched back in on all CPUs. On server hosts with very large
> thread counts, this is much less disruptive than making thread creation
> globally unavailable. However, this is less desirable on CPU-isolated,
> realtime workloads, so I am interested in suggestions on how to reach a
> compromise for the two use cases.

As I understand this only impacts moving a monitor group? To me this sounds
like a user space triggered event associated with a particular use case that
may not be relevant to the workloads that you refer to. I think this could be
something that can be documented for users with this type of workloads.
(please see patch #6)

> 
> Also note that the soft-ABMC[2] proposal is based on swapping the RMID
> values in mongroups when monitors are assigned to them, which will
> result in the same slowdown as was encountered with re-parenting
> monitoring groups.
> 
> Using pointers to rdtgroups from the task_struct been used internally at

"have been used internally"?

> Google for a few years to support an alternate CLOSID management
> technique[3], which was replaced by mongroup rename. Usage of this
> technique during production revealed an issue where threads in an
> exiting process become unreachable via for_each_process_thread() before
> destruction, but can still switch in and out. Consequently, an rmdir
> operation can fail to remove references to an rdtgroup before it frees
> it, causing the pointer to freed memory to be dereferenced after the
> structure is freed. This would result in invalid CLOSID/RMID values
> being written into MSRs, causing an exception. The workaround for this
> is to clear a task's rdtgroup pointer when it exits.

This does not sound like a problem unique to this new implementation but the
"invalid CLOSID/RMID values being written into MSRs" sounds like something
that happens today? Probably not worth pulling out for this reason because
in its current form the exiting process will keep using the original
CLOSID/RMID that have no issue being written to MSRs.

Reinette

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line
  2024-03-25 17:27 ` [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line Peter Newman
@ 2024-04-04 23:09   ` Reinette Chatre
  2024-04-05 22:04     ` Peter Newman
  0 siblings, 1 reply; 27+ messages in thread
From: Reinette Chatre @ 2024-04-04 23:09 UTC (permalink / raw)
  To: Peter Newman, Fenghua Yu, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 3/25/2024 10:27 AM, Peter Newman wrote:
> __resctrl_sched_in() is unable to dereference a struct rdtgroup pointer
> when defined inline because rdtgroup is a private structure defined in
> internal.h.

Being inline has nothing to do with whether it can reference a struct rdtgroup
pointer, no?

> 
> This function is defined inline to avoid impacting context switch
> performance for the majority of users who aren't using resctrl at all.
> These benefits can already be realized without access to internal
> resctrl data structures.
> 
> The logic of performing an out-of-line call to __resctrl_sched_in() only
> when resctrl is mounted is architecture-independent, so the inline
> definition of resctrl_sched_in() can be moved into linux/resctrl.h.
> 
> Signed-off-by: Peter Newman <peternewman@google.com>

...

> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 0917c7f25720..8f92a87d381d 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -38,6 +38,7 @@
>  #include <linux/io.h>
>  #include <linux/kdebug.h>
>  #include <linux/syscalls.h>
> +#include <linux/resctrl.h>
>  
>  #include <asm/ldt.h>
>  #include <asm/processor.h>
> @@ -51,7 +52,6 @@
>  #include <asm/debugreg.h>
>  #include <asm/switch_to.h>
>  #include <asm/vm86.h>
> -#include <asm/resctrl.h>
>  #include <asm/proto.h>
>  
>  #include "process.h"
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 7062b84dd467..d442269bb25b 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -40,6 +40,7 @@
>  #include <linux/ftrace.h>
>  #include <linux/syscalls.h>
>  #include <linux/iommu.h>
> +#include <linux/resctrl.h>
>  
>  #include <asm/processor.h>
>  #include <asm/pkru.h>

With a change like this we should be very careful about what is included when
the kernel is not built with resctrl and in its current form linux/resctrl.h is
not ready for this.

If CONFIG_X86_CPU_RESCTRL is not set linux/resctrl.h should have the bare minimum,
just like asm/resctrl.h has.

> @@ -53,7 +54,6 @@
>  #include <asm/switch_to.h>
>  #include <asm/xen/hypervisor.h>
>  #include <asm/vdso.h>
> -#include <asm/resctrl.h>
>  #include <asm/unistd.h>
>  #include <asm/fsgsbase.h>
>  #include <asm/fred.h>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index a365f67131ec..62d607939a73 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -304,4 +304,25 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
>  extern unsigned int resctrl_rmid_realloc_threshold;
>  extern unsigned int resctrl_rmid_realloc_limit;
>  
> +DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
> +
> +void __resctrl_sched_in(struct task_struct *tsk);
> +
> +/*
> + * resctrl_sched_in() - Assigns the incoming task's control/monitor IDs to the
> + *			current CPU
> + *
> + * To minimize impact to the scheduler hot path, this will stay as no-op unless
> + * running on a system supporting resctrl and the filesystem is mounted.
> + *
> + * Must be called with preemption disabled.
> + */
> +static inline void resctrl_sched_in(struct task_struct *tsk)
> +{
> +#ifdef CONFIG_X86_CPU_RESCTRL
> +	if (static_branch_likely(&rdt_enable_key))
> +		__resctrl_sched_in(tsk);
> +#endif
> +}
> +

include/linux/resctrl.h should rather be divided to accommodate code
as below:

#ifdef CONFIG_X86_CPU_RESCTRL

static inline void resctrl_sched_in(struct task_struct *tsk)
{
	if (static_branch_likely(&rdt_enable_key))
		__resctrl_sched_in(tsk);
}

#else

static inline void resctrl_sched_in(struct task_struct *tsk) {}

#endif

so that core code does not get anything unnecessary when CONFIG_X86_CPU_RESCTRL
is not set.

Reinette


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/6] x86/resctrl: Add hook for releasing task_struct references
  2024-03-25 17:27 ` [PATCH v1 2/6] x86/resctrl: Add hook for releasing task_struct references Peter Newman
@ 2024-04-04 23:10   ` Reinette Chatre
  0 siblings, 0 replies; 27+ messages in thread
From: Reinette Chatre @ 2024-04-04 23:10 UTC (permalink / raw)
  To: Peter Newman, Fenghua Yu, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 3/25/2024 10:27 AM, Peter Newman wrote:
> In order for the task_struct to hold references to rdtgroups, it must be
> possible to release these references before a concurrent deletion causes
> them to be freed.

This is very hard to parse (for me). I found your description in the
cover letter to be much easier to understand.
Considering the core area of code changed with this patch the
changelog needs to be specific on what is needed and why.

I'd also like to recommend that the subject prefix is changed to "exit: ..."
to highlight to folks that this crosses into a different area of the kernel.

It is not obvious to me how changes are routed to this exit code but we
can start by highlighting this to not appear to want to sneak it in.

> 
> It is not possible for resctrl code to do this with
> for_each_process_thread() because the task can still switch in after it
> has been removed from the tasklist, at which point the task_struct could
> be referring to freed memory.
> 
> Signed-off-by: Peter Newman <peternewman@google.com>
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 10 ++++++++++
>  include/linux/resctrl.h                |  6 ++++++
>  kernel/exit.c                          |  3 +++
>  3 files changed, 19 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 5d599d99f94b..9b1969e4235a 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2931,6 +2931,16 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
>  	read_unlock(&tasklist_lock);
>  }
>  
> +/**
> + * exit_resctrl() - called at thread destruction to release resources
> + *
> + * This hook is called just before the task is removed from the global tasklist
> + * and still reachable via for_each_process_thread().
> + */
> +void exit_resctrl(struct task_struct *tsk)
> +{
> +}
> +
>  static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
>  {
>  	struct rdtgroup *sentry, *stmp;
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 62d607939a73..b2af1fbc7aa1 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -325,4 +325,10 @@ static inline void resctrl_sched_in(struct task_struct *tsk)
>  #endif
>  }
>  
> +#ifdef CONFIG_X86_CPU_RESCTRL
> +void exit_resctrl(struct task_struct *tsk);
> +#else
> +static inline void exit_resctrl(struct task_struct *tsk) {}
> +#endif

Scattering these ifdefs in the header file is not ideal. I think when there
are just the two sections as I mentioned in previous patch this will look better.

> +
>  #endif /* _RESCTRL_H */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 41a12630cbbc..ccdc90ff6d71 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -70,6 +70,7 @@
>  #include <linux/sysfs.h>
>  #include <linux/user_events.h>
>  #include <linux/uaccess.h>
> +#include <linux/resctrl.h>
>  
>  #include <uapi/linux/wait.h>
>  
> @@ -862,6 +863,8 @@ void __noreturn do_exit(long code)
>  	tsk->exit_code = code;
>  	taskstats_exit(tsk, group_dead);
>  
> +	exit_resctrl(tsk);
> +

This seems fair but I am not familiar with the customs in this area of the kernel.
I see both exit_xxx() and xxx_task_exit() being used.

Reinette

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 3/6] x86/resctrl: Disallow mongroup rename on MPAM
  2024-03-25 17:27 ` [PATCH v1 3/6] x86/resctrl: Disallow mongroup rename on MPAM Peter Newman
@ 2024-04-04 23:11   ` Reinette Chatre
  2024-04-05 22:10     ` Peter Newman
  0 siblings, 1 reply; 27+ messages in thread
From: Reinette Chatre @ 2024-04-04 23:11 UTC (permalink / raw)
  To: Peter Newman, Fenghua Yu, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 3/25/2024 10:27 AM, Peter Newman wrote:
> Moving a monitoring group to a different parent control assumes that the
> monitors will not be impacted. This is not the case on MPAM where the
> PMG is an extension of the PARTID.
> 
> Detect this situation by requiring the change in CLOSID not to affect
> the result of resctrl_arch_rmid_idx_encode(), otherwise return
> -EOPNOTSUPP.
> 

Thanks for catching this. This seems out of place in this series. It sounds
more like an independent fix that should go in separately.

> Signed-off-by: Peter Newman <peternewman@google.com>
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 9b1969e4235a..8d6979dbfd02 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3879,6 +3879,19 @@ static int rdtgroup_rename(struct kernfs_node *kn,
>  		goto out;
>  	}
>  
> +	/*
> +	 * If changing the CLOSID impacts the RMID, this operation is not
> +	 * supported.
> +	 */
> +	if (resctrl_arch_rmid_idx_encode(rdtgrp->mon.parent->closid,
> +					 rdtgrp->mon.rmid) !=
> +	    resctrl_arch_rmid_idx_encode(new_prdtgrp->closid,
> +					 rdtgrp->mon.rmid)) {
> +		rdt_last_cmd_puts("changing parent control group not supported\n");

Please start message with capital letter.

> +		ret = -EOPNOTSUPP;
> +		goto out;
> +	}
> +
>  	/*
>  	 * If the MON group is monitoring CPUs, the CPUs must be assigned to the
>  	 * current parent CTRL_MON group and therefore cannot be assigned to

Reinette

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/6] x86/resctrl: Use rdtgroup pointer to indicate task membership
  2024-03-25 17:27 ` [PATCH v1 4/6] x86/resctrl: Use rdtgroup pointer to indicate task membership Peter Newman
@ 2024-04-04 23:15   ` Reinette Chatre
  0 siblings, 0 replies; 27+ messages in thread
From: Reinette Chatre @ 2024-04-04 23:15 UTC (permalink / raw)
  To: Peter Newman, Fenghua Yu, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 3/25/2024 10:27 AM, Peter Newman wrote:
> Caching the CLOSID and RMID values in all member tasks makes changing
> either ID for a group expensive, as all task_structs must be inspected
> while read-locking the tasklist_lock.
> 
> A single rdtgroup reference from the task_struct can indicate the
> mongroup and ctrl group membership of a task. In the case of mongroups,
> the parent pointer can be used to determine the CLOSID indirectly,
> avoiding the need for invalidating a cached CLOSID in all task_structs.
> 
> This also solves the problem of tearing PARTID/PMG values in MPAM, as
> the parent pointer of a mongroup does not change. Therefore an atomic
> read of the rdt_group pointer provides a consistent view of current
> mongroup and control group membership, making __resctrl_sched_in()
> portable.
> 
> Care must be taken to ensure that __resctrl_sched_in() does not
> dereference a pointer to a freed rdtgroup struct.

"care must be taken" ... could this be expanded to describe how care 
is taken?

> Tasks may no longer be
> reachable via for_each_process_thread() but can still be switched in, so
> update the rdt_group pointer before the thread is removed from the
> tasklist.
> 
> Co-developed-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Peter Newman <peternewman@google.com>
> ---
>  arch/x86/include/asm/resctrl.h         |  18 ---
>  arch/x86/kernel/cpu/resctrl/core.c     |   3 +-
>  arch/x86/kernel/cpu/resctrl/internal.h |  13 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 205 +++++++++++++------------
>  include/linux/sched.h                  |   3 +-
>  5 files changed, 110 insertions(+), 132 deletions(-)
> 
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 99ba8c0dc155..be4afbc6180f 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -64,24 +64,6 @@ static inline unsigned int resctrl_arch_round_mon_val(unsigned int val)
>  	return val * scale;
>  }
>  
> -static inline void resctrl_arch_set_closid_rmid(struct task_struct *tsk,
> -						u32 closid, u32 rmid)
> -{
> -	WRITE_ONCE(tsk->closid, closid);
> -	WRITE_ONCE(tsk->rmid, rmid);
> -}
> -
> -static inline bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid)
> -{
> -	return READ_ONCE(tsk->closid) == closid;
> -}
> -
> -static inline bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 ignored,
> -					   u32 rmid)
> -{
> -	return READ_ONCE(tsk->rmid) == rmid;
> -}
> -
>  static inline u32 resctrl_arch_system_num_rmid_idx(void)
>  {
>  	/* RMID are independent numbers for x86. num_rmid_idx == num_rmid */
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 83e40341583e..ae5878d748fc 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -600,8 +600,7 @@ static void clear_closid_rmid(int cpu)
>  {
>  	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
>  
> -	state->default_closid = RESCTRL_RESERVED_CLOSID;
> -	state->default_rmid = RESCTRL_RESERVED_RMID;
> +	state->default_group = &rdtgroup_default;
>  	state->cur_closid = RESCTRL_RESERVED_CLOSID;
>  	state->cur_rmid = RESCTRL_RESERVED_RMID;
>  	wrmsr(MSR_IA32_PQR_ASSOC, RESCTRL_RESERVED_RMID,
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 56a68e542572..0ba0d2428780 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -334,14 +334,8 @@ struct rftype {
>  /**
>   * struct resctrl_pqr_state - State cache for the PQR MSR
>   * @cur_rmid:		The cached Resource Monitoring ID
> - * @cur_closid:	The cached Class Of Service ID
> - * @default_rmid:	The user assigned Resource Monitoring ID
> - * @default_closid:	The user assigned cached Class Of Service ID
> - *
> - * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
> - * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
> - * contains both parts, so we need to cache them. This also
> - * stores the user configured per cpu CLOSID and RMID.
> + * @cur_closid:		The cached Class Of Service ID
> + * @default_group:	The user assigned rdtgroup

What is meant with "user assigned rdtgroup"? How about "Resource group
to which this CPU is assigned." (still sounds awkward, please feel free to
improve). It may help if the header is "Per-CPU state cache ...". 

The naming is potentially confusing though because there is already
rdtgroup_default ... and frequently assigning default_group to
rdtgroup_default gets confusing quickly. How about "cpu_group" or
even "this_cpu_group"?

>   *
>   * The cache also helps to avoid pointless updates if the value does
>   * not change.
> @@ -349,8 +343,7 @@ struct rftype {
>  struct resctrl_pqr_state {
>  	u32			cur_rmid;
>  	u32			cur_closid;
> -	u32			default_rmid;
> -	u32			default_closid;
> +	struct rdtgroup		*default_group;
>  };
>  
>  DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 8d6979dbfd02..badf181c8cbb 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -348,25 +348,55 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
>  void __resctrl_sched_in(struct task_struct *tsk)
>  {
>  	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
> -	u32 closid = state->default_closid;
> -	u32 rmid = state->default_rmid;
> -	u32 tmp;
> +	u32 closid = state->cur_closid;
> +	u32 rmid = state->cur_rmid;
> +	struct rdtgroup *rgrp;
>  
>  	/*
> -	 * If this task has a closid/rmid assigned, use it.
> -	 * Else use the closid/rmid assigned to this cpu.
> +	 * A task's group assignment can change concurrently, but the CLOSID or
> +	 * RMID assigned to a group cannot change.
>  	 */
> +	rgrp = READ_ONCE(tsk->rdt_group);
> +	if (!rgrp || rgrp == &rdtgroup_default)
> +		/*
> +		 * If this task is a member of a control or monitoring group,
> +		 * use the IDs assigned to these groups. Else use the
> +		 * closid/rmid assigned to this cpu.

Please consistently use caps for acronyms.

> +		 */
> +		rgrp = state->default_group;
> +
> +	/*
> +	 * Context switches are possible before the cpuonline handler
> +	 * initializes default_group.
> +	 */
> +	if (!rgrp)
> +		rgrp = &rdtgroup_default;
> +
>  	if (static_branch_likely(&rdt_alloc_enable_key)) {
> -		tmp = READ_ONCE(tsk->closid);
> -		if (tmp)
> -			closid = tmp;
> +		/*
> +		 * If the task is assigned to a monitoring group, the CLOSID is
> +		 * determined by the parent control group.
> +		 */
> +		if (rgrp->type == RDTMON_GROUP) {
> +			if (!WARN_ON(!rgrp->mon.parent))
> +				/*
> +				 * The parent rdtgroup cannot be freed until
> +				 * after the mon group is freed. In the event
> +				 * that the parent rdtgroup is removed (by
> +				 * rdtgroup_rmdir_ctrl()), rdt_mon_group would
> +				 * be redirected to rdtgroup_default, followed
> +				 * by a full barrier and synchronous IPI
> +				 * broadcast before proceeding to free the
> +				 * group.
> +				 */
> +				closid = rgrp->mon.parent->closid;
> +		} else {
> +			closid = rgrp->closid;
> +		}
>  	}
>  
> -	if (static_branch_likely(&rdt_mon_enable_key)) {
> -		tmp = READ_ONCE(tsk->rmid);
> -		if (tmp)
> -			rmid = tmp;
> -	}
> +	if (static_branch_likely(&rdt_mon_enable_key))
> +		rmid = rgrp->mon.rmid;
>  
>  	if (closid != state->cur_closid || rmid != state->cur_rmid) {
>  		state->cur_closid = closid;
> @@ -385,10 +415,8 @@ static void update_cpu_closid_rmid(void *info)
>  {
>  	struct rdtgroup *r = info;
>  
> -	if (r) {
> -		this_cpu_write(pqr_state.default_closid, r->closid);
> -		this_cpu_write(pqr_state.default_rmid, r->mon.rmid);
> -	}
> +	if (r)
> +		this_cpu_write(pqr_state.default_group, r);
>  
>  	/*
>  	 * We cannot unconditionally write the MSR because the current
> @@ -624,49 +652,61 @@ static void update_task_closid_rmid(struct task_struct *t)
>  
>  static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup *rdtgrp)
>  {
> -	u32 closid, rmid = rdtgrp->mon.rmid;
> +	struct rdtgroup *task_group = READ_ONCE(tsk->rdt_group);
>  
> -	if (rdtgrp->type == RDTCTRL_GROUP)
> -		closid = rdtgrp->closid;
> -	else if (rdtgrp->type == RDTMON_GROUP)
> -		closid = rdtgrp->mon.parent->closid;
> -	else
> -		return false;
> +	lockdep_assert_held(&rdtgroup_mutex);
> +
> +	/* Uninitalized rdt_group pointer implies rdtgroup_default. */
> +	if (!task_group)
> +		task_group = &rdtgroup_default;
> +
> +	if (rdtgrp == task_group)
> +		return true;
> +
> +	/* Tasks in child mongroups are members of the parent ctrlmon group. */
> +	if (task_group->type == RDTMON_GROUP &&
> +	    task_group->mon.parent == rdtgrp)
> +		return true;

This does not maintain existing behavior. Note that current behavior
(captured in removed snippet below) is that CLOSID _and_ RMID must
match. The consequence seems to be that the only used of this function,
which is to support moving of groups, can no longer move a task
from one monitor group to another within the same control group.

>  
> -	return resctrl_arch_match_closid(tsk, closid) &&
> -	       resctrl_arch_match_rmid(tsk, closid, rmid);
> +	return false;
>  }
>  
>  static int __rdtgroup_move_task(struct task_struct *tsk,
>  				struct rdtgroup *rdtgrp)
>  {
> +	struct rdtgroup *task_group = READ_ONCE(tsk->rdt_group);
> +
>  	/* If the task is already in rdtgrp, no need to move the task. */
>  	if (task_in_rdtgroup(tsk, rdtgrp))
>  		return 0;
>  
>  	/*
> -	 * Set the task's closid/rmid before the PQR_ASSOC MSR can be
> -	 * updated by them.
> +	 * NULL is used in the task_struct so it can be overridden by a CPU's
> +	 * default_group
> +	 */
> +	if (!task_group)
> +		task_group = &rdtgroup_default;
> +
> +	/*
> +	 * Set the task's group before the CPU can be updated by them.
>  	 *
>  	 * For ctrl_mon groups, move both closid and rmid.
>  	 * For monitor groups, can move the tasks only from
> -	 * their parent CTRL group.
> +	 * their parent CTRL group or another mon group under the same parent.
>  	 */
> -	if (rdtgrp->type == RDTMON_GROUP &&
> -	    !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) {
> +	if (rdtgrp->type == RDTCTRL_GROUP) {
> +		WRITE_ONCE(tsk->rdt_group, rdtgrp);
> +	} else if (rdtgrp->type == RDTMON_GROUP &&
> +		   (task_group == rdtgrp->mon.parent ||
> +		    task_group->mon.parent == rdtgrp->mon.parent)) {
> +		WRITE_ONCE(tsk->rdt_group, rdtgrp);
> +	} else {
>  		rdt_last_cmd_puts("Can't move task to different control group\n");
>  		return -EINVAL;
>  	}
>  
> -	if (rdtgrp->type == RDTMON_GROUP)
> -		resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid,
> -					     rdtgrp->mon.rmid);
> -	else
> -		resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid,
> -					     rdtgrp->mon.rmid);
> -
>  	/*
> -	 * Ensure the task's closid and rmid are written before determining if
> +	 * Ensure the task's group is written before determining if
>  	 * the task is current that will decide if it will be interrupted.
>  	 * This pairs with the full barrier between the rq->curr update and
>  	 * resctrl_sched_in() during context switch.
> @@ -684,19 +724,6 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
>  	return 0;
>  }
>  
> -static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
> -{
> -	return (resctrl_arch_alloc_capable() && (r->type == RDTCTRL_GROUP) &&
> -		resctrl_arch_match_closid(t, r->closid));
> -}
> -
> -static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r)
> -{
> -	return (resctrl_arch_mon_capable() && (r->type == RDTMON_GROUP) &&
> -		resctrl_arch_match_rmid(t, r->mon.parent->closid,
> -					r->mon.rmid));
> -}
> -
>  /**
>   * rdtgroup_tasks_assigned - Test if tasks have been assigned to resource group
>   * @r: Resource group
> @@ -712,7 +739,7 @@ int rdtgroup_tasks_assigned(struct rdtgroup *r)
>  
>  	rcu_read_lock();
>  	for_each_process_thread(p, t) {
> -		if (is_closid_match(t, r) || is_rmid_match(t, r)) {
> +		if (task_in_rdtgroup(t, r)) {

Looks like task_in_rdtgroup() was modified to combine these two checks but care needs
to be taken for existing usage of task_in_rdtgroup().

>  			ret = 1;
>  			break;
>  		}
> @@ -830,7 +857,7 @@ static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)
>  
>  	rcu_read_lock();
>  	for_each_process_thread(p, t) {
> -		if (is_closid_match(t, r) || is_rmid_match(t, r)) {
> +		if (task_in_rdtgroup(t, r)) {
>  			pid = task_pid_vnr(t);
>  			if (pid)
>  				seq_printf(s, "%d\n", pid);
> @@ -924,53 +951,34 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
>  		      struct pid *pid, struct task_struct *tsk)
>  {
>  	struct rdtgroup *rdtg;
> -	int ret = 0;
> -
> -	mutex_lock(&rdtgroup_mutex);
> +	struct rdtgroup *crg;
> +	struct rdtgroup *mrg;
>  
>  	/* Return empty if resctrl has not been mounted. */
>  	if (!resctrl_mounted) {

resctrl_mounted needs to be checked with rdtgroup_mutex held, no?

>  		seq_puts(s, "res:\nmon:\n");
> -		goto unlock;
> +		return 0;
>  	}
>  
> -	list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
> -		struct rdtgroup *crg;
> +	mutex_lock(&rdtgroup_mutex);
>  

...

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3c2abbc587b4..d07d7a80006b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1236,8 +1236,7 @@ struct task_struct {
>  	struct list_head		cg_list;
>  #endif
>  #ifdef CONFIG_X86_CPU_RESCTRL
> -	u32				closid;
> -	u32				rmid;
> +	struct rdtgroup			*rdt_group;

(I'm thought about whether this should be an architecture-independent name
considering the goals of this series and the visibility of task_struct... but
on the other hand we are still stuck with that name for the struct.)

>  #endif
>  #ifdef CONFIG_FUTEX
>  	struct robust_list_head __user	*robust_list;

Reinette

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 5/6] x86/resctrl: Abstract PQR_ASSOC from generic code
  2024-03-25 17:27 ` [PATCH v1 5/6] x86/resctrl: Abstract PQR_ASSOC from generic code Peter Newman
@ 2024-04-04 23:16   ` Reinette Chatre
  2024-04-05 17:15   ` Reinette Chatre
  2024-04-07 19:21   ` Reinette Chatre
  2 siblings, 0 replies; 27+ messages in thread
From: Reinette Chatre @ 2024-04-04 23:16 UTC (permalink / raw)
  To: Peter Newman, Fenghua Yu, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 3/25/2024 10:27 AM, Peter Newman wrote:
> While CLOSID and RMID originated in RDT, the concept applies to other
> architectures, as it's standard to write allocation and monitoring IDs
> into per-CPU registers.
> 
>  - Rename resctrl_pqr_state and pqr_state to be more
>    architecturally-neutral.

I think it will be helpful to also introduce the PQR register, otherwise
it is not clear how this rename is motivated by the first paragraph.

> 
>  - Introduce resctrl_arch_update_cpu() to replace the explicit write to
>    MSR_IA32_PQR_ASSOC in __resctrl_sched_in(). In the case of MPAM,
>    PARTID[_I,D] and PMG are a simple function of closid, rmid, and an
>    internal global.

It is not obvious where this is going ... it sounds as though MPAM will
need three parameters for this function?

> 
>  - Update terminology containing explicit references to the PQR_ASSOC
>    register.
> 
> Signed-off-by: Peter Newman <peternewman@google.com>
> ---
>  arch/x86/kernel/cpu/resctrl/core.c        | 11 ++++++++---
>  arch/x86/kernel/cpu/resctrl/internal.h    |  6 +++---
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  4 ++--
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 18 +++++++++---------
>  include/linux/resctrl.h                   | 11 +++++++++++
>  5 files changed, 33 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index ae5878d748fc..4cc584754f8b 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -37,12 +37,12 @@
>  static DEFINE_MUTEX(domain_list_lock);
>  
>  /*
> - * The cached resctrl_pqr_state is strictly per CPU and can never be
> + * The cached resctrl_cpu_state is strictly per CPU and can never be
>   * updated from a remote CPU. Functions which modify the state
>   * are called with interrupts disabled and no preemption, which
>   * is sufficient for the protection.
>   */
> -DEFINE_PER_CPU(struct resctrl_pqr_state, pqr_state);
> +DEFINE_PER_CPU(struct resctrl_cpu_state, resctrl_state);
>  
>  /*
>   * Used to store the max resource name width and max resource data width
> @@ -309,6 +309,11 @@ static void rdt_get_cdp_l2_config(void)
>  	rdt_get_cdp_config(RDT_RESOURCE_L2);
>  }
>  
> +void resctrl_arch_update_cpu(u32 ctrl_id, u32 mon_id)

We already started using the names ctrl_hw_id and mon_hw_id. Could this be consistent
with that?

> +{
> +	wrmsr(MSR_IA32_PQR_ASSOC, mon_id, ctrl_id);
> +}
> +
>  static void
>  mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
>  {
> @@ -598,7 +603,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  
>  static void clear_closid_rmid(int cpu)
>  {
> -	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
> +	struct resctrl_cpu_state *state = this_cpu_ptr(&resctrl_state);
>  
>  	state->default_group = &rdtgroup_default;
>  	state->cur_closid = RESCTRL_RESERVED_CLOSID;
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 0ba0d2428780..e30f42744ac7 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -332,7 +332,7 @@ struct rftype {
>  };
>  
>  /**
> - * struct resctrl_pqr_state - State cache for the PQR MSR
> + * struct resctrl_cpu_state - State cache for allocation/monitoring group IDs
>   * @cur_rmid:		The cached Resource Monitoring ID
>   * @cur_closid:		The cached Class Of Service ID
>   * @default_group:	The user assigned rdtgroup
> @@ -340,13 +340,13 @@ struct rftype {
>   * The cache also helps to avoid pointless updates if the value does
>   * not change.
>   */
> -struct resctrl_pqr_state {
> +struct resctrl_cpu_state {
>  	u32			cur_rmid;
>  	u32			cur_closid;
>  	struct rdtgroup		*default_group;
>  };
>  
> -DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
> +DECLARE_PER_CPU(struct resctrl_cpu_state, resctrl_state);
>  
>  /**
>   * struct mbm_state - status for each MBM counter in each domain
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 884b88e25141..ca1805a566cb 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -480,8 +480,8 @@ static int pseudo_lock_fn(void *_rdtgrp)
>  	 */
>  	saved_msr = __rdmsr(MSR_MISC_FEATURE_CONTROL);
>  	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
> -	closid_p = this_cpu_read(pqr_state.cur_closid);
> -	rmid_p = this_cpu_read(pqr_state.cur_rmid);
> +	closid_p = this_cpu_read(resctrl_state.cur_closid);
> +	rmid_p = this_cpu_read(resctrl_state.cur_rmid);
>  	mem_r = plr->kmem;
>  	size = plr->size;
>  	line_size = plr->line_size;
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index badf181c8cbb..bd067f7ed5b6 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -112,7 +112,7 @@ void rdt_staged_configs_clear(void)
>   * + We can simply set current's closid to assign a task to a resource
>   *   group.
>   * + Context switch code can avoid extra memory references deciding which
> - *   CLOSID to load into the PQR_ASSOC MSR
> + *   CLOSID to load into the CPU
>   * - We give up some options in configuring resource groups across multi-socket
>   *   systems.
>   * - Our choices on how to configure each resource become progressively more
> @@ -347,7 +347,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
>   */
>  void __resctrl_sched_in(struct task_struct *tsk)
>  {
> -	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
> +	struct resctrl_cpu_state *state = this_cpu_ptr(&resctrl_state);
>  	u32 closid = state->cur_closid;
>  	u32 rmid = state->cur_rmid;
>  	struct rdtgroup *rgrp;
> @@ -401,7 +401,7 @@ void __resctrl_sched_in(struct task_struct *tsk)
>  	if (closid != state->cur_closid || rmid != state->cur_rmid) {
>  		state->cur_closid = closid;
>  		state->cur_rmid = rmid;
> -		wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
> +		resctrl_arch_update_cpu(closid, rmid);
>  	}
>  }
>  
> @@ -416,7 +416,7 @@ static void update_cpu_closid_rmid(void *info)
>  	struct rdtgroup *r = info;
>  
>  	if (r)
> -		this_cpu_write(pqr_state.default_group, r);
> +		this_cpu_write(resctrl_state.default_group, r);
>  
>  	/*
>  	 * We cannot unconditionally write the MSR because the current
> @@ -635,8 +635,8 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
>  static void _update_task_closid_rmid(void *task)
>  {
>  	/*
> -	 * If the task is still current on this CPU, update PQR_ASSOC MSR.
> -	 * Otherwise, the MSR is updated when the task is scheduled in.
> +	 * If the task is still current on this CPU, update the current ctrl
> +	 * group. Otherwise, the CPU is updated when the task is scheduled in.

I think control and/or monitor group could be updated, not just control group?

>  	 */
>  	if (task == current)
>  		resctrl_sched_in(task);
> @@ -3005,7 +3005,7 @@ static void rmdir_all_sub(void)
>  		else
>  			rdtgroup_remove(rdtgrp);
>  	}
> -	/* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */
> +	/* Update online CPUs to propagate group membership changes. */
>  	update_closid_rmid(cpu_online_mask, &rdtgroup_default);
>  
>  	kernfs_remove(kn_info);
> @@ -3688,7 +3688,7 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>  
>  	/* Update per cpu rmid of the moved CPUs first */
>  	for_each_cpu(cpu, &rdtgrp->cpu_mask)
> -		per_cpu(pqr_state.default_group, cpu) = prdtgrp;
> +		per_cpu(resctrl_state.default_group, cpu) = prdtgrp;
>  	/*
>  	 * Update the MSR on moved CPUs and CPUs which have moved
>  	 * task running on them.
> @@ -3732,7 +3732,7 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>  
>  	/* Update per cpu closid and rmid of the moved CPUs first */
>  	for_each_cpu(cpu, &rdtgrp->cpu_mask)
> -		per_cpu(pqr_state.default_group, cpu) = &rdtgroup_default;
> +		per_cpu(resctrl_state.default_group, cpu) = &rdtgroup_default;
>  
>  	/*
>  	 * Update the MSR on moved CPUs and CPUs which have moved
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index b2af1fbc7aa1..a6b1b13cc769 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -306,6 +306,17 @@ extern unsigned int resctrl_rmid_realloc_limit;
>  
>  DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
>  
> +/*

For valid kerneldoc this needs to be /**

> + * resctrl_arch_update_cpu() - Make control and monitoring group IDs effective
> + *			       on the current CPU
> + *
> + * @ctrl_id:	An identifier for the control group which is to be used on the
> + *		current CPU.

Same comment here about naming.

> + * @mon_id:	An identifier for the monitoring group which is to be used on
> + *		the current CPU.
> + */
> +void resctrl_arch_update_cpu(u32 ctrl_id, u32 mon_id);
> +
>  void __resctrl_sched_in(struct task_struct *tsk);
>  
>  /*

Reinette

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 6/6] x86/resctrl: Don't search tasklist in mongroup rename
  2024-03-25 17:27 ` [PATCH v1 6/6] x86/resctrl: Don't search tasklist in mongroup rename Peter Newman
@ 2024-04-04 23:18   ` Reinette Chatre
  0 siblings, 0 replies; 27+ messages in thread
From: Reinette Chatre @ 2024-04-04 23:18 UTC (permalink / raw)
  To: Peter Newman, Fenghua Yu, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 3/25/2024 10:27 AM, Peter Newman wrote:
> Iterating over all task_structs while read-locking the tasklist_lock
> results in significant task creation/destruction latency. Back-to-back
> move operations can thus be disastrous to the responsiveness of
> threadpool-based services.

Please be specific with claims.

> 
> Now that the CLOSID is determined indirectly through a reference to the
> task's current rdtgroup, it is not longer necessary to update the CLOSID
> in all tasks belonging to the moved mongroup. The context switch handler
> just needs to be prepared for concurrent writes to the parent pointer.

(insert text explanation how context switch handler is prepared for
concurrent writes)

> 
> Signed-off-by: Peter Newman <peternewman@google.com>
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 30 +++++++-------------------
>  1 file changed, 8 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index bd067f7ed5b6..a007c0ec478f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -388,8 +388,11 @@ void __resctrl_sched_in(struct task_struct *tsk)
>  				 * by a full barrier and synchronous IPI
>  				 * broadcast before proceeding to free the
>  				 * group.
> +				 *
> +				 * parent can be concurrently updated to a new
> +				 * group as a result of mongrp_reparent().
>  				 */
> -				closid = rgrp->mon.parent->closid;
> +				closid = READ_ONCE(rgrp->mon.parent)->closid;
>  		} else {
>  			closid = rgrp->closid;
>  		}
> @@ -3809,8 +3812,7 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
>   * Monitoring data for the group is unaffected by this operation.
>   */
>  static void mongrp_reparent(struct rdtgroup *rdtgrp,
> -			    struct rdtgroup *new_prdtgrp,
> -			    cpumask_var_t cpus)
> +			    struct rdtgroup *new_prdtgrp)
>  {
>  	struct rdtgroup *prdtgrp = rdtgrp->mon.parent;
>  
> @@ -3825,13 +3827,10 @@ static void mongrp_reparent(struct rdtgroup *rdtgrp,
>  	list_move_tail(&rdtgrp->mon.crdtgrp_list,
>  		       &new_prdtgrp->mon.crdtgrp_list);
>  
> -	rdtgrp->mon.parent = new_prdtgrp;
> +	WRITE_ONCE(rdtgrp->mon.parent, new_prdtgrp);
>  	rdtgrp->closid = new_prdtgrp->closid;
>  
> -	/* Propagate updated closid to all tasks in this group. */
> -	rdt_move_group_tasks(rdtgrp, rdtgrp, cpus);
> -
> -	update_closid_rmid(cpus, NULL);
> +	update_closid_rmid(cpu_online_mask, NULL);

This deserves a mention in changelog.

There is a section in the documentation, "Resource alloc and monitor groups"
that describes moving monitor groups. Unless you receive better ideas to address
the concern about this impact on CPU-isolated realtime workloads I would like
to suggest that you add a snippet there about the consequences of a move.
 
Reinette

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 5/6] x86/resctrl: Abstract PQR_ASSOC from generic code
  2024-03-25 17:27 ` [PATCH v1 5/6] x86/resctrl: Abstract PQR_ASSOC from generic code Peter Newman
  2024-04-04 23:16   ` Reinette Chatre
@ 2024-04-05 17:15   ` Reinette Chatre
  2024-04-07 19:21   ` Reinette Chatre
  2 siblings, 0 replies; 27+ messages in thread
From: Reinette Chatre @ 2024-04-05 17:15 UTC (permalink / raw)
  To: Peter Newman, Fenghua Yu, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 3/25/2024 10:27 AM, Peter Newman wrote:
> While CLOSID and RMID originated in RDT, the concept applies to other
> architectures, as it's standard to write allocation and monitoring IDs
> into per-CPU registers.
> 
>  - Rename resctrl_pqr_state and pqr_state to be more
>    architecturally-neutral.
> 
>  - Introduce resctrl_arch_update_cpu() to replace the explicit write to
>    MSR_IA32_PQR_ASSOC in __resctrl_sched_in(). In the case of MPAM,
>    PARTID[_I,D] and PMG are a simple function of closid, rmid, and an
>    internal global.
> 
>  - Update terminology containing explicit references to the PQR_ASSOC
>    register.

fyi ... I just noticed this while looking at the MPAM patches that there
is an instance in comments of update_closid_rmid() where this
was misspelled (PGR_ASSOC) and thus not picked up by your rename
script.

Reinette

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent
  2024-04-04 23:09 ` [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Reinette Chatre
@ 2024-04-05 21:30   ` Peter Newman
  2024-04-07 19:13     ` Reinette Chatre
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Newman @ 2024-04-05 21:30 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, James Morse, Stephane Eranian, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Uros Bizjak,
	Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe, Xin Li,
	Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman, Jens Axboe,
	Christian Brauner, Oleg Nesterov, Andrew Morton, Tycho Andersen,
	Nicholas Piggin, Beau Belgrave, Matthew Wilcox (Oracle),
	linux-kernel

Hi Reinette,

On Thu, Apr 4, 2024 at 4:09 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
> On 3/25/2024 10:27 AM, Peter Newman wrote:
> > I've been working with users of the recently-added mongroup rename
> > operation[1] who have observed the impact of back-to-back operations on
> > latency-sensitive, thread pool-based services. Because changing a
> > resctrl group's CLOSID (or RMID) requires all task_structs in the system
> > to be inspected with the tasklist_lock read-locked, a series of move
> > operations can block out thread creation for long periods of time, as
> > creating threads needs to write-lock the tasklist_lock.
>
> Could you please give some insight into the delays experienced? "long
> periods of time" mean different things to different people and this
> series seems to get more ominous as is progresses with the cover letter
> starting with "long periods of time" and by the time the final patch
> appears it has become "disastrous".

There was an incident where 99.999p tail latencies of a service
increased from 100 milliseconds to over 2 seconds when the container
manager switched from our legacy downstream CLOSID-reuse technique[1]
to mongrp_rename().

A more focused study benchmarked creating 128 threads with
pthread_create() on a production host found that moving mongroups
unrelated to any of the benchmark threads increased the completion cpu
time from 30ms to 183ms. Profiling the contention on the tasklist_lock
showed that the average contention time on the tasklist_lock was about
70ms when mongroup move operations were taking place.

It's difficult for me to access real production workloads, but I
estimated a crude figure by measuring the task time of "wc -l
/sys/fs/resctrl" with perf stat on a relatively idle Intel(R) Xeon(R)
Platinum 8273CL CPU @ 2.20GHz. As I increased the thread count, it
converged to a line where every additional 1000 threads added about 1
millisecond.

Incorporating kernfs_rename() into the solution for changing a group's
class of service also contributes a lot of overhead (about 90% of a
mongroup rename seems to be spent here), but the global impact is far
less than that of the tasklist_lock contention.


>
> >
> > To avoid this issue, this series replaces the CLOSID and RMID values
> > cached in the task_struct with a pointer to the task's rdtgroup, through
> > which the current CLOSID and RMID can be obtained indirectly during a
>
> On a high level this seems fair but using a pointer to the rdtgroup in the
> task_struct means that the scheduling code needs to dereference that pointer
> without any lock held. The changes do take great care and it
> would be valuable to clearly document how these accesses are safe. (please
> see patch #4 and #6)

I'll clarify that the existing technique for revoking a CLOSID or RMID
through a blocking IPI solves most of the problem already.

>
> > context switch. Updating a group's ID then only requires the current
> > task to be switched back in on all CPUs. On server hosts with very large
> > thread counts, this is much less disruptive than making thread creation
> > globally unavailable. However, this is less desirable on CPU-isolated,
> > realtime workloads, so I am interested in suggestions on how to reach a
> > compromise for the two use cases.
>
> As I understand this only impacts moving a monitor group? To me this sounds
> like a user space triggered event associated with a particular use case that
> may not be relevant to the workloads that you refer to. I think this could be
> something that can be documented for users with this type of workloads.
> (please see patch #6)

All of the existing rmdir cases seem to have the same problem, but
they must not be used frequently enough for any concerns to be raised.

It seems that it's routine for the workload of hosts to be increased
until memory bandwidth saturation, so applying and unapplying
allocation restrictions happens rather frequently.

>
> >
> > Also note that the soft-ABMC[2] proposal is based on swapping the RMID
> > values in mongroups when monitors are assigned to them, which will
> > result in the same slowdown as was encountered with re-parenting
> > monitoring groups.
> >
> > Using pointers to rdtgroups from the task_struct been used internally at
>
> "have been used internally"?

Thanks, I'll fix that.

>
> > Google for a few years to support an alternate CLOSID management
> > technique[3], which was replaced by mongroup rename. Usage of this
> > technique during production revealed an issue where threads in an
> > exiting process become unreachable via for_each_process_thread() before
> > destruction, but can still switch in and out. Consequently, an rmdir
> > operation can fail to remove references to an rdtgroup before it frees
> > it, causing the pointer to freed memory to be dereferenced after the
> > structure is freed. This would result in invalid CLOSID/RMID values
> > being written into MSRs, causing an exception. The workaround for this
> > is to clear a task's rdtgroup pointer when it exits.
>
> This does not sound like a problem unique to this new implementation but the
> "invalid CLOSID/RMID values being written into MSRs" sounds like something
> that happens today? Probably not worth pulling out for this reason because
> in its current form the exiting process will keep using the original
> CLOSID/RMID that have no issue being written to MSRs.

Today the values are at worst stale and in the range of valid CLOSIDs
and RMIDs. If __resctrl_sched_in() dereferences a freed struct
rdtgroup pointer, the resulting value could be an arbitrary u32, the
vast majority of which are not valid CLOSID/RMID values.

-Peter


[1] https://lore.kernel.org/lkml/CALPaoCj-zav4x6H3ffXo_O+RFan8Qb-uLy-DdtkaQTfuxY4a0w@mail.gmail.com/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line
  2024-04-04 23:09   ` Reinette Chatre
@ 2024-04-05 22:04     ` Peter Newman
  2024-04-07 19:21       ` Reinette Chatre
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Newman @ 2024-04-05 22:04 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, James Morse, Stephane Eranian, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Uros Bizjak,
	Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe, Xin Li,
	Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman, Jens Axboe,
	Christian Brauner, Oleg Nesterov, Andrew Morton, Tycho Andersen,
	Nicholas Piggin, Beau Belgrave, Matthew Wilcox (Oracle),
	linux-kernel

Hi Reinette,

On Thu, Apr 4, 2024 at 4:09 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Peter,
>
> On 3/25/2024 10:27 AM, Peter Newman wrote:
> > __resctrl_sched_in() is unable to dereference a struct rdtgroup pointer
> > when defined inline because rdtgroup is a private structure defined in
> > internal.h.
>
> Being inline has nothing to do with whether it can reference a struct rdtgroup
> pointer, no?

No, but it has a lot to do with whether it can de-reference a struct
rdtgroup pointer in order to obtain a CLOSID or RMID, as doing so
would pull the definitions for struct rdtgroup and struct mongroup
into an external header. Before doing so, I would want to make sure
implementing  __resctrl_sched_in() inline is actually adding value.

>
> >
> > This function is defined inline to avoid impacting context switch
> > performance for the majority of users who aren't using resctrl at all.
> > These benefits can already be realized without access to internal
> > resctrl data structures.
> >
> > The logic of performing an out-of-line call to __resctrl_sched_in() only
> > when resctrl is mounted is architecture-independent, so the inline
> > definition of resctrl_sched_in() can be moved into linux/resctrl.h.
> >
> > Signed-off-by: Peter Newman <peternewman@google.com>
>
> ...
>
> > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> > index 0917c7f25720..8f92a87d381d 100644
> > --- a/arch/x86/kernel/process_32.c
> > +++ b/arch/x86/kernel/process_32.c
> > @@ -38,6 +38,7 @@
> >  #include <linux/io.h>
> >  #include <linux/kdebug.h>
> >  #include <linux/syscalls.h>
> > +#include <linux/resctrl.h>
> >
> >  #include <asm/ldt.h>
> >  #include <asm/processor.h>
> > @@ -51,7 +52,6 @@
> >  #include <asm/debugreg.h>
> >  #include <asm/switch_to.h>
> >  #include <asm/vm86.h>
> > -#include <asm/resctrl.h>
> >  #include <asm/proto.h>
> >
> >  #include "process.h"
> > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > index 7062b84dd467..d442269bb25b 100644
> > --- a/arch/x86/kernel/process_64.c
> > +++ b/arch/x86/kernel/process_64.c
> > @@ -40,6 +40,7 @@
> >  #include <linux/ftrace.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/iommu.h>
> > +#include <linux/resctrl.h>
> >
> >  #include <asm/processor.h>
> >  #include <asm/pkru.h>
>
> With a change like this we should be very careful about what is included when
> the kernel is not built with resctrl and in its current form linux/resctrl.h is
> not ready for this.
>
> If CONFIG_X86_CPU_RESCTRL is not set linux/resctrl.h should have the bare minimum,
> just like asm/resctrl.h has.
>
> > @@ -53,7 +54,6 @@
> >  #include <asm/switch_to.h>
> >  #include <asm/xen/hypervisor.h>
> >  #include <asm/vdso.h>
> > -#include <asm/resctrl.h>
> >  #include <asm/unistd.h>
> >  #include <asm/fsgsbase.h>
> >  #include <asm/fred.h>
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index a365f67131ec..62d607939a73 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -304,4 +304,25 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
> >  extern unsigned int resctrl_rmid_realloc_threshold;
> >  extern unsigned int resctrl_rmid_realloc_limit;
> >
> > +DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
> > +
> > +void __resctrl_sched_in(struct task_struct *tsk);
> > +
> > +/*
> > + * resctrl_sched_in() - Assigns the incoming task's control/monitor IDs to the
> > + *                   current CPU
> > + *
> > + * To minimize impact to the scheduler hot path, this will stay as no-op unless
> > + * running on a system supporting resctrl and the filesystem is mounted.
> > + *
> > + * Must be called with preemption disabled.
> > + */
> > +static inline void resctrl_sched_in(struct task_struct *tsk)
> > +{
> > +#ifdef CONFIG_X86_CPU_RESCTRL
> > +     if (static_branch_likely(&rdt_enable_key))
> > +             __resctrl_sched_in(tsk);
> > +#endif
> > +}
> > +
>
> include/linux/resctrl.h should rather be divided to accommodate code
> as below:
>
> #ifdef CONFIG_X86_CPU_RESCTRL
>
> static inline void resctrl_sched_in(struct task_struct *tsk)
> {
>         if (static_branch_likely(&rdt_enable_key))
>                 __resctrl_sched_in(tsk);
> }
>
> #else
>
> static inline void resctrl_sched_in(struct task_struct *tsk) {}
>
> #endif
>
> so that core code does not get anything unnecessary when CONFIG_X86_CPU_RESCTRL
> is not set.

Will do.

Thanks!
-Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 3/6] x86/resctrl: Disallow mongroup rename on MPAM
  2024-04-04 23:11   ` Reinette Chatre
@ 2024-04-05 22:10     ` Peter Newman
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Newman @ 2024-04-05 22:10 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, James Morse, Stephane Eranian, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Uros Bizjak,
	Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe, Xin Li,
	Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman, Jens Axboe,
	Christian Brauner, Oleg Nesterov, Andrew Morton, Tycho Andersen,
	Nicholas Piggin, Beau Belgrave, Matthew Wilcox (Oracle),
	linux-kernel

Hi Reinette,

On Thu, Apr 4, 2024 at 4:11 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Peter,
>
> On 3/25/2024 10:27 AM, Peter Newman wrote:
> > Moving a monitoring group to a different parent control assumes that the
> > monitors will not be impacted. This is not the case on MPAM where the
> > PMG is an extension of the PARTID.
> >
> > Detect this situation by requiring the change in CLOSID not to affect
> > the result of resctrl_arch_rmid_idx_encode(), otherwise return
> > -EOPNOTSUPP.
> >
>
> Thanks for catching this. This seems out of place in this series. It sounds
> more like an independent fix that should go in separately.

I asserted in a comment in a patch later in the series that the
mongroup parent pointer never changes on MPAM, then decided to follow
up on whether it was actually true, so it's only here because this
series depends on it. I'll post it again separately with the fix you
requested below.

Thanks!
-Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent
  2024-04-05 21:30   ` Peter Newman
@ 2024-04-07 19:13     ` Reinette Chatre
  0 siblings, 0 replies; 27+ messages in thread
From: Reinette Chatre @ 2024-04-07 19:13 UTC (permalink / raw)
  To: Peter Newman
  Cc: Fenghua Yu, James Morse, Stephane Eranian, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Uros Bizjak,
	Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe, Xin Li,
	Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman, Jens Axboe,
	Christian Brauner, Oleg Nesterov, Andrew Morton, Tycho Andersen,
	Nicholas Piggin, Beau Belgrave, Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 4/5/2024 2:30 PM, Peter Newman wrote:
> Hi Reinette,
> 
> On Thu, Apr 4, 2024 at 4:09 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>> On 3/25/2024 10:27 AM, Peter Newman wrote:
>>> I've been working with users of the recently-added mongroup rename
>>> operation[1] who have observed the impact of back-to-back operations on
>>> latency-sensitive, thread pool-based services. Because changing a
>>> resctrl group's CLOSID (or RMID) requires all task_structs in the system
>>> to be inspected with the tasklist_lock read-locked, a series of move
>>> operations can block out thread creation for long periods of time, as
>>> creating threads needs to write-lock the tasklist_lock.
>>
>> Could you please give some insight into the delays experienced? "long
>> periods of time" mean different things to different people and this
>> series seems to get more ominous as is progresses with the cover letter
>> starting with "long periods of time" and by the time the final patch
>> appears it has become "disastrous".
> 
> There was an incident where 99.999p tail latencies of a service
> increased from 100 milliseconds to over 2 seconds when the container
> manager switched from our legacy downstream CLOSID-reuse technique[1]
> to mongrp_rename().
> 
> A more focused study benchmarked creating 128 threads with
> pthread_create() on a production host found that moving mongroups
> unrelated to any of the benchmark threads increased the completion cpu
> time from 30ms to 183ms. Profiling the contention on the tasklist_lock
> showed that the average contention time on the tasklist_lock was about
> 70ms when mongroup move operations were taking place.
> 
> It's difficult for me to access real production workloads, but I
> estimated a crude figure by measuring the task time of "wc -l
> /sys/fs/resctrl" with perf stat on a relatively idle Intel(R) Xeon(R)
> Platinum 8273CL CPU @ 2.20GHz. As I increased the thread count, it
> converged to a line where every additional 1000 threads added about 1
> millisecond.

Thank you very much for capturing this. Could you please include this in
next posting? This data motivates this work significantly more than
terms that are not measurable.

> Incorporating kernfs_rename() into the solution for changing a group's
> class of service also contributes a lot of overhead (about 90% of a
> mongroup rename seems to be spent here), but the global impact is far
> less than that of the tasklist_lock contention.

Is the kernfs_rename() overhead in an acceptable range?

>>> context switch. Updating a group's ID then only requires the current
>>> task to be switched back in on all CPUs. On server hosts with very large
>>> thread counts, this is much less disruptive than making thread creation
>>> globally unavailable. However, this is less desirable on CPU-isolated,
>>> realtime workloads, so I am interested in suggestions on how to reach a
>>> compromise for the two use cases.
>>
>> As I understand this only impacts moving a monitor group? To me this sounds
>> like a user space triggered event associated with a particular use case that
>> may not be relevant to the workloads that you refer to. I think this could be
>> something that can be documented for users with this type of workloads.
>> (please see patch #6)
> 
> All of the existing rmdir cases seem to have the same problem, but
> they must not be used frequently enough for any concerns to be raised.
> 
> It seems that it's routine for the workload of hosts to be increased
> until memory bandwidth saturation, so applying and unapplying
> allocation restrictions happens rather frequently.

Thank you. 

Reinette

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line
  2024-04-05 22:04     ` Peter Newman
@ 2024-04-07 19:21       ` Reinette Chatre
  2024-04-08 19:05         ` Peter Newman
  0 siblings, 1 reply; 27+ messages in thread
From: Reinette Chatre @ 2024-04-07 19:21 UTC (permalink / raw)
  To: Peter Newman
  Cc: Fenghua Yu, James Morse, Stephane Eranian, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Uros Bizjak,
	Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe, Xin Li,
	Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman, Jens Axboe,
	Christian Brauner, Oleg Nesterov, Andrew Morton, Tycho Andersen,
	Nicholas Piggin, Beau Belgrave, Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 4/5/2024 3:04 PM, Peter Newman wrote:
> Hi Reinette,
> 
> On Thu, Apr 4, 2024 at 4:09 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Peter,
>>
>> On 3/25/2024 10:27 AM, Peter Newman wrote:
>>> __resctrl_sched_in() is unable to dereference a struct rdtgroup pointer
>>> when defined inline because rdtgroup is a private structure defined in
>>> internal.h.
>>
>> Being inline has nothing to do with whether it can reference a struct rdtgroup
>> pointer, no?
> 
> No, but it has a lot to do with whether it can de-reference a struct
> rdtgroup pointer in order to obtain a CLOSID or RMID, as doing so
> would pull the definitions for struct rdtgroup and struct mongroup
> into an external header. Before doing so, I would want to make sure
> implementing  __resctrl_sched_in() inline is actually adding value.

I expect that each architecture would need architecture specific task
switching code and by pointing to rdtgroup from the task_struct every
architecture would need to know how to reach the CLOSID/RMID within.

Having architectures reach into the private fs data is not ideal
so we should consider to make just a portion of rdtgroup information
required to support switching code accessible to the architectures, not the
entire rdtgroup and mongroup structures.

...

>>> +static inline void resctrl_sched_in(struct task_struct *tsk)
>>> +{
>>> +#ifdef CONFIG_X86_CPU_RESCTRL
>>> +     if (static_branch_likely(&rdt_enable_key))
>>> +             __resctrl_sched_in(tsk);
>>> +#endif
>>> +}
>>> +
>>
>> include/linux/resctrl.h should rather be divided to accommodate code
>> as below:
>>
>> #ifdef CONFIG_X86_CPU_RESCTRL
>>
>> static inline void resctrl_sched_in(struct task_struct *tsk)
>> {
>>         if (static_branch_likely(&rdt_enable_key))
>>                 __resctrl_sched_in(tsk);
>> }
>>
>> #else
>>
>> static inline void resctrl_sched_in(struct task_struct *tsk) {}
>>
>> #endif
>>
>> so that core code does not get anything unnecessary when CONFIG_X86_CPU_RESCTRL
>> is not set.
> 
> Will do.

I think this needs more thought. rdt_enable_key is x86 specific now and should not
be in the fs code. Every architecture will have its own task switch code, with
__resctrl_sched_in() belonging to x86 and thus not something to be directly called
from the fs code.

Reinette


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 5/6] x86/resctrl: Abstract PQR_ASSOC from generic code
  2024-03-25 17:27 ` [PATCH v1 5/6] x86/resctrl: Abstract PQR_ASSOC from generic code Peter Newman
  2024-04-04 23:16   ` Reinette Chatre
  2024-04-05 17:15   ` Reinette Chatre
@ 2024-04-07 19:21   ` Reinette Chatre
  2 siblings, 0 replies; 27+ messages in thread
From: Reinette Chatre @ 2024-04-07 19:21 UTC (permalink / raw)
  To: Peter Newman, Fenghua Yu, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 3/25/2024 10:27 AM, Peter Newman wrote:
> While CLOSID and RMID originated in RDT, the concept applies to other
> architectures, as it's standard to write allocation and monitoring IDs
> into per-CPU registers.
> 
>  - Rename resctrl_pqr_state and pqr_state to be more
>    architecturally-neutral.
> 
>  - Introduce resctrl_arch_update_cpu() to replace the explicit write to
>    MSR_IA32_PQR_ASSOC in __resctrl_sched_in(). In the case of MPAM,
>    PARTID[_I,D] and PMG are a simple function of closid, rmid, and an
>    internal global.
> 
>  - Update terminology containing explicit references to the PQR_ASSOC
>    register.
> 
> Signed-off-by: Peter Newman <peternewman@google.com>
> ---
>  arch/x86/kernel/cpu/resctrl/core.c        | 11 ++++++++---
>  arch/x86/kernel/cpu/resctrl/internal.h    |  6 +++---
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  4 ++--
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 18 +++++++++---------
>  include/linux/resctrl.h                   | 11 +++++++++++
>  5 files changed, 33 insertions(+), 17 deletions(-)

Looks like __rdtgroup_move_task() still refers to the PQR MSR in comments
and it eventually ends up in fs code.

Also found in arch/x86/kernel/process_{32,64}.c:
	/* Load the Intel cache allocation PQR MSR. */

Reinette

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line
  2024-04-07 19:21       ` Reinette Chatre
@ 2024-04-08 19:05         ` Peter Newman
  2024-04-08 20:59           ` Reinette Chatre
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Newman @ 2024-04-08 19:05 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, James Morse, Stephane Eranian, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Uros Bizjak,
	Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe, Xin Li,
	Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman, Jens Axboe,
	Christian Brauner, Oleg Nesterov, Andrew Morton, Tycho Andersen,
	Nicholas Piggin, Beau Belgrave, Matthew Wilcox (Oracle),
	linux-kernel

Hi Reinette,

On Sun, Apr 7, 2024 at 12:21 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> I think this needs more thought. rdt_enable_key is x86 specific now and should not
> be in the fs code. Every architecture will have its own task switch code, with
> __resctrl_sched_in() belonging to x86 and thus not something to be directly called
> from the fs code.

I think we will need to discuss whether __resctrl_sched_in() is really
arch or FS code following the changes in this series. This series
removes the explicit MSR access and James has already provided inline
wrappers for the mon and alloc enable keys[1], so I can only assume
they are architecture-independent in concept.

Thanks!
-Peter

https://lore.kernel.org/r/20240213184438.16675-18-james.morse@arm.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line
  2024-04-08 19:05         ` Peter Newman
@ 2024-04-08 20:59           ` Reinette Chatre
  2024-04-08 21:41             ` Peter Newman
  0 siblings, 1 reply; 27+ messages in thread
From: Reinette Chatre @ 2024-04-08 20:59 UTC (permalink / raw)
  To: Peter Newman
  Cc: Fenghua Yu, James Morse, Stephane Eranian, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Uros Bizjak,
	Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe, Xin Li,
	Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman, Jens Axboe,
	Christian Brauner, Oleg Nesterov, Andrew Morton, Tycho Andersen,
	Nicholas Piggin, Beau Belgrave, Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 4/8/2024 12:05 PM, Peter Newman wrote:
> Hi Reinette,
> 
> On Sun, Apr 7, 2024 at 12:21 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> I think this needs more thought. rdt_enable_key is x86 specific now and should not
>> be in the fs code. Every architecture will have its own task switch code, with
>> __resctrl_sched_in() belonging to x86 and thus not something to be directly called
>> from the fs code.
> 
> I think we will need to discuss whether __resctrl_sched_in() is really
> arch or FS code following the changes in this series. This series
> removes the explicit MSR access and James has already provided inline

Indeed, I missed how resctrl_arch_update_cpu() introduced in
"x86/resctrl: Abstract PQR_ASSOC from generic code" results in
__resctrl_sched_in() being architecture agnostic.

(sidenotes ...  it is not obvious to me why resctrl_arch_update_cpu()
is not consistently used, for example, in clear_closid_rmid(),
and it also seems a better candidate to be inline within
arch/x86/include/asm/resctrl.h)

> wrappers for the mon and alloc enable keys[1], so I can only assume
> they are architecture-independent in concept.

The wrappers are intended to be architecture-independent, but the keys
are architecture dependent. Quoting the commit you linked to:
"This means other architectures don't have to mirror the static keys"

Reinette
 
> https://lore.kernel.org/r/20240213184438.16675-18-james.morse@arm.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line
  2024-04-08 20:59           ` Reinette Chatre
@ 2024-04-08 21:41             ` Peter Newman
  2024-04-09  3:53               ` Reinette Chatre
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Newman @ 2024-04-08 21:41 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, James Morse, Stephane Eranian, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Uros Bizjak,
	Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe, Xin Li,
	Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman, Jens Axboe,
	Christian Brauner, Oleg Nesterov, Andrew Morton, Tycho Andersen,
	Nicholas Piggin, Beau Belgrave, Matthew Wilcox (Oracle),
	linux-kernel

Hi Reinette,

On Mon, Apr 8, 2024 at 1:59 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Peter,
>
> On 4/8/2024 12:05 PM, Peter Newman wrote:
> > Hi Reinette,
> >
> > On Sun, Apr 7, 2024 at 12:21 PM Reinette Chatre
> > <reinette.chatre@intel.com> wrote:
> >>
> >> I think this needs more thought. rdt_enable_key is x86 specific now and should not
> >> be in the fs code. Every architecture will have its own task switch code, with
> >> __resctrl_sched_in() belonging to x86 and thus not something to be directly called
> >> from the fs code.
> >
> > I think we will need to discuss whether __resctrl_sched_in() is really
> > arch or FS code following the changes in this series. This series
> > removes the explicit MSR access and James has already provided inline
>
> Indeed, I missed how resctrl_arch_update_cpu() introduced in
> "x86/resctrl: Abstract PQR_ASSOC from generic code" results in
> __resctrl_sched_in() being architecture agnostic.
>
> (sidenotes ...  it is not obvious to me why resctrl_arch_update_cpu()
> is not consistently used, for example, in clear_closid_rmid(),
> and it also seems a better candidate to be inline within
> arch/x86/include/asm/resctrl.h)

I must have renamed resctrl_pqr_state to resctrl_cpu_state after I
last looked over clear_closid_rmid(). Now looking at the function
again, it seems clearly portable now and should definitely use
resctrl_arch_update_cpu(). I need to look over the accesses to
resctrl_cpu_state again to reconsider whether they're
architecture-dependent.

>
> > wrappers for the mon and alloc enable keys[1], so I can only assume
> > they are architecture-independent in concept.
>
> The wrappers are intended to be architecture-independent, but the keys
> are architecture dependent. Quoting the commit you linked to:
> "This means other architectures don't have to mirror the static keys"

The static keys seem to exist mainly for the benefit of
__resctrl_sched_in(), so if it becomes architecture-agnostic, I think
it would make sense for the static keys to move into the FS code with
it. I don't see any other usage of these keys in the code that
remained under arch/x86 on James' latest series.

-Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line
  2024-04-08 21:41             ` Peter Newman
@ 2024-04-09  3:53               ` Reinette Chatre
  0 siblings, 0 replies; 27+ messages in thread
From: Reinette Chatre @ 2024-04-09  3:53 UTC (permalink / raw)
  To: Peter Newman
  Cc: Fenghua Yu, James Morse, Stephane Eranian, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Uros Bizjak,
	Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe, Xin Li,
	Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman, Jens Axboe,
	Christian Brauner, Oleg Nesterov, Andrew Morton, Tycho Andersen,
	Nicholas Piggin, Beau Belgrave, Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 4/8/2024 2:41 PM, Peter Newman wrote:
> On Mon, Apr 8, 2024 at 1:59 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>> On 4/8/2024 12:05 PM, Peter Newman wrote:
>>> On Sun, Apr 7, 2024 at 12:21 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:

>>> wrappers for the mon and alloc enable keys[1], so I can only assume
>>> they are architecture-independent in concept.
>>
>> The wrappers are intended to be architecture-independent, but the keys
>> are architecture dependent. Quoting the commit you linked to:
>> "This means other architectures don't have to mirror the static keys"
> 
> The static keys seem to exist mainly for the benefit of
> __resctrl_sched_in(), so if it becomes architecture-agnostic, I think
> it would make sense for the static keys to move into the FS code with
> it. I don't see any other usage of these keys in the code that
> remained under arch/x86 on James' latest series.

Good point. James untangled the static keys from multiple usages with
only context switching remaining.
If its only use remains to avoid runtime checks in __resctrl_sched_in()
then it sounds reasonable that the static keys follow
__resctrl_sched_in() to FS code.

Reinette


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent
  2024-03-25 17:27 [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Peter Newman
                   ` (6 preceding siblings ...)
  2024-04-04 23:09 ` [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Reinette Chatre
@ 2024-04-11 22:34 ` Moger, Babu
  2024-04-11 22:44   ` Peter Newman
  7 siblings, 1 reply; 27+ messages in thread
From: Moger, Babu @ 2024-04-11 22:34 UTC (permalink / raw)
  To: Peter Newman, Fenghua Yu, Reinette Chatre, James Morse
  Cc: Stephane Eranian, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Uros Bizjak, Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe,
	Xin Li, Babu Moger, Shaopeng Tan, Maciej Wieczor-Retman,
	Jens Axboe, Christian Brauner, Oleg Nesterov, Andrew Morton,
	Tycho Andersen, Nicholas Piggin, Beau Belgrave,
	Matthew Wilcox (Oracle),
	linux-kernel

Hi Peter,

On 3/25/2024 12:27 PM, Peter Newman wrote:
> Hi Reinette,
>
> I've been working with users of the recently-added mongroup rename
> operation[1] who have observed the impact of back-to-back operations on
> latency-sensitive, thread pool-based services. Because changing a
> resctrl group's CLOSID (or RMID) requires all task_structs in the system
> to be inspected with the tasklist_lock read-locked, a series of move
> operations can block out thread creation for long periods of time, as
> creating threads needs to write-lock the tasklist_lock.
>
> To avoid this issue, this series replaces the CLOSID and RMID values
> cached in the task_struct with a pointer to the task's rdtgroup, through
> which the current CLOSID and RMID can be obtained indirectly during a
> context switch. Updating a group's ID then only requires the current
> task to be switched back in on all CPUs. On server hosts with very large
> thread counts, this is much less disruptive than making thread creation
> globally unavailable. However, this is less desirable on CPU-isolated,
> realtime workloads, so I am interested in suggestions on how to reach a
> compromise for the two use cases.

Before going this route, have you thought about your original solution 
where CONTROL_MON groups sharing the same CLOSID?

[3] https://lore.kernel.org/lkml/CALPaoCj-zav4x6H3ffXo_O+RFan8Qb-uLy-DdtkaQTfuxY4a0w@mail.gmail.com

May be it is less disruptive than changing the context switch code.. Thoughts?

thanks
Babu

>
> Also note that the soft-ABMC[2] proposal is based on swapping the RMID
> values in mongroups when monitors are assigned to them, which will
> result in the same slowdown as was encountered with re-parenting
> monitoring groups.
>
> Using pointers to rdtgroups from the task_struct been used internally at
> Google for a few years to support an alternate CLOSID management
> technique[3], which was replaced by mongroup rename. Usage of this
> technique during production revealed an issue where threads in an
> exiting process become unreachable via for_each_process_thread() before
> destruction, but can still switch in and out. Consequently, an rmdir
> operation can fail to remove references to an rdtgroup before it frees
> it, causing the pointer to freed memory to be dereferenced after the
> structure is freed. This would result in invalid CLOSID/RMID values
> being written into MSRs, causing an exception. The workaround for this
> is to clear a task's rdtgroup pointer when it exits.
>
> As a benefit, pointers to rdtgroups are architecture-independent,
> resulting in __resctrl_sched_in() and more of the task assignment code
> becoming portable, simplifying the effort of supporting MPAM. Using a
> single pointer allows the task's current monitor and control groups to
> be determined atomically.
>
> Similar changes have been used internally on a kernel supporting MPAM.
> Upon request, I can provide the required changes to the MPAM-resctrl
> interface based on James Morse's latest MPAM snapshot[4] for reference.
>
> Thanks!
> -Peter
>
> [1] https://lore.kernel.org/r/20230419125015.693566-3-peternewman@google.com
> [2] https://lore.kernel.org/lkml/CALPaoChhKJiMAueFtgCTc7ffO++S5DJCySmxqf9ZDmhR9RQapw@mail.gmail.com
> [3] https://lore.kernel.org/lkml/CALPaoCj-zav4x6H3ffXo_O+RFan8Qb-uLy-DdtkaQTfuxY4a0w@mail.gmail.com
> [4] https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/v6.7-rc2
>
> Peter Newman (6):
>    x86/resctrl: Move __resctrl_sched_in() out-of-line
>    x86/resctrl: Add hook for releasing task_struct references
>    x86/resctrl: Disallow mongroup rename on MPAM
>    x86/resctrl: Use rdtgroup pointer to indicate task membership
>    x86/resctrl: Abstract PQR_ASSOC from generic code
>    x86/resctrl: Don't search tasklist in mongroup rename
>
>   arch/x86/include/asm/resctrl.h            |  93 --------
>   arch/x86/kernel/cpu/resctrl/core.c        |  14 +-
>   arch/x86/kernel/cpu/resctrl/internal.h    |  17 ++
>   arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 279 +++++++++++++---------
>   arch/x86/kernel/process_32.c              |   2 +-
>   arch/x86/kernel/process_64.c              |   2 +-
>   include/linux/resctrl.h                   |  38 +++
>   include/linux/sched.h                     |   3 +-
>   kernel/exit.c                             |   3 +
>   10 files changed, 239 insertions(+), 216 deletions(-)
>
>
> base-commit: 4cece764965020c22cff7665b18a012006359095

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent
  2024-04-11 22:34 ` Moger, Babu
@ 2024-04-11 22:44   ` Peter Newman
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Newman @ 2024-04-11 22:44 UTC (permalink / raw)
  To: babu.moger
  Cc: Fenghua Yu, Reinette Chatre, James Morse, Stephane Eranian,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Uros Bizjak,
	Mike Rapoport, Kirill A. Shutemov, Rick Edgecombe, Xin Li,
	Shaopeng Tan, Maciej Wieczor-Retman, Jens Axboe,
	Christian Brauner, Oleg Nesterov, Andrew Morton, Tycho Andersen,
	Nicholas Piggin, Beau Belgrave, Matthew Wilcox (Oracle),
	linux-kernel

Hi Babu,

On Thu, Apr 11, 2024 at 3:34 PM Moger, Babu <bmoger@amd.com> wrote:
>
> Hi Peter,
>
> On 3/25/2024 12:27 PM, Peter Newman wrote:
> > To avoid this issue, this series replaces the CLOSID and RMID values
> > cached in the task_struct with a pointer to the task's rdtgroup, through
> > which the current CLOSID and RMID can be obtained indirectly during a
> > context switch. Updating a group's ID then only requires the current
> > task to be switched back in on all CPUs. On server hosts with very large
> > thread counts, this is much less disruptive than making thread creation
> > globally unavailable. However, this is less desirable on CPU-isolated,
> > realtime workloads, so I am interested in suggestions on how to reach a
> > compromise for the two use cases.
>
> Before going this route, have you thought about your original solution
> where CONTROL_MON groups sharing the same CLOSID?
>
> [3] https://lore.kernel.org/lkml/CALPaoCj-zav4x6H3ffXo_O+RFan8Qb-uLy-DdtkaQTfuxY4a0w@mail.gmail.com
>
> May be it is less disruptive than changing the context switch code.. Thoughts?

If had I ever posted that series, it would have contained the very
same changes to the context switch code, because that solution causes
rdtgroup->closid fields to change on almost every schemata write.

-Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-04-11 22:44 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-25 17:27 [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Peter Newman
2024-03-25 17:27 ` [PATCH v1 1/6] x86/resctrl: Move __resctrl_sched_in() out-of-line Peter Newman
2024-04-04 23:09   ` Reinette Chatre
2024-04-05 22:04     ` Peter Newman
2024-04-07 19:21       ` Reinette Chatre
2024-04-08 19:05         ` Peter Newman
2024-04-08 20:59           ` Reinette Chatre
2024-04-08 21:41             ` Peter Newman
2024-04-09  3:53               ` Reinette Chatre
2024-03-25 17:27 ` [PATCH v1 2/6] x86/resctrl: Add hook for releasing task_struct references Peter Newman
2024-04-04 23:10   ` Reinette Chatre
2024-03-25 17:27 ` [PATCH v1 3/6] x86/resctrl: Disallow mongroup rename on MPAM Peter Newman
2024-04-04 23:11   ` Reinette Chatre
2024-04-05 22:10     ` Peter Newman
2024-03-25 17:27 ` [PATCH v1 4/6] x86/resctrl: Use rdtgroup pointer to indicate task membership Peter Newman
2024-04-04 23:15   ` Reinette Chatre
2024-03-25 17:27 ` [PATCH v1 5/6] x86/resctrl: Abstract PQR_ASSOC from generic code Peter Newman
2024-04-04 23:16   ` Reinette Chatre
2024-04-05 17:15   ` Reinette Chatre
2024-04-07 19:21   ` Reinette Chatre
2024-03-25 17:27 ` [PATCH v1 6/6] x86/resctrl: Don't search tasklist in mongroup rename Peter Newman
2024-04-04 23:18   ` Reinette Chatre
2024-04-04 23:09 ` [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent Reinette Chatre
2024-04-05 21:30   ` Peter Newman
2024-04-07 19:13     ` Reinette Chatre
2024-04-11 22:34 ` Moger, Babu
2024-04-11 22:44   ` Peter Newman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).