All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/2] x86/resctrl: Fix task CLOSID update race
@ 2022-11-29 11:10 Peter Newman
  2022-11-29 11:10 ` [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() Peter Newman
  2022-11-29 11:10 ` [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates Peter Newman
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Newman @ 2022-11-29 11:10 UTC (permalink / raw)
  To: reinette.chatre, fenghua.yu
  Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh,
	linux-kernel, mingo, tglx, x86, Peter Newman

Hi Reinette, Fenghua,

I've fixed the wording in changelogs and code comments throughout and
clarified the explanations as Reinette had requested.

The patch series addresses the IPI race we discussed in the container
move RFD thread[1]. The changelog in the patches should also provide a
good description.

Updates in v4:
 - Reorder the patches so that justification for sending more IPIs can
   reference the patch fixing __rdtgroup_move_task().
 - Correct tense of wording used in changelog and comments
Updates in v3:
 - Split the handling of multi-task and single-task operations into
   separate patches, now that they're handled differently.
 - Clarify justification in the commit message, including moving some of
   it out of inline code comment.
Updates in v2:
 - Following Reinette's suggestion: use task_call_func() for single
   task, IPI broadcast for group movements.
 - Rebased to v6.1-rc4

v1: https://lore.kernel.org/lkml/20221103141641.3055981-1-peternewman@google.com/
v2: https://lore.kernel.org/lkml/20221110135346.2209839-1-peternewman@google.com/
v3: https://lore.kernel.org/lkml/20221115141953.816851-1-peternewman@google.com/

Thanks!
-Peter

[1] https://lore.kernel.org/all/CALPaoCg2-9ARbK+MEgdvdcjJtSy_2H6YeRkLrT97zgy8Aro3Vg@mail.gmail.com/

Peter Newman (2):
  x86/resctrl: Update task closid/rmid with task_call_func()
  x86/resctrl: IPI all online CPUs for group updates

 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 130 ++++++++++++-------------
 1 file changed, 60 insertions(+), 70 deletions(-)


base-commit: eb7081409f94a9a8608593d0fb63a1aa3d6f95d8
--
2.38.1.584.g0f3c55d4c2-goog


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func()
  2022-11-29 11:10 [PATCH v4 0/2] x86/resctrl: Fix task CLOSID update race Peter Newman
@ 2022-11-29 11:10 ` Peter Newman
  2022-12-06 18:56   ` Reinette Chatre
  2022-11-29 11:10 ` [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates Peter Newman
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Newman @ 2022-11-29 11:10 UTC (permalink / raw)
  To: reinette.chatre, fenghua.yu
  Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh,
	linux-kernel, mingo, tglx, x86, Peter Newman

When the user moves a running task to a new rdtgroup using the tasks
file interface, the resulting change in CLOSID/RMID must be immediately
propagated to the PQR_ASSOC MSR on the task's CPU.

It is possible for a task to wake up or migrate while it is being moved
to a new group. If __rdtgroup_move_task() fails to observe that a task
has begun running or misses that it migrated to a new CPU, the task will
continue to use the old CLOSID or RMID until it switches in again.

__rdtgroup_move_task() assumes that if the task migrates off of its CPU
before it can IPI the task, then the task has already observed the
updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can
delay stores until after loads, the following incorrect scenarios are
possible:

 1. __rdtgroup_move_task() stores the new closid and rmid in
    the task structure after it loads task_curr() and task_cpu().
 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
    switch stores new task_curr() and task_cpu() values.

Use task_call_func() in __rdtgroup_move_task() to serialize updates to
the closid and rmid fields in the task_struct with context switch.

Signed-off-by: Peter Newman <peternewman@google.com>
Reviewed-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 78 ++++++++++++++++----------
 1 file changed, 47 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..59b7ffcd53bb 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -528,6 +528,31 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
 	kfree(rdtgrp);
 }
 
+static int update_locked_task_closid_rmid(struct task_struct *t, void *arg)
+{
+	struct rdtgroup *rdtgrp = arg;
+
+	/*
+	 * Although task_call_func() serializes the writes below with the paired
+	 * reads in resctrl_sched_in(), resctrl_sched_in() still needs
+	 * READ_ONCE() due to rdt_move_group_tasks(), so use WRITE_ONCE() here
+	 * to conform.
+	 */
+	if (rdtgrp->type == RDTCTRL_GROUP) {
+		WRITE_ONCE(t->closid, rdtgrp->closid);
+		WRITE_ONCE(t->rmid, rdtgrp->mon.rmid);
+	} else if (rdtgrp->type == RDTMON_GROUP) {
+		WRITE_ONCE(t->rmid, rdtgrp->mon.rmid);
+	}
+
+	/*
+	 * If the task is current on a CPU, the PQR_ASSOC MSR needs to be
+	 * updated to make the resource group go into effect. If the task is not
+	 * current, the MSR will be updated when the task is scheduled in.
+	 */
+	return task_curr(t);
+}
+
 static void _update_task_closid_rmid(void *task)
 {
 	/*
@@ -538,10 +563,24 @@ static void _update_task_closid_rmid(void *task)
 		resctrl_sched_in();
 }
 
-static void update_task_closid_rmid(struct task_struct *t)
+static void update_task_closid_rmid(struct task_struct *t,
+				    struct rdtgroup *rdtgrp)
 {
-	if (IS_ENABLED(CONFIG_SMP) && task_curr(t))
-		smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1);
+	/*
+	 * Serialize the closid and rmid update with context switch. If
+	 * task_call_func() indicates that the task was running during
+	 * update_locked_task_closid_rmid(), then interrupt it.
+	 */
+	if (task_call_func(t, update_locked_task_closid_rmid, rdtgrp) &&
+	    IS_ENABLED(CONFIG_SMP))
+		/*
+		 * If the task has migrated away from the CPU indicated by
+		 * task_cpu() below, then it has already switched in on the
+		 * new CPU using the updated closid and rmid and the call below
+		 * is unnecessary, but harmless.
+		 */
+		smp_call_function_single(task_cpu(t),
+					 _update_task_closid_rmid, t, 1);
 	else
 		_update_task_closid_rmid(t);
 }
@@ -557,39 +596,16 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 		return 0;
 
 	/*
-	 * Set the task's closid/rmid before the PQR_ASSOC MSR can be
-	 * updated by them.
-	 *
-	 * For ctrl_mon groups, move both closid and rmid.
 	 * For monitor groups, can move the tasks only from
 	 * their parent CTRL group.
 	 */
-
-	if (rdtgrp->type == RDTCTRL_GROUP) {
-		WRITE_ONCE(tsk->closid, rdtgrp->closid);
-		WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
-	} else if (rdtgrp->type == RDTMON_GROUP) {
-		if (rdtgrp->mon.parent->closid == tsk->closid) {
-			WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
-		} else {
-			rdt_last_cmd_puts("Can't move task to different control group\n");
-			return -EINVAL;
-		}
+	if (rdtgrp->type == RDTMON_GROUP &&
+	    rdtgrp->mon.parent->closid != tsk->closid) {
+		rdt_last_cmd_puts("Can't move task to different control group\n");
+		return -EINVAL;
 	}
 
-	/*
-	 * Ensure the task's closid and rmid are written before determining if
-	 * the task is current that will decide if it will be interrupted.
-	 */
-	barrier();
-
-	/*
-	 * By now, the task's closid and rmid are set. If the task is current
-	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
-	 * group go into effect. If the task is not current, the MSR will be
-	 * updated when the task is scheduled in.
-	 */
-	update_task_closid_rmid(tsk);
+	update_task_closid_rmid(tsk, rdtgrp);
 
 	return 0;
 }
-- 
2.38.1.584.g0f3c55d4c2-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates
  2022-11-29 11:10 [PATCH v4 0/2] x86/resctrl: Fix task CLOSID update race Peter Newman
  2022-11-29 11:10 ` [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() Peter Newman
@ 2022-11-29 11:10 ` Peter Newman
  2022-12-06 18:57   ` Reinette Chatre
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Newman @ 2022-11-29 11:10 UTC (permalink / raw)
  To: reinette.chatre, fenghua.yu
  Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh,
	linux-kernel, mingo, tglx, x86, Peter Newman

Removing a CTRL_MON or MON group directory moves all tasks to the parent
group. The rmdir implementation therefore interrupts any running
tasks which were in the deleted group to update their CLOSID/RMID to
those of the parent.

The rmdir operation iterates over all tasks in the deleted group while
read-locking the tasklist_lock to ensure that no newly-created child
tasks remain in the deleted group. Calling task_call_func() to perform
the updates on every task in the deleted group, similar to the recent
fix in __rdtgroup_move_task(), would result in a much longer
tasklist_lock critical section.

To avoid this, stop attempting to construct a precise mask of CPUs
hosting the moved tasks in rdt_move_group_tasks(). Its callers instead
perform the PQR_ASSOC MSR update on all online CPUs to ensure all
affected tasks are notified.

To measure the impact of the rdt_move_group_tasks() implementation
options, the following command was run in an rdtgroup to produce a
1600-task workload:

 # mkdir /sys/fs/resctrl/test
 # echo $$ > /sys/fs/resctrl/test/tasks
 # perf bench sched messaging -g 40 -l 100000

Results collected using:

 # perf stat rmdir /sys/fs/resctrl/test

CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)

Calling task_call_func() on all tasks in the deleted group increased
task-clock time from 1.54 to 2.35 ms, while the IPI broadcast reduced
the time to 1.31 ms.

Restructuring resctrl groups is assumed to be a rare act of system-level
reconfiguration by the user, so the impact of additional IPIs resulting
from this change to a CPU-isolated workload is not a concern.

Signed-off-by: Peter Newman <peternewman@google.com>
Reviewed-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 52 +++++++-------------------
 1 file changed, 13 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 59b7ffcd53bb..4a3c0b315484 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2401,12 +2401,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
  * Move tasks from one to the other group. If @from is NULL, then all tasks
  * in the systems are moved unconditionally (used for teardown).
  *
- * If @mask is not NULL the cpus on which moved tasks are running are set
- * in that mask so the update smp function call is restricted to affected
- * cpus.
+ * Following this operation, the caller should update PQR_ASSOC MSR and per-CPU
+ * storage on all online CPUs.
  */
-static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
-				 struct cpumask *mask)
+static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to)
 {
 	struct task_struct *p, *t;
 
@@ -2416,16 +2414,6 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
 		    is_rmid_match(t, from)) {
 			WRITE_ONCE(t->closid, to->closid);
 			WRITE_ONCE(t->rmid, to->mon.rmid);
-
-			/*
-			 * If the task is on a CPU, set the CPU in the mask.
-			 * The detection is inaccurate as tasks might move or
-			 * schedule before the smp function call takes place.
-			 * In such a case the function call is pointless, but
-			 * there is no other side effect.
-			 */
-			if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
-				cpumask_set_cpu(task_cpu(t), mask);
 		}
 	}
 	read_unlock(&tasklist_lock);
@@ -2456,7 +2444,7 @@ static void rmdir_all_sub(void)
 	struct rdtgroup *rdtgrp, *tmp;
 
 	/* Move all tasks to the default resource group */
-	rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
+	rdt_move_group_tasks(NULL, &rdtgroup_default);
 
 	list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
 		/* Free any child rmids */
@@ -3115,23 +3103,19 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	return -EPERM;
 }
 
-static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
+static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp)
 {
 	struct rdtgroup *prdtgrp = rdtgrp->mon.parent;
 	int cpu;
 
 	/* Give any tasks back to the parent group */
-	rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask);
+	rdt_move_group_tasks(rdtgrp, prdtgrp);
 
 	/* Update per cpu rmid of the moved CPUs first */
 	for_each_cpu(cpu, &rdtgrp->cpu_mask)
 		per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid;
-	/*
-	 * Update the MSR on moved CPUs and CPUs which have moved
-	 * task running on them.
-	 */
-	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
-	update_closid_rmid(tmpmask, NULL);
+
+	update_closid_rmid(cpu_online_mask, NULL);
 
 	rdtgrp->flags = RDT_DELETED;
 	free_rmid(rdtgrp->mon.rmid);
@@ -3156,12 +3140,12 @@ static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp)
 	return 0;
 }
 
-static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
+static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp)
 {
 	int cpu;
 
 	/* Give any tasks back to the default group */
-	rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask);
+	rdt_move_group_tasks(rdtgrp, &rdtgroup_default);
 
 	/* Give any CPUs back to the default group */
 	cpumask_or(&rdtgroup_default.cpu_mask,
@@ -3173,12 +3157,7 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
 		per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid;
 	}
 
-	/*
-	 * Update the MSR on moved CPUs and CPUs which have moved
-	 * task running on them.
-	 */
-	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
-	update_closid_rmid(tmpmask, NULL);
+	update_closid_rmid(cpu_online_mask, NULL);
 
 	closid_free(rdtgrp->closid);
 	free_rmid(rdtgrp->mon.rmid);
@@ -3197,12 +3176,8 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
 {
 	struct kernfs_node *parent_kn = kn->parent;
 	struct rdtgroup *rdtgrp;
-	cpumask_var_t tmpmask;
 	int ret = 0;
 
-	if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
-		return -ENOMEM;
-
 	rdtgrp = rdtgroup_kn_lock_live(kn);
 	if (!rdtgrp) {
 		ret = -EPERM;
@@ -3222,18 +3197,17 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
 		    rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
 			ret = rdtgroup_ctrl_remove(rdtgrp);
 		} else {
-			ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask);
+			ret = rdtgroup_rmdir_ctrl(rdtgrp);
 		}
 	} else if (rdtgrp->type == RDTMON_GROUP &&
 		 is_mon_groups(parent_kn, kn->name)) {
-		ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask);
+		ret = rdtgroup_rmdir_mon(rdtgrp);
 	} else {
 		ret = -EPERM;
 	}
 
 out:
 	rdtgroup_kn_unlock(kn);
-	free_cpumask_var(tmpmask);
 	return ret;
 }
 
-- 
2.38.1.584.g0f3c55d4c2-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func()
  2022-11-29 11:10 ` [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() Peter Newman
@ 2022-12-06 18:56   ` Reinette Chatre
  2022-12-07 10:58     ` Peter Newman
  0 siblings, 1 reply; 13+ messages in thread
From: Reinette Chatre @ 2022-12-06 18:56 UTC (permalink / raw)
  To: Peter Newman, fenghua.yu
  Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh,
	linux-kernel, mingo, tglx, x86

Hi Peter,

On 11/29/2022 3:10 AM, Peter Newman wrote:
> When the user moves a running task to a new rdtgroup using the tasks
> file interface, the resulting change in CLOSID/RMID must be immediately
> propagated to the PQR_ASSOC MSR on the task's CPU.
> 
> It is possible for a task to wake up or migrate while it is being moved
> to a new group. If __rdtgroup_move_task() fails to observe that a task
> has begun running or misses that it migrated to a new CPU, the task will
> continue to use the old CLOSID or RMID until it switches in again.
> 
> __rdtgroup_move_task() assumes that if the task migrates off of its CPU
> before it can IPI the task, then the task has already observed the
> updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can
> delay stores until after loads, the following incorrect scenarios are
> possible:
> 
>  1. __rdtgroup_move_task() stores the new closid and rmid in
>     the task structure after it loads task_curr() and task_cpu().

Stating how this scenario encounters the problem would help
so perhaps something like (please feel free to change):
"If the task starts running between a reordered task_curr() check and
the CLOSID/RMID update then it will start running with the old CLOSID/RMID
until it is switched again because __rdtgroup_move_task() failed to determine 
that it needs to be interrupted to obtain the new CLOSID/RMID."

>  2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
>     switch stores new task_curr() and task_cpu() values.

This scenario is not clear to me. Could you please provide more detail about it?
I was trying to follow the context_switch() flow and resctrl_sched_in() is
one of the last things done (context_switch()->switch_to()->resctrl_sched_in()).
From what I can tell rq->curr, as used by task_curr() is set before
even context_switch() is called ... and since the next task is picked from
the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to
a runqueue) it seems to me that the value used by task_cpu() would also
be set early (before context_switch() is called). It is thus not clear to
me how the above reordering could occur so an example would help a lot.

> 
> Use task_call_func() in __rdtgroup_move_task() to serialize updates to
> the closid and rmid fields in the task_struct with context switch.

Is there a reason why there is a switch between the all caps CLOSID/RMID
at the beginning to the no caps here? 

> Signed-off-by: Peter Newman <peternewman@google.com>
> Reviewed-by: James Morse <james.morse@arm.com>
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 78 ++++++++++++++++----------
>  1 file changed, 47 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index e5a48f05e787..59b7ffcd53bb 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -528,6 +528,31 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
>  	kfree(rdtgrp);
>  }
>  
> +static int update_locked_task_closid_rmid(struct task_struct *t, void *arg)
> +{
> +	struct rdtgroup *rdtgrp = arg;
> +
> +	/*
> +	 * Although task_call_func() serializes the writes below with the paired
> +	 * reads in resctrl_sched_in(), resctrl_sched_in() still needs
> +	 * READ_ONCE() due to rdt_move_group_tasks(), so use WRITE_ONCE() here
> +	 * to conform.
> +	 */
> +	if (rdtgrp->type == RDTCTRL_GROUP) {
> +		WRITE_ONCE(t->closid, rdtgrp->closid);
> +		WRITE_ONCE(t->rmid, rdtgrp->mon.rmid);
> +	} else if (rdtgrp->type == RDTMON_GROUP) {
> +		WRITE_ONCE(t->rmid, rdtgrp->mon.rmid);
> +	}
> +
> +	/*
> +	 * If the task is current on a CPU, the PQR_ASSOC MSR needs to be
> +	 * updated to make the resource group go into effect. If the task is not
> +	 * current, the MSR will be updated when the task is scheduled in.
> +	 */
> +	return task_curr(t);
> +}
> +
>  static void _update_task_closid_rmid(void *task)
>  {
>  	/*
> @@ -538,10 +563,24 @@ static void _update_task_closid_rmid(void *task)
>  		resctrl_sched_in();
>  }
>  
> -static void update_task_closid_rmid(struct task_struct *t)
> +static void update_task_closid_rmid(struct task_struct *t,
> +				    struct rdtgroup *rdtgrp)
>  {
> -	if (IS_ENABLED(CONFIG_SMP) && task_curr(t))
> -		smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1);
> +	/*
> +	 * Serialize the closid and rmid update with context switch. If
> +	 * task_call_func() indicates that the task was running during
> +	 * update_locked_task_closid_rmid(), then interrupt it.
> +	 */
> +	if (task_call_func(t, update_locked_task_closid_rmid, rdtgrp) &&
> +	    IS_ENABLED(CONFIG_SMP))
> +		/*
> +		 * If the task has migrated away from the CPU indicated by
> +		 * task_cpu() below, then it has already switched in on the
> +		 * new CPU using the updated closid and rmid and the call below
> +		 * is unnecessary, but harmless.
> +		 */
> +		smp_call_function_single(task_cpu(t),
> +					 _update_task_closid_rmid, t, 1);
>  	else
>  		_update_task_closid_rmid(t);
>  }
> @@ -557,39 +596,16 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
>  		return 0;
>  
>  	/*
> -	 * Set the task's closid/rmid before the PQR_ASSOC MSR can be
> -	 * updated by them.
> -	 *
> -	 * For ctrl_mon groups, move both closid and rmid.
>  	 * For monitor groups, can move the tasks only from
>  	 * their parent CTRL group.
>  	 */
> -
> -	if (rdtgrp->type == RDTCTRL_GROUP) {
> -		WRITE_ONCE(tsk->closid, rdtgrp->closid);
> -		WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
> -	} else if (rdtgrp->type == RDTMON_GROUP) {
> -		if (rdtgrp->mon.parent->closid == tsk->closid) {
> -			WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
> -		} else {
> -			rdt_last_cmd_puts("Can't move task to different control group\n");
> -			return -EINVAL;
> -		}
> +	if (rdtgrp->type == RDTMON_GROUP &&
> +	    rdtgrp->mon.parent->closid != tsk->closid) {
> +		rdt_last_cmd_puts("Can't move task to different control group\n");
> +		return -EINVAL;
>  	}
>  
> -	/*
> -	 * Ensure the task's closid and rmid are written before determining if
> -	 * the task is current that will decide if it will be interrupted.
> -	 */
> -	barrier();
> -
> -	/*
> -	 * By now, the task's closid and rmid are set. If the task is current
> -	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
> -	 * group go into effect. If the task is not current, the MSR will be
> -	 * updated when the task is scheduled in.
> -	 */
> -	update_task_closid_rmid(tsk);
> +	update_task_closid_rmid(tsk, rdtgrp);
>  
>  	return 0;
>  }

The change looks good to me.

Reinette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates
  2022-11-29 11:10 ` [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates Peter Newman
@ 2022-12-06 18:57   ` Reinette Chatre
  2022-12-07 11:04     ` Peter Newman
  0 siblings, 1 reply; 13+ messages in thread
From: Reinette Chatre @ 2022-12-06 18:57 UTC (permalink / raw)
  To: Peter Newman, fenghua.yu
  Cc: bp, derkling, eranian, hpa, james.morse, jannh, kpsingh,
	linux-kernel, mingo, tglx, x86

Hi Peter,

On 11/29/2022 3:10 AM, Peter Newman wrote:
> Removing a CTRL_MON or MON group directory moves all tasks to the parent
> group. The rmdir implementation therefore interrupts any running
> tasks which were in the deleted group to update their CLOSID/RMID to
> those of the parent.
> 
> The rmdir operation iterates over all tasks in the deleted group while
> read-locking the tasklist_lock to ensure that no newly-created child
> tasks remain in the deleted group.

The above describes the current behavior. This is great context. What
follows in the changelog is a description of different fixes. This is
unexpected because there is no description of a problem with the current
behavior.

Could you please describe the problem with the current implementation? Next
you could state the two possible solutions and then I think the reader would
be ready to parse what is written below.


> Calling task_call_func() to perform
> the updates on every task in the deleted group, similar to the recent
> fix in __rdtgroup_move_task(), would result in a much longer
> tasklist_lock critical section.


I so still think it would help to state that this additional locking
does not help to provide precise CPU mask. Especially since
the next paragraph may be interpreted that a precise CPU mask
is lost by giving up the additional locking.

> To avoid this, stop attempting to construct a precise mask of CPUs
> hosting the moved tasks in rdt_move_group_tasks(). Its callers instead
> perform the PQR_ASSOC MSR update on all online CPUs to ensure all
> affected tasks are notified.
> 
> To measure the impact of the rdt_move_group_tasks() implementation
> options, the following command was run in an rdtgroup to produce a
> 1600-task workload:
> 
>  # mkdir /sys/fs/resctrl/test
>  # echo $$ > /sys/fs/resctrl/test/tasks
>  # perf bench sched messaging -g 40 -l 100000
> 
> Results collected using:
> 
>  # perf stat rmdir /sys/fs/resctrl/test
> 
> CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
> 
> Calling task_call_func() on all tasks in the deleted group increased
> task-clock time from 1.54 to 2.35 ms, while the IPI broadcast reduced
> the time to 1.31 ms.

Thank you very much for doing this testing.

> 
> Restructuring resctrl groups is assumed to be a rare act of system-level
> reconfiguration by the user, so the impact of additional IPIs resulting
> from this change to a CPU-isolated workload is not a concern.
> 
> Signed-off-by: Peter Newman <peternewman@google.com>
> Reviewed-by: James Morse <james.morse@arm.com>
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 52 +++++++-------------------
>  1 file changed, 13 insertions(+), 39 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 59b7ffcd53bb..4a3c0b315484 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2401,12 +2401,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
>   * Move tasks from one to the other group. If @from is NULL, then all tasks
>   * in the systems are moved unconditionally (used for teardown).
>   *
> - * If @mask is not NULL the cpus on which moved tasks are running are set
> - * in that mask so the update smp function call is restricted to affected
> - * cpus.
> + * Following this operation, the caller should update PQR_ASSOC MSR and per-CPU
> + * storage on all online CPUs.
>   */
> -static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
> -				 struct cpumask *mask)
> +static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to)
>  {
>  	struct task_struct *p, *t;
>  
> @@ -2416,16 +2414,6 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
>  		    is_rmid_match(t, from)) {
>  			WRITE_ONCE(t->closid, to->closid);
>  			WRITE_ONCE(t->rmid, to->mon.rmid);
> -
> -			/*
> -			 * If the task is on a CPU, set the CPU in the mask.
> -			 * The detection is inaccurate as tasks might move or
> -			 * schedule before the smp function call takes place.
> -			 * In such a case the function call is pointless, but
> -			 * there is no other side effect.
> -			 */
> -			if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
> -				cpumask_set_cpu(task_cpu(t), mask);
>  		}
>  	}
>  	read_unlock(&tasklist_lock);
> @@ -2456,7 +2444,7 @@ static void rmdir_all_sub(void)
>  	struct rdtgroup *rdtgrp, *tmp;
>  
>  	/* Move all tasks to the default resource group */
> -	rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
> +	rdt_move_group_tasks(NULL, &rdtgroup_default);
>  
>  	list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
>  		/* Free any child rmids */
> @@ -3115,23 +3103,19 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>  	return -EPERM;
>  }
>  
> -static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
> +static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp)
>  {
>  	struct rdtgroup *prdtgrp = rdtgrp->mon.parent;
>  	int cpu;
>  
>  	/* Give any tasks back to the parent group */
> -	rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask);
> +	rdt_move_group_tasks(rdtgrp, prdtgrp);
>  
>  	/* Update per cpu rmid of the moved CPUs first */
>  	for_each_cpu(cpu, &rdtgrp->cpu_mask)
>  		per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid;
> -	/*
> -	 * Update the MSR on moved CPUs and CPUs which have moved
> -	 * task running on them.
> -	 */
> -	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
> -	update_closid_rmid(tmpmask, NULL);
> +
> +	update_closid_rmid(cpu_online_mask, NULL);
>  
>  	rdtgrp->flags = RDT_DELETED;
>  	free_rmid(rdtgrp->mon.rmid);
> @@ -3156,12 +3140,12 @@ static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp)
>  	return 0;
>  }
>  
> -static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
> +static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp)
>  {
>  	int cpu;
>  
>  	/* Give any tasks back to the default group */
> -	rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask);
> +	rdt_move_group_tasks(rdtgrp, &rdtgroup_default);
>  
>  	/* Give any CPUs back to the default group */
>  	cpumask_or(&rdtgroup_default.cpu_mask,
> @@ -3173,12 +3157,7 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>  		per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid;
>  	}
>  
> -	/*
> -	 * Update the MSR on moved CPUs and CPUs which have moved
> -	 * task running on them.
> -	 */
> -	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
> -	update_closid_rmid(tmpmask, NULL);
> +	update_closid_rmid(cpu_online_mask, NULL);
>  
>  	closid_free(rdtgrp->closid);
>  	free_rmid(rdtgrp->mon.rmid);
> @@ -3197,12 +3176,8 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
>  {
>  	struct kernfs_node *parent_kn = kn->parent;
>  	struct rdtgroup *rdtgrp;
> -	cpumask_var_t tmpmask;
>  	int ret = 0;
>  
> -	if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
> -		return -ENOMEM;
> -
>  	rdtgrp = rdtgroup_kn_lock_live(kn);
>  	if (!rdtgrp) {
>  		ret = -EPERM;
> @@ -3222,18 +3197,17 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
>  		    rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
>  			ret = rdtgroup_ctrl_remove(rdtgrp);
>  		} else {
> -			ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask);
> +			ret = rdtgroup_rmdir_ctrl(rdtgrp);
>  		}
>  	} else if (rdtgrp->type == RDTMON_GROUP &&
>  		 is_mon_groups(parent_kn, kn->name)) {
> -		ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask);
> +		ret = rdtgroup_rmdir_mon(rdtgrp);
>  	} else {
>  		ret = -EPERM;
>  	}
>  
>  out:
>  	rdtgroup_kn_unlock(kn);
> -	free_cpumask_var(tmpmask);
>  	return ret;
>  }
>  

The change looks good to me.

Reinette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func()
  2022-12-06 18:56   ` Reinette Chatre
@ 2022-12-07 10:58     ` Peter Newman
  2022-12-07 18:38       ` Reinette Chatre
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Newman @ 2022-12-07 10:58 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, bp, derkling, eranian, hpa, james.morse, jannh,
	kpsingh, linux-kernel, mingo, tglx, x86

Hi Reinette,

On Tue, Dec 6, 2022 at 7:57 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
> On 11/29/2022 3:10 AM, Peter Newman wrote:
> > When the user moves a running task to a new rdtgroup using the tasks
> > file interface, the resulting change in CLOSID/RMID must be immediately
> > propagated to the PQR_ASSOC MSR on the task's CPU.
> >
> > It is possible for a task to wake up or migrate while it is being moved
> > to a new group. If __rdtgroup_move_task() fails to observe that a task
> > has begun running or misses that it migrated to a new CPU, the task will
> > continue to use the old CLOSID or RMID until it switches in again.
> >
> > __rdtgroup_move_task() assumes that if the task migrates off of its CPU
> > before it can IPI the task, then the task has already observed the
> > updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can
> > delay stores until after loads, the following incorrect scenarios are
> > possible:
> >
> >  1. __rdtgroup_move_task() stores the new closid and rmid in
> >     the task structure after it loads task_curr() and task_cpu().
>
> Stating how this scenario encounters the problem would help
> so perhaps something like (please feel free to change):
> "If the task starts running between a reordered task_curr() check and
> the CLOSID/RMID update then it will start running with the old CLOSID/RMID
> until it is switched again because __rdtgroup_move_task() failed to determine
> that it needs to be interrupted to obtain the new CLOSID/RMID."

That is largely what I was trying to state in paragraph 2 above, though
at a higher level. I hoped the paragraph following it would do enough to
connect the high-level description with the low-level problem scenarios.

>
> >  2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
> >     switch stores new task_curr() and task_cpu() values.
>
> This scenario is not clear to me. Could you please provide more detail about it?
> I was trying to follow the context_switch() flow and resctrl_sched_in() is
> one of the last things done (context_switch()->switch_to()->resctrl_sched_in()).
> From what I can tell rq->curr, as used by task_curr() is set before
> even context_switch() is called ... and since the next task is picked from
> the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to
> a runqueue) it seems to me that the value used by task_cpu() would also
> be set early (before context_switch() is called). It is thus not clear to
> me how the above reordering could occur so an example would help a lot.

Perhaps in both scenarios I didn't make it clear that reordering in the
CPU can cause the incorrect behavior rather than the program order. In
this explanation, items 1. and 2. are supposed to be completing the
sentence ending with a ':' at the end of paragraph 3, so I thought that
would keep focus on the CPU.

I had assumed that the ordering requirements were well-understood, since
they're stated in existing code comments a few times, and that making a
case for how the expected ordering could be violated would be enough,
but I'm happy to draw up a side-by-side example.

>
> >
> > Use task_call_func() in __rdtgroup_move_task() to serialize updates to
> > the closid and rmid fields in the task_struct with context switch.
>
> Is there a reason why there is a switch between the all caps CLOSID/RMID
> at the beginning to the no caps here?

It's because I referred to the task_struct fields explicitly here.

-Peter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates
  2022-12-06 18:57   ` Reinette Chatre
@ 2022-12-07 11:04     ` Peter Newman
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Newman @ 2022-12-07 11:04 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, bp, derkling, eranian, hpa, james.morse, jannh,
	kpsingh, linux-kernel, mingo, tglx, x86

Hi Reinette,

On Tue, Dec 6, 2022 at 7:57 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
> On 11/29/2022 3:10 AM, Peter Newman wrote:
> > Removing a CTRL_MON or MON group directory moves all tasks to the parent
> > group. The rmdir implementation therefore interrupts any running
> > tasks which were in the deleted group to update their CLOSID/RMID to
> > those of the parent.
> >
> > The rmdir operation iterates over all tasks in the deleted group while
> > read-locking the tasklist_lock to ensure that no newly-created child
> > tasks remain in the deleted group.
>
> The above describes the current behavior. This is great context. What
> follows in the changelog is a description of different fixes. This is
> unexpected because there is no description of a problem with the current
> behavior.
>
> Could you please describe the problem with the current implementation? Next
> you could state the two possible solutions and then I think the reader would
> be ready to parse what is written below.

Ok

> > Calling task_call_func() to perform
> > the updates on every task in the deleted group, similar to the recent
> > fix in __rdtgroup_move_task(), would result in a much longer
> > tasklist_lock critical section.
>
>
> I so still think it would help to state that this additional locking
> does not help to provide precise CPU mask. Especially since
> the next paragraph may be interpreted that a precise CPU mask
> is lost by giving up the additional locking.

Yes, that's a very good point, and I'm afraid I've already made you
reiterate it once before. I will make sure to work it into the next
revision.

-Peter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func()
  2022-12-07 10:58     ` Peter Newman
@ 2022-12-07 18:38       ` Reinette Chatre
  2022-12-08 22:30         ` Peter Newman
  0 siblings, 1 reply; 13+ messages in thread
From: Reinette Chatre @ 2022-12-07 18:38 UTC (permalink / raw)
  To: Peter Newman
  Cc: fenghua.yu, bp, derkling, eranian, hpa, james.morse, jannh,
	kpsingh, linux-kernel, mingo, tglx, x86

Hi Peter,

On 12/7/2022 2:58 AM, Peter Newman wrote:
> Hi Reinette,
> 
> On Tue, Dec 6, 2022 at 7:57 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>> On 11/29/2022 3:10 AM, Peter Newman wrote:
>>> When the user moves a running task to a new rdtgroup using the tasks
>>> file interface, the resulting change in CLOSID/RMID must be immediately
>>> propagated to the PQR_ASSOC MSR on the task's CPU.
>>>
>>> It is possible for a task to wake up or migrate while it is being moved
>>> to a new group. If __rdtgroup_move_task() fails to observe that a task
>>> has begun running or misses that it migrated to a new CPU, the task will
>>> continue to use the old CLOSID or RMID until it switches in again.
>>>
>>> __rdtgroup_move_task() assumes that if the task migrates off of its CPU
>>> before it can IPI the task, then the task has already observed the
>>> updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can
>>> delay stores until after loads, the following incorrect scenarios are
>>> possible:
>>>
>>>  1. __rdtgroup_move_task() stores the new closid and rmid in
>>>     the task structure after it loads task_curr() and task_cpu().
>>
>> Stating how this scenario encounters the problem would help
>> so perhaps something like (please feel free to change):
>> "If the task starts running between a reordered task_curr() check and
>> the CLOSID/RMID update then it will start running with the old CLOSID/RMID
>> until it is switched again because __rdtgroup_move_task() failed to determine
>> that it needs to be interrupted to obtain the new CLOSID/RMID."
> 
> That is largely what I was trying to state in paragraph 2 above, though
> at a higher level. I hoped the paragraph following it would do enough to
> connect the high-level description with the low-level problem scenarios.

There is no need to require the reader to connect various snippets to create
a problematic scenario themselves. The changelog should make the problem
obvious. I understand that it is what you wanted to say, that is why I moved
existing snippets to form a coherent problem scenario. It is ok if you do not
like the way I wrote it, it was only an example on how it can be done.

>>>  2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
>>>     switch stores new task_curr() and task_cpu() values.
>>
>> This scenario is not clear to me. Could you please provide more detail about it?
>> I was trying to follow the context_switch() flow and resctrl_sched_in() is
>> one of the last things done (context_switch()->switch_to()->resctrl_sched_in()).
>> From what I can tell rq->curr, as used by task_curr() is set before
>> even context_switch() is called ... and since the next task is picked from
>> the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to
>> a runqueue) it seems to me that the value used by task_cpu() would also
>> be set early (before context_switch() is called). It is thus not clear to
>> me how the above reordering could occur so an example would help a lot.
> 
> Perhaps in both scenarios I didn't make it clear that reordering in the
> CPU can cause the incorrect behavior rather than the program order. In
> this explanation, items 1. and 2. are supposed to be completing the
> sentence ending with a ':' at the end of paragraph 3, so I thought that
> would keep focus on the CPU.

You did make it clear that the cause is reordering in the CPU. I am just
not able to see where the reordering is occurring in your scenario (2).

> I had assumed that the ordering requirements were well-understood, since
> they're stated in existing code comments a few times, and that making a
> case for how the expected ordering could be violated would be enough,
> but I'm happy to draw up a side-by-side example.

Please do. Could you start by highlighting which resctrl_sched_in()
you are referring to? I am trying to dissect (2) with the given information:
Through "the calling context switch" the scenario is written to create
understanding that it refers to:
context_switch()->switch_to()->resctrl_sched_in() - so the calling context
switch is the first in the above call path ... where does it (context_switch())
store the new task_curr() and task_cpu() values and how does that reorder with
resctrl_sched_in() further down in call path?

>>> Use task_call_func() in __rdtgroup_move_task() to serialize updates to
>>> the closid and rmid fields in the task_struct with context switch.
>>
>> Is there a reason why there is a switch between the all caps CLOSID/RMID
>> at the beginning to the no caps here?
> 
> It's because I referred to the task_struct fields explicitly here.

You can use task_struct::closid and task_struct::rmid to make this clear.

Reinette


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func()
  2022-12-07 18:38       ` Reinette Chatre
@ 2022-12-08 22:30         ` Peter Newman
  2022-12-09 23:54           ` Reinette Chatre
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Newman @ 2022-12-08 22:30 UTC (permalink / raw)
  To: reinette.chatre
  Cc: bp, derkling, eranian, fenghua.yu, hpa, james.morse, jannh,
	kpsingh, linux-kernel, mingo, peternewman, tglx, x86

Hi Reinette,

On Wed, Dec 7, 2022 at 7:41 PM Reinette Chatre <reinette.chatre@intel.com> wrote:
> On 12/7/2022 2:58 AM, Peter Newman wrote:
> >>>  2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
> >>>     switch stores new task_curr() and task_cpu() values.
> >>
> >> This scenario is not clear to me. Could you please provide more detail about it?
> >> I was trying to follow the context_switch() flow and resctrl_sched_in() is
> >> one of the last things done (context_switch()->switch_to()->resctrl_sched_in()).
> >> From what I can tell rq->curr, as used by task_curr() is set before
> >> even context_switch() is called ... and since the next task is picked from
> >> the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to
> >> a runqueue) it seems to me that the value used by task_cpu() would also
> >> be set early (before context_switch() is called). It is thus not clear to
> >> me how the above reordering could occur so an example would help a lot.
> >
> > Perhaps in both scenarios I didn't make it clear that reordering in the
> > CPU can cause the incorrect behavior rather than the program order. In
> > this explanation, items 1. and 2. are supposed to be completing the
> > sentence ending with a ':' at the end of paragraph 3, so I thought that
> > would keep focus on the CPU.
>
> You did make it clear that the cause is reordering in the CPU. I am just
> not able to see where the reordering is occurring in your scenario (2).

It will all come down to whether it can get from updating rq->curr to
reading task_struct::{closid,rmid} without encountering a full barrier.

I'll go into the details below.

> Please do. Could you start by highlighting which resctrl_sched_in()
> you are referring to? I am trying to dissect (2) with the given information:
> Through "the calling context switch" the scenario is written to create
> understanding that it refers to:
> context_switch()->switch_to()->resctrl_sched_in() - so the calling context
> switch is the first in the above call path ... where does it (context_switch())
> store the new task_curr() and task_cpu() values and how does that reorder with
> resctrl_sched_in() further down in call path?

Yes, the rq->curr update is actually in __schedule(). I was probably
still thinking it was in prepare_task_switch() (called from
context_switch()) because of older kernels where __rdtgroup_move_task()
is still reading task_struct::on_cpu.

There is an interesting code comment under the rq->curr update site in
__schedule():

	/*
	 * RCU users of rcu_dereference(rq->curr) may not see
	 * changes to task_struct made by pick_next_task().
	 */
	RCU_INIT_POINTER(rq->curr, next);
	/*
	 * The membarrier system call requires each architecture
	 * to have a full memory barrier after updating
	 * rq->curr, before returning to user-space.
	 *
	 * Here are the schemes providing that barrier on the
	 * various architectures:
	 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
	 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
	 * - finish_lock_switch() for weakly-ordered
	 *   architectures where spin_unlock is a full barrier,
	 * - switch_to() for arm64 (weakly-ordered, spin_unlock
	 *   is a RELEASE barrier),
	 */

Based on this, I believe (2) doesn't happen on x86 because switch_mm()
provides the required ordering.

On arm64, it won't happen as long as it calls its resctrl_sched_in()
after the dsb(ish) in __switch_to(), which seems to be the case:

https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/tree/arch/arm64/kernel/process.c?h=mpam/snapshot/v6.0-rc1#n561

Based on this, I'll just sketch out the first scenario below and drop
(2) from the changelog. This also implies that the group update cases
can use a single smp_mb() to provide all the necessary ordering, because
there's a full barrier on context switch for it to pair with, so I don't
need to broadcast IPI anymore.  I don't know whether task_call_func() is
faster than an smp_mb(). I'll take some measurements to see.

The presumed behavior is __rdtgroup_move_task() not seeing t1 running
yet implies that it observes the updated values:

CPU 0                                   CPU 1
-----                                   -----
(t1->{closid,rmid} -> {1,1})            (rq->curr -> t0)

__rdtgroup_move_task():
  t1->{closid,rmid} <- {2,2}
  curr <- t1->cpu->rq->curr
                                        __schedule():
                                          rq->curr <- t1
                                        resctrl_sched_in():
                                          t1->{closid,rmid} -> {2,2}
  if (curr == t1) // false
    IPI(t1->cpu)

In (1), CPU 0 is allowed to store {closid,rmid} after reading whether t1
is current:

CPU 0                                   CPU 1
-----                                   -----
__rdtgroup_move_task():
  curr <- t1->cpu->rq->curr
                                        __schedule():
                                          rq->curr <- t1
                                        resctrl_sched_in():
                                          t1->{closid,rmid} -> {1,1}
  t1->{closid,rmid} <- {2,2}
  if (curr == t1) // false
   IPI(t1->cpu)

Please let me know if these diagrams clarify things.

-Peter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func()
  2022-12-08 22:30         ` Peter Newman
@ 2022-12-09 23:54           ` Reinette Chatre
  2022-12-12 17:36             ` Peter Newman
  0 siblings, 1 reply; 13+ messages in thread
From: Reinette Chatre @ 2022-12-09 23:54 UTC (permalink / raw)
  To: Peter Newman
  Cc: bp, derkling, eranian, fenghua.yu, hpa, james.morse, jannh,
	kpsingh, linux-kernel, mingo, tglx, x86

Hi Peter,

On 12/8/2022 2:30 PM, Peter Newman wrote:
> On Wed, Dec 7, 2022 at 7:41 PM Reinette Chatre <reinette.chatre@intel.com> wrote:
>> On 12/7/2022 2:58 AM, Peter Newman wrote:
>>>>>  2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
>>>>>     switch stores new task_curr() and task_cpu() values.

...

> 
> Based on this, I'll just sketch out the first scenario below and drop
> (2) from the changelog. This also implies that the group update cases

ok, thank you for doing that analysis.

> can use a single smp_mb() to provide all the necessary ordering, because
> there's a full barrier on context switch for it to pair with, so I don't
> need to broadcast IPI anymore.  I don't know whether task_call_func() is

This is not clear to me because rdt_move_group_tasks() seems to have the
same code as shown below as vulnerable to re-ordering. Only difference
is that it uses the "//false" checks to set a bit in the cpumask for a
later IPI instead of an immediate IPI.

> faster than an smp_mb(). I'll take some measurements to see.
> 
> The presumed behavior is __rdtgroup_move_task() not seeing t1 running
> yet implies that it observes the updated values:
> 
> CPU 0                                   CPU 1
> -----                                   -----
> (t1->{closid,rmid} -> {1,1})            (rq->curr -> t0)
> 
> __rdtgroup_move_task():
>   t1->{closid,rmid} <- {2,2}
>   curr <- t1->cpu->rq->curr
>                                         __schedule():
>                                           rq->curr <- t1
>                                         resctrl_sched_in():
>                                           t1->{closid,rmid} -> {2,2}
>   if (curr == t1) // false
>     IPI(t1->cpu)

I understand that the test is false when it may be expected to be true, but
there does not seem to be a problem because of that. t1 was scheduled in with
the correct CLOSID/RMID and its CPU did not get an unnecessary IPI.

> In (1), CPU 0 is allowed to store {closid,rmid} after reading whether t1
> is current:
> 
> CPU 0                                   CPU 1
> -----                                   -----
> __rdtgroup_move_task():
>   curr <- t1->cpu->rq->curr
>                                         __schedule():
>                                           rq->curr <- t1
>                                         resctrl_sched_in():
>                                           t1->{closid,rmid} -> {1,1}
>   t1->{closid,rmid} <- {2,2}
>   if (curr == t1) // false
>    IPI(t1->cpu)

Yes, this I understand to be the problematic scenario.
 
> Please let me know if these diagrams clarify things.

They do, thank you very much.

Reinette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func()
  2022-12-09 23:54           ` Reinette Chatre
@ 2022-12-12 17:36             ` Peter Newman
  2022-12-13 18:33               ` Reinette Chatre
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Newman @ 2022-12-12 17:36 UTC (permalink / raw)
  To: reinette.chatre
  Cc: bp, derkling, eranian, fenghua.yu, hpa, james.morse, jannh,
	kpsingh, linux-kernel, mingo, peternewman, tglx, x86

Hi Reinette,

On Sat, Dec 10, 2022 at 12:54 AM Reinette Chatre <reinette.chatre@intel.com> wrote:
> On 12/8/2022 2:30 PM, Peter Newman wrote:
> > Based on this, I'll just sketch out the first scenario below and drop
> > (2) from the changelog. This also implies that the group update cases
>
> ok, thank you for doing that analysis.
>
> > can use a single smp_mb() to provide all the necessary ordering, because
> > there's a full barrier on context switch for it to pair with, so I don't
> > need to broadcast IPI anymore.  I don't know whether task_call_func() is
>
> This is not clear to me because rdt_move_group_tasks() seems to have the
> same code as shown below as vulnerable to re-ordering. Only difference
> is that it uses the "//false" checks to set a bit in the cpumask for a
> later IPI instead of an immediate IPI.

An smp_mb() between writing the new task_struct::{closid,rmid} and
calling task_curr() would prevent the reordering I described, but I
was worried about the cost of executing a full barrier for every
matching task.

I tried something like this:

for_each_process_thread(p, t) {
	if (!from || is_closid_match(t, from) ||
	    is_rmid_match(t, from)) {
		WRITE_ONCE(t->closid, to->closid);
		WRITE_ONCE(t->rmid, to->mon.rmid);
		/* group moves are serialized by rdt */
		t->resctrl_dirty = true;
	}
}
if (IS_ENABLED(CONFIG_SMP) && mask) {
	/* Order t->{closid,rmid} stores before loads in task_curr() */
	smp_mb();
	for_each_process_thread(p, t) {
		if (t->resctrl_dirty) {
			if (task_curr(t))
				cpumask_set_cpu(task_cpu(t), mask);
			t->resctrl_dirty = false;
		}
	}
}

I repeated the `perf bench sched messaging -g 40 -l100000 ` benchmark
from before[1] to compare this with the baseline, and found that it
only increased the time to delete the benchmark's group from 1.65ms to
1.66ms, so it's an alternative to what I last posted.

I could do something similar in the single-task move, but I don't think
it makes much of a performance difference in that case. It's only a win
for the group move because the synchronization cost doesn't grow with
the group size.

[1] https://lore.kernel.org/lkml/20221129111055.953833-3-peternewman@google.com/


>
> > faster than an smp_mb(). I'll take some measurements to see.
> >
> > The presumed behavior is __rdtgroup_move_task() not seeing t1 running
> > yet implies that it observes the updated values:
> >
> > CPU 0                                   CPU 1
> > -----                                   -----
> > (t1->{closid,rmid} -> {1,1})            (rq->curr -> t0)
> >
> > __rdtgroup_move_task():
> >   t1->{closid,rmid} <- {2,2}
> >   curr <- t1->cpu->rq->curr
> >                                         __schedule():
> >                                           rq->curr <- t1
> >                                         resctrl_sched_in():
> >                                           t1->{closid,rmid} -> {2,2}
> >   if (curr == t1) // false
> >     IPI(t1->cpu)
>
> I understand that the test is false when it may be expected to be true, but
> there does not seem to be a problem because of that. t1 was scheduled in with
> the correct CLOSID/RMID and its CPU did not get an unnecessary IPI.

Yes, this one was reminding the reader of the correct behavior. I can
just leave it out.

-Peter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func()
  2022-12-12 17:36             ` Peter Newman
@ 2022-12-13 18:33               ` Reinette Chatre
  2022-12-14 10:05                 ` Peter Newman
  0 siblings, 1 reply; 13+ messages in thread
From: Reinette Chatre @ 2022-12-13 18:33 UTC (permalink / raw)
  To: Peter Newman
  Cc: bp, derkling, eranian, fenghua.yu, hpa, james.morse, jannh,
	kpsingh, linux-kernel, mingo, tglx, x86

Hi Peter,

On 12/12/2022 9:36 AM, Peter Newman wrote:
> On Sat, Dec 10, 2022 at 12:54 AM Reinette Chatre <reinette.chatre@intel.com> wrote:
>> On 12/8/2022 2:30 PM, Peter Newman wrote:
>>> Based on this, I'll just sketch out the first scenario below and drop
>>> (2) from the changelog. This also implies that the group update cases
>>
>> ok, thank you for doing that analysis.
>>
>>> can use a single smp_mb() to provide all the necessary ordering, because
>>> there's a full barrier on context switch for it to pair with, so I don't
>>> need to broadcast IPI anymore.  I don't know whether task_call_func() is
>>
>> This is not clear to me because rdt_move_group_tasks() seems to have the
>> same code as shown below as vulnerable to re-ordering. Only difference
>> is that it uses the "//false" checks to set a bit in the cpumask for a
>> later IPI instead of an immediate IPI.
> 
> An smp_mb() between writing the new task_struct::{closid,rmid} and
> calling task_curr() would prevent the reordering I described, but I
> was worried about the cost of executing a full barrier for every
> matching task.

So for moving a single task the solution may just be to change
the current barrier() to smp_mb()?

> 
> I tried something like this:
> 
> for_each_process_thread(p, t) {
> 	if (!from || is_closid_match(t, from) ||
> 	    is_rmid_match(t, from)) {
> 		WRITE_ONCE(t->closid, to->closid);
> 		WRITE_ONCE(t->rmid, to->mon.rmid);
> 		/* group moves are serialized by rdt */
> 		t->resctrl_dirty = true;
> 	}
> }
> if (IS_ENABLED(CONFIG_SMP) && mask) {
> 	/* Order t->{closid,rmid} stores before loads in task_curr() */
> 	smp_mb();
> 	for_each_process_thread(p, t) {
> 		if (t->resctrl_dirty) {
> 			if (task_curr(t))
> 				cpumask_set_cpu(task_cpu(t), mask);
> 			t->resctrl_dirty = false;
> 		}
> 	}
> }
> 

struct task_struct would not welcome a new member dedicated to resctrl's
rare use for convenience. Another option may be to use a flag within
the variables themselves but that seems like significant complication
(flag need to be dealt with during scheduling) for which the benefit
is not clear to me. I would prefer that we start with the simplest 
solution first (I see that as IPI to all CPUs). The changelog needs clear
description of the problem needing to be solved and the solution chosen, noting
the tradeoffs with other possible solutions. You can submit that, as an RFC
if the "IPI to all CPUs" remains a concern, after which we can bring that
submission to the attention of the experts who would have needed information then
to point us in the right direction.

Reinette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func()
  2022-12-13 18:33               ` Reinette Chatre
@ 2022-12-14 10:05                 ` Peter Newman
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Newman @ 2022-12-14 10:05 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: bp, derkling, eranian, fenghua.yu, hpa, james.morse, jannh,
	kpsingh, linux-kernel, mingo, tglx, x86

Hi Reinette,

On Tue, Dec 13, 2022 at 7:34 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
> On 12/12/2022 9:36 AM, Peter Newman wrote:
> > On Sat, Dec 10, 2022 at 12:54 AM Reinette Chatre <reinette.chatre@intel.com> wrote:
> >> On 12/8/2022 2:30 PM, Peter Newman wrote:
> >>> Based on this, I'll just sketch out the first scenario below and drop
> >>> (2) from the changelog. This also implies that the group update cases
> >>
> >> ok, thank you for doing that analysis.
> >>
> >>> can use a single smp_mb() to provide all the necessary ordering, because
> >>> there's a full barrier on context switch for it to pair with, so I don't
> >>> need to broadcast IPI anymore.  I don't know whether task_call_func() is
> >>
> >> This is not clear to me because rdt_move_group_tasks() seems to have the
> >> same code as shown below as vulnerable to re-ordering. Only difference
> >> is that it uses the "//false" checks to set a bit in the cpumask for a
> >> later IPI instead of an immediate IPI.
> >
> > An smp_mb() between writing the new task_struct::{closid,rmid} and
> > calling task_curr() would prevent the reordering I described, but I
> > was worried about the cost of executing a full barrier for every
> > matching task.
>
> So for moving a single task the solution may just be to change
> the current barrier() to smp_mb()?

Yes, that's a simpler change, so I'll just do that instead.

>
> >
> > I tried something like this:
> >
> > for_each_process_thread(p, t) {
> >       if (!from || is_closid_match(t, from) ||
> >           is_rmid_match(t, from)) {
> >               WRITE_ONCE(t->closid, to->closid);
> >               WRITE_ONCE(t->rmid, to->mon.rmid);
> >               /* group moves are serialized by rdt */
> >               t->resctrl_dirty = true;
> >       }
> > }
> > if (IS_ENABLED(CONFIG_SMP) && mask) {
> >       /* Order t->{closid,rmid} stores before loads in task_curr() */
> >       smp_mb();
> >       for_each_process_thread(p, t) {
> >               if (t->resctrl_dirty) {
> >                       if (task_curr(t))
> >                               cpumask_set_cpu(task_cpu(t), mask);
> >                       t->resctrl_dirty = false;
> >               }
> >       }
> > }
> >
>
> struct task_struct would not welcome a new member dedicated to resctrl's
> rare use for convenience. Another option may be to use a flag within
> the variables themselves but that seems like significant complication
> (flag need to be dealt with during scheduling) for which the benefit
> is not clear to me. I would prefer that we start with the simplest
> solution first (I see that as IPI to all CPUs). The changelog needs clear
> description of the problem needing to be solved and the solution chosen, noting
> the tradeoffs with other possible solutions. You can submit that, as an RFC
> if the "IPI to all CPUs" remains a concern, after which we can bring that
> submission to the attention of the experts who would have needed information then
> to point us in the right direction.

To be complete, I did the benchmark again with the simple addition of
an smp_mb() on every iteration with a matching CLOSID/RMID and found
that it didn't result in a substantial performance impact. (1.57ms
-> 1.65ms).

This isn't as significant as the 2x slowdown I saw when using
task_call_func(), so maybe task_call_func() is just really expensive.
That's more reason to just upgrade the barrier in the single-task move
case.

While I agree with your points on the IPI broadcast, it seems like a
discussion I would prefer to just avoid given these measurements.

-Peter

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-12-14 10:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-29 11:10 [PATCH v4 0/2] x86/resctrl: Fix task CLOSID update race Peter Newman
2022-11-29 11:10 ` [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() Peter Newman
2022-12-06 18:56   ` Reinette Chatre
2022-12-07 10:58     ` Peter Newman
2022-12-07 18:38       ` Reinette Chatre
2022-12-08 22:30         ` Peter Newman
2022-12-09 23:54           ` Reinette Chatre
2022-12-12 17:36             ` Peter Newman
2022-12-13 18:33               ` Reinette Chatre
2022-12-14 10:05                 ` Peter Newman
2022-11-29 11:10 ` [PATCH v4 2/2] x86/resctrl: IPI all online CPUs for group updates Peter Newman
2022-12-06 18:57   ` Reinette Chatre
2022-12-07 11:04     ` Peter Newman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.