linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a resource group
@ 2020-12-03 23:25 Reinette Chatre
  2020-12-03 23:25 ` [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers Reinette Chatre
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Reinette Chatre @ 2020-12-03 23:25 UTC (permalink / raw)
  To: tglx, fenghua.yu, bp, tony.luck
  Cc: kuo-lang.tseng, shakeelb, valentin.schneider, mingo, babu.moger,
	james.morse, hpa, x86, linux-kernel, Reinette Chatre

When a user writes a task id to the resctrl "tasks" file the task will be
moved to the resource group to which the destination "tasks" file belongs.
Primarily this includes updating the task's closid and rmid in its task_struct
and, for a running task, setting these new values in the PQR_ASSOC register
that reflects the active closid and rmid on a CPU.

Moving a task to a new resource group is currently accomplished by
updating its task_struct and queueing the work that updates the PQR_ASSOC
register. This queued work will be run as soon as possible if the task is
running or if the task is not running the queued work will be run when the task
exits the kernel and returns to user mode (task_work_add(...,..., TWA_RESUME)).

Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task
is running is the right thing to do. Queueing work for a task that is
not running is unnecessary (the PQR_ASSOC MSR is already updated when the
task is scheduled in) and causing system resource waste with the way in
which it is implemented: Work to update the PQR_ASSOC register is queued
every time the user writes a task id to the "tasks" file, even if the task
already belongs to the resource group. This could result in multiple pending
work items associated with a single task even if they are all identical and
even though only a single update with most recent values is needed.
Specifically, even if a task is moved between different resource groups
while it is sleeping, then it is only the last move that is relevant but
yet a work item is queued during each move.
This unnecessary queueing of work items could result in significant system
resource waste, especially on tasks sleeping for a long time. For example,
as demonstrated by Shakeel Butt in [1] writing the same task id to the
"tasks" file can quickly consume significant memory. The same problem
(wasted system resources) occurs when moving a task between different
resource groups. As pointed out by Valentin Schneider in [2] there is an
additional issue with the way in which the queueing of work is done in that
the task_struct update is currently done after the work is queued, resulting
in a race with the register update possibly done before the data needed
by the update is available.

This series fixes all the above issues related to the queueing of the updates
to the PQR_ASSOC register.

There is already a way in which resctrl moves tasks that can guide towards
a solution. Within resctrl tasks are also moved between resource groups when
a resource group is removed, also an action initiated by the user with "rmdir"
of the resource group directory. In this case resctrl moves all tasks belonging
to the removed group to the default resource group (in the case of a control
group) or the parent resource group (in the case of a monitor resource
group). These task moves are handled synchronously by resctrl with an
immediate update of the PQR_ASSOC register on the CPU the task is currently
running.

This fix follows the same update mechanism used as when resource groups are
removed. The task's closid and rmid is updated in its task_struct as
before. Instead of queueing work to update the PQR_ASSOC register with the
new values this update is done immediately on the CPU where the task is
currently running. If the task is not running there is no action since the
register will be updated when the task is scheduled in.

After patch 1 does some preparations, patch 2 updates
the PQR_ASSOC MSR in synchronous way instead of in a callback.
Patch 3 fixes the issue of unnecessary work when a task move is not needed
(when user writes a task id to a "tasks" file to which it already belongs)
by adding sanity checking to avoid costly move operations in the same resource
group.

Valentin's series in [2] ends by adding memory barriers to support the
updating of the task_struct from one CPU and the usage of the task_struct data
from another CPU. This work is still needed and as discussed with Valentin in
that thread the work would be re-evaluated by him after seeing how this series
turns out.

[1]: https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/20201123022433.17905-1-valentin.schneider@arm.com

Fenghua Yu (3):
  x86/resctrl: Move setting task's active CPU in a mask into helpers
  x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to
    resource group
  x86/resctrl: Don't move a task to the same resource group

 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 159 +++++++++++++------------
 1 file changed, 82 insertions(+), 77 deletions(-)


base-commit: b65054597872ce3aefbc6a666385eabdf9e288da
-- 
2.26.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers
  2020-12-03 23:25 [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a resource group Reinette Chatre
@ 2020-12-03 23:25 ` Reinette Chatre
  2020-12-07 18:29   ` Borislav Petkov
  2020-12-09 16:47   ` James Morse
  2020-12-03 23:25 ` [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group Reinette Chatre
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Reinette Chatre @ 2020-12-03 23:25 UTC (permalink / raw)
  To: tglx, fenghua.yu, bp, tony.luck
  Cc: kuo-lang.tseng, shakeelb, valentin.schneider, mingo, babu.moger,
	james.morse, hpa, x86, linux-kernel, Reinette Chatre, stable

From: Fenghua Yu <fenghua.yu@intel.com>

The code of setting the CPU on which a task is running in a CPU mask is
moved into a couple of helpers. The new helper task_on_cpu() will be
reused shortly.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 47 +++++++++++++++++++-------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 6f4ca4bea625..68db7d2dec8f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -525,6 +525,38 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
 	kfree(rdtgrp);
 }
 
+#ifdef CONFIG_SMP
+/* Get the CPU if the task is on it. */
+static bool task_on_cpu(struct task_struct *t, int *cpu)
+{
+	/*
+	 * This is safe on x86 w/o barriers as the ordering of writing to
+	 * task_cpu() and t->on_cpu is reverse to the reading here. The
+	 * detection is inaccurate as tasks might move or schedule before
+	 * the smp function call takes place. In such a case the function
+	 * call is pointless, but there is no other side effect.
+	 */
+	if (t->on_cpu) {
+		*cpu = task_cpu(t);
+
+		return true;
+	}
+
+	return false;
+}
+
+static void set_task_cpumask(struct task_struct *t, struct cpumask *mask)
+{
+	int cpu;
+
+	if (mask && task_on_cpu(t, &cpu))
+		cpumask_set_cpu(cpu, mask);
+}
+#else
+static inline void
+set_task_cpumask(struct task_struct *t, struct cpumask *mask) { }
+#endif
+
 struct task_move_callback {
 	struct callback_head	work;
 	struct rdtgroup		*rdtgrp;
@@ -2327,19 +2359,8 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
 			t->closid = to->closid;
 			t->rmid = to->mon.rmid;
 
-#ifdef CONFIG_SMP
-			/*
-			 * This is safe on x86 w/o barriers as the ordering
-			 * of writing to task_cpu() and t->on_cpu is
-			 * reverse to the reading here. The detection is
-			 * inaccurate as tasks might move or schedule
-			 * before the smp function call takes place. In
-			 * such a case the function call is pointless, but
-			 * there is no other side effect.
-			 */
-			if (mask && t->on_cpu)
-				cpumask_set_cpu(task_cpu(t), mask);
-#endif
+			/* If the task is on a CPU, set the CPU in the mask. */
+			set_task_cpumask(t, mask);
 		}
 	}
 	read_unlock(&tasklist_lock);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group
  2020-12-03 23:25 [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a resource group Reinette Chatre
  2020-12-03 23:25 ` [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers Reinette Chatre
@ 2020-12-03 23:25 ` Reinette Chatre
  2020-12-09 16:51   ` James Morse
  2020-12-11 20:46   ` Valentin Schneider
  2020-12-03 23:25 ` [PATCH 3/3] x86/resctrl: Don't move a task to the same " Reinette Chatre
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Reinette Chatre @ 2020-12-03 23:25 UTC (permalink / raw)
  To: tglx, fenghua.yu, bp, tony.luck
  Cc: kuo-lang.tseng, shakeelb, valentin.schneider, mingo, babu.moger,
	james.morse, hpa, x86, linux-kernel, Reinette Chatre, stable

From: Fenghua Yu <fenghua.yu@intel.com>

Currently when moving a task to a resource group the PQR_ASSOC MSR
is updated with the new closid and rmid in an added task callback.
If the task is running the work is run as soon as possible. If the
task is not running the work is executed later in the kernel exit path
when the kernel returns to the task again.

Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task
is running is the right thing to do. Queueing work for a task that is
not running is unnecessary (the PQR_ASSOC MSR is already updated when the
task is scheduled in) and causing system resource waste with the way in
which it is implemented: Work to update the PQR_ASSOC register is queued
every time the user writes a task id to the "tasks" file, even if the task
already belongs to the resource group. This could result in multiple pending
work items associated with a single task even if they are all identical and
even though only a single update with most recent values is needed.
Specifically, even if a task is moved between different resource groups
while it is sleeping then it is only the last move that is relevant but
yet a work item is queued during each move.
This unnecessary queueing of work items could result in significant system
resource waste, especially on tasks sleeping for a long time. For example,
as demonstrated by Shakeel Butt in [1] writing the same task id to the
"tasks" file can quickly consume significant memory. The same problem
(wasted system resources) occurs when moving a task between different
resource groups.

As pointed out by Valentin Schneider in [2] there is an additional issue with
the way in which the queueing of work is done in that the task_struct update
is currently done after the work is queued, resulting in a race with the
register update possibly done before the data needed by the update is available.

To solve these issues, the PQR_ASSOC MSR is updated in a synchronous way
right after the new closid and rmid are ready during the task movement,
only if the task is running. If a moved task is not running nothing is
done since the PQR_ASSOC MSR will be updated next time the task is scheduled.
This is the same way used to update the register when tasks are moved as
part of resource group removal.

[1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/
[2] https://lore.kernel.org/lkml/20201123022433.17905-1-valentin.schneider@arm.com

Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
Reported-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 123 ++++++++++---------------
 1 file changed, 50 insertions(+), 73 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 68db7d2dec8f..9d62f1fadcc3 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -525,6 +525,16 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
 	kfree(rdtgrp);
 }
 
+static void _update_task_closid_rmid(void *task)
+{
+	/*
+	 * If the task is still current on this CPU, update PQR_ASSOC MSR.
+	 * Otherwise, the MSR is updated when the task is scheduled in.
+	 */
+	if (task == current)
+		resctrl_sched_in();
+}
+
 #ifdef CONFIG_SMP
 /* Get the CPU if the task is on it. */
 static bool task_on_cpu(struct task_struct *t, int *cpu)
@@ -552,94 +562,61 @@ static void set_task_cpumask(struct task_struct *t, struct cpumask *mask)
 	if (mask && task_on_cpu(t, &cpu))
 		cpumask_set_cpu(cpu, mask);
 }
-#else
-static inline void
-set_task_cpumask(struct task_struct *t, struct cpumask *mask) { }
-#endif
-
-struct task_move_callback {
-	struct callback_head	work;
-	struct rdtgroup		*rdtgrp;
-};
 
-static void move_myself(struct callback_head *head)
+static void update_task_closid_rmid(struct task_struct *t)
 {
-	struct task_move_callback *callback;
-	struct rdtgroup *rdtgrp;
-
-	callback = container_of(head, struct task_move_callback, work);
-	rdtgrp = callback->rdtgrp;
-
-	/*
-	 * If resource group was deleted before this task work callback
-	 * was invoked, then assign the task to root group and free the
-	 * resource group.
-	 */
-	if (atomic_dec_and_test(&rdtgrp->waitcount) &&
-	    (rdtgrp->flags & RDT_DELETED)) {
-		current->closid = 0;
-		current->rmid = 0;
-		rdtgroup_remove(rdtgrp);
-	}
+	int cpu;
 
-	if (unlikely(current->flags & PF_EXITING))
-		goto out;
+	if (task_on_cpu(t, &cpu))
+		smp_call_function_single(cpu, _update_task_closid_rmid, t, 1);
+}
 
-	preempt_disable();
-	/* update PQR_ASSOC MSR to make resource group go into effect */
-	resctrl_sched_in();
-	preempt_enable();
+#else
+static inline void
+set_task_cpumask(struct task_struct *t, struct cpumask *mask) { }
 
-out:
-	kfree(callback);
+static void update_task_closid_rmid(struct task_struct *t)
+{
+	_update_task_closid_rmid(t);
 }
+#endif
 
 static int __rdtgroup_move_task(struct task_struct *tsk,
 				struct rdtgroup *rdtgrp)
 {
-	struct task_move_callback *callback;
-	int ret;
-
-	callback = kzalloc(sizeof(*callback), GFP_KERNEL);
-	if (!callback)
-		return -ENOMEM;
-	callback->work.func = move_myself;
-	callback->rdtgrp = rdtgrp;
-
 	/*
-	 * Take a refcount, so rdtgrp cannot be freed before the
-	 * callback has been invoked.
+	 * Set the task's closid/rmid before the PQR_ASSOC MSR can be
+	 * updated by them.
+	 *
+	 * For ctrl_mon groups, move both closid and rmid.
+	 * For monitor groups, can move the tasks only from
+	 * their parent CTRL group.
 	 */
-	atomic_inc(&rdtgrp->waitcount);
-	ret = task_work_add(tsk, &callback->work, TWA_RESUME);
-	if (ret) {
-		/*
-		 * Task is exiting. Drop the refcount and free the callback.
-		 * No need to check the refcount as the group cannot be
-		 * deleted before the write function unlocks rdtgroup_mutex.
-		 */
-		atomic_dec(&rdtgrp->waitcount);
-		kfree(callback);
-		rdt_last_cmd_puts("Task exited\n");
-	} else {
-		/*
-		 * For ctrl_mon groups move both closid and rmid.
-		 * For monitor groups, can move the tasks only from
-		 * their parent CTRL group.
-		 */
-		if (rdtgrp->type == RDTCTRL_GROUP) {
-			tsk->closid = rdtgrp->closid;
+
+	if (rdtgrp->type == RDTCTRL_GROUP) {
+		tsk->closid = rdtgrp->closid;
+		tsk->rmid = rdtgrp->mon.rmid;
+	} else if (rdtgrp->type == RDTMON_GROUP) {
+		if (rdtgrp->mon.parent->closid == tsk->closid) {
 			tsk->rmid = rdtgrp->mon.rmid;
-		} else if (rdtgrp->type == RDTMON_GROUP) {
-			if (rdtgrp->mon.parent->closid == tsk->closid) {
-				tsk->rmid = rdtgrp->mon.rmid;
-			} else {
-				rdt_last_cmd_puts("Can't move task to different control group\n");
-				ret = -EINVAL;
-			}
+		} else {
+			rdt_last_cmd_puts("Can't move task to different control group\n");
+			return -EINVAL;
 		}
+	} else {
+		rdt_last_cmd_puts("Invalid resource group type\n");
+		return -EINVAL;
 	}
-	return ret;
+
+	/*
+	 * By now, the task's closid and rmid are set. If the task is current
+	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
+	 * group go into effect. If the task is not current, the MSR will be
+	 * updated when the task is scheduled in.
+	 */
+	update_task_closid_rmid(tsk);
+
+	return 0;
 }
 
 static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/3] x86/resctrl: Don't move a task to the same resource group
  2020-12-03 23:25 [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a resource group Reinette Chatre
  2020-12-03 23:25 ` [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers Reinette Chatre
  2020-12-03 23:25 ` [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group Reinette Chatre
@ 2020-12-03 23:25 ` Reinette Chatre
  2020-12-11 20:46 ` [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a " Valentin Schneider
  2020-12-17 12:19 ` [PATCH 4/3] x86/intel_rdt: Apply READ_ONCE/WRITE_ONCE to task_struct .rmid & .closid Valentin Schneider
  4 siblings, 0 replies; 22+ messages in thread
From: Reinette Chatre @ 2020-12-03 23:25 UTC (permalink / raw)
  To: tglx, fenghua.yu, bp, tony.luck
  Cc: kuo-lang.tseng, shakeelb, valentin.schneider, mingo, babu.moger,
	james.morse, hpa, x86, linux-kernel, Reinette Chatre, stable

From: Fenghua Yu <fenghua.yu@intel.com>

Shakeel Butt reported in [1] that a user can request a task to be moved
to a resource group even if the task is already in the group. It just
wastes time to do the move operation which could be costly to send IPI
to a different CPU.

Add a sanity check to ensure that the move operation only happens when
the task is not already in the resource group.

[1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/

Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
Reported-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9d62f1fadcc3..d3523032265c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -584,6 +584,13 @@ static void update_task_closid_rmid(struct task_struct *t)
 static int __rdtgroup_move_task(struct task_struct *tsk,
 				struct rdtgroup *rdtgrp)
 {
+	/* If the task is already in rdtgrp, no need to move the task. */
+	if ((rdtgrp->type == RDTCTRL_GROUP && tsk->closid == rdtgrp->closid &&
+	     tsk->rmid == rdtgrp->mon.rmid) ||
+	    (rdtgrp->type == RDTMON_GROUP && tsk->rmid == rdtgrp->mon.rmid &&
+	     tsk->closid == rdtgrp->mon.parent->closid))
+		return 0;
+
 	/*
 	 * Set the task's closid/rmid before the PQR_ASSOC MSR can be
 	 * updated by them.
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers
  2020-12-03 23:25 ` [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers Reinette Chatre
@ 2020-12-07 18:29   ` Borislav Petkov
  2020-12-07 21:24     ` Reinette Chatre
  2020-12-09 16:47   ` James Morse
  1 sibling, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2020-12-07 18:29 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, tony.luck, kuo-lang.tseng, shakeelb,
	valentin.schneider, mingo, babu.moger, james.morse, hpa, x86,
	linux-kernel, stable

On Thu, Dec 03, 2020 at 03:25:48PM -0800, Reinette Chatre wrote:
> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> The code of setting the CPU on which a task is running in a CPU mask is
> moved into a couple of helpers.

Pls read section "2) Describe your changes" in
Documentation/process/submitting-patches.rst for more details.

More specifically:

"Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
to do frotz", as if you are giving orders to the codebase to change
its behaviour."

> The new helper task_on_cpu() will be reused shortly.

"reused shortly"? I don't think so.

> 
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Cc: stable@vger.kernel.org

Fixes?

I guess the same commit from the other two:

Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")

?

> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 47 +++++++++++++++++++-------
>  1 file changed, 34 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 6f4ca4bea625..68db7d2dec8f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -525,6 +525,38 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
>  	kfree(rdtgrp);
>  }
>  
> +#ifdef CONFIG_SMP
> +/* Get the CPU if the task is on it. */
> +static bool task_on_cpu(struct task_struct *t, int *cpu)
> +{
> +	/*
> +	 * This is safe on x86 w/o barriers as the ordering of writing to
> +	 * task_cpu() and t->on_cpu is reverse to the reading here. The
> +	 * detection is inaccurate as tasks might move or schedule before
> +	 * the smp function call takes place. In such a case the function
> +	 * call is pointless, but there is no other side effect.
> +	 */
> +	if (t->on_cpu) {
> +		*cpu = task_cpu(t);

Why have an I/O parameter when you can make it simply:

static int task_on_cpu(struct task_struct *t)
{
	if (t->on_cpu)
		return task_cpu(t);

	return -1;
}

> +
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +static void set_task_cpumask(struct task_struct *t, struct cpumask *mask)
> +{
> +	int cpu;
> +
> +	if (mask && task_on_cpu(t, &cpu))
> +		cpumask_set_cpu(cpu, mask);

And that you can turn into:

	if (!mask)
		return;

	cpu = task_on_cpu(t);
	if (cpu < 0)
		return;

	cpumask_set_cpu(cpu, mask);

Readable and simple.

Hmm?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers
  2020-12-07 18:29   ` Borislav Petkov
@ 2020-12-07 21:24     ` Reinette Chatre
  2020-12-08  9:49       ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Reinette Chatre @ 2020-12-07 21:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tglx, fenghua.yu, tony.luck, kuo-lang.tseng, shakeelb,
	valentin.schneider, mingo, babu.moger, james.morse, hpa, x86,
	linux-kernel, stable

Hi Borislav,

Thank you very much for your review.

On 12/7/2020 10:29 AM, Borislav Petkov wrote:
> On Thu, Dec 03, 2020 at 03:25:48PM -0800, Reinette Chatre wrote:
>> From: Fenghua Yu <fenghua.yu@intel.com>
>>
>> The code of setting the CPU on which a task is running in a CPU mask is
>> moved into a couple of helpers.
> 
> Pls read section "2) Describe your changes" in
> Documentation/process/submitting-patches.rst for more details.
> 
> More specifically:
> 
> "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
> instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
> to do frotz", as if you are giving orders to the codebase to change
> its behaviour."
> 
>> The new helper task_on_cpu() will be reused shortly.
> 
> "reused shortly"? I don't think so.


How about:
"Move the setting of the CPU on which a task is running in a CPU mask 
into a couple of helpers.

There is no functional change. This is a preparatory change for the fix 
in the following patch from where the Fixes tag is copied."

> 
>>
>> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
>> Reviewed-by: Tony Luck <tony.luck@intel.com>
>> Cc: stable@vger.kernel.org
> 
> Fixes?
> 
> I guess the same commit from the other two:
> 
> Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
> 
> ?

Correct. I will add it. The addition to the commit message above aims to 
explain a Fixes tag to a patch with no functional changes.


>> ---
>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 47 +++++++++++++++++++-------
>>   1 file changed, 34 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 6f4ca4bea625..68db7d2dec8f 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -525,6 +525,38 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
>>   	kfree(rdtgrp);
>>   }
>>   
>> +#ifdef CONFIG_SMP
>> +/* Get the CPU if the task is on it. */
>> +static bool task_on_cpu(struct task_struct *t, int *cpu)
>> +{
>> +	/*
>> +	 * This is safe on x86 w/o barriers as the ordering of writing to
>> +	 * task_cpu() and t->on_cpu is reverse to the reading here. The
>> +	 * detection is inaccurate as tasks might move or schedule before
>> +	 * the smp function call takes place. In such a case the function
>> +	 * call is pointless, but there is no other side effect.
>> +	 */
>> +	if (t->on_cpu) {
>> +		*cpu = task_cpu(t);
> 
> Why have an I/O parameter when you can make it simply:
> 
> static int task_on_cpu(struct task_struct *t)
> {
> 	if (t->on_cpu)
> 		return task_cpu(t);
> 
> 	return -1;
> }
> 
>> +
>> +		return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +static void set_task_cpumask(struct task_struct *t, struct cpumask *mask)
>> +{
>> +	int cpu;
>> +
>> +	if (mask && task_on_cpu(t, &cpu))
>> +		cpumask_set_cpu(cpu, mask);
> 
> And that you can turn into:
> 
> 	if (!mask)
> 		return;
> 
> 	cpu = task_on_cpu(t);
> 	if (cpu < 0)
> 		return;
> 
> 	cpumask_set_cpu(cpu, mask);
> 
> Readable and simple.
> 
> Hmm?
> 

Will do. Thank you very much.

Reinette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers
  2020-12-07 21:24     ` Reinette Chatre
@ 2020-12-08  9:49       ` Borislav Petkov
  2020-12-08 16:35         ` Reinette Chatre
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2020-12-08  9:49 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, tony.luck, kuo-lang.tseng, shakeelb,
	valentin.schneider, mingo, babu.moger, james.morse, hpa, x86,
	linux-kernel, stable

On Mon, Dec 07, 2020 at 01:24:51PM -0800, Reinette Chatre wrote:
> How about:
> "Move the setting of the CPU on which a task is running in a CPU mask into a
> couple of helpers.
> 
> There is no functional change. This is a preparatory change for the fix in
> the following patch from where the Fixes tag is copied."

Almost. Just not call it a "following patch" because once this is
applied, the following one might be a different one depending on the
ordering a git command has requested. So a "later patch" would be
probably better.

> Correct. I will add it. The addition to the commit message above aims to
> explain a Fixes tag to a patch with no functional changes.

Yes but you need to tell the stable people somehow that this one is a
prerequisite and that they should pick it up too.

Unless you can reorg your code this way that you don't need patch 1...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers
  2020-12-08  9:49       ` Borislav Petkov
@ 2020-12-08 16:35         ` Reinette Chatre
  0 siblings, 0 replies; 22+ messages in thread
From: Reinette Chatre @ 2020-12-08 16:35 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tglx, fenghua.yu, tony.luck, kuo-lang.tseng, shakeelb,
	valentin.schneider, mingo, babu.moger, james.morse, hpa, x86,
	linux-kernel, stable

Hi Borislav,

On 12/8/2020 1:49 AM, Borislav Petkov wrote:
> On Mon, Dec 07, 2020 at 01:24:51PM -0800, Reinette Chatre wrote:
>> How about:
>> "Move the setting of the CPU on which a task is running in a CPU mask into a
>> couple of helpers.
>>
>> There is no functional change. This is a preparatory change for the fix in
>> the following patch from where the Fixes tag is copied."
> 
> Almost. Just not call it a "following patch" because once this is
> applied, the following one might be a different one depending on the
> ordering a git command has requested. So a "later patch" would be
> probably better.

Indeed, will do. Thank you.

> 
>> Correct. I will add it. The addition to the commit message above aims to
>> explain a Fixes tag to a patch with no functional changes.
> 
> Yes but you need to tell the stable people somehow that this one is a
> prerequisite and that they should pick it up too.

Right. Thanks for guiding here.

> 
> Unless you can reorg your code this way that you don't need patch 1...

I think that the current organization, with patch 1 containing the 
preparatory work without functional changes, makes the fix in patch 2 
easier to review. I thus plan to keep the code organization as is while 
surely following your suggestion on how to support the stable team.

Thank you very much

Reinette


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers
  2020-12-03 23:25 ` [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers Reinette Chatre
  2020-12-07 18:29   ` Borislav Petkov
@ 2020-12-09 16:47   ` James Morse
  2020-12-10  0:21     ` Reinette Chatre
  1 sibling, 1 reply; 22+ messages in thread
From: James Morse @ 2020-12-09 16:47 UTC (permalink / raw)
  To: Reinette Chatre, fenghua.yu
  Cc: tglx, bp, tony.luck, kuo-lang.tseng, shakeelb,
	valentin.schneider, mingo, babu.moger, hpa, x86, linux-kernel,
	stable

Hi Reinette, Fenghua,

On 03/12/2020 23:25, Reinette Chatre wrote:
> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> The code of setting the CPU on which a task is running in a CPU mask is
> moved into a couple of helpers. The new helper task_on_cpu() will be
> reused shortly.

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 6f4ca4bea625..68db7d2dec8f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -525,6 +525,38 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)

> +#ifdef CONFIG_SMP

(using IS_ENABLED(CONFIG_SMP) lets the compiler check all the code in one go, then
dead-code-remove the stuff that will never happen... its also easier on the eye!)


> +/* Get the CPU if the task is on it. */
> +static bool task_on_cpu(struct task_struct *t, int *cpu)
> +{
> +	/*
> +	 * This is safe on x86 w/o barriers as the ordering of writing to
> +	 * task_cpu() and t->on_cpu is reverse to the reading here. The
> +	 * detection is inaccurate as tasks might move or schedule before
> +	 * the smp function call takes place. In such a case the function
> +	 * call is pointless, but there is no other side effect.
> +	 */

> +	if (t->on_cpu) {

kernel/sched/core.c calls out that there can be two tasks on one CPU with this set.
(grep astute)
I think that means this series will falsely match the old task for a CPU while the
scheduler is running, and IPI it unnecessarily.

task_curr() is the helper that knows not to do this.


> +		*cpu = task_cpu(t);
> +
> +		return true;
> +	}
> +
> +	return false;
> +}


Thanks,

James

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group
  2020-12-03 23:25 ` [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group Reinette Chatre
@ 2020-12-09 16:51   ` James Morse
  2020-12-10  0:22     ` Reinette Chatre
  2020-12-11 20:46   ` Valentin Schneider
  1 sibling, 1 reply; 22+ messages in thread
From: James Morse @ 2020-12-09 16:51 UTC (permalink / raw)
  To: Reinette Chatre, fenghua.yu
  Cc: tglx, bp, tony.luck, kuo-lang.tseng, shakeelb,
	valentin.schneider, mingo, babu.moger, hpa, x86, linux-kernel,
	stable

Hi Reinette, Fenghua,

Subject nit: I think 'use IPI instead of task_work() to update PQR_ASSOC_MSR' conveys the
guts of this change more quickly!

On 03/12/2020 23:25, Reinette Chatre wrote:
> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> Currently when moving a task to a resource group the PQR_ASSOC MSR
> is updated with the new closid and rmid in an added task callback.
> If the task is running the work is run as soon as possible. If the
> task is not running the work is executed later

> in the kernel exit path when the kernel returns to the task again.

kernel exit makes me thing of user-space... is it enough to just say:
"by __switch_to() when task is next run"?


> Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task
> is running is the right thing to do. Queueing work for a task that is
> not running is unnecessary (the PQR_ASSOC MSR is already updated when the
> task is scheduled in) and causing system resource waste with the way in
> which it is implemented: Work to update the PQR_ASSOC register is queued
> every time the user writes a task id to the "tasks" file, even if the task
> already belongs to the resource group. This could result in multiple pending
> work items associated with a single task even if they are all identical and
> even though only a single update with most recent values is needed.
> Specifically, even if a task is moved between different resource groups
> while it is sleeping then it is only the last move that is relevant but
> yet a work item is queued during each move.
> This unnecessary queueing of work items could result in significant system
> resource waste, especially on tasks sleeping for a long time. For example,
> as demonstrated by Shakeel Butt in [1] writing the same task id to the
> "tasks" file can quickly consume significant memory. The same problem
> (wasted system resources) occurs when moving a task between different
> resource groups.
> 
> As pointed out by Valentin Schneider in [2] there is an additional issue with
> the way in which the queueing of work is done in that the task_struct update
> is currently done after the work is queued, resulting in a race with the
> register update possibly done before the data needed by the update is available.
> 
> To solve these issues, the PQR_ASSOC MSR is updated in a synchronous way
> right after the new closid and rmid are ready during the task movement,
> only if the task is running. If a moved task is not running nothing is
> done since the PQR_ASSOC MSR will be updated next time the task is scheduled.
> This is the same way used to update the register when tasks are moved as
> part of resource group removal.

(as t->on_cpu is already used...)

Reviewed-by: James Morse <james.morse@arm.com>


> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 68db7d2dec8f..9d62f1fadcc3 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -525,6 +525,16 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)


> +static void update_task_closid_rmid(struct task_struct *t)
>  {
> +	int cpu;
>  
> +	if (task_on_cpu(t, &cpu))
> +		smp_call_function_single(cpu, _update_task_closid_rmid, t, 1);


I think:
|	if (task_curr(t))
|		smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1);

here would make for an easier backport as it doesn't depend on the previous patch.


> +}

[...]

>  static int __rdtgroup_move_task(struct task_struct *tsk,
>  				struct rdtgroup *rdtgrp)
>  {

> +	if (rdtgrp->type == RDTCTRL_GROUP) {
> +		tsk->closid = rdtgrp->closid;
> +		tsk->rmid = rdtgrp->mon.rmid;
> +	} else if (rdtgrp->type == RDTMON_GROUP) {

[...]

> +	} else {

> +		rdt_last_cmd_puts("Invalid resource group type\n");
> +		return -EINVAL;

Wouldn't this be a kernel bug?
I'd have thought there would be a WARN_ON_ONCE() here to make it clear this isn't
user-space's fault!


>  	}
> -	return ret;
> +
> +	/*
> +	 * By now, the task's closid and rmid are set. If the task is current
> +	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
> +	 * group go into effect. If the task is not current, the MSR will be
> +	 * updated when the task is scheduled in.
> +	 */
> +	update_task_closid_rmid(tsk);
> +
> +	return 0;
>  }


Thanks,

James

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers
  2020-12-09 16:47   ` James Morse
@ 2020-12-10  0:21     ` Reinette Chatre
  0 siblings, 0 replies; 22+ messages in thread
From: Reinette Chatre @ 2020-12-10  0:21 UTC (permalink / raw)
  To: James Morse, fenghua.yu
  Cc: tglx, bp, tony.luck, kuo-lang.tseng, shakeelb,
	valentin.schneider, mingo, babu.moger, hpa, x86, linux-kernel,
	stable

Hi James,

Thank you very much for your review.

On 12/9/2020 8:47 AM, James Morse wrote:
> Hi Reinette, Fenghua,
> 
> On 03/12/2020 23:25, Reinette Chatre wrote:
>> From: Fenghua Yu <fenghua.yu@intel.com>
>>
>> The code of setting the CPU on which a task is running in a CPU mask is
>> moved into a couple of helpers. The new helper task_on_cpu() will be
>> reused shortly.
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 6f4ca4bea625..68db7d2dec8f 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -525,6 +525,38 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
> 
>> +#ifdef CONFIG_SMP
> 
> (using IS_ENABLED(CONFIG_SMP) lets the compiler check all the code in one go, then
> dead-code-remove the stuff that will never happen... its also easier on the eye!)

Thank you for catching this. New fix (see below) uses this.

>> +/* Get the CPU if the task is on it. */
>> +static bool task_on_cpu(struct task_struct *t, int *cpu)
>> +{
>> +	/*
>> +	 * This is safe on x86 w/o barriers as the ordering of writing to
>> +	 * task_cpu() and t->on_cpu is reverse to the reading here. The
>> +	 * detection is inaccurate as tasks might move or schedule before
>> +	 * the smp function call takes place. In such a case the function
>> +	 * call is pointless, but there is no other side effect.
>> +	 */
> 
>> +	if (t->on_cpu) {
> 
> kernel/sched/core.c calls out that there can be two tasks on one CPU with this set.
> (grep astute)
> I think that means this series will falsely match the old task for a CPU while the
> scheduler is running, and IPI it unnecessarily.
> 
> task_curr() is the helper that knows not to do this.
> 

Thank you very much for catching this. I did not know this. This exposes 
an issue with the current implementation of moving tasks as part of 
directory removal. I now plan to replace this patch with a new fix to 
address this new issue you exposed: the fix will replace the current 
usage of t->on_cpu with task_curr(). Since I also follow your suggestion 
for patch #2 this original patch is no longer needed, which is something 
Borislav also suggested but I could not see a way to do it at the time.

This new fix does seem to fall into the "This could be a problem.." 
category of issues referred to in stable-kernel-rules.rst so while I 
plan on adding a Fixes tag I plan to not cc the stable team on this one. 
I am unsure about the right thing to do here so if you have an opinion I 
would appreciate it.

What do you think of this replacement patch (that will be moved to end 
of series)?

Reinette

----8<------
x86/resctrl: Replace t->on_cpu with task_curr() to prevent unnecessary IPI

James reported in [1] that there could be two tasks running on the same CPU
with t->on_cpu set. Using t->on_cpu as a test if a task is running on a
CPU may thus match the old task for a CPU while the scheduler is
running and IPI it unnecessarily.

task_curr() is the correct helper to use. While doing so move the #ifdef
check of the CONFIG_SMP symbol to be a C conditional used to determine
if this helper should be used so ensure the code is always checked for
correctness by the compiler.

[1] 
https://lore.kernel.org/lkml/a782d2f3-d2f6-795f-f4b1-9462205fd581@arm.com

Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on 
CPU in rmdir and unmount")
Reported-by: James Morse <james.morse@arm.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 14 ++------------
  1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c 
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5e5a49f38fa1..c64fb37f0aec 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2317,19 +2317,9 @@ static void rdt_move_group_tasks(struct rdtgroup 
*from, struct rdtgroup *to,
  			t->closid = to->closid;
  			t->rmid = to->mon.rmid;

-#ifdef CONFIG_SMP
-			/*
-			 * This is safe on x86 w/o barriers as the ordering
-			 * of writing to task_cpu() and t->on_cpu is
-			 * reverse to the reading here. The detection is
-			 * inaccurate as tasks might move or schedule
-			 * before the smp function call takes place. In
-			 * such a case the function call is pointless, but
-			 * there is no other side effect.
-			 */
-			if (mask && t->on_cpu)
+			/* If the task is on a CPU, set the CPU in the mask. */
+			if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
  				cpumask_set_cpu(task_cpu(t), mask);
-#endif
  		}
  	}
  	read_unlock(&tasklist_lock);








^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group
  2020-12-09 16:51   ` James Morse
@ 2020-12-10  0:22     ` Reinette Chatre
  0 siblings, 0 replies; 22+ messages in thread
From: Reinette Chatre @ 2020-12-10  0:22 UTC (permalink / raw)
  To: James Morse, fenghua.yu
  Cc: tglx, bp, tony.luck, kuo-lang.tseng, shakeelb,
	valentin.schneider, mingo, babu.moger, hpa, x86, linux-kernel,
	stable

Hi James,

On 12/9/2020 8:51 AM, James Morse wrote:
> Hi Reinette, Fenghua,
> 
> Subject nit: I think 'use IPI instead of task_work() to update PQR_ASSOC_MSR' conveys the
> guts of this change more quickly!

Sure. Thank you. A small change is that I plan to write "PQR_ASSOC MSR" 
instead to closer match the name.

> 
> On 03/12/2020 23:25, Reinette Chatre wrote:
>> From: Fenghua Yu <fenghua.yu@intel.com>
>>
>> Currently when moving a task to a resource group the PQR_ASSOC MSR
>> is updated with the new closid and rmid in an added task callback.
>> If the task is running the work is run as soon as possible. If the
>> task is not running the work is executed later
> 
>> in the kernel exit path when the kernel returns to the task again.
> 
> kernel exit makes me thing of user-space... is it enough to just say:
> "by __switch_to() when task is next run"?

I do not think that would be accurate. The paragraph to which you are 
proposing the change states the current context, before the fix, when 
task_work_add() is still used to update PQR_ASSOC. The "kernel exit" 
text you refer to is quite close to task_work_add()'s comments and 
indeed refers to the work that is run before returning to user space. If 
a function name would make things clearer perhaps adding 
exit_to_user_mode_loop() instead? Changing the text to __switch_to() 
does not reflect the context described here since __switch_to() is what 
is called when task is context switched back from where 
resctrl_sched_in() is called, not the queued work being described here.

>> Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task
>> is running is the right thing to do. Queueing work for a task that is
>> not running is unnecessary (the PQR_ASSOC MSR is already updated when the
>> task is scheduled in) and causing system resource waste with the way in
>> which it is implemented: Work to update the PQR_ASSOC register is queued
>> every time the user writes a task id to the "tasks" file, even if the task
>> already belongs to the resource group. This could result in multiple pending
>> work items associated with a single task even if they are all identical and
>> even though only a single update with most recent values is needed.
>> Specifically, even if a task is moved between different resource groups
>> while it is sleeping then it is only the last move that is relevant but
>> yet a work item is queued during each move.
>> This unnecessary queueing of work items could result in significant system
>> resource waste, especially on tasks sleeping for a long time. For example,
>> as demonstrated by Shakeel Butt in [1] writing the same task id to the
>> "tasks" file can quickly consume significant memory. The same problem
>> (wasted system resources) occurs when moving a task between different
>> resource groups.
>>
>> As pointed out by Valentin Schneider in [2] there is an additional issue with
>> the way in which the queueing of work is done in that the task_struct update
>> is currently done after the work is queued, resulting in a race with the
>> register update possibly done before the data needed by the update is available.
>>
>> To solve these issues, the PQR_ASSOC MSR is updated in a synchronous way
>> right after the new closid and rmid are ready during the task movement,
>> only if the task is running. If a moved task is not running nothing is
>> done since the PQR_ASSOC MSR will be updated next time the task is scheduled.
>> This is the same way used to update the register when tasks are moved as
>> part of resource group removal.
> 
> (as t->on_cpu is already used...)
> 
> Reviewed-by: James Morse <james.morse@arm.com>

Thank you very much. I do plan to follow your suggestion below though.


>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 68db7d2dec8f..9d62f1fadcc3 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -525,6 +525,16 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
> 
> 
>> +static void update_task_closid_rmid(struct task_struct *t)
>>   {
>> +	int cpu;
>>   
>> +	if (task_on_cpu(t, &cpu))
>> +		smp_call_function_single(cpu, _update_task_closid_rmid, t, 1);
> 
> 
> I think:
> |	if (task_curr(t))
> |		smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1);
> 
> here would make for an easier backport as it doesn't depend on the previous patch.
> 

Will do, thank you very much.

The previous patch has now become a bugfix in its own right after you 
pointed out the issue with using t->on_cpu. In addressing that I plan to 
remove the helpers found in patch #1 so backporting should continue to 
be easier.

> 
>> +}
> 
> [...]
> 
>>   static int __rdtgroup_move_task(struct task_struct *tsk,
>>   				struct rdtgroup *rdtgrp)
>>   {
> 
>> +	if (rdtgrp->type == RDTCTRL_GROUP) {
>> +		tsk->closid = rdtgrp->closid;
>> +		tsk->rmid = rdtgrp->mon.rmid;
>> +	} else if (rdtgrp->type == RDTMON_GROUP) {
> 
> [...]
> 
>> +	} else {
> 
>> +		rdt_last_cmd_puts("Invalid resource group type\n");
>> +		return -EINVAL;
> 
> Wouldn't this be a kernel bug?
> I'd have thought there would be a WARN_ON_ONCE() here to make it clear this isn't
> user-space's fault!

You are right, this would be a kernel bug. I'll add a WARN_ON_ONCE().

> 
> 
>>   	}
>> -	return ret;
>> +
>> +	/*
>> +	 * By now, the task's closid and rmid are set. If the task is current
>> +	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
>> +	 * group go into effect. If the task is not current, the MSR will be
>> +	 * updated when the task is scheduled in.
>> +	 */
>> +	update_task_closid_rmid(tsk);
>> +
>> +	return 0;
>>   }
> 

Thank you very much.

Reinette


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group
  2020-12-03 23:25 ` [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group Reinette Chatre
  2020-12-09 16:51   ` James Morse
@ 2020-12-11 20:46   ` Valentin Schneider
  2020-12-14 18:41     ` Reinette Chatre
  1 sibling, 1 reply; 22+ messages in thread
From: Valentin Schneider @ 2020-12-11 20:46 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, bp, tony.luck, kuo-lang.tseng, shakeelb, mingo,
	babu.moger, james.morse, hpa, x86, linux-kernel, stable


On 03/12/20 23:25, Reinette Chatre wrote:
> Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
> Reported-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: Valentin Schneider <valentin.schneider@arm.com>
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Cc: stable@vger.kernel.org

Some pedantic comments below; with James' task_curr() + task_cpu()
suggestion:

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

> ---
>  static int __rdtgroup_move_task(struct task_struct *tsk,
>                               struct rdtgroup *rdtgrp)
>  {
> -	struct task_move_callback *callback;
> -	int ret;
> -
> -	callback = kzalloc(sizeof(*callback), GFP_KERNEL);
> -	if (!callback)
> -		return -ENOMEM;
> -	callback->work.func = move_myself;
> -	callback->rdtgrp = rdtgrp;
> -
>       /*
> -	 * Take a refcount, so rdtgrp cannot be freed before the
> -	 * callback has been invoked.
> +	 * Set the task's closid/rmid before the PQR_ASSOC MSR can be
> +	 * updated by them.
> +	 *
> +	 * For ctrl_mon groups, move both closid and rmid.
> +	 * For monitor groups, can move the tasks only from
> +	 * their parent CTRL group.
>        */
> -	atomic_inc(&rdtgrp->waitcount);
> -	ret = task_work_add(tsk, &callback->work, TWA_RESUME);
> -	if (ret) {
> -		/*
> -		 * Task is exiting. Drop the refcount and free the callback.
> -		 * No need to check the refcount as the group cannot be
> -		 * deleted before the write function unlocks rdtgroup_mutex.
> -		 */
> -		atomic_dec(&rdtgrp->waitcount);
> -		kfree(callback);
> -		rdt_last_cmd_puts("Task exited\n");
> -	} else {
> -		/*
> -		 * For ctrl_mon groups move both closid and rmid.
> -		 * For monitor groups, can move the tasks only from
> -		 * their parent CTRL group.
> -		 */
> -		if (rdtgrp->type == RDTCTRL_GROUP) {
> -			tsk->closid = rdtgrp->closid;
> +
> +	if (rdtgrp->type == RDTCTRL_GROUP) {
> +		tsk->closid = rdtgrp->closid;
> +		tsk->rmid = rdtgrp->mon.rmid;
> +	} else if (rdtgrp->type == RDTMON_GROUP) {
> +		if (rdtgrp->mon.parent->closid == tsk->closid) {
>                       tsk->rmid = rdtgrp->mon.rmid;
> -		} else if (rdtgrp->type == RDTMON_GROUP) {
> -			if (rdtgrp->mon.parent->closid == tsk->closid) {
> -				tsk->rmid = rdtgrp->mon.rmid;
> -			} else {
> -				rdt_last_cmd_puts("Can't move task to different control group\n");
> -				ret = -EINVAL;
> -			}
> +		} else {
> +			rdt_last_cmd_puts("Can't move task to different control group\n");
> +			return -EINVAL;
>               }
> +	} else {
> +		rdt_last_cmd_puts("Invalid resource group type\n");
> +		return -EINVAL;
>       }

James already pointed out this should be a WARN_ON_ONCE(), but is that the
right place to assert rdtgrp->type validity?

I see only a single assignment to rdtgrp->type in mkdir_rdt_prepare();
could we fail the group creation there instead if the passed rtype is
invalid?

> -	return ret;
> +
> +	/*
> +	 * By now, the task's closid and rmid are set. If the task is current
> +	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
> +	 * group go into effect. If the task is not current, the MSR will be
> +	 * updated when the task is scheduled in.
> +	 */
> +	update_task_closid_rmid(tsk);

We need the above writes to be compile-ordered before the IPI is sent.
There *is* a preempt_disable() down in smp_call_function_single() that
gives us the required barrier(), can we deem that sufficient or would we
want one before update_task_closid_rmid() for the sake of clarity?

> +
> +	return 0;
>  }
>
>  static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a resource group
  2020-12-03 23:25 [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a resource group Reinette Chatre
                   ` (2 preceding siblings ...)
  2020-12-03 23:25 ` [PATCH 3/3] x86/resctrl: Don't move a task to the same " Reinette Chatre
@ 2020-12-11 20:46 ` Valentin Schneider
  2020-12-14 18:38   ` Reinette Chatre
  2020-12-17 12:19 ` [PATCH 4/3] x86/intel_rdt: Apply READ_ONCE/WRITE_ONCE to task_struct .rmid & .closid Valentin Schneider
  4 siblings, 1 reply; 22+ messages in thread
From: Valentin Schneider @ 2020-12-11 20:46 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, bp, tony.luck, kuo-lang.tseng, shakeelb, mingo,
	babu.moger, james.morse, hpa, x86, linux-kernel


Hi Reinette,

On 03/12/20 23:25, Reinette Chatre wrote:
> Valentin's series in [2] ends by adding memory barriers to support the
> updating of the task_struct from one CPU and the usage of the task_struct data
> from another CPU. This work is still needed and as discussed with Valentin in
> that thread the work would be re-evaluated by him after seeing how this series
> turns out.
>

So the "problematic" pattern is still there: a context switch can happen
concurrently with a write to the switching-to-tasks's {closid, rmid}.
Accesses to these fields would thus need to be wrapped by READ_ONCE() &
WRITE_ONCE().

Thinking a bit more (too much?) about it, we could limit ourselves to
wrapping only reads not protected by the rdtgroup_mutex: the only two
task_struct {closid, rmid} writers are
- rdtgroup_move_task()
- rdt_move_group_tasks()
and they are both invoked while holding said mutex. Thus, a reader holding
the mutex cannot race with a write, so load tearing ought to be safe.

> [1]: https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/
> [2]: https://lore.kernel.org/lkml/20201123022433.17905-1-valentin.schneider@arm.com
>
> Fenghua Yu (3):
>   x86/resctrl: Move setting task's active CPU in a mask into helpers
>   x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to
>     resource group
>   x86/resctrl: Don't move a task to the same resource group
>
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 159 +++++++++++++------------
>  1 file changed, 82 insertions(+), 77 deletions(-)
>
>
> base-commit: b65054597872ce3aefbc6a666385eabdf9e288da

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a resource group
  2020-12-11 20:46 ` [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a " Valentin Schneider
@ 2020-12-14 18:38   ` Reinette Chatre
  2020-12-16 17:41     ` Valentin Schneider
  0 siblings, 1 reply; 22+ messages in thread
From: Reinette Chatre @ 2020-12-14 18:38 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: tglx, fenghua.yu, bp, tony.luck, kuo-lang.tseng, shakeelb, mingo,
	babu.moger, james.morse, hpa, x86, linux-kernel

Hi Valentin,

On 12/11/2020 12:46 PM, Valentin Schneider wrote:
> On 03/12/20 23:25, Reinette Chatre wrote:
>> Valentin's series in [2] ends by adding memory barriers to support the
>> updating of the task_struct from one CPU and the usage of the task_struct data
>> from another CPU. This work is still needed and as discussed with Valentin in
>> that thread the work would be re-evaluated by him after seeing how this series
>> turns out.
>>

Thank you very much for taking a look.

> 
> So the "problematic" pattern is still there: a context switch can happen
> concurrently with a write to the switching-to-tasks's {closid, rmid}.
> Accesses to these fields would thus need to be wrapped by READ_ONCE() &
> WRITE_ONCE().

ok.

> 
> Thinking a bit more (too much?) about it, we could limit ourselves to
> wrapping only reads not protected by the rdtgroup_mutex: the only two
> task_struct {closid, rmid} writers are
> - rdtgroup_move_task()
> - rdt_move_group_tasks()
> and they are both invoked while holding said mutex. Thus, a reader holding
> the mutex cannot race with a write, so load tearing ought to be safe.

The reads that are not protected by the rdtgroup_mutex can be found in 
__resctrl_sched_in(). It thus sounds to me like your proposed changes to 
this function found in your patch [1] is what is needed? It is not clear 
to me how the pairing would work in this case though. If I understand 
correctly the goal is for the  write to the closid/rmid in the functions 
you mention above to be paired with the reads in resctrl_sched_in() and 
it is not clear how adding a single READ_ONCE would accomplish this 
pairing by itself.

It is also not entirely clear to me what the problematic scenario could 
be. If I understand correctly, the risk is (as you explained in your 
commit message), that a CPU could have its {closid, rmid} fields read 
locally (resctrl_sched_in()) while they are concurrently being written 
to from another CPU (in rdtgroup_move_task() and rdt_move_group_tasks() 
as you state above). If this happens then a task being moved may be 
scheduled in with its old closid/rmid. The update of closid/rmid in 
rdtgroup_move_task()/rdt_move_group_tasks() is followed by 
smp_call_function_xx() where the registers are updated with preemption 
disabled and thus protected against __switch_to. If a task was thus 
incorrectly scheduled in with old closid/rmid, would it not be corrected 
at this point?

Thank you

Reinette


[1] 
https://lore.kernel.org/lkml/20201123022433.17905-4-valentin.schneider@arm.com/


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group
  2020-12-11 20:46   ` Valentin Schneider
@ 2020-12-14 18:41     ` Reinette Chatre
  2020-12-16 17:41       ` Valentin Schneider
  0 siblings, 1 reply; 22+ messages in thread
From: Reinette Chatre @ 2020-12-14 18:41 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: tglx, fenghua.yu, bp, tony.luck, kuo-lang.tseng, shakeelb, mingo,
	babu.moger, james.morse, hpa, x86, linux-kernel, stable

Hi Valentin,

On 12/11/2020 12:46 PM, Valentin Schneider wrote:
> 
> On 03/12/20 23:25, Reinette Chatre wrote:
>> Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
>> Reported-by: Shakeel Butt <shakeelb@google.com>
>> Reported-by: Valentin Schneider <valentin.schneider@arm.com>
>> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
>> Reviewed-by: Tony Luck <tony.luck@intel.com>
>> Cc: stable@vger.kernel.org
> 
> Some pedantic comments below; with James' task_curr() + task_cpu()
> suggestion:
> 
> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

Thank you very much.


>> ---
...

>> +	if (rdtgrp->type == RDTCTRL_GROUP) {
>> +		tsk->closid = rdtgrp->closid;
>> +		tsk->rmid = rdtgrp->mon.rmid;
>> +	} else if (rdtgrp->type == RDTMON_GROUP) {
>> +		if (rdtgrp->mon.parent->closid == tsk->closid) {
>>                        tsk->rmid = rdtgrp->mon.rmid;
>> -		} else if (rdtgrp->type == RDTMON_GROUP) {
>> -			if (rdtgrp->mon.parent->closid == tsk->closid) {
>> -				tsk->rmid = rdtgrp->mon.rmid;
>> -			} else {
>> -				rdt_last_cmd_puts("Can't move task to different control group\n");
>> -				ret = -EINVAL;
>> -			}
>> +		} else {
>> +			rdt_last_cmd_puts("Can't move task to different control group\n");
>> +			return -EINVAL;
>>                }
>> +	} else {
>> +		rdt_last_cmd_puts("Invalid resource group type\n");
>> +		return -EINVAL;
>>        }
> 
> James already pointed out this should be a WARN_ON_ONCE(), but is that the
> right place to assert rdtgrp->type validity?
> 
> I see only a single assignment to rdtgrp->type in mkdir_rdt_prepare();
> could we fail the group creation there instead if the passed rtype is
> invalid?

Yes, there is that single assignment in mkdir_rdt_prepare() and looking 
at how mkdir_rdt_prepare() is called it is only ever called with 
RDTMON_GROUP or RDTCTRL_GROUP. This additional error checking was added 
as part of this fix but unrelated to the fix itself. Since you and James 
both pointed out flaws with it and it is unnecessary I will remove it 
here and maintain the original checking that continues to be sufficient.

>> -	return ret;
>> +
>> +	/*
>> +	 * By now, the task's closid and rmid are set. If the task is current
>> +	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
>> +	 * group go into effect. If the task is not current, the MSR will be
>> +	 * updated when the task is scheduled in.
>> +	 */
>> +	update_task_closid_rmid(tsk);
> 
> We need the above writes to be compile-ordered before the IPI is sent.
> There *is* a preempt_disable() down in smp_call_function_single() that
> gives us the required barrier(), can we deem that sufficient or would we
> want one before update_task_closid_rmid() for the sake of clarity?
> 

Apologies, it is not clear to me why the preempt_disable() would be 
insufficient. If it is not then there may be a few other areas (where 
resctrl calls smp_call_function_xxx()) that needs to be re-evaluated.

Thank you very much

Reinette


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a resource group
  2020-12-14 18:38   ` Reinette Chatre
@ 2020-12-16 17:41     ` Valentin Schneider
  2020-12-16 18:26       ` Reinette Chatre
  0 siblings, 1 reply; 22+ messages in thread
From: Valentin Schneider @ 2020-12-16 17:41 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, bp, tony.luck, kuo-lang.tseng, shakeelb, mingo,
	babu.moger, james.morse, hpa, x86, linux-kernel


On 14/12/20 18:38, Reinette Chatre wrote:
>> Thinking a bit more (too much?) about it, we could limit ourselves to
>> wrapping only reads not protected by the rdtgroup_mutex: the only two
>> task_struct {closid, rmid} writers are
>> - rdtgroup_move_task()
>> - rdt_move_group_tasks()
>> and they are both invoked while holding said mutex. Thus, a reader holding
>> the mutex cannot race with a write, so load tearing ought to be safe.
>
> The reads that are not protected by the rdtgroup_mutex can be found in
> __resctrl_sched_in(). It thus sounds to me like your proposed changes to
> this function found in your patch [1] is what is needed?

Right.

> It is not clear
> to me how the pairing would work in this case though. If I understand
> correctly the goal is for the  write to the closid/rmid in the functions
> you mention above to be paired with the reads in resctrl_sched_in() and
> it is not clear how adding a single READ_ONCE would accomplish this
> pairing by itself.
>

So all the writes would need WRITE_ONCE(), but not all reads would require
a READ_ONCE() (those that can't race with writes shouldn't need them).

I'll go and update that patch so that you can bundle it with v2 of this
series.

> It is also not entirely clear to me what the problematic scenario could
> be. If I understand correctly, the risk is (as you explained in your
> commit message), that a CPU could have its {closid, rmid} fields read
> locally (resctrl_sched_in()) while they are concurrently being written
> to from another CPU (in rdtgroup_move_task() and rdt_move_group_tasks()
> as you state above). If this happens then a task being moved may be
> scheduled in with its old closid/rmid.

Worse, it may be scheduled with a mangled closid/rmid if the read in
resctrl_sched_in() is torn (i.e. compiled as a sequence of multiple
smaller-sized loads). This one of the things READ_ONCE() / WRITE_ONCE()
try to address.

> The update of closid/rmid in
> rdtgroup_move_task()/rdt_move_group_tasks() is followed by
> smp_call_function_xx() where the registers are updated with preemption
> disabled and thus protected against __switch_to. If a task was thus
> incorrectly scheduled in with old closid/rmid, would it not be corrected
> at this point?
>

Excluding load/store tearing, then yes, the above works fine.

> Thank you
>
> Reinette
>
>
> [1]
> https://lore.kernel.org/lkml/20201123022433.17905-4-valentin.schneider@arm.com/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group
  2020-12-14 18:41     ` Reinette Chatre
@ 2020-12-16 17:41       ` Valentin Schneider
  2020-12-16 18:26         ` Reinette Chatre
  0 siblings, 1 reply; 22+ messages in thread
From: Valentin Schneider @ 2020-12-16 17:41 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, bp, tony.luck, kuo-lang.tseng, shakeelb, mingo,
	babu.moger, james.morse, hpa, x86, linux-kernel, stable


On 14/12/20 18:41, Reinette Chatre wrote:
>>> -	return ret;
>>> +
>>> +	/*
>>> +	 * By now, the task's closid and rmid are set. If the task is current
>>> +	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
>>> +	 * group go into effect. If the task is not current, the MSR will be
>>> +	 * updated when the task is scheduled in.
>>> +	 */
>>> +	update_task_closid_rmid(tsk);
>>
>> We need the above writes to be compile-ordered before the IPI is sent.
>> There *is* a preempt_disable() down in smp_call_function_single() that
>> gives us the required barrier(), can we deem that sufficient or would we
>> want one before update_task_closid_rmid() for the sake of clarity?
>>
>
> Apologies, it is not clear to me why the preempt_disable() would be
> insufficient. If it is not then there may be a few other areas (where
> resctrl calls smp_call_function_xxx()) that needs to be re-evaluated.

So that's part paranoia and part nonsense from my end - the contents of
smp_call() shouldn't matter here.

If we distill the code to:

  tsk->closid = x;

  if (task_curr(tsk))
      smp_call(...);

It is somewhat far fetched, but AFAICT this can be compiled as:

  if (task_curr(tsk))
      tsk->closid = x;
      smp_call(...);
  else
      tsk->closid = x;

IOW, there could be a sequence where the closid write is ordered *after*
the task_curr() read. With

  tsk->closid = x;

  barrier();

  if (task_curr(tsk))
      smp_call(...);

that explicitely cannot happen.

>
> Thank you very much
>
> Reinette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a resource group
  2020-12-16 17:41     ` Valentin Schneider
@ 2020-12-16 18:26       ` Reinette Chatre
  0 siblings, 0 replies; 22+ messages in thread
From: Reinette Chatre @ 2020-12-16 18:26 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: tglx, fenghua.yu, bp, tony.luck, kuo-lang.tseng, shakeelb, mingo,
	babu.moger, james.morse, hpa, x86, linux-kernel

Hi Valentin,

On 12/16/2020 9:41 AM, Valentin Schneider wrote:
> 
> On 14/12/20 18:38, Reinette Chatre wrote:
>>> Thinking a bit more (too much?) about it, we could limit ourselves to
>>> wrapping only reads not protected by the rdtgroup_mutex: the only two
>>> task_struct {closid, rmid} writers are
>>> - rdtgroup_move_task()
>>> - rdt_move_group_tasks()
>>> and they are both invoked while holding said mutex. Thus, a reader holding
>>> the mutex cannot race with a write, so load tearing ought to be safe.
>>
>> The reads that are not protected by the rdtgroup_mutex can be found in
>> __resctrl_sched_in(). It thus sounds to me like your proposed changes to
>> this function found in your patch [1] is what is needed?
> 
> Right.
> 
>> It is not clear
>> to me how the pairing would work in this case though. If I understand
>> correctly the goal is for the  write to the closid/rmid in the functions
>> you mention above to be paired with the reads in resctrl_sched_in() and
>> it is not clear how adding a single READ_ONCE would accomplish this
>> pairing by itself.
>>
> 
> So all the writes would need WRITE_ONCE(), but not all reads would require
> a READ_ONCE() (those that can't race with writes shouldn't need them).

Got it. I misunderstood your previous response, mistakenly thinking that 
it stated that there would be only READ_ONCE() reads without being 
paired with WRITE_ONCE() writes. Thanks for clearing that up.

> I'll go and update that patch so that you can bundle it with v2 of this
> series.

Thank you so much.

>> It is also not entirely clear to me what the problematic scenario could
>> be. If I understand correctly, the risk is (as you explained in your
>> commit message), that a CPU could have its {closid, rmid} fields read
>> locally (resctrl_sched_in()) while they are concurrently being written
>> to from another CPU (in rdtgroup_move_task() and rdt_move_group_tasks()
>> as you state above). If this happens then a task being moved may be
>> scheduled in with its old closid/rmid.
> 
> Worse, it may be scheduled with a mangled closid/rmid if the read in
> resctrl_sched_in() is torn (i.e. compiled as a sequence of multiple
> smaller-sized loads). This one of the things READ_ONCE() / WRITE_ONCE()
> try to address.

I see. This area is unfamiliar to me, thank you very much for catching 
this and helping to get it right.

> 
>> The update of closid/rmid in
>> rdtgroup_move_task()/rdt_move_group_tasks() is followed by
>> smp_call_function_xx() where the registers are updated with preemption
>> disabled and thus protected against __switch_to. If a task was thus
>> incorrectly scheduled in with old closid/rmid, would it not be corrected
>> at this point?
>>
> 
> Excluding load/store tearing, then yes, the above works fine.
> 

Thank you for this sanity check.

Reinette


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group
  2020-12-16 17:41       ` Valentin Schneider
@ 2020-12-16 18:26         ` Reinette Chatre
  2020-12-17 10:39           ` Valentin Schneider
  0 siblings, 1 reply; 22+ messages in thread
From: Reinette Chatre @ 2020-12-16 18:26 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: tglx, fenghua.yu, bp, tony.luck, kuo-lang.tseng, shakeelb, mingo,
	babu.moger, james.morse, hpa, x86, linux-kernel, stable

Hi Valentin,

On 12/16/2020 9:41 AM, Valentin Schneider wrote:
> 
> On 14/12/20 18:41, Reinette Chatre wrote:
>>>> -	return ret;
>>>> +
>>>> +	/*
>>>> +	 * By now, the task's closid and rmid are set. If the task is current
>>>> +	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
>>>> +	 * group go into effect. If the task is not current, the MSR will be
>>>> +	 * updated when the task is scheduled in.
>>>> +	 */
>>>> +	update_task_closid_rmid(tsk);
>>>
>>> We need the above writes to be compile-ordered before the IPI is sent.
>>> There *is* a preempt_disable() down in smp_call_function_single() that
>>> gives us the required barrier(), can we deem that sufficient or would we
>>> want one before update_task_closid_rmid() for the sake of clarity?
>>>
>>
>> Apologies, it is not clear to me why the preempt_disable() would be
>> insufficient. If it is not then there may be a few other areas (where
>> resctrl calls smp_call_function_xxx()) that needs to be re-evaluated.
> 
> So that's part paranoia and part nonsense from my end - the contents of
> smp_call() shouldn't matter here.
> 
> If we distill the code to:
> 
>    tsk->closid = x;
> 
>    if (task_curr(tsk))
>        smp_call(...);
> 
> It is somewhat far fetched, but AFAICT this can be compiled as:
> 
>    if (task_curr(tsk))
>        tsk->closid = x;
>        smp_call(...);
>    else
>        tsk->closid = x;
> 
> IOW, there could be a sequence where the closid write is ordered *after*
> the task_curr() read.

Could you please elaborate why it would be an issue is the closid write 
is ordered after the task_curr() read? task_curr() does not depend on 
the closid.

> With
> 
>    tsk->closid = x;
> 
>    barrier();
> 
>    if (task_curr(tsk))
>        smp_call(...);
> 
> that explicitely cannot happen.
> 


Reinette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group
  2020-12-16 18:26         ` Reinette Chatre
@ 2020-12-17 10:39           ` Valentin Schneider
  0 siblings, 0 replies; 22+ messages in thread
From: Valentin Schneider @ 2020-12-17 10:39 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, bp, tony.luck, kuo-lang.tseng, shakeelb, mingo,
	babu.moger, james.morse, hpa, x86, linux-kernel, stable


On 16/12/20 18:26, Reinette Chatre wrote:
> Hi Valentin,
>> So that's part paranoia and part nonsense from my end - the contents of
>> smp_call() shouldn't matter here.
>>
>> If we distill the code to:
>>
>>    tsk->closid = x;
>>
>>    if (task_curr(tsk))
>>        smp_call(...);
>>
>> It is somewhat far fetched, but AFAICT this can be compiled as:
>>
>>    if (task_curr(tsk))
>>        tsk->closid = x;
>>        smp_call(...);
>>    else
>>        tsk->closid = x;
>>
>> IOW, there could be a sequence where the closid write is ordered *after*
>> the task_curr() read.
>
> Could you please elaborate why it would be an issue is the closid write
> is ordered after the task_curr() read? task_curr() does not depend on
> the closid.
>

IMO the 'task_curr()' check only makes sense if it happens *after* the
write, the logic being: 'closid/rmid has been written to, does the task now
need interrupting?'

In the above 'else' clause, task_curr() would need to be re-evaluated after
the write: the task wasn't current *before* the write, but nothing
guarantees this still holds *after* the write.

>> With
>>
>>    tsk->closid = x;
>>
>>    barrier();
>>
>>    if (task_curr(tsk))
>>        smp_call(...);
>>
>> that explicitely cannot happen.
>>
>
>
> Reinette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/3] x86/intel_rdt: Apply READ_ONCE/WRITE_ONCE to task_struct .rmid & .closid
  2020-12-03 23:25 [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a resource group Reinette Chatre
                   ` (3 preceding siblings ...)
  2020-12-11 20:46 ` [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a " Valentin Schneider
@ 2020-12-17 12:19 ` Valentin Schneider
  4 siblings, 0 replies; 22+ messages in thread
From: Valentin Schneider @ 2020-12-17 12:19 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: tglx, fenghua.yu, bp, tony.luck, kuo-lang.tseng, shakeelb, mingo,
	babu.moger, james.morse, hpa, Reinette Chatre

A CPU's current task can have its {closid, rmid} fields read locally while
they are being concurrently written to from another CPU. This can happen
anytime __resctrl_sched_in() races with either __rdtgroup_move_task() or
rdt_move_group_tasks().

Prevent load / store tearing for those accesses by giving them the
READ_ONCE() / WRITE_ONCE() treatment.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 arch/x86/include/asm/resctrl.h         | 11 +++++++----
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 10 +++++-----
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 07603064df8f..d60ed0668a59 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -56,19 +56,22 @@ static void __resctrl_sched_in(void)
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
 	u32 closid = state->default_closid;
 	u32 rmid = state->default_rmid;
+	u32 tmp;
 
 	/*
 	 * If this task has a closid/rmid assigned, use it.
 	 * Else use the closid/rmid assigned to this cpu.
 	 */
 	if (static_branch_likely(&rdt_alloc_enable_key)) {
-		if (current->closid)
-			closid = current->closid;
+		tmp = READ_ONCE(current->closid);
+		if (tmp)
+			closid = tmp;
 	}
 
 	if (static_branch_likely(&rdt_mon_enable_key)) {
-		if (current->rmid)
-			rmid = current->rmid;
+		tmp = READ_ONCE(current->rmid);
+		if (tmp)
+			rmid = tmp;
 	}
 
 	if (closid != state->cur_closid || rmid != state->cur_rmid) {
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 523660a68d9a..791a7064edea 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -601,11 +601,11 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 	 */
 
 	if (rdtgrp->type == RDTCTRL_GROUP) {
-		tsk->closid = rdtgrp->closid;
-		tsk->rmid = rdtgrp->mon.rmid;
+		WRITE_ONCE(tsk->closid, rdtgrp->closid);
+		WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
 	} else if (rdtgrp->type == RDTMON_GROUP) {
 		if (rdtgrp->mon.parent->closid == tsk->closid) {
-			tsk->rmid = rdtgrp->mon.rmid;
+			WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
 		} else {
 			rdt_last_cmd_puts("Can't move task to different control group\n");
 			return -EINVAL;
@@ -2345,8 +2345,8 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
 	for_each_process_thread(p, t) {
 		if (!from || is_closid_match(t, from) ||
 		    is_rmid_match(t, from)) {
-			t->closid = to->closid;
-			t->rmid = to->mon.rmid;
+			WRITE_ONCE(t->closid, to->closid);
+			WRITE_ONCE(t->rmid, to->mon.rmid);
 
 			/* If the task is on a CPU, set the CPU in the mask. */
 			set_task_cpumask(t, mask);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-12-17 12:23 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-03 23:25 [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a resource group Reinette Chatre
2020-12-03 23:25 ` [PATCH 1/3] x86/resctrl: Move setting task's active CPU in a mask into helpers Reinette Chatre
2020-12-07 18:29   ` Borislav Petkov
2020-12-07 21:24     ` Reinette Chatre
2020-12-08  9:49       ` Borislav Petkov
2020-12-08 16:35         ` Reinette Chatre
2020-12-09 16:47   ` James Morse
2020-12-10  0:21     ` Reinette Chatre
2020-12-03 23:25 ` [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group Reinette Chatre
2020-12-09 16:51   ` James Morse
2020-12-10  0:22     ` Reinette Chatre
2020-12-11 20:46   ` Valentin Schneider
2020-12-14 18:41     ` Reinette Chatre
2020-12-16 17:41       ` Valentin Schneider
2020-12-16 18:26         ` Reinette Chatre
2020-12-17 10:39           ` Valentin Schneider
2020-12-03 23:25 ` [PATCH 3/3] x86/resctrl: Don't move a task to the same " Reinette Chatre
2020-12-11 20:46 ` [PATCH 0/3] x86/resctrl: Fix a few issues in moving a task to a " Valentin Schneider
2020-12-14 18:38   ` Reinette Chatre
2020-12-16 17:41     ` Valentin Schneider
2020-12-16 18:26       ` Reinette Chatre
2020-12-17 12:19 ` [PATCH 4/3] x86/intel_rdt: Apply READ_ONCE/WRITE_ONCE to task_struct .rmid & .closid Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).