linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race
@ 2020-11-23  2:24 Valentin Schneider
  2020-11-23  2:24 ` [PATCH v2 1/3] x86/intel_rdt: Check monitor group vs control group membership earlier Valentin Schneider
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Valentin Schneider @ 2020-11-23  2:24 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, James Morse

Hi folks,

This is a small cleanup + a fix for a race I stumbled upon while staring at
resctrl stuff.

Briefly tested on a Xeon Gold 5120 (m2.xlarge.x86 on Equinix) by bouncing
a few tasks around control groups.

Revisions
=========

v1 -> v2
--------
Empty git diff between the two, but:

o Collected Reviewed-by
o Reworded changelogs (James)
o Split READ_ONCE/WRITE_ONCE changes into separate patch (James)
  (@James: note that I had the audacity to slap your RB to that new patch,
   given that you already reviewed the diff. Hopefully you won't hate the
   CL too much!) 

Cheers,
Valentin

Valentin Schneider (3):
  x86/intel_rdt: Check monitor group vs control group membership earlier
  x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race
  x86/intel_rdt: Apply READ_ONCE/WRITE_ONCE to task_struct .rmid &
    .closid

 arch/x86/include/asm/resctrl.h         | 11 ++--
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 71 ++++++++++++++++----------
 2 files changed, 50 insertions(+), 32 deletions(-)

--
2.27.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/3] x86/intel_rdt: Check monitor group vs control group membership earlier
  2020-11-23  2:24 [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race Valentin Schneider
@ 2020-11-23  2:24 ` Valentin Schneider
  2020-11-23  2:24 ` [PATCH v2 2/3] x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race Valentin Schneider
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2020-11-23  2:24 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: James Morse, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin

A task can only be moved between monitor groups if both groups belong to
the same control group. This is checked fairly late however: by that time
we already have appended a task_work() callback, the execution of which
will be useless (there are no closid/rmid updates to handle, barring
concurrent writes).

Check the validity of the move earlier to save any kzalloc() /
task_work_add() if it wasn't going to be necessary.

Reviewed-by: James Morse <James.Morse@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index af323e2e3100..b6b5b95df833 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -581,12 +581,7 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 			tsk->closid = rdtgrp->closid;
 			tsk->rmid = rdtgrp->mon.rmid;
 		} else if (rdtgrp->type == RDTMON_GROUP) {
-			if (rdtgrp->mon.parent->closid == tsk->closid) {
-				tsk->rmid = rdtgrp->mon.rmid;
-			} else {
-				rdt_last_cmd_puts("Can't move task to different control group\n");
-				ret = -EINVAL;
-			}
+			tsk->rmid = rdtgrp->mon.rmid;
 		}
 	}
 	return ret;
@@ -673,9 +668,19 @@ static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp,
 	rcu_read_unlock();
 
 	ret = rdtgroup_task_write_permission(tsk, of);
-	if (!ret)
-		ret = __rdtgroup_move_task(tsk, rdtgrp);
+	if (ret)
+		goto out;
 
+	if (rdtgrp->type == RDTMON_GROUP &&
+	    rdtgrp->mon.parent->closid != tsk->closid) {
+		rdt_last_cmd_puts("Can't move task to different control group\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = __rdtgroup_move_task(tsk, rdtgrp);
+
+out:
 	put_task_struct(tsk);
 	return ret;
 }
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/3] x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race
  2020-11-23  2:24 [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race Valentin Schneider
  2020-11-23  2:24 ` [PATCH v2 1/3] x86/intel_rdt: Check monitor group vs control group membership earlier Valentin Schneider
@ 2020-11-23  2:24 ` Valentin Schneider
  2020-11-23  2:24 ` [PATCH v2 3/3] x86/intel_rdt: Apply READ_ONCE/WRITE_ONCE to task_struct .rmid & .closid Valentin Schneider
  2020-11-24 21:37 ` [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race Reinette Chatre
  3 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2020-11-23  2:24 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: James Morse, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin

Upon moving a task to a new control / monitor group, said task's
{closid, rmid} fields are updated *after* triggering the move_myself()
task_work callback. If the triggering thread gets preempted, or if the
targeted task was already on its way to return to userspace, then
move_myself() might be executed before the relevant task's {closid, rmid}
fields have been updated.

Update the task_struct's {closid, rmid} tuple *before* invoking
task_work_add(). Highlight the required ordering with a pair of comments.

Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
Reviewed-by: James Morse <James.Morse@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 34 ++++++++++++++++----------
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b6b5b95df833..f62d81104fd0 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -524,6 +524,8 @@ static void move_myself(struct callback_head *head)
 	 * If resource group was deleted before this task work callback
 	 * was invoked, then assign the task to root group and free the
 	 * resource group.
+	 *
+	 * See pairing atomic_inc() in __rdtgroup_move_task()
 	 */
 	if (atomic_dec_and_test(&rdtgrp->waitcount) &&
 	    (rdtgrp->flags & RDT_DELETED)) {
@@ -553,14 +555,32 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 	callback = kzalloc(sizeof(*callback), GFP_KERNEL);
 	if (!callback)
 		return -ENOMEM;
-	callback->work.func = move_myself;
+
+	init_task_work(&callback->work, move_myself);
 	callback->rdtgrp = rdtgrp;
 
+	/*
+	 * For ctrl_mon groups move both closid and rmid.
+	 * For monitor groups, can move the tasks only from
+	 * their parent CTRL group.
+	 */
+	if (rdtgrp->type == RDTCTRL_GROUP)
+		tsk->closid = rdtgrp->closid;
+	tsk->rmid = rdtgrp->mon.rmid;
+
 	/*
 	 * Take a refcount, so rdtgrp cannot be freed before the
 	 * callback has been invoked.
+	 *
+	 * Also ensures above {closid, rmid} writes are observed by
+	 * move_myself(), as it can run immediately after task_work_add().
+	 * Otherwise old values may be loaded, and the move will only actually
+	 * happen at the next context switch.
+	 *
+	 * Pairs with atomic_dec() in move_myself().
 	 */
 	atomic_inc(&rdtgrp->waitcount);
+
 	ret = task_work_add(tsk, &callback->work, TWA_RESUME);
 	if (ret) {
 		/*
@@ -571,18 +591,6 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 		atomic_dec(&rdtgrp->waitcount);
 		kfree(callback);
 		rdt_last_cmd_puts("Task exited\n");
-	} else {
-		/*
-		 * For ctrl_mon groups move both closid and rmid.
-		 * For monitor groups, can move the tasks only from
-		 * their parent CTRL group.
-		 */
-		if (rdtgrp->type == RDTCTRL_GROUP) {
-			tsk->closid = rdtgrp->closid;
-			tsk->rmid = rdtgrp->mon.rmid;
-		} else if (rdtgrp->type == RDTMON_GROUP) {
-			tsk->rmid = rdtgrp->mon.rmid;
-		}
 	}
 	return ret;
 }
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/3] x86/intel_rdt: Apply READ_ONCE/WRITE_ONCE to task_struct .rmid & .closid
  2020-11-23  2:24 [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race Valentin Schneider
  2020-11-23  2:24 ` [PATCH v2 1/3] x86/intel_rdt: Check monitor group vs control group membership earlier Valentin Schneider
  2020-11-23  2:24 ` [PATCH v2 2/3] x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race Valentin Schneider
@ 2020-11-23  2:24 ` Valentin Schneider
  2020-11-24 21:37 ` [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race Reinette Chatre
  3 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2020-11-23  2:24 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: James Morse, Fenghua Yu, Reinette Chatre, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin

A CPU's current task can have its {closid, rmid} fields read locally while
they are being concurrently written to from another CPU. This can happen
anytime __resctrl_sched_in() races with e.g. a __rdtgroup_move_task() call
on a different CPU targeting the first CPU's current task.

Prevent load / store tearing for those accesses by giving them the
READ_ONCE() / WRITE_ONCE() treatment.

Reviewed-by: James Morse <James.Morse@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 arch/x86/include/asm/resctrl.h         | 11 +++++++----
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 24 +++++++++++++-----------
 2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 07603064df8f..d60ed0668a59 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -56,19 +56,22 @@ static void __resctrl_sched_in(void)
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
 	u32 closid = state->default_closid;
 	u32 rmid = state->default_rmid;
+	u32 tmp;
 
 	/*
 	 * If this task has a closid/rmid assigned, use it.
 	 * Else use the closid/rmid assigned to this cpu.
 	 */
 	if (static_branch_likely(&rdt_alloc_enable_key)) {
-		if (current->closid)
-			closid = current->closid;
+		tmp = READ_ONCE(current->closid);
+		if (tmp)
+			closid = tmp;
 	}
 
 	if (static_branch_likely(&rdt_mon_enable_key)) {
-		if (current->rmid)
-			rmid = current->rmid;
+		tmp = READ_ONCE(current->rmid);
+		if (tmp)
+			rmid = tmp;
 	}
 
 	if (closid != state->cur_closid || rmid != state->cur_rmid) {
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f62d81104fd0..135a51529f70 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -529,8 +529,8 @@ static void move_myself(struct callback_head *head)
 	 */
 	if (atomic_dec_and_test(&rdtgrp->waitcount) &&
 	    (rdtgrp->flags & RDT_DELETED)) {
-		current->closid = 0;
-		current->rmid = 0;
+		WRITE_ONCE(current->closid, 0);
+		WRITE_ONCE(current->rmid, 0);
 		kfree(rdtgrp);
 	}
 
@@ -565,8 +565,8 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 	 * their parent CTRL group.
 	 */
 	if (rdtgrp->type == RDTCTRL_GROUP)
-		tsk->closid = rdtgrp->closid;
-	tsk->rmid = rdtgrp->mon.rmid;
+		WRITE_ONCE(tsk->closid, rdtgrp->closid);
+	WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
 
 	/*
 	 * Take a refcount, so rdtgrp cannot be freed before the
@@ -598,13 +598,15 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
 {
 	return (rdt_alloc_capable &&
-	       (r->type == RDTCTRL_GROUP) && (t->closid == r->closid));
+		(r->type == RDTCTRL_GROUP) &&
+		(READ_ONCE(t->closid) == r->closid));
 }
 
 static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r)
 {
 	return (rdt_mon_capable &&
-	       (r->type == RDTMON_GROUP) && (t->rmid == r->mon.rmid));
+		(r->type == RDTMON_GROUP) &&
+		(READ_ONCE(t->rmid) == r->mon.rmid));
 }
 
 /**
@@ -680,7 +682,7 @@ static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp,
 		goto out;
 
 	if (rdtgrp->type == RDTMON_GROUP &&
-	    rdtgrp->mon.parent->closid != tsk->closid) {
+	    rdtgrp->mon.parent->closid != READ_ONCE(tsk->closid)) {
 		rdt_last_cmd_puts("Can't move task to different control group\n");
 		ret = -EINVAL;
 		goto out;
@@ -810,7 +812,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
 		    rdtg->mode != RDT_MODE_EXCLUSIVE)
 			continue;
 
-		if (rdtg->closid != tsk->closid)
+		if (rdtg->closid != READ_ONCE(tsk->closid))
 			continue;
 
 		seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "",
@@ -818,7 +820,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
 		seq_puts(s, "mon:");
 		list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
 				    mon.crdtgrp_list) {
-			if (tsk->rmid != crg->mon.rmid)
+			if (READ_ONCE(tsk->rmid) != crg->mon.rmid)
 				continue;
 			seq_printf(s, "%s", crg->kn->name);
 			break;
@@ -2336,8 +2338,8 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
 	for_each_process_thread(p, t) {
 		if (!from || is_closid_match(t, from) ||
 		    is_rmid_match(t, from)) {
-			t->closid = to->closid;
-			t->rmid = to->mon.rmid;
+			WRITE_ONCE(t->closid, to->closid);
+			WRITE_ONCE(t->rmid, to->mon.rmid);
 
 #ifdef CONFIG_SMP
 			/*
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race
  2020-11-23  2:24 [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race Valentin Schneider
                   ` (2 preceding siblings ...)
  2020-11-23  2:24 ` [PATCH v2 3/3] x86/intel_rdt: Apply READ_ONCE/WRITE_ONCE to task_struct .rmid & .closid Valentin Schneider
@ 2020-11-24 21:37 ` Reinette Chatre
  2020-11-25 15:01   ` Valentin Schneider
  3 siblings, 1 reply; 10+ messages in thread
From: Reinette Chatre @ 2020-11-24 21:37 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel, x86
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, James Morse

Hi Valentin,

On 11/22/2020 6:24 PM, Valentin Schneider wrote:
> This is a small cleanup + a fix for a race I stumbled upon while staring at
> resctrl stuff.
> 
> Briefly tested on a Xeon Gold 5120 (m2.xlarge.x86 on Equinix) by bouncing
> a few tasks around control groups.
> 

...

Thank you very much for taking this on. Unfortunately this race is one 
of a few issues with the way in which tasks moving to a new resource 
group is handled.

Other issues are:

1.
Until the queued work is run, the moved task runs with old (and even
invalid in the case when its original resource group has been removed)
closid and rmid.

2.
Work to update the PQR_ASSOC register is queued every time the user 
writes a task id to the "tasks" file, even if the task already belongs 
to the resource group and in addition to any other pending work for that 
task. This could result in multiple pending work items associated with a 
single task even if they are all identical and even though only a single 
update with most recent values is needed. This could result in 
significant system resource waste, especially on tasks sleeping for a 
long time.

Fenghua solved these issues by replacing the callback with a synchronous 
update, similar to how tasks are currently moved when a resource group 
is deleted. We plan to submit this work next week.

This new solution will make patch 1 and 2 of this series unnecessary. As 
I understand it patch 3 would still be a welcome addition but would 
require changes. As you prefer you could either submit patch 3 on its 
own for the code as it is now and we will rework the task related 
changes on top of that, or you could wait for the task related changes 
to land first?

> 
> Valentin Schneider (3):
>    x86/intel_rdt: Check monitor group vs control group membership earlier
>    x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race
>    x86/intel_rdt: Apply READ_ONCE/WRITE_ONCE to task_struct .rmid &
>      .closid
> 

Thank you very much

Reinette


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race
  2020-11-24 21:37 ` [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race Reinette Chatre
@ 2020-11-25 15:01   ` Valentin Schneider
  2020-11-25 17:20     ` Reinette Chatre
  0 siblings, 1 reply; 10+ messages in thread
From: Valentin Schneider @ 2020-11-25 15:01 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: linux-kernel, x86, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, James Morse


Hi Reinette,

On 24/11/20 21:37, Reinette Chatre wrote:
> Hi Valentin,
>
> On 11/22/2020 6:24 PM, Valentin Schneider wrote:
>> This is a small cleanup + a fix for a race I stumbled upon while staring at
>> resctrl stuff.
>>
>> Briefly tested on a Xeon Gold 5120 (m2.xlarge.x86 on Equinix) by bouncing
>> a few tasks around control groups.
>>
>
> ...
>
> Thank you very much for taking this on. Unfortunately this race is one
> of a few issues with the way in which tasks moving to a new resource
> group is handled.
>
> Other issues are:
>
> 1.
> Until the queued work is run, the moved task runs with old (and even
> invalid in the case when its original resource group has been removed)
> closid and rmid.
>

For a userspace task, that queued work should be run as soon as possible
(& relevant). If said task is currently running, then task_work_add() will
lead to an IPI; the other cases (task moving itself or not currently
running) are covered by the return to userspace path.

Kernel threads however are a prickly matter because they quite explicitly
don't have this return to userspace - they only run their task_work
callbacks on exit. So we currently have to wait for those kthreads to go
through a context switch to update the relevant register, but I don't
see any other alternative that wouldn't involve interrupting every other
CPU (the kthread could move between us triggering some remote work and its
previous CPU receiving the IPI).

> 2.
> Work to update the PQR_ASSOC register is queued every time the user
> writes a task id to the "tasks" file, even if the task already belongs
> to the resource group and in addition to any other pending work for that
> task. This could result in multiple pending work items associated with a
> single task even if they are all identical and even though only a single
> update with most recent values is needed. This could result in
> significant system resource waste, especially on tasks sleeping for a
> long time.
>
> Fenghua solved these issues by replacing the callback with a synchronous
> update, similar to how tasks are currently moved when a resource group
> is deleted. We plan to submit this work next week.
>
> This new solution will make patch 1 and 2 of this series unnecessary. As
> I understand it patch 3 would still be a welcome addition but would
> require changes. As you prefer you could either submit patch 3 on its
> own for the code as it is now and we will rework the task related
> changes on top of that, or you could wait for the task related changes
> to land first?
>

Please do Cc me on those - I'll re-evaluate the need for patch 3 then.

Thanks!

>>
>> Valentin Schneider (3):
>>    x86/intel_rdt: Check monitor group vs control group membership earlier
>>    x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race
>>    x86/intel_rdt: Apply READ_ONCE/WRITE_ONCE to task_struct .rmid &
>>      .closid
>>
>
> Thank you very much
>
> Reinette

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race
  2020-11-25 15:01   ` Valentin Schneider
@ 2020-11-25 17:20     ` Reinette Chatre
  2020-11-25 18:39       ` Valentin Schneider
  0 siblings, 1 reply; 10+ messages in thread
From: Reinette Chatre @ 2020-11-25 17:20 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, James Morse

Hi Valentin,

On 11/25/2020 7:01 AM, Valentin Schneider wrote:
> On 24/11/20 21:37, Reinette Chatre wrote:
>> On 11/22/2020 6:24 PM, Valentin Schneider wrote:
>>> This is a small cleanup + a fix for a race I stumbled upon while staring at
>>> resctrl stuff.
>>>
>>> Briefly tested on a Xeon Gold 5120 (m2.xlarge.x86 on Equinix) by bouncing
>>> a few tasks around control groups.
>>>
>>
>> ...
>>
>> Thank you very much for taking this on. Unfortunately this race is one
>> of a few issues with the way in which tasks moving to a new resource
>> group is handled.
>>
>> Other issues are:
>>
>> 1.
>> Until the queued work is run, the moved task runs with old (and even
>> invalid in the case when its original resource group has been removed)
>> closid and rmid.
>>
> 
> For a userspace task, that queued work should be run as soon as possible
> (& relevant). If said task is currently running, then task_work_add() will
> lead to an IPI;
> the other cases (task moving itself or not currently
> running) are covered by the return to userspace path.

At this time the work is added with the TWA_RESUME flag so the running 
task does not get a signal. I tried to follow the task_work_add() path 
if there is a change to use TWA_SIGNAL instead and (I may have 
misunderstanding) it seems to me that a sleeping task will be woken (if 
it is TASK_INTERRUPTIBLE)? That is unnecessary. The goal of this work is 
only to change the CPU register to indicate the active closid/rmid so it 
is unnecessary to wake a process to do that, it only needs to be done 
next time the task is scheduled in (which is already done with the 
resctrl_sched_in() call in __switch_to()). If a task is not running all 
that is needed is to change the closid/rmid in its task_struct to be 
used next time it is scheduled in.

In the new solution, after updating closid/rmid in the task_struct, the 
CPU register is updated via smp_call_function_single() on a CPU the task 
is running. Nothing is done for tasks not running, next time they are 
scheduled in the CPU's register will be updated to reflect the task's 
closid/rmid. Moving to the smp_call_function_xxx() API would also bring 
this update in line with how other register updates are already done in 
resctrl.

> Kernel threads however are a prickly matter because they quite explicitly
> don't have this return to userspace - they only run their task_work
> callbacks on exit. So we currently have to wait for those kthreads to go
> through a context switch to update the relevant register, but I don't
> see any other alternative that wouldn't involve interrupting every other
> CPU (the kthread could move between us triggering some remote work and its
> previous CPU receiving the IPI).

This seems ok? In the new solution the closid/rmid would be updated in 
task_struct and a smp_call_function_single() attempted on the CPU where 
the kthread is running. If the kthread is no longer running at the time 
the function is called the CPU register will not be changed. I assume 
the kthread move would include a context switch that would result in the 
register change (__switch_to()->resctrl_sched_in()) for the kthread to 
run with its new closid/rmid after the move.

>> 2.
>> Work to update the PQR_ASSOC register is queued every time the user
>> writes a task id to the "tasks" file, even if the task already belongs
>> to the resource group and in addition to any other pending work for that
>> task. This could result in multiple pending work items associated with a
>> single task even if they are all identical and even though only a single
>> update with most recent values is needed. This could result in
>> significant system resource waste, especially on tasks sleeping for a
>> long time.
>>
>> Fenghua solved these issues by replacing the callback with a synchronous
>> update, similar to how tasks are currently moved when a resource group
>> is deleted. We plan to submit this work next week.
>>
>> This new solution will make patch 1 and 2 of this series unnecessary. As
>> I understand it patch 3 would still be a welcome addition but would
>> require changes. As you prefer you could either submit patch 3 on its
>> own for the code as it is now and we will rework the task related
>> changes on top of that, or you could wait for the task related changes
>> to land first?
>>
> 
> Please do Cc me on those - I'll re-evaluate the need for patch 3 then.

Will do. I appreciate you taking a look.

Thank you very much

Reinette

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race
  2020-11-25 17:20     ` Reinette Chatre
@ 2020-11-25 18:39       ` Valentin Schneider
  2020-11-25 19:06         ` Reinette Chatre
  0 siblings, 1 reply; 10+ messages in thread
From: Valentin Schneider @ 2020-11-25 18:39 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: linux-kernel, x86, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, James Morse


On 25/11/20 17:20, Reinette Chatre wrote:
> Hi Valentin,
>>>
>>> Until the queued work is run, the moved task runs with old (and even
>>> invalid in the case when its original resource group has been removed)
>>> closid and rmid.
>>>
>>
>> For a userspace task, that queued work should be run as soon as possible
>> (& relevant). If said task is currently running, then task_work_add() will
>> lead to an IPI;
>> the other cases (task moving itself or not currently
>> running) are covered by the return to userspace path.
>
> At this time the work is added with the TWA_RESUME flag so the running
> task does not get a signal. I tried to follow the task_work_add() path
> if there is a change to use TWA_SIGNAL instead and (I may have
> misunderstanding) it seems to me that a sleeping task will be woken (if
> it is TASK_INTERRUPTIBLE)? That is unnecessary. The goal of this work is
> only to change the CPU register to indicate the active closid/rmid so it
> is unnecessary to wake a process to do that, it only needs to be done
> next time the task is scheduled in (which is already done with the
> resctrl_sched_in() call in __switch_to()). If a task is not running all
> that is needed is to change the closid/rmid in its task_struct to be
> used next time it is scheduled in.
>

The (default) TWA_RESUME ensures the targeted (userspace) task gets kicked
if it is currently running, and doesn't perturb any CPU otherwise;
see set_notify_resume() + exit_to_user_mode_loop() (or do_notify_resume()
on arm64)

> In the new solution, after updating closid/rmid in the task_struct, the
> CPU register is updated via smp_call_function_single() on a CPU the task
> is running. Nothing is done for tasks not running, next time they are
> scheduled in the CPU's register will be updated to reflect the task's
> closid/rmid. Moving to the smp_call_function_xxx() API would also bring
> this update in line with how other register updates are already done in
> resctrl.
>
>> Kernel threads however are a prickly matter because they quite explicitly
>> don't have this return to userspace - they only run their task_work
>> callbacks on exit. So we currently have to wait for those kthreads to go
>> through a context switch to update the relevant register, but I don't
>> see any other alternative that wouldn't involve interrupting every other
>> CPU (the kthread could move between us triggering some remote work and its
>> previous CPU receiving the IPI).
>
> This seems ok? In the new solution the closid/rmid would be updated in
> task_struct and a smp_call_function_single() attempted on the CPU where
> the kthread is running. If the kthread is no longer running at the time
> the function is called the CPU register will not be changed.

Right, if the update happens before triggering the remote work then that
should all work. I was stuck thinking about keeping the update contained
within the remote work itself to prevent any other races (i.e. patch 3).

Anywho, that's enough speculation from me, I'll just sit tight and see what
comes next!

> I assume
> the kthread move would include a context switch that would result in the
> register change (__switch_to()->resctrl_sched_in()) for the kthread to
> run with its new closid/rmid after the move.
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race
  2020-11-25 18:39       ` Valentin Schneider
@ 2020-11-25 19:06         ` Reinette Chatre
  2020-11-25 23:23           ` Valentin Schneider
  0 siblings, 1 reply; 10+ messages in thread
From: Reinette Chatre @ 2020-11-25 19:06 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, James Morse

Hi Valentin,

On 11/25/2020 10:39 AM, Valentin Schneider wrote:
> On 25/11/20 17:20, Reinette Chatre wrote:
>>>> Until the queued work is run, the moved task runs with old (and even
>>>> invalid in the case when its original resource group has been removed)
>>>> closid and rmid.
>>>>
>>>
>>> For a userspace task, that queued work should be run as soon as possible
>>> (& relevant). If said task is currently running, then task_work_add() will
>>> lead to an IPI;
>>> the other cases (task moving itself or not currently
>>> running) are covered by the return to userspace path.
>>
>> At this time the work is added with the TWA_RESUME flag so the running
>> task does not get a signal. I tried to follow the task_work_add() path
>> if there is a change to use TWA_SIGNAL instead and (I may have
>> misunderstanding) it seems to me that a sleeping task will be woken (if
>> it is TASK_INTERRUPTIBLE)? That is unnecessary. The goal of this work is
>> only to change the CPU register to indicate the active closid/rmid so it
>> is unnecessary to wake a process to do that, it only needs to be done
>> next time the task is scheduled in (which is already done with the
>> resctrl_sched_in() call in __switch_to()). If a task is not running all
>> that is needed is to change the closid/rmid in its task_struct to be
>> used next time it is scheduled in.
>>
> 
> The (default) TWA_RESUME ensures the targeted (userspace) task gets kicked
> if it is currently running, and doesn't perturb any CPU otherwise;
> see set_notify_resume() + exit_to_user_mode_loop() (or do_notify_resume()
> on arm64)

I missed that, thanks. The first issue is thus not a problem. Thank you 
very much for clearing this up. Queueing work for tasks that are not 
running remains unnecessary and simplifying this with a targeted 
smp_call_function addresses that (while also taking care of the other 
issues with using the queued work).

>> In the new solution, after updating closid/rmid in the task_struct, the
>> CPU register is updated via smp_call_function_single() on a CPU the task
>> is running. Nothing is done for tasks not running, next time they are
>> scheduled in the CPU's register will be updated to reflect the task's
>> closid/rmid. Moving to the smp_call_function_xxx() API would also bring
>> this update in line with how other register updates are already done in
>> resctrl.
>>
>>> Kernel threads however are a prickly matter because they quite explicitly
>>> don't have this return to userspace - they only run their task_work
>>> callbacks on exit. So we currently have to wait for those kthreads to go
>>> through a context switch to update the relevant register, but I don't
>>> see any other alternative that wouldn't involve interrupting every other
>>> CPU (the kthread could move between us triggering some remote work and its
>>> previous CPU receiving the IPI).
>>
>> This seems ok? In the new solution the closid/rmid would be updated in
>> task_struct and a smp_call_function_single() attempted on the CPU where
>> the kthread is running. If the kthread is no longer running at the time
>> the function is called the CPU register will not be changed.
> 
> Right, if the update happens before triggering the remote work then that
> should all work. I was stuck thinking about keeping the update contained
> within the remote work itself to prevent any other races (i.e. patch 3).

Are you saying that the task_struct update as well as register update 
should both be done in the remote work? I think I may be 
misunderstanding though. Currently, with your entire series applied, the 
update to task_struct is done before the remote work is queued that only 
changes the register. The new solution would also first update the 
task_struct and then the remote work (this time with smp_call_function) 
will just update the register.

 From what I understand your work in patch 3 would continue to be 
welcome with the new solution that will also update the task_struct and 
then trigger the remote work to just update the register.

> Anywho, that's enough speculation from me, I'll just sit tight and see what
> comes next!
> 

Reinette

>> I assume
>> the kthread move would include a context switch that would result in the
>> register change (__switch_to()->resctrl_sched_in()) for the kthread to
>> run with its new closid/rmid after the move.
>>


Reinette

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race
  2020-11-25 19:06         ` Reinette Chatre
@ 2020-11-25 23:23           ` Valentin Schneider
  0 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2020-11-25 23:23 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: linux-kernel, x86, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, James Morse


On 25/11/20 19:06, Reinette Chatre wrote:
> Hi Valentin,
>
> On 11/25/2020 10:39 AM, Valentin Schneider wrote:
>> The (default) TWA_RESUME ensures the targeted (userspace) task gets kicked
>> if it is currently running, and doesn't perturb any CPU otherwise;
>> see set_notify_resume() + exit_to_user_mode_loop() (or do_notify_resume()
>> on arm64)
>
> I missed that, thanks. The first issue is thus not a problem. Thank you
> very much for clearing this up. Queueing work for tasks that are not
> running remains unnecessary and simplifying this with a targeted
> smp_call_function addresses that (while also taking care of the other
> issues with using the queued work).
>

Right.

>>> In the new solution, after updating closid/rmid in the task_struct, the
>>> CPU register is updated via smp_call_function_single() on a CPU the task
>>> is running. Nothing is done for tasks not running, next time they are
>>> scheduled in the CPU's register will be updated to reflect the task's
>>> closid/rmid. Moving to the smp_call_function_xxx() API would also bring
>>> this update in line with how other register updates are already done in
>>> resctrl.
>>>
>>>> Kernel threads however are a prickly matter because they quite explicitly
>>>> don't have this return to userspace - they only run their task_work
>>>> callbacks on exit. So we currently have to wait for those kthreads to go
>>>> through a context switch to update the relevant register, but I don't
>>>> see any other alternative that wouldn't involve interrupting every other
>>>> CPU (the kthread could move between us triggering some remote work and its
>>>> previous CPU receiving the IPI).
>>>
>>> This seems ok? In the new solution the closid/rmid would be updated in
>>> task_struct and a smp_call_function_single() attempted on the CPU where
>>> the kthread is running. If the kthread is no longer running at the time
>>> the function is called the CPU register will not be changed.
>>
>> Right, if the update happens before triggering the remote work then that
>> should all work. I was stuck thinking about keeping the update contained
>> within the remote work itself to prevent any other races (i.e. patch 3).
>
> Are you saying that the task_struct update as well as register update
> should both be done in the remote work? I think I may be
> misunderstanding though.

It would simplify the concurrency aspect - if the {closid, rmid} update is
always done on the targeted task' context, then there can be no races
between an update (write) and a context switch (read). Sadly I don't see a
nice way to do this for kthreads, so I think it'll have to be update +
smp_call.


> Currently, with your entire series applied, the
> update to task_struct is done before the remote work is queued that only
> changes the register. The new solution would also first update the
> task_struct and then the remote work (this time with smp_call_function)
> will just update the register.
>
>  From what I understand your work in patch 3 would continue to be
> welcome with the new solution that will also update the task_struct and
> then trigger the remote work to just update the register.
>

That's how I see it as well ATM.

>> Anywho, that's enough speculation from me, I'll just sit tight and see what
>> comes next!
>>
>
> Reinette
>
>>> I assume
>>> the kthread move would include a context switch that would result in the
>>> register change (__switch_to()->resctrl_sched_in()) for the kthread to
>>> run with its new closid/rmid after the move.
>>>
>
>
> Reinette

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-11-25 23:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-23  2:24 [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race Valentin Schneider
2020-11-23  2:24 ` [PATCH v2 1/3] x86/intel_rdt: Check monitor group vs control group membership earlier Valentin Schneider
2020-11-23  2:24 ` [PATCH v2 2/3] x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race Valentin Schneider
2020-11-23  2:24 ` [PATCH v2 3/3] x86/intel_rdt: Apply READ_ONCE/WRITE_ONCE to task_struct .rmid & .closid Valentin Schneider
2020-11-24 21:37 ` [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race Reinette Chatre
2020-11-25 15:01   ` Valentin Schneider
2020-11-25 17:20     ` Reinette Chatre
2020-11-25 18:39       ` Valentin Schneider
2020-11-25 19:06         ` Reinette Chatre
2020-11-25 23:23           ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).