All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] x86/resctrl: fix task CLOSID update race
@ 2022-11-03 14:16 Peter Newman
  2022-11-03 14:16 ` [PATCH 1/1] x86/resctrl: serialize task CLOSID update with task_call_func() Peter Newman
  2022-11-08 18:49 ` [PATCH 0/1] x86/resctrl: fix task CLOSID update race Reinette Chatre
  0 siblings, 2 replies; 4+ messages in thread
From: Peter Newman @ 2022-11-03 14:16 UTC (permalink / raw)
  To: Reinette Chatre, Fenghua Yu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel, jannh, eranian, kpsingh, derkling,
	james.morse, Peter Newman

Hi Reinette, Fenghua,

Below is my patch to address the IPI race we discussed in the container
move RFD thread[1].

The patch below uses the new task_call_func() interface to serialize
updating closid and rmid with any context switch of the task. AFAICT,
the implementation of this function acts like a mutex with context
switch, but I'm not certain whether it is intended to be one. If this is
not how task_call_func() is meant to be used, I will instead move the
code performing the update under sched/ where it can be done holding the
task_rq_lock() explicitly, as Reinette has suggested before[2].

From my own measurements, this change will double the time to complete a
mass-move operation, such as rmdir on an rdtgroup with a large task
list. But to the best of my knowedge, these large-scale reconfigurations
of the control groups are infrequent, and the baseline I'm measuring
against is racy anyways.

What's still unclear to me is, when processing a large task list, is
obtaining the pi/rq locks for thousands of tasks (all while read-locking
the tasklist_lock) better than just blindly notifying all CPUs? My guess
is that the situation where notifying all CPUs would be better is
uncommon for most users and probably more likely in Google's use case
than most others, as we have a use case for moving large container jobs
to a different MBA group.

Thanks!
-Peter

[1] https://lore.kernel.org/all/CALPaoCg2-9ARbK+MEgdvdcjJtSy_2H6YeRkLrT97zgy8Aro3Vg@mail.gmail.com/
[2] https://lore.kernel.org/lkml/d3c06fa3-83a4-7ade-6b08-3a7259aa6c4b@intel.com/

Peter Newman (1):
  x86/resctrl: serialize task CLOSID update with task_call_func()

 arch/x86/include/asm/resctrl.h         | 11 ++--
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 83 +++++++++++++++-----------
 2 files changed, 51 insertions(+), 43 deletions(-)

-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/1] x86/resctrl: serialize task CLOSID update with task_call_func()
  2022-11-03 14:16 [PATCH 0/1] x86/resctrl: fix task CLOSID update race Peter Newman
@ 2022-11-03 14:16 ` Peter Newman
  2022-11-08 18:49 ` [PATCH 0/1] x86/resctrl: fix task CLOSID update race Reinette Chatre
  1 sibling, 0 replies; 4+ messages in thread
From: Peter Newman @ 2022-11-03 14:16 UTC (permalink / raw)
  To: Reinette Chatre, Fenghua Yu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel, jannh, eranian, kpsingh, derkling,
	james.morse, Peter Newman

When determining whether tasks need to be interrupted due to a
closid/rmid change, it is possible for the task in question to migrate
or wake up concurrently without observing the updated closid/rmid
values.

This is because stores updating the closid and rmid in the task
structure could reorder with the loads in task_curr() and task_cpu().
Similar reordering also impacts resctrl_sched_in(), where reading the
updated values could reorder with prior stores to t->cpu or rq->curr.

Instead, use task_call_func() to serialize updates to the closid and
rmid fields in the task_struct with context switch. This also removes
the need for READ_ONCE()/WRITE_ONCE when accessing the fields.

Signed-off-by: Peter Newman <peternewman@google.com>
---
 arch/x86/include/asm/resctrl.h         | 11 ++--
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 83 +++++++++++++++-----------
 2 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index d24b04ebf950..b712106e8f81 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -56,22 +56,19 @@ static void __resctrl_sched_in(void)
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
 	u32 closid = state->default_closid;
 	u32 rmid = state->default_rmid;
-	u32 tmp;
 
 	/*
 	 * If this task has a closid/rmid assigned, use it.
 	 * Else use the closid/rmid assigned to this cpu.
 	 */
 	if (static_branch_likely(&rdt_alloc_enable_key)) {
-		tmp = READ_ONCE(current->closid);
-		if (tmp)
-			closid = tmp;
+		if (current->closid)
+			closid = current->closid;
 	}
 
 	if (static_branch_likely(&rdt_mon_enable_key)) {
-		tmp = READ_ONCE(current->rmid);
-		if (tmp)
-			rmid = tmp;
+		if (current->rmid)
+			rmid = current->rmid;
 	}
 
 	if (closid != state->cur_closid || rmid != state->cur_rmid) {
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..97cfa841f296 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -538,12 +538,38 @@ static void _update_task_closid_rmid(void *task)
 		resctrl_sched_in();
 }
 
-static void update_task_closid_rmid(struct task_struct *t)
+static int update_locked_task_closid_rmid(struct task_struct *t, void *arg)
 {
-	if (IS_ENABLED(CONFIG_SMP) && task_curr(t))
-		smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1);
-	else
-		_update_task_closid_rmid(t);
+	struct rdtgroup *rdtgrp = arg;
+
+	/*
+	 * We assume task_call_func() has provided the necessary serialization
+	 * with resctrl_sched_in().
+	 */
+	if (rdtgrp->type == RDTCTRL_GROUP) {
+		t->closid = rdtgrp->closid;
+		t->rmid = rdtgrp->mon.rmid;
+	} else if (rdtgrp->type == RDTMON_GROUP) {
+		t->rmid = rdtgrp->mon.rmid;
+	}
+
+	/*
+	 * If the task is current on a CPU, the PQR_ASSOC MSR needs to be
+	 * updated to make the resource group go into effect. If the task is not
+	 * current, the MSR will be updated when the task is scheduled in.
+	 */
+	return task_curr(t);
+}
+
+static bool update_task_closid_rmid(struct task_struct *t,
+				    struct rdtgroup *rdtgrp)
+{
+	/*
+	 * Serialize the closid and rmid update with context switch. If this
+	 * function indicates that the task was running, then it needs to be
+	 * interrupted to install the new closid and rmid.
+	 */
+	return task_call_func(t, update_locked_task_closid_rmid, rdtgrp);
 }
 
 static int __rdtgroup_move_task(struct task_struct *tsk,
@@ -557,39 +583,26 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
 		return 0;
 
 	/*
-	 * Set the task's closid/rmid before the PQR_ASSOC MSR can be
-	 * updated by them.
-	 *
-	 * For ctrl_mon groups, move both closid and rmid.
 	 * For monitor groups, can move the tasks only from
 	 * their parent CTRL group.
 	 */
-
-	if (rdtgrp->type == RDTCTRL_GROUP) {
-		WRITE_ONCE(tsk->closid, rdtgrp->closid);
-		WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
-	} else if (rdtgrp->type == RDTMON_GROUP) {
-		if (rdtgrp->mon.parent->closid == tsk->closid) {
-			WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
-		} else {
-			rdt_last_cmd_puts("Can't move task to different control group\n");
-			return -EINVAL;
-		}
+	if (rdtgrp->type == RDTMON_GROUP &&
+	    rdtgrp->mon.parent->closid != tsk->closid) {
+		rdt_last_cmd_puts("Can't move task to different control group\n");
+		return -EINVAL;
 	}
 
-	/*
-	 * Ensure the task's closid and rmid are written before determining if
-	 * the task is current that will decide if it will be interrupted.
-	 */
-	barrier();
-
-	/*
-	 * By now, the task's closid and rmid are set. If the task is current
-	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
-	 * group go into effect. If the task is not current, the MSR will be
-	 * updated when the task is scheduled in.
-	 */
-	update_task_closid_rmid(tsk);
+	if (update_task_closid_rmid(tsk, rdtgrp) && IS_ENABLED(CONFIG_SMP))
+		/*
+		 * If the task has migrated away from the CPU indicated by
+		 * task_cpu() below, then it has already switched in on the
+		 * new CPU using the updated closid and rmid and the call below
+		 * unnecessary, but harmless.
+		 */
+		smp_call_function_single(task_cpu(tsk),
+					 _update_task_closid_rmid, tsk, 1);
+	else
+		_update_task_closid_rmid(tsk);
 
 	return 0;
 }
@@ -2398,8 +2411,6 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
 	for_each_process_thread(p, t) {
 		if (!from || is_closid_match(t, from) ||
 		    is_rmid_match(t, from)) {
-			WRITE_ONCE(t->closid, to->closid);
-			WRITE_ONCE(t->rmid, to->mon.rmid);
 
 			/*
 			 * If the task is on a CPU, set the CPU in the mask.
@@ -2408,7 +2419,7 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
 			 * In such a case the function call is pointless, but
 			 * there is no other side effect.
 			 */
-			if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
+			if (update_task_closid_rmid(t, to) && mask)
 				cpumask_set_cpu(task_cpu(t), mask);
 		}
 	}
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/1] x86/resctrl: fix task CLOSID update race
  2022-11-03 14:16 [PATCH 0/1] x86/resctrl: fix task CLOSID update race Peter Newman
  2022-11-03 14:16 ` [PATCH 1/1] x86/resctrl: serialize task CLOSID update with task_call_func() Peter Newman
@ 2022-11-08 18:49 ` Reinette Chatre
  2022-11-10 15:28   ` Peter Newman
  1 sibling, 1 reply; 4+ messages in thread
From: Reinette Chatre @ 2022-11-08 18:49 UTC (permalink / raw)
  To: Peter Newman, Fenghua Yu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel, jannh, eranian, kpsingh, derkling,
	james.morse

Hi Peter,

On 11/3/2022 7:16 AM, Peter Newman wrote:
> Below is my patch to address the IPI race we discussed in the container
> move RFD thread[1].

Thank you very much for taking this on.

> 
> The patch below uses the new task_call_func() interface to serialize
> updating closid and rmid with any context switch of the task. AFAICT,
> the implementation of this function acts like a mutex with context
> switch, but I'm not certain whether it is intended to be one. If this is
> not how task_call_func() is meant to be used, I will instead move the
> code performing the update under sched/ where it can be done holding the
> task_rq_lock() explicitly, as Reinette has suggested before[2].
> 
> From my own measurements, this change will double the time to complete a
> mass-move operation, such as rmdir on an rdtgroup with a large task
> list. But to the best of my knowedge, these large-scale reconfigurations
> of the control groups are infrequent, and the baseline I'm measuring
> against is racy anyways.
> 
> What's still unclear to me is, when processing a large task list, is
> obtaining the pi/rq locks for thousands of tasks (all while read-locking
> the tasklist_lock) better than just blindly notifying all CPUs? My guess
> is that the situation where notifying all CPUs would be better is
> uncommon for most users and probably more likely in Google's use case
> than most others, as we have a use case for moving large container jobs
> to a different MBA group.
> 

It was unclear to me also so I asked for advice and learned that, in
general, sending extra IPIs is not evil. I learned that there is precedent
for sending unnecessary IPIs, for example, in the TLB flushing code where
it is common to land in the TLB flush IPI hander and learn that the TLB
does not need to be flushed. Highlighting that the user initiated resctrl
flow in question is rare when compared to TLB flush.

From what I understand even going through the extra locking and resulting
delays to avoid unnecessary IPIs with usage of task_call_func() it is still
possible to send unnecessary IPIs because the information about where modified
tasks are running may be stale by the time the IPIs are sent. To me it seems that
the risk of stale information increases as the size of the moved task group
increases. The benefit of using task_call_func() when moving a group of
tasks is thus not clear to me.

I do not see it as an either/or though. I think that using task_call_func()
to serialize with context switching is a good idea when moving a single
task. Sending IPIs to all CPUs in this case seems overkill. On the other hand,
when moving a group of tasks I think that notifying all CPUs would be
simpler. The current code already ensures that it does not modify the
PQR register unnecessarily. I would really like to learn more about this
from the experts but at this point I am most comfortable with such a
solution and look forward to to learning from the experts when it is
presented to the x86 maintainers for inclusion.

Reinette

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/1] x86/resctrl: fix task CLOSID update race
  2022-11-08 18:49 ` [PATCH 0/1] x86/resctrl: fix task CLOSID update race Reinette Chatre
@ 2022-11-10 15:28   ` Peter Newman
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Newman @ 2022-11-10 15:28 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel, jannh, eranian, kpsingh, derkling,
	james.morse

Hi Reinette,

On Tue, Nov 8, 2022 at 7:50 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
> I do not see it as an either/or though. I think that using task_call_func()
> to serialize with context switching is a good idea when moving a single
> task. Sending IPIs to all CPUs in this case seems overkill. On the other hand,
> when moving a group of tasks I think that notifying all CPUs would be
> simpler. The current code already ensures that it does not modify the
> PQR register unnecessarily. I would really like to learn more about this
> from the experts but at this point I am most comfortable with such a
> solution and look forward to to learning from the experts when it is
> presented to the x86 maintainers for inclusion.

Great, that should allow me to post my mon group rename patch
independent of this one. As it is written today, it depends on this
patch to reliably notify moved tasks' CPUs.

Thanks!
-Peter

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-11-10 15:30 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-03 14:16 [PATCH 0/1] x86/resctrl: fix task CLOSID update race Peter Newman
2022-11-03 14:16 ` [PATCH 1/1] x86/resctrl: serialize task CLOSID update with task_call_func() Peter Newman
2022-11-08 18:49 ` [PATCH 0/1] x86/resctrl: fix task CLOSID update race Reinette Chatre
2022-11-10 15:28   ` Peter Newman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.