All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/5] sched: Miscellaneous RT related tweaks
@ 2021-09-28 12:24 Thomas Gleixner
  2021-09-28 12:24 ` [patch 1/5] sched: Limit the number of task migrations per batch on RT Thomas Gleixner
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Thomas Gleixner @ 2021-09-28 12:24 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Masami Hiramatsu, Sebastian Andrzej Siewior

RT enabled kernels have a few issues with the inner workings of the
scheduler:

   - The remote TTWU_QUEUE mechanism leads to 5x larger maximum latencies

   - The batched migration limit of 32 tasks causes large latencies

   - The cleanup of kprobes, vmapped stacks of dead tasks and mmdrop() are
     latency sources and eventually calling into code pathes which take
     regular spinlocks from within the scheduler core which has preemption
     disabled.

The following series cleans this up. It is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git sched

applied on top of the previous might_sleep() cleanups:

    https://lore.kernel.org/r/20210923164145.466686140@linutronix.de

Thanks,

	tglx
---
 include/linux/mm_types.h |    4 ++++
 include/linux/sched/mm.h |   20 ++++++++++++++++++++
 kernel/exit.c            |    7 +++++++
 kernel/fork.c            |   18 +++++++++++++++++-
 kernel/kprobes.c         |    8 ++++----
 kernel/sched/core.c      |   16 +++++++++-------
 kernel/sched/features.h  |    5 +++++
 7 files changed, 66 insertions(+), 12 deletions(-)



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 1/5] sched: Limit the number of task migrations per batch on RT
  2021-09-28 12:24 [patch 0/5] sched: Miscellaneous RT related tweaks Thomas Gleixner
@ 2021-09-28 12:24 ` Thomas Gleixner
  2021-10-01 15:05   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
  2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
  2021-09-28 12:24 ` [patch 2/5] sched: Disable TTWU_QUEUE " Thomas Gleixner
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Thomas Gleixner @ 2021-09-28 12:24 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Masami Hiramatsu, Sebastian Andrzej Siewior

Batched task migrations are a source for large latencies as they keep the
scheduler from running while processing the migrations.

Limit the batch size to 8 instead of 32 when running on a RT enabled
kernel.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c |    4 ++++
 1 file changed, 4 insertions(+)
---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,7 +74,11 @@ const_debug unsigned int sysctl_sched_fe
  * Number of tasks to iterate in a single balance run.
  * Limited because this is done with IRQs disabled.
  */
+#ifdef CONFIG_PREEMPT_RT
+const_debug unsigned int sysctl_sched_nr_migrate = 8;
+#else
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
+#endif
 
 /*
  * period over which we measure -rt task CPU usage in us.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 2/5] sched: Disable TTWU_QUEUE on RT
  2021-09-28 12:24 [patch 0/5] sched: Miscellaneous RT related tweaks Thomas Gleixner
  2021-09-28 12:24 ` [patch 1/5] sched: Limit the number of task migrations per batch on RT Thomas Gleixner
@ 2021-09-28 12:24 ` Thomas Gleixner
  2021-10-01 15:05   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
  2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
  2021-09-28 12:24 ` [patch 3/5] sched: Move kprobes cleanup out of finish_task_switch() Thomas Gleixner
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Thomas Gleixner @ 2021-09-28 12:24 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Masami Hiramatsu, Sebastian Andrzej Siewior

The queued remote wakeup mechanism has turned out to be suboptimal for RT
enabled kernels. The maximum latencies go up by a factor of > 5x in certain
scenarious.

This is caused by either long wake lists or by a large number of TTWU IPIs
which are processed back to back.

Disable it for RT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/features.h |    5 +++++
 1 file changed, 5 insertions(+)
---
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -46,11 +46,16 @@ SCHED_FEAT(DOUBLE_TICK, false)
  */
 SCHED_FEAT(NONTASK_CAPACITY, true)
 
+#ifdef CONFIG_PREEMPT_RT
+SCHED_FEAT(TTWU_QUEUE, false)
+#else
+
 /*
  * Queue remote wakeups on the target CPU and process them
  * using the scheduler IPI. Reduces rq->lock contention/bounces.
  */
 SCHED_FEAT(TTWU_QUEUE, true)
+#endif
 
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 3/5] sched: Move kprobes cleanup out of finish_task_switch()
  2021-09-28 12:24 [patch 0/5] sched: Miscellaneous RT related tweaks Thomas Gleixner
  2021-09-28 12:24 ` [patch 1/5] sched: Limit the number of task migrations per batch on RT Thomas Gleixner
  2021-09-28 12:24 ` [patch 2/5] sched: Disable TTWU_QUEUE " Thomas Gleixner
@ 2021-09-28 12:24 ` Thomas Gleixner
  2021-10-01 15:05   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
  2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
  2021-09-28 12:24 ` [patch 4/5] sched: Delay task stack freeing on RT Thomas Gleixner
  2021-09-28 12:24 ` [patch 5/5] sched: Move mmdrop to RCU " Thomas Gleixner
  4 siblings, 2 replies; 22+ messages in thread
From: Thomas Gleixner @ 2021-09-28 12:24 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Masami Hiramatsu, Sebastian Andrzej Siewior

Doing cleanups in the tail of schedule() is a latency punishment for the
incoming task. The point of invoking kprobes_task_flush() for a dead task
is that the instances are returned and cannot leak when __schedule() is
kprobed.

Move it into the delayed cleanup.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
---
 kernel/exit.c       |    2 ++
 kernel/kprobes.c    |    8 ++++----
 kernel/sched/core.c |    6 ------
 3 files changed, 6 insertions(+), 10 deletions(-)

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -64,6 +64,7 @@
 #include <linux/rcuwait.h>
 #include <linux/compat.h>
 #include <linux/io_uring.h>
+#include <linux/kprobes.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -168,6 +169,7 @@ static void delayed_put_task_struct(stru
 {
 	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
 
+	kprobe_flush_task(tsk);
 	perf_event_delayed_put(tsk);
 	trace_sched_process_free(tsk);
 	put_task_struct(tsk);
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1250,10 +1250,10 @@ void kprobe_busy_end(void)
 }
 
 /*
- * This function is called from finish_task_switch when task tk becomes dead,
- * so that we can recycle any function-return probe instances associated
- * with this task. These left over instances represent probed functions
- * that have been called but will never return.
+ * This function is called from delayed_put_task_struct() when a task is
+ * dead and cleaned up to recycle any function-return probe instances
+ * associated with this task. These left over instances represent probed
+ * functions that have been called but will never return.
  */
 void kprobe_flush_task(struct task_struct *tk)
 {
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4846,12 +4846,6 @@ static struct rq *finish_task_switch(str
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
-		/*
-		 * Remove function-return probe instances associated with this
-		 * task and put them back on the free list.
-		 */
-		kprobe_flush_task(prev);
-
 		/* Task is done with its stack. */
 		put_task_stack(prev);
 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 4/5] sched: Delay task stack freeing on RT
  2021-09-28 12:24 [patch 0/5] sched: Miscellaneous RT related tweaks Thomas Gleixner
                   ` (2 preceding siblings ...)
  2021-09-28 12:24 ` [patch 3/5] sched: Move kprobes cleanup out of finish_task_switch() Thomas Gleixner
@ 2021-09-28 12:24 ` Thomas Gleixner
  2021-09-29 11:54   ` Peter Zijlstra
  2021-09-28 12:24 ` [patch 5/5] sched: Move mmdrop to RCU " Thomas Gleixner
  4 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2021-09-28 12:24 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Sebastian Andrzej Siewior, Masami Hiramatsu

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Anything which is done on behalf of a dead task at the end of
finish_task_switch() is preventing the incoming task from doing useful
work. While it is benefitial for fork heavy workloads to recycle the task
stack quickly, this is a latency source for real-time tasks.

Therefore delay the stack cleanup on RT enabled kernels.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/exit.c       |    5 +++++
 kernel/fork.c       |    5 ++++-
 kernel/sched/core.c |    8 ++++++--
 3 files changed, 15 insertions(+), 3 deletions(-)

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -172,6 +172,11 @@ static void delayed_put_task_struct(stru
 	kprobe_flush_task(tsk);
 	perf_event_delayed_put(tsk);
 	trace_sched_process_free(tsk);
+
+	/* RT enabled kernels delay freeing the VMAP'ed task stack */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		put_task_stack(tsk);
+
 	put_task_struct(tsk);
 }
 
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -289,7 +289,10 @@ static inline void free_thread_stack(str
 			return;
 		}
 
-		vfree_atomic(tsk->stack);
+		if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+			vfree_atomic(tsk->stack);
+		else
+			vfree(tsk->stack);
 		return;
 	}
 #endif
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4846,8 +4846,12 @@ static struct rq *finish_task_switch(str
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
-		/* Task is done with its stack. */
-		put_task_stack(prev);
+		/*
+		 * Release VMAP'ed task stack immediate for reuse. On RT
+		 * enabled kernels this is delayed for latency reasons.
+		 */
+		if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+			put_task_stack(prev);
 
 		put_task_struct_rcu_user(prev);
 	}


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 5/5] sched: Move mmdrop to RCU on RT
  2021-09-28 12:24 [patch 0/5] sched: Miscellaneous RT related tweaks Thomas Gleixner
                   ` (3 preceding siblings ...)
  2021-09-28 12:24 ` [patch 4/5] sched: Delay task stack freeing on RT Thomas Gleixner
@ 2021-09-28 12:24 ` Thomas Gleixner
  2021-09-29 12:02   ` Peter Zijlstra
                     ` (2 more replies)
  4 siblings, 3 replies; 22+ messages in thread
From: Thomas Gleixner @ 2021-09-28 12:24 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Masami Hiramatsu, Sebastian Andrzej Siewior

mmdrop() is invoked from finish_task_switch() by the incoming task to drop
the mm which was handed over by the previous task. mmdrop() can be quite
expensive which prevents an incoming real-time task from getting useful
work done.

Provide mmdrop_sched() which maps to mmdrop() on !RT kernels. On RT kernels
it delagates the eventually required invocation of __mmdrop() to RCU.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/mm_types.h |    4 ++++
 include/linux/sched/mm.h |   20 ++++++++++++++++++++
 kernel/fork.c            |   13 +++++++++++++
 kernel/sched/core.c      |    2 +-
 4 files changed, 38 insertions(+), 1 deletion(-)
---
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/uprobes.h>
+#include <linux/rcupdate.h>
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
@@ -572,6 +573,9 @@ struct mm_struct {
 		bool tlb_flush_batched;
 #endif
 		struct uprobes_state uprobes_state;
+#ifdef CONFIG_PREEMPT_RT
+		struct rcu_head delayed_drop;
+#endif
 #ifdef CONFIG_HUGETLB_PAGE
 		atomic_long_t hugetlb_usage;
 #endif
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,26 @@ static inline void mmdrop(struct mm_stru
 		__mmdrop(mm);
 }
 
+#ifdef CONFIG_PREEMPT_RT
+extern void __mmdrop_delayed(struct rcu_head *rhp);
+
+/*
+ * Invoked from finish_task_switch(). Delegates the heavy lifting on RT
+ * kernels via RCU.
+ */
+static inline void mmdrop_sched(struct mm_struct *mm)
+{
+	/* Provides a full memory barrier. See mmdrop() */
+	if (atomic_dec_and_test(&mm->mm_count))
+		call_rcu(&mm->delayed_drop, __mmdrop_delayed);
+}
+#else
+static inline void mmdrop_sched(struct mm_struct *mm)
+{
+	mmdrop(mm);
+}
+#endif
+
 /**
  * mmget() - Pin the address space associated with a &struct mm_struct.
  * @mm: The address space to pin.
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -708,6 +708,19 @@ void __mmdrop(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
 
+#ifdef CONFIG_PREEMPT_RT
+/*
+ * RCU callback for delayed mm drop. Not strictly RCU, but call_rcu() is
+ * by far the least expensive way to do that.
+ */
+void __mmdrop_delayed(struct rcu_head *rhp)
+{
+	struct mm_struct *mm = container_of(rhp, struct mm_struct, delayed_drop);
+
+	__mmdrop(mm);
+}
+#endif
+
 static void mmdrop_async_fn(struct work_struct *work)
 {
 	struct mm_struct *mm;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4840,7 +4840,7 @@ static struct rq *finish_task_switch(str
 	 */
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
-		mmdrop(mm);
+		mmdrop_sched(mm);
 	}
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 4/5] sched: Delay task stack freeing on RT
  2021-09-28 12:24 ` [patch 4/5] sched: Delay task stack freeing on RT Thomas Gleixner
@ 2021-09-29 11:54   ` Peter Zijlstra
  2021-10-01 16:12     ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2021-09-29 11:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Sebastian Andrzej Siewior, Masami Hiramatsu,
	Andy Lutomirski

On Tue, Sep 28, 2021 at 02:24:30PM +0200, Thomas Gleixner wrote:

> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -172,6 +172,11 @@ static void delayed_put_task_struct(stru
>  	kprobe_flush_task(tsk);
>  	perf_event_delayed_put(tsk);
>  	trace_sched_process_free(tsk);
> +
> +	/* RT enabled kernels delay freeing the VMAP'ed task stack */
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +		put_task_stack(tsk);
> +
>  	put_task_struct(tsk);
>  }

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4846,8 +4846,12 @@ static struct rq *finish_task_switch(str
>  		if (prev->sched_class->task_dead)
>  			prev->sched_class->task_dead(prev);
>  
> -		/* Task is done with its stack. */
> -		put_task_stack(prev);
> +		/*
> +		 * Release VMAP'ed task stack immediate for reuse. On RT
> +		 * enabled kernels this is delayed for latency reasons.
> +		 */
> +		if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +			put_task_stack(prev);
>  
>  		put_task_struct_rcu_user(prev);
>  	}


Having this logic split across two files seems unfortunate and prone to
'accidents'. Is there a real down-side to unconditionally doing it in
delayed_put_task_struct() ?

/me goes out for lunch... meanwhile tglx points at: 68f24b08ee89.

Bah.. Andy?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 5/5] sched: Move mmdrop to RCU on RT
  2021-09-28 12:24 ` [patch 5/5] sched: Move mmdrop to RCU " Thomas Gleixner
@ 2021-09-29 12:02   ` Peter Zijlstra
  2021-09-29 13:05     ` Thomas Gleixner
  2021-10-01 15:05   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
  2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
  2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2021-09-29 12:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Masami Hiramatsu, Sebastian Andrzej Siewior

On Tue, Sep 28, 2021 at 02:24:32PM +0200, Thomas Gleixner wrote:

> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -49,6 +49,26 @@ static inline void mmdrop(struct mm_stru
>  		__mmdrop(mm);
>  }
>  
> +#ifdef CONFIG_PREEMPT_RT
> +extern void __mmdrop_delayed(struct rcu_head *rhp);
> +
> +/*
> + * Invoked from finish_task_switch(). Delegates the heavy lifting on RT
> + * kernels via RCU.
> + */
> +static inline void mmdrop_sched(struct mm_struct *mm)
> +{
> +	/* Provides a full memory barrier. See mmdrop() */
> +	if (atomic_dec_and_test(&mm->mm_count))
> +		call_rcu(&mm->delayed_drop, __mmdrop_delayed);
> +}
> +#else
> +static inline void mmdrop_sched(struct mm_struct *mm)
> +{
> +	mmdrop(mm);
> +}
> +#endif
> +
>  /**
>   * mmget() - Pin the address space associated with a &struct mm_struct.
>   * @mm: The address space to pin.
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -708,6 +708,19 @@ void __mmdrop(struct mm_struct *mm)
>  }
>  EXPORT_SYMBOL_GPL(__mmdrop);
>  
> +#ifdef CONFIG_PREEMPT_RT
> +/*
> + * RCU callback for delayed mm drop. Not strictly RCU, but call_rcu() is
> + * by far the least expensive way to do that.
> + */
> +void __mmdrop_delayed(struct rcu_head *rhp)
> +{
> +	struct mm_struct *mm = container_of(rhp, struct mm_struct, delayed_drop);
> +
> +	__mmdrop(mm);
> +}
> +#endif

Would you mind terribly if I fold this into mm.h as a static inline ?

The only risk that carries is that if mmdrop_sched() is called from
multiple translation units (it is not) we get multiple instances of this
function, but possibly even !LTO linkers can fix that for us.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 5/5] sched: Move mmdrop to RCU on RT
  2021-09-29 12:02   ` Peter Zijlstra
@ 2021-09-29 13:05     ` Thomas Gleixner
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2021-09-29 13:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Masami Hiramatsu, Sebastian Andrzej Siewior

On Wed, Sep 29 2021 at 14:02, Peter Zijlstra wrote:
> On Tue, Sep 28, 2021 at 02:24:32PM +0200, Thomas Gleixner wrote:
>> +#ifdef CONFIG_PREEMPT_RT
>> +/*
>> + * RCU callback for delayed mm drop. Not strictly RCU, but call_rcu() is
>> + * by far the least expensive way to do that.
>> + */
>> +void __mmdrop_delayed(struct rcu_head *rhp)
>> +{
>> +	struct mm_struct *mm = container_of(rhp, struct mm_struct, delayed_drop);
>> +
>> +	__mmdrop(mm);
>> +}
>> +#endif
>
> Would you mind terribly if I fold this into mm.h as a static inline ?
>
> The only risk that carries is that if mmdrop_sched() is called from
> multiple translation units (it is not) we get multiple instances of this
> function, but possibly even !LTO linkers can fix that for us.

No preference here.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched: Disable TTWU_QUEUE on RT
  2021-09-28 12:24 ` [patch 2/5] sched: Disable TTWU_QUEUE " Thomas Gleixner
@ 2021-10-01 15:05   ` tip-bot2 for Thomas Gleixner
  2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-10-01 15:05 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e3865866752e1b9fd26383f548dea58334fe6eba
Gitweb:        https://git.kernel.org/tip/e3865866752e1b9fd26383f548dea58334fe6eba
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 28 Sep 2021 14:24:27 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 01 Oct 2021 13:58:07 +02:00

sched: Disable TTWU_QUEUE on RT

The queued remote wakeup mechanism has turned out to be suboptimal for RT
enabled kernels. The maximum latencies go up by a factor of > 5x in certain
scenarious.

This is caused by either long wake lists or by a large number of TTWU IPIs
which are processed back to back.

Disable it for RT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.482262764@linutronix.de
---
 kernel/sched/features.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7f8dace..1cf435b 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -46,11 +46,16 @@ SCHED_FEAT(DOUBLE_TICK, false)
  */
 SCHED_FEAT(NONTASK_CAPACITY, true)
 
+#ifdef CONFIG_PREEMPT_RT
+SCHED_FEAT(TTWU_QUEUE, false)
+#else
+
 /*
  * Queue remote wakeups on the target CPU and process them
  * using the scheduler IPI. Reduces rq->lock contention/bounces.
  */
 SCHED_FEAT(TTWU_QUEUE, true)
+#endif
 
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched: Move kprobes cleanup out of finish_task_switch()
  2021-09-28 12:24 ` [patch 3/5] sched: Move kprobes cleanup out of finish_task_switch() Thomas Gleixner
@ 2021-10-01 15:05   ` tip-bot2 for Thomas Gleixner
  2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-10-01 15:05 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     d428aac9dff0a0d217f3449884ac958dbf3f232b
Gitweb:        https://git.kernel.org/tip/d428aac9dff0a0d217f3449884ac958dbf3f232b
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 28 Sep 2021 14:24:28 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 01 Oct 2021 13:58:07 +02:00

sched: Move kprobes cleanup out of finish_task_switch()

Doing cleanups in the tail of schedule() is a latency punishment for the
incoming task. The point of invoking kprobes_task_flush() for a dead task
is that the instances are returned and cannot leak when __schedule() is
kprobed.

Move it into the delayed cleanup.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.537994026@linutronix.de
---
 kernel/exit.c       | 2 ++
 kernel/kprobes.c    | 8 ++++----
 kernel/sched/core.c | 6 ------
 3 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c041..df281b4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -64,6 +64,7 @@
 #include <linux/rcuwait.h>
 #include <linux/compat.h>
 #include <linux/io_uring.h>
+#include <linux/kprobes.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -169,6 +170,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 {
 	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
 
+	kprobe_flush_task(tsk);
 	perf_event_delayed_put(tsk);
 	trace_sched_process_free(tsk);
 	put_task_struct(tsk);
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 745f08f..89194e5 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1256,10 +1256,10 @@ void kprobe_busy_end(void)
 }
 
 /*
- * This function is called from finish_task_switch when task tk becomes dead,
- * so that we can recycle any function-return probe instances associated
- * with this task. These left over instances represent probed functions
- * that have been called but will never return.
+ * This function is called from delayed_put_task_struct() when a task is
+ * dead and cleaned up to recycle any function-return probe instances
+ * associated with this task. These left over instances represent probed
+ * functions that have been called but will never return.
  */
 void kprobe_flush_task(struct task_struct *tk)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8d844d0..8e49b17 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4783,12 +4783,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
-		/*
-		 * Remove function-return probe instances associated with this
-		 * task and put them back on the free list.
-		 */
-		kprobe_flush_task(prev);
-
 		/* Task is done with its stack. */
 		put_task_stack(prev);
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched: Limit the number of task migrations per batch on RT
  2021-09-28 12:24 ` [patch 1/5] sched: Limit the number of task migrations per batch on RT Thomas Gleixner
@ 2021-10-01 15:05   ` tip-bot2 for Thomas Gleixner
  2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-10-01 15:05 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     92add3a897e9e923acde0f2c5e69705818076d69
Gitweb:        https://git.kernel.org/tip/92add3a897e9e923acde0f2c5e69705818076d69
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 28 Sep 2021 14:24:25 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 01 Oct 2021 13:58:07 +02:00

sched: Limit the number of task migrations per batch on RT

Batched task migrations are a source for large latencies as they keep the
scheduler from running while processing the migrations.

Limit the batch size to 8 instead of 32 when running on a RT enabled
kernel.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.425097596@linutronix.de
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bb70a07..8d844d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,7 +74,11 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
  * Number of tasks to iterate in a single balance run.
  * Limited because this is done with IRQs disabled.
  */
+#ifdef CONFIG_PREEMPT_RT
+const_debug unsigned int sysctl_sched_nr_migrate = 8;
+#else
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
+#endif
 
 /*
  * period over which we measure -rt task CPU usage in us.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched: Move mmdrop to RCU on RT
  2021-09-28 12:24 ` [patch 5/5] sched: Move mmdrop to RCU " Thomas Gleixner
  2021-09-29 12:02   ` Peter Zijlstra
@ 2021-10-01 15:05   ` tip-bot2 for Thomas Gleixner
  2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-10-01 15:05 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     df89544263cd98ffcef1318b3bf18509b9420c8a
Gitweb:        https://git.kernel.org/tip/df89544263cd98ffcef1318b3bf18509b9420c8a
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 28 Sep 2021 14:24:32 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 01 Oct 2021 13:58:06 +02:00

sched: Move mmdrop to RCU on RT

mmdrop() is invoked from finish_task_switch() by the incoming task to drop
the mm which was handed over by the previous task. mmdrop() can be quite
expensive which prevents an incoming real-time task from getting useful
work done.

Provide mmdrop_sched() which maps to mmdrop() on !RT kernels. On RT kernels
it delagates the eventually required invocation of __mmdrop() to RCU.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.648582026@linutronix.de
---
 include/linux/mm_types.h |  4 ++++
 include/linux/sched/mm.h | 29 +++++++++++++++++++++++++++++
 kernel/sched/core.c      |  2 +-
 3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8f0fb62..09a2885 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/uprobes.h>
+#include <linux/rcupdate.h>
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
@@ -567,6 +568,9 @@ struct mm_struct {
 		bool tlb_flush_batched;
 #endif
 		struct uprobes_state uprobes_state;
+#ifdef CONFIG_PREEMPT_RT
+		struct rcu_head delayed_drop;
+#endif
 #ifdef CONFIG_HUGETLB_PAGE
 		atomic_long_t hugetlb_usage;
 #endif
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e24b1fe..0d81060 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,35 @@ static inline void mmdrop(struct mm_struct *mm)
 		__mmdrop(mm);
 }
 
+#ifdef CONFIG_PREEMPT_RT
+/*
+ * RCU callback for delayed mm drop. Not strictly RCU, but call_rcu() is
+ * by far the least expensive way to do that.
+ */
+static inline void __mmdrop_delayed(struct rcu_head *rhp)
+{
+	struct mm_struct *mm = container_of(rhp, struct mm_struct, delayed_drop);
+
+	__mmdrop(mm);
+}
+
+/*
+ * Invoked from finish_task_switch(). Delegates the heavy lifting on RT
+ * kernels via RCU.
+ */
+static inline void mmdrop_sched(struct mm_struct *mm)
+{
+	/* Provides a full memory barrier. See mmdrop() */
+	if (atomic_dec_and_test(&mm->mm_count))
+		call_rcu(&mm->delayed_drop, __mmdrop_delayed);
+}
+#else
+static inline void mmdrop_sched(struct mm_struct *mm)
+{
+	mmdrop(mm);
+}
+#endif
+
 /**
  * mmget() - Pin the address space associated with a &struct mm_struct.
  * @mm: The address space to pin.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b36b5d7..bb70a07 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4773,7 +4773,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 */
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
-		mmdrop(mm);
+		mmdrop_sched(mm);
 	}
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [patch 4/5] sched: Delay task stack freeing on RT
  2021-09-29 11:54   ` Peter Zijlstra
@ 2021-10-01 16:12     ` Andy Lutomirski
  2021-10-01 17:24       ` Thomas Gleixner
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2021-10-01 16:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Sebastian Andrzej Siewior,
	Masami Hiramatsu

On Wed, Sep 29, 2021 at 4:54 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Sep 28, 2021 at 02:24:30PM +0200, Thomas Gleixner wrote:
>
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -172,6 +172,11 @@ static void delayed_put_task_struct(stru
> >       kprobe_flush_task(tsk);
> >       perf_event_delayed_put(tsk);
> >       trace_sched_process_free(tsk);
> > +
> > +     /* RT enabled kernels delay freeing the VMAP'ed task stack */
> > +     if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > +             put_task_stack(tsk);
> > +
> >       put_task_struct(tsk);
> >  }
>
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4846,8 +4846,12 @@ static struct rq *finish_task_switch(str
> >               if (prev->sched_class->task_dead)
> >                       prev->sched_class->task_dead(prev);
> >
> > -             /* Task is done with its stack. */
> > -             put_task_stack(prev);
> > +             /*
> > +              * Release VMAP'ed task stack immediate for reuse. On RT
> > +              * enabled kernels this is delayed for latency reasons.
> > +              */
> > +             if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > +                     put_task_stack(prev);
> >
> >               put_task_struct_rcu_user(prev);
> >       }
>
>
> Having this logic split across two files seems unfortunate and prone to
> 'accidents'. Is there a real down-side to unconditionally doing it in
> delayed_put_task_struct() ?
>
> /me goes out for lunch... meanwhile tglx points at: 68f24b08ee89.
>
> Bah.. Andy?

Could we make whatever we do here unconditional?  And what actually
causes the latency?  If it's vfree, shouldn't the existing use of
vfree_atomic() in free_thread_stack() handle it?  Or is it the
accounting?


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 4/5] sched: Delay task stack freeing on RT
  2021-10-01 16:12     ` Andy Lutomirski
@ 2021-10-01 17:24       ` Thomas Gleixner
  2021-10-01 18:48         ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2021-10-01 17:24 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: LKML, Ingo Molnar, Sebastian Andrzej Siewior, Masami Hiramatsu

On Fri, Oct 01 2021 at 09:12, Andy Lutomirski wrote:
> On Wed, Sep 29, 2021 at 4:54 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> Having this logic split across two files seems unfortunate and prone to
>> 'accidents'. Is there a real down-side to unconditionally doing it in
>> delayed_put_task_struct() ?
>>
>> /me goes out for lunch... meanwhile tglx points at: 68f24b08ee89.
>>
>> Bah.. Andy?
>
> Could we make whatever we do here unconditional?

Sure. I just was unsure about your reasoning in 68f24b08ee89.

> And what actually causes the latency?  If it's vfree, shouldn't the
> existing use of vfree_atomic() in free_thread_stack() handle it?  Or
> is it the accounting?

The accounting muck because it can go into the allocator and sleep in
the worst case, which is nasty even on !RT kernels.

But thinking some more, there is actually a way nastier issue on RT in
the following case:

CPU 0                           CPU 1
  T1                            
  spin_lock(L1)
  rt_mutex_lock()
      schedule()

  T2
     do_exit()
     do_task_dead()             spin_unlock(L1)
                                   wake(T1)
     __schedule()                           
       switch_to(T1)
       finish_task_switch()
         put_task_stack()
           account()
             ....
             spin_lock(L2)

So if L1 == L2 or L1 and L2 have a reverse dependency then this can just
deadlock.

We've never observed that, but the above case is obviously hard to
hit. Nevertheless it's there.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 4/5] sched: Delay task stack freeing on RT
  2021-10-01 17:24       ` Thomas Gleixner
@ 2021-10-01 18:48         ` Andy Lutomirski
  2021-10-01 19:02           ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2021-10-01 18:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Sebastian Andrzej Siewior,
	Masami Hiramatsu

On Fri, Oct 1, 2021 at 10:24 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Fri, Oct 01 2021 at 09:12, Andy Lutomirski wrote:
> > On Wed, Sep 29, 2021 at 4:54 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >> Having this logic split across two files seems unfortunate and prone to
> >> 'accidents'. Is there a real down-side to unconditionally doing it in
> >> delayed_put_task_struct() ?
> >>
> >> /me goes out for lunch... meanwhile tglx points at: 68f24b08ee89.
> >>
> >> Bah.. Andy?
> >
> > Could we make whatever we do here unconditional?
>
> Sure. I just was unsure about your reasoning in 68f24b08ee89.

Mmm, right.  The reasoning is that there are a lot of workloads that
frequently wait for a task to exit and immediately start a new task --
most shell scripts, for example.  I think I tested this with the
following amazing workload:

while true; do true; done

and we want to reuse the same stack each time from the cached stack
lookaside list instead of vfreeing and vmallocing a stack each time.
Deferring the release to the lookaside list breaks it.  Although I
suppose the fact that it works well right now is a bit fragile --
we're waking the parent (sh, etc) before releasing the stack, but
nothing gets to run until the stack is released.

>
> > And what actually causes the latency?  If it's vfree, shouldn't the
> > existing use of vfree_atomic() in free_thread_stack() handle it?  Or
> > is it the accounting?
>
> The accounting muck because it can go into the allocator and sleep in
> the worst case, which is nasty even on !RT kernels.

Wait, unaccounting memory can go into the allocator?  That seems quite nasty.

>
> But thinking some more, there is actually a way nastier issue on RT in
> the following case:
>
> CPU 0                           CPU 1
>   T1
>   spin_lock(L1)
>   rt_mutex_lock()
>       schedule()
>
>   T2
>      do_exit()
>      do_task_dead()             spin_unlock(L1)
>                                    wake(T1)
>      __schedule()
>        switch_to(T1)
>        finish_task_switch()
>          put_task_stack()
>            account()
>              ....
>              spin_lock(L2)
>
> So if L1 == L2 or L1 and L2 have a reverse dependency then this can just
> deadlock.
>
> We've never observed that, but the above case is obviously hard to
> hit. Nevertheless it's there.

Hmm.

ISTM it would be conceptually for do_exit() to handle its own freeing
in its own preemptible context.  Obviously that can't really work,
since we can't free a task_struct or a task stack while we're running
on it.  But I wonder if we could approximate it by putting this work
in a workqueue so that it all runs in a normal schedulable context.
To make the shell script case work nicely, we want to release the task
stack before notifying anyone waiting for the dying task to exit, but
maybe that's doable.  It could involve some nasty exit_signal hackery,
though.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 4/5] sched: Delay task stack freeing on RT
  2021-10-01 18:48         ` Andy Lutomirski
@ 2021-10-01 19:02           ` Andy Lutomirski
  2021-10-01 20:54             ` Thomas Gleixner
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2021-10-01 19:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Peter Zijlstra, LKML, Ingo Molnar,
	Sebastian Andrzej Siewior, Masami Hiramatsu

On Fri, Oct 1, 2021 at 11:48 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Fri, Oct 1, 2021 at 10:24 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >

> ISTM it would be conceptually for do_exit() to handle its own freeing
> in its own preemptible context.  Obviously that can't really work,
> since we can't free a task_struct or a task stack while we're running
> on it.  But I wonder if we could approximate it by putting this work
> in a workqueue so that it all runs in a normal schedulable context.
> To make the shell script case work nicely, we want to release the task
> stack before notifying anyone waiting for the dying task to exit, but
> maybe that's doable.  It could involve some nasty exit_signal hackery,
> though.

I'm making this way more complicated than it needs to be.  How about
we unaccount the task stack in do_exit and release it for real in
finish_task_switch()?  Other than accounting, free_thread_stack
doesn't take any locks.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 4/5] sched: Delay task stack freeing on RT
  2021-10-01 19:02           ` Andy Lutomirski
@ 2021-10-01 20:54             ` Thomas Gleixner
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2021-10-01 20:54 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Sebastian Andrzej Siewior,
	Masami Hiramatsu

On Fri, Oct 01 2021 at 12:02, Andy Lutomirski wrote:
> On Fri, Oct 1, 2021 at 11:48 AM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Fri, Oct 1, 2021 at 10:24 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> >
>
>> ISTM it would be conceptually for do_exit() to handle its own freeing
>> in its own preemptible context.  Obviously that can't really work,
>> since we can't free a task_struct or a task stack while we're running
>> on it.  But I wonder if we could approximate it by putting this work
>> in a workqueue so that it all runs in a normal schedulable context.
>> To make the shell script case work nicely, we want to release the task
>> stack before notifying anyone waiting for the dying task to exit, but
>> maybe that's doable.  It could involve some nasty exit_signal hackery,
>> though.
>
> I'm making this way more complicated than it needs to be.  How about
> we unaccount the task stack in do_exit and release it for real in
> finish_task_switch()?  Other than accounting, free_thread_stack
> doesn't take any locks.

Right.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched: Move kprobes cleanup out of finish_task_switch()
  2021-09-28 12:24 ` [patch 3/5] sched: Move kprobes cleanup out of finish_task_switch() Thomas Gleixner
  2021-10-01 15:05   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
@ 2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-10-05 14:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     670721c7bd2a6e16e40db29b2707a27bdecd6928
Gitweb:        https://git.kernel.org/tip/670721c7bd2a6e16e40db29b2707a27bdecd6928
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 28 Sep 2021 14:24:28 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 05 Oct 2021 15:52:14 +02:00

sched: Move kprobes cleanup out of finish_task_switch()

Doing cleanups in the tail of schedule() is a latency punishment for the
incoming task. The point of invoking kprobes_task_flush() for a dead task
is that the instances are returned and cannot leak when __schedule() is
kprobed.

Move it into the delayed cleanup.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.537994026@linutronix.de
---
 kernel/exit.c       | 2 ++
 kernel/kprobes.c    | 8 ++++----
 kernel/sched/core.c | 6 ------
 3 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 91a43e5..6385132 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -64,6 +64,7 @@
 #include <linux/rcuwait.h>
 #include <linux/compat.h>
 #include <linux/io_uring.h>
+#include <linux/kprobes.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -168,6 +169,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 {
 	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
 
+	kprobe_flush_task(tsk);
 	perf_event_delayed_put(tsk);
 	trace_sched_process_free(tsk);
 	put_task_struct(tsk);
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 790a573..9a38e75 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1250,10 +1250,10 @@ void kprobe_busy_end(void)
 }
 
 /*
- * This function is called from finish_task_switch when task tk becomes dead,
- * so that we can recycle any function-return probe instances associated
- * with this task. These left over instances represent probed functions
- * that have been called but will never return.
+ * This function is called from delayed_put_task_struct() when a task is
+ * dead and cleaned up to recycle any function-return probe instances
+ * associated with this task. These left over instances represent probed
+ * functions that have been called but will never return.
  */
 void kprobe_flush_task(struct task_struct *tk)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 749284f..e33b03c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4846,12 +4846,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
-		/*
-		 * Remove function-return probe instances associated with this
-		 * task and put them back on the free list.
-		 */
-		kprobe_flush_task(prev);
-
 		/* Task is done with its stack. */
 		put_task_stack(prev);
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched: Disable TTWU_QUEUE on RT
  2021-09-28 12:24 ` [patch 2/5] sched: Disable TTWU_QUEUE " Thomas Gleixner
  2021-10-01 15:05   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
@ 2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-10-05 14:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     539fbb5be0da56ffa1434b4f56521a0522bd1d61
Gitweb:        https://git.kernel.org/tip/539fbb5be0da56ffa1434b4f56521a0522bd1d61
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 28 Sep 2021 14:24:27 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 05 Oct 2021 15:52:12 +02:00

sched: Disable TTWU_QUEUE on RT

The queued remote wakeup mechanism has turned out to be suboptimal for RT
enabled kernels. The maximum latencies go up by a factor of > 5x in certain
scenarious.

This is caused by either long wake lists or by a large number of TTWU IPIs
which are processed back to back.

Disable it for RT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.482262764@linutronix.de
---
 kernel/sched/features.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7f8dace..1cf435b 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -46,11 +46,16 @@ SCHED_FEAT(DOUBLE_TICK, false)
  */
 SCHED_FEAT(NONTASK_CAPACITY, true)
 
+#ifdef CONFIG_PREEMPT_RT
+SCHED_FEAT(TTWU_QUEUE, false)
+#else
+
 /*
  * Queue remote wakeups on the target CPU and process them
  * using the scheduler IPI. Reduces rq->lock contention/bounces.
  */
 SCHED_FEAT(TTWU_QUEUE, true)
+#endif
 
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched: Move mmdrop to RCU on RT
  2021-09-28 12:24 ` [patch 5/5] sched: Move mmdrop to RCU " Thomas Gleixner
  2021-09-29 12:02   ` Peter Zijlstra
  2021-10-01 15:05   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
@ 2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-10-05 14:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8d491de6edc27138806cae6e8eca455beb325b62
Gitweb:        https://git.kernel.org/tip/8d491de6edc27138806cae6e8eca455beb325b62
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 28 Sep 2021 14:24:32 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 05 Oct 2021 15:52:09 +02:00

sched: Move mmdrop to RCU on RT

mmdrop() is invoked from finish_task_switch() by the incoming task to drop
the mm which was handed over by the previous task. mmdrop() can be quite
expensive which prevents an incoming real-time task from getting useful
work done.

Provide mmdrop_sched() which maps to mmdrop() on !RT kernels. On RT kernels
it delagates the eventually required invocation of __mmdrop() to RCU.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.648582026@linutronix.de
---
 include/linux/mm_types.h |  4 ++++
 include/linux/sched/mm.h | 29 +++++++++++++++++++++++++++++
 kernel/sched/core.c      |  2 +-
 3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7f8ee09..e9672de 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/uprobes.h>
+#include <linux/rcupdate.h>
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
@@ -572,6 +573,9 @@ struct mm_struct {
 		bool tlb_flush_batched;
 #endif
 		struct uprobes_state uprobes_state;
+#ifdef CONFIG_PREEMPT_RT
+		struct rcu_head delayed_drop;
+#endif
 #ifdef CONFIG_HUGETLB_PAGE
 		atomic_long_t hugetlb_usage;
 #endif
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 5561486..aca874d 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,35 @@ static inline void mmdrop(struct mm_struct *mm)
 		__mmdrop(mm);
 }
 
+#ifdef CONFIG_PREEMPT_RT
+/*
+ * RCU callback for delayed mm drop. Not strictly RCU, but call_rcu() is
+ * by far the least expensive way to do that.
+ */
+static inline void __mmdrop_delayed(struct rcu_head *rhp)
+{
+	struct mm_struct *mm = container_of(rhp, struct mm_struct, delayed_drop);
+
+	__mmdrop(mm);
+}
+
+/*
+ * Invoked from finish_task_switch(). Delegates the heavy lifting on RT
+ * kernels via RCU.
+ */
+static inline void mmdrop_sched(struct mm_struct *mm)
+{
+	/* Provides a full memory barrier. See mmdrop() */
+	if (atomic_dec_and_test(&mm->mm_count))
+		call_rcu(&mm->delayed_drop, __mmdrop_delayed);
+}
+#else
+static inline void mmdrop_sched(struct mm_struct *mm)
+{
+	mmdrop(mm);
+}
+#endif
+
 /**
  * mmget() - Pin the address space associated with a &struct mm_struct.
  * @mm: The address space to pin.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 95f4e16..9eaeba6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4836,7 +4836,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 */
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
-		mmdrop(mm);
+		mmdrop_sched(mm);
 	}
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched: Limit the number of task migrations per batch on RT
  2021-09-28 12:24 ` [patch 1/5] sched: Limit the number of task migrations per batch on RT Thomas Gleixner
  2021-10-01 15:05   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
@ 2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-10-05 14:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     691925f3ddccea832cf2d162dc277d2623a816e3
Gitweb:        https://git.kernel.org/tip/691925f3ddccea832cf2d162dc277d2623a816e3
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 28 Sep 2021 14:24:25 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 05 Oct 2021 15:52:11 +02:00

sched: Limit the number of task migrations per batch on RT

Batched task migrations are a source for large latencies as they keep the
scheduler from running while processing the migrations.

Limit the batch size to 8 instead of 32 when running on a RT enabled
kernel.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.425097596@linutronix.de
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9eaeba6..749284f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,7 +74,11 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
  * Number of tasks to iterate in a single balance run.
  * Limited because this is done with IRQs disabled.
  */
+#ifdef CONFIG_PREEMPT_RT
+const_debug unsigned int sysctl_sched_nr_migrate = 8;
+#else
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
+#endif
 
 /*
  * period over which we measure -rt task CPU usage in us.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-10-05 14:12 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-28 12:24 [patch 0/5] sched: Miscellaneous RT related tweaks Thomas Gleixner
2021-09-28 12:24 ` [patch 1/5] sched: Limit the number of task migrations per batch on RT Thomas Gleixner
2021-10-01 15:05   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
2021-09-28 12:24 ` [patch 2/5] sched: Disable TTWU_QUEUE " Thomas Gleixner
2021-10-01 15:05   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
2021-09-28 12:24 ` [patch 3/5] sched: Move kprobes cleanup out of finish_task_switch() Thomas Gleixner
2021-10-01 15:05   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner
2021-09-28 12:24 ` [patch 4/5] sched: Delay task stack freeing on RT Thomas Gleixner
2021-09-29 11:54   ` Peter Zijlstra
2021-10-01 16:12     ` Andy Lutomirski
2021-10-01 17:24       ` Thomas Gleixner
2021-10-01 18:48         ` Andy Lutomirski
2021-10-01 19:02           ` Andy Lutomirski
2021-10-01 20:54             ` Thomas Gleixner
2021-09-28 12:24 ` [patch 5/5] sched: Move mmdrop to RCU " Thomas Gleixner
2021-09-29 12:02   ` Peter Zijlstra
2021-09-29 13:05     ` Thomas Gleixner
2021-10-01 15:05   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2021-10-05 14:11   ` tip-bot2 for Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.