live-patching.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched()
@ 2023-02-24 16:49 Josh Poimboeuf
  2023-02-24 16:49 ` [PATCH v3 1/3] livepatch: Skip task_call_func() for current task Josh Poimboeuf
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Josh Poimboeuf @ 2023-02-24 16:49 UTC (permalink / raw)
  To: live-patching
  Cc: linux-kernel, Seth Forshee, Peter Zijlstra, Song Liu,
	Mark Rutland, Petr Mladek, Joe Lawrence, Miroslav Benes,
	Jiri Kosina, Ingo Molnar, Rik van Riel

v3:
- Add barriers (pmladek)
- Update comments

v2:
- Avoid calling klp_cond_resched_disable() in klp_cancel_transition()
- Fix race in klp_reverse_transition()

Fix patching stalls caused by busy kthreads.

Josh Poimboeuf (3):
  livepatch: Skip task_call_func() for current task
  livepatch,sched: Add livepatch task switching to cond_resched()
  vhost: Fix livepatch timeouts in vhost_worker()

 drivers/vhost/vhost.c           |   3 +-
 include/linux/livepatch.h       |   1 +
 include/linux/livepatch_sched.h |  29 ++++++++
 include/linux/sched.h           |  20 ++++--
 kernel/livepatch/core.c         |   1 +
 kernel/livepatch/transition.c   | 113 +++++++++++++++++++++++++++-----
 kernel/sched/core.c             |  64 +++++++++++++++---
 7 files changed, 200 insertions(+), 31 deletions(-)
 create mode 100644 include/linux/livepatch_sched.h

-- 
2.39.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/3] livepatch: Skip task_call_func() for current task
  2023-02-24 16:49 [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched() Josh Poimboeuf
@ 2023-02-24 16:49 ` Josh Poimboeuf
  2023-02-24 16:50 ` [PATCH v3 2/3] livepatch,sched: Add livepatch task switching to cond_resched() Josh Poimboeuf
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Josh Poimboeuf @ 2023-02-24 16:49 UTC (permalink / raw)
  To: live-patching
  Cc: linux-kernel, Seth Forshee, Peter Zijlstra, Song Liu,
	Mark Rutland, Petr Mladek, Joe Lawrence, Miroslav Benes,
	Jiri Kosina, Ingo Molnar, Rik van Riel

The current task doesn't need the scheduler's protection to unwind its
own stack.

Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/livepatch/transition.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index f1b25ec581e0..4d1f443778f7 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -307,7 +307,11 @@ static bool klp_try_switch_task(struct task_struct *task)
 	 * functions.  If all goes well, switch the task to the target patch
 	 * state.
 	 */
-	ret = task_call_func(task, klp_check_and_switch_task, &old_name);
+	if (task == current)
+		ret = klp_check_and_switch_task(current, &old_name);
+	else
+		ret = task_call_func(task, klp_check_and_switch_task, &old_name);
+
 	switch (ret) {
 	case 0:		/* success */
 		break;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 2/3] livepatch,sched: Add livepatch task switching to cond_resched()
  2023-02-24 16:49 [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched() Josh Poimboeuf
  2023-02-24 16:49 ` [PATCH v3 1/3] livepatch: Skip task_call_func() for current task Josh Poimboeuf
@ 2023-02-24 16:50 ` Josh Poimboeuf
  2023-02-27 15:55   ` Petr Mladek
  2023-02-24 16:50 ` [PATCH v3 3/3] vhost: Fix livepatch timeouts in vhost_worker() Josh Poimboeuf
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Josh Poimboeuf @ 2023-02-24 16:50 UTC (permalink / raw)
  To: live-patching
  Cc: linux-kernel, Seth Forshee, Peter Zijlstra, Song Liu,
	Mark Rutland, Petr Mladek, Joe Lawrence, Miroslav Benes,
	Jiri Kosina, Ingo Molnar, Rik van Riel

There have been reports [1][2] of live patches failing to complete
within a reasonable amount of time due to CPU-bound kthreads.

Fix it by patching tasks in cond_resched().

There are four different flavors of cond_resched(), depending on the
kernel configuration.  Hook into all of them.

A more elegant solution might be to use a preempt notifier.  However,
non-ORC unwinders can't unwind a preempted task reliably.

[1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/
[2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org

Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/livepatch.h       |   1 +
 include/linux/livepatch_sched.h |  29 +++++++++
 include/linux/sched.h           |  20 ++++--
 kernel/livepatch/core.c         |   1 +
 kernel/livepatch/transition.c   | 107 +++++++++++++++++++++++++++-----
 kernel/sched/core.c             |  64 ++++++++++++++++---
 6 files changed, 194 insertions(+), 28 deletions(-)
 create mode 100644 include/linux/livepatch_sched.h

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 293e29960c6e..9b9b38e89563 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -13,6 +13,7 @@
 #include <linux/ftrace.h>
 #include <linux/completion.h>
 #include <linux/list.h>
+#include <linux/livepatch_sched.h>
 
 #if IS_ENABLED(CONFIG_LIVEPATCH)
 
diff --git a/include/linux/livepatch_sched.h b/include/linux/livepatch_sched.h
new file mode 100644
index 000000000000..013794fb5da0
--- /dev/null
+++ b/include/linux/livepatch_sched.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_LIVEPATCH_SCHED_H_
+#define _LINUX_LIVEPATCH_SCHED_H_
+
+#include <linux/jump_label.h>
+#include <linux/static_call_types.h>
+
+#ifdef CONFIG_LIVEPATCH
+
+void __klp_sched_try_switch(void);
+
+#if !defined(CONFIG_PREEMPT_DYNAMIC) || !defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
+
+DECLARE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
+
+static __always_inline void klp_sched_try_switch(void)
+{
+	if (static_branch_unlikely(&klp_sched_try_switch_key))
+		__klp_sched_try_switch();
+}
+
+#endif /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
+#else /* !CONFIG_LIVEPATCH */
+static inline void klp_sched_try_switch(void) {}
+static inline void __klp_sched_try_switch(void) {}
+#endif /* CONFIG_LIVEPATCH */
+
+#endif /* _LINUX_LIVEPATCH_SCHED_H_ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 853d08f7562b..bd1e6f02facb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -36,6 +36,7 @@
 #include <linux/seqlock.h>
 #include <linux/kcsan.h>
 #include <linux/rv.h>
+#include <linux/livepatch_sched.h>
 #include <asm/kmap_size.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
@@ -2064,6 +2065,9 @@ extern int __cond_resched(void);
 
 #if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 
+void sched_dynamic_klp_enable(void);
+void sched_dynamic_klp_disable(void);
+
 DECLARE_STATIC_CALL(cond_resched, __cond_resched);
 
 static __always_inline int _cond_resched(void)
@@ -2072,6 +2076,7 @@ static __always_inline int _cond_resched(void)
 }
 
 #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
+
 extern int dynamic_cond_resched(void);
 
 static __always_inline int _cond_resched(void)
@@ -2079,20 +2084,25 @@ static __always_inline int _cond_resched(void)
 	return dynamic_cond_resched();
 }
 
-#else
+#else /* !CONFIG_PREEMPTION */
 
 static inline int _cond_resched(void)
 {
+	klp_sched_try_switch();
 	return __cond_resched();
 }
 
-#endif /* CONFIG_PREEMPT_DYNAMIC */
+#endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
 
-#else
+#else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */
 
-static inline int _cond_resched(void) { return 0; }
+static inline int _cond_resched(void)
+{
+	klp_sched_try_switch();
+	return 0;
+}
 
-#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
+#endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */
 
 #define cond_resched() ({			\
 	__might_resched(__FILE__, __LINE__, 0);	\
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 201f0c0482fb..3f79265dd6e5 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -33,6 +33,7 @@
  *
  * - klp_ftrace_handler()
  * - klp_update_patch_state()
+ * - __klp_sched_try_switch()
  */
 DEFINE_MUTEX(klp_mutex);
 
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index 4d1f443778f7..2662f2efb164 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -9,6 +9,7 @@
 
 #include <linux/cpu.h>
 #include <linux/stacktrace.h>
+#include <linux/static_call.h>
 #include "core.h"
 #include "patch.h"
 #include "transition.h"
@@ -24,6 +25,25 @@ static int klp_target_state = KLP_UNDEFINED;
 
 static unsigned int klp_signals_cnt;
 
+/*
+ * When a livepatch is in progress, enable klp stack checking in
+ * cond_resched().  This helps CPU-bound kthreads get patched.
+ */
+#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
+
+#define klp_cond_resched_enable() sched_dynamic_klp_enable()
+#define klp_cond_resched_disable() sched_dynamic_klp_disable()
+
+#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
+DEFINE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
+EXPORT_SYMBOL(klp_sched_try_switch_key);
+
+#define klp_cond_resched_enable() static_branch_enable(&klp_sched_try_switch_key)
+#define klp_cond_resched_disable() static_branch_disable(&klp_sched_try_switch_key)
+
+#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
 /*
  * This work can be performed periodically to finish patching or unpatching any
  * "straggler" tasks which failed to transition in the first attempt.
@@ -172,8 +192,8 @@ void klp_update_patch_state(struct task_struct *task)
 	 * barrier (smp_rmb) for two cases:
 	 *
 	 * 1) Enforce the order of the TIF_PATCH_PENDING read and the
-	 *    klp_target_state read.  The corresponding write barrier is in
-	 *    klp_init_transition().
+	 *    klp_target_state read.  The corresponding write barriers are in
+	 *    klp_init_transition() and klp_reverse_transition().
 	 *
 	 * 2) Enforce the order of the TIF_PATCH_PENDING read and a future read
 	 *    of func->transition, if klp_ftrace_handler() is called later on
@@ -338,6 +358,44 @@ static bool klp_try_switch_task(struct task_struct *task)
 	return !ret;
 }
 
+void __klp_sched_try_switch(void)
+{
+	if (likely(!klp_patch_pending(current)))
+		return;
+
+	/*
+	 * This function is called from cond_resched() which is called in many
+	 * places throughout the kernel.  Using the klp_mutex here might
+	 * deadlock.
+	 *
+	 * Instead, disable preemption to prevent racing with other callers of
+	 * klp_try_switch_task().  Thanks to task_call_func() they won't be
+	 * able to switch this task while it's running.
+	 */
+	preempt_disable();
+
+	/*
+	 * Make sure current didn't get patched between the above check and
+	 * preempt_disable().
+	 */
+	if (unlikely(!klp_patch_pending(current)))
+		goto out;
+
+	/*
+	 * Enforce the order of the TIF_PATCH_PENDING read above and the
+	 * klp_target_state read in klp_try_switch_task().  The corresponding
+	 * write barriers are in klp_init_transition() and
+	 * klp_reverse_transition().
+	 */
+	smp_rmb();
+
+	klp_try_switch_task(current);
+
+out:
+	preempt_enable();
+}
+EXPORT_SYMBOL(__klp_sched_try_switch);
+
 /*
  * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
  * Kthreads with TIF_PATCH_PENDING set are woken up.
@@ -444,7 +502,8 @@ void klp_try_complete_transition(void)
 		return;
 	}
 
-	/* we're done, now cleanup the data structures */
+	/* Done!  Now cleanup the data structures. */
+	klp_cond_resched_disable();
 	patch = klp_transition_patch;
 	klp_complete_transition();
 
@@ -496,6 +555,8 @@ void klp_start_transition(void)
 			set_tsk_thread_flag(task, TIF_PATCH_PENDING);
 	}
 
+	klp_cond_resched_enable();
+
 	klp_signals_cnt = 0;
 }
 
@@ -551,8 +612,9 @@ void klp_init_transition(struct klp_patch *patch, int state)
 	 * see a func in transition with a task->patch_state of KLP_UNDEFINED.
 	 *
 	 * Also enforce the order of the klp_target_state write and future
-	 * TIF_PATCH_PENDING writes to ensure klp_update_patch_state() doesn't
-	 * set a task->patch_state to KLP_UNDEFINED.
+	 * TIF_PATCH_PENDING writes to ensure klp_update_patch_state() and
+	 * __klp_sched_try_switch() don't set a task->patch_state to
+	 * KLP_UNDEFINED.
 	 */
 	smp_wmb();
 
@@ -588,14 +650,10 @@ void klp_reverse_transition(void)
 		 klp_target_state == KLP_PATCHED ? "patching to unpatching" :
 						   "unpatching to patching");
 
-	klp_transition_patch->enabled = !klp_transition_patch->enabled;
-
-	klp_target_state = !klp_target_state;
-
 	/*
 	 * Clear all TIF_PATCH_PENDING flags to prevent races caused by
-	 * klp_update_patch_state() running in parallel with
-	 * klp_start_transition().
+	 * klp_update_patch_state() or __klp_sched_try_switch() running in
+	 * parallel with the reverse transition.
 	 */
 	read_lock(&tasklist_lock);
 	for_each_process_thread(g, task)
@@ -605,9 +663,28 @@ void klp_reverse_transition(void)
 	for_each_possible_cpu(cpu)
 		clear_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);
 
-	/* Let any remaining calls to klp_update_patch_state() complete */
+	/*
+	 * Make sure all existing invocations of klp_update_patch_state() and
+	 * __klp_sched_try_switch() see the cleared TIF_PATCH_PENDING before
+	 * starting the reverse transition.
+	 */
 	klp_synchronize_transition();
 
+	/*
+	 * All patching has stopped, now re-initialize the global variables to
+	 * prepare for the reverse transition.
+	 */
+	klp_transition_patch->enabled = !klp_transition_patch->enabled;
+	klp_target_state = !klp_target_state;
+
+	/*
+	 * Enforce the order of the klp_target_state write and the
+	 * TIF_PATCH_PENDING writes in klp_start_transition() to ensure
+	 * klp_update_patch_state() and __klp_sched_try_switch() don't set
+	 * task->patch_state to the wrong value.
+	 */
+	smp_wmb();
+
 	klp_start_transition();
 }
 
@@ -621,9 +698,9 @@ void klp_copy_process(struct task_struct *child)
 	 * the task flag up to date with the parent here.
 	 *
 	 * The operation is serialized against all klp_*_transition()
-	 * operations by the tasklist_lock. The only exception is
-	 * klp_update_patch_state(current), but we cannot race with
-	 * that because we are current.
+	 * operations by the tasklist_lock. The only exceptions are
+	 * klp_update_patch_state(current) and __klp_sched_try_switch(), but we
+	 * cannot race with them because we are current.
 	 */
 	if (test_tsk_thread_flag(current, TIF_PATCH_PENDING))
 		set_tsk_thread_flag(child, TIF_PATCH_PENDING);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e838feb6adc5..895d2a1fdcb3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8487,6 +8487,7 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
 static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
 int __sched dynamic_cond_resched(void)
 {
+	klp_sched_try_switch();
 	if (!static_branch_unlikely(&sk_dynamic_cond_resched))
 		return 0;
 	return __cond_resched();
@@ -8635,13 +8636,17 @@ int sched_dynamic_mode(const char *str)
 #error "Unsupported PREEMPT_DYNAMIC mechanism"
 #endif
 
-void sched_dynamic_update(int mode)
+DEFINE_MUTEX(sched_dynamic_mutex);
+static bool klp_override;
+
+static void __sched_dynamic_update(int mode)
 {
 	/*
 	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
 	 * the ZERO state, which is invalid.
 	 */
-	preempt_dynamic_enable(cond_resched);
+	if (!klp_override)
+		preempt_dynamic_enable(cond_resched);
 	preempt_dynamic_enable(might_resched);
 	preempt_dynamic_enable(preempt_schedule);
 	preempt_dynamic_enable(preempt_schedule_notrace);
@@ -8649,36 +8654,79 @@ void sched_dynamic_update(int mode)
 
 	switch (mode) {
 	case preempt_dynamic_none:
-		preempt_dynamic_enable(cond_resched);
+		if (!klp_override)
+			preempt_dynamic_enable(cond_resched);
 		preempt_dynamic_disable(might_resched);
 		preempt_dynamic_disable(preempt_schedule);
 		preempt_dynamic_disable(preempt_schedule_notrace);
 		preempt_dynamic_disable(irqentry_exit_cond_resched);
-		pr_info("Dynamic Preempt: none\n");
+		if (mode != preempt_dynamic_mode)
+			pr_info("Dynamic Preempt: none\n");
 		break;
 
 	case preempt_dynamic_voluntary:
-		preempt_dynamic_enable(cond_resched);
+		if (!klp_override)
+			preempt_dynamic_enable(cond_resched);
 		preempt_dynamic_enable(might_resched);
 		preempt_dynamic_disable(preempt_schedule);
 		preempt_dynamic_disable(preempt_schedule_notrace);
 		preempt_dynamic_disable(irqentry_exit_cond_resched);
-		pr_info("Dynamic Preempt: voluntary\n");
+		if (mode != preempt_dynamic_mode)
+			pr_info("Dynamic Preempt: voluntary\n");
 		break;
 
 	case preempt_dynamic_full:
-		preempt_dynamic_disable(cond_resched);
+		if (!klp_override)
+			preempt_dynamic_disable(cond_resched);
 		preempt_dynamic_disable(might_resched);
 		preempt_dynamic_enable(preempt_schedule);
 		preempt_dynamic_enable(preempt_schedule_notrace);
 		preempt_dynamic_enable(irqentry_exit_cond_resched);
-		pr_info("Dynamic Preempt: full\n");
+		if (mode != preempt_dynamic_mode)
+			pr_info("Dynamic Preempt: full\n");
 		break;
 	}
 
 	preempt_dynamic_mode = mode;
 }
 
+void sched_dynamic_update(int mode)
+{
+	mutex_lock(&sched_dynamic_mutex);
+	__sched_dynamic_update(mode);
+	mutex_unlock(&sched_dynamic_mutex);
+}
+
+#ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
+
+static int klp_cond_resched(void)
+{
+	__klp_sched_try_switch();
+	return __cond_resched();
+}
+
+void sched_dynamic_klp_enable(void)
+{
+	mutex_lock(&sched_dynamic_mutex);
+
+	klp_override = true;
+	static_call_update(cond_resched, klp_cond_resched);
+
+	mutex_unlock(&sched_dynamic_mutex);
+}
+
+void sched_dynamic_klp_disable(void)
+{
+	mutex_lock(&sched_dynamic_mutex);
+
+	klp_override = false;
+	__sched_dynamic_update(preempt_dynamic_mode);
+
+	mutex_unlock(&sched_dynamic_mutex);
+}
+
+#endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
 static int __init setup_preempt_mode(char *str)
 {
 	int mode = sched_dynamic_mode(str);
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 3/3] vhost: Fix livepatch timeouts in vhost_worker()
  2023-02-24 16:49 [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched() Josh Poimboeuf
  2023-02-24 16:49 ` [PATCH v3 1/3] livepatch: Skip task_call_func() for current task Josh Poimboeuf
  2023-02-24 16:50 ` [PATCH v3 2/3] livepatch,sched: Add livepatch task switching to cond_resched() Josh Poimboeuf
@ 2023-02-24 16:50 ` Josh Poimboeuf
  2023-02-28 16:39 ` [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched() Seth Forshee
  2023-03-30 16:11 ` Miroslav Benes
  4 siblings, 0 replies; 12+ messages in thread
From: Josh Poimboeuf @ 2023-02-24 16:50 UTC (permalink / raw)
  To: live-patching
  Cc: linux-kernel, Seth Forshee, Peter Zijlstra, Song Liu,
	Mark Rutland, Petr Mladek, Joe Lawrence, Miroslav Benes,
	Jiri Kosina, Ingo Molnar, Rik van Riel

Livepatch timeouts were reported due to busy vhost_worker() kthreads.
Now that cond_resched() can do livepatch task switching, use
cond_resched() in vhost_worker().  That's the better way to
conditionally call schedule() anyway.

Reported-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Link: https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org
Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 drivers/vhost/vhost.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 43c9770b86e5..87e3cf12da1c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -363,8 +363,7 @@ static int vhost_worker(void *data)
 			kcov_remote_start_common(dev->kcov_handle);
 			work->fn(work);
 			kcov_remote_stop();
-			if (need_resched())
-				schedule();
+			cond_resched();
 		}
 	}
 	kthread_unuse_mm(dev->mm);
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/3] livepatch,sched: Add livepatch task switching to cond_resched()
  2023-02-24 16:50 ` [PATCH v3 2/3] livepatch,sched: Add livepatch task switching to cond_resched() Josh Poimboeuf
@ 2023-02-27 15:55   ` Petr Mladek
  2023-02-28 16:56     ` Josh Poimboeuf
  0 siblings, 1 reply; 12+ messages in thread
From: Petr Mladek @ 2023-02-27 15:55 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: live-patching, linux-kernel, Seth Forshee, Peter Zijlstra,
	Song Liu, Mark Rutland, Joe Lawrence, Miroslav Benes,
	Jiri Kosina, Ingo Molnar, Rik van Riel

On Fri 2023-02-24 08:50:00, Josh Poimboeuf wrote:
> There have been reports [1][2] of live patches failing to complete
> within a reasonable amount of time due to CPU-bound kthreads.
> 
> Fix it by patching tasks in cond_resched().
> 
> There are four different flavors of cond_resched(), depending on the
> kernel configuration.  Hook into all of them.
> 
> A more elegant solution might be to use a preempt notifier.  However,
> non-ORC unwinders can't unwind a preempted task reliably.
> 
> [1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/
> [2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org
> 
> Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
> Reviewed-by: Petr Mladek <pmladek@suse.com>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>

Just for record, I have double checked the changes against v2
and everything looks good to me.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched()
  2023-02-24 16:49 [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched() Josh Poimboeuf
                   ` (2 preceding siblings ...)
  2023-02-24 16:50 ` [PATCH v3 3/3] vhost: Fix livepatch timeouts in vhost_worker() Josh Poimboeuf
@ 2023-02-28 16:39 ` Seth Forshee
  2023-03-30 16:11 ` Miroslav Benes
  4 siblings, 0 replies; 12+ messages in thread
From: Seth Forshee @ 2023-02-28 16:39 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: live-patching, linux-kernel, Peter Zijlstra, Song Liu,
	Mark Rutland, Petr Mladek, Joe Lawrence, Miroslav Benes,
	Jiri Kosina, Ingo Molnar, Rik van Riel

On Fri, Feb 24, 2023 at 08:49:58AM -0800, Josh Poimboeuf wrote:
> v3:
> - Add barriers (pmladek)
> - Update comments
> 
> v2:
> - Avoid calling klp_cond_resched_disable() in klp_cancel_transition()
> - Fix race in klp_reverse_transition()
> 
> Fix patching stalls caused by busy kthreads.

I ran the v3 patches through my reproducer, still working great.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/3] livepatch,sched: Add livepatch task switching to cond_resched()
  2023-02-27 15:55   ` Petr Mladek
@ 2023-02-28 16:56     ` Josh Poimboeuf
  2023-03-03 14:00       ` Petr Mladek
  0 siblings, 1 reply; 12+ messages in thread
From: Josh Poimboeuf @ 2023-02-28 16:56 UTC (permalink / raw)
  To: Petr Mladek
  Cc: live-patching, linux-kernel, Seth Forshee, Peter Zijlstra,
	Song Liu, Mark Rutland, Joe Lawrence, Miroslav Benes,
	Jiri Kosina, Ingo Molnar, Rik van Riel

On Mon, Feb 27, 2023 at 04:55:47PM +0100, Petr Mladek wrote:
> On Fri 2023-02-24 08:50:00, Josh Poimboeuf wrote:
> > There have been reports [1][2] of live patches failing to complete
> > within a reasonable amount of time due to CPU-bound kthreads.
> > 
> > Fix it by patching tasks in cond_resched().
> > 
> > There are four different flavors of cond_resched(), depending on the
> > kernel configuration.  Hook into all of them.
> > 
> > A more elegant solution might be to use a preempt notifier.  However,
> > non-ORC unwinders can't unwind a preempted task reliably.
> > 
> > [1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/
> > [2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org
> > 
> > Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
> > Reviewed-by: Petr Mladek <pmladek@suse.com>
> > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> 
> Just for record, I have double checked the changes against v2
> and everything looks good to me.

Whoops, so I found another little surprise:

static int klp_check_stack(struct task_struct *task, const char **oldname)
{
        static unsigned long entries[MAX_STACK_ENTRIES];
	^^^^^^

That entries array is shared between the klp_mutex owner and all
cond_resched() callers.

MAX_STACK_ENTRIES is 100, which seems excessive.  If we halved that, the
array would be "only" 400 bytes, which is *almost* reasonable to
allocate on the stack?

Alternatively we could have a percpu entries array... :-/

-- 
Josh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/3] livepatch,sched: Add livepatch task switching to cond_resched()
  2023-02-28 16:56     ` Josh Poimboeuf
@ 2023-03-03 14:00       ` Petr Mladek
  2023-03-13 23:33         ` [PATCH 0.5/3] livepatch: Convert stack entries array to percpu Josh Poimboeuf
  0 siblings, 1 reply; 12+ messages in thread
From: Petr Mladek @ 2023-03-03 14:00 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: live-patching, linux-kernel, Seth Forshee, Peter Zijlstra,
	Song Liu, Mark Rutland, Joe Lawrence, Miroslav Benes,
	Jiri Kosina, Ingo Molnar, Rik van Riel

On Tue 2023-02-28 08:56:08, Josh Poimboeuf wrote:
> On Mon, Feb 27, 2023 at 04:55:47PM +0100, Petr Mladek wrote:
> > On Fri 2023-02-24 08:50:00, Josh Poimboeuf wrote:
> > > There have been reports [1][2] of live patches failing to complete
> > > within a reasonable amount of time due to CPU-bound kthreads.
> > > 
> > > Fix it by patching tasks in cond_resched().
> > > 
> > > There are four different flavors of cond_resched(), depending on the
> > > kernel configuration.  Hook into all of them.
> > > 
> > > A more elegant solution might be to use a preempt notifier.  However,
> > > non-ORC unwinders can't unwind a preempted task reliably.
> > > 
> > > [1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/
> > > [2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org
> > > 
> > > Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
> > > Reviewed-by: Petr Mladek <pmladek@suse.com>
> > > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> > 
> > Just for record, I have double checked the changes against v2
> > and everything looks good to me.
> 
> Whoops, so I found another little surprise:
> 
> static int klp_check_stack(struct task_struct *task, const char **oldname)
> {
>         static unsigned long entries[MAX_STACK_ENTRIES];
> 	^^^^^^
> 
> That entries array is shared between the klp_mutex owner and all
> cond_resched() callers.

Huh, great catch!

> MAX_STACK_ENTRIES is 100, which seems excessive.  If we halved that, the
> array would be "only" 400 bytes, which is *almost* reasonable to
> allocate on the stack?

It is just for the stack in the process context. Right?

I think that I have never seen a stack with over 50 entries. And in
the worst case, a bigger amount of entries would "just" result in
a non-reliable stack which might be acceptable.

It looks acceptable to me.

> Alternatively we could have a percpu entries array... :-/

That said, percpu entries would be fine as well. It sounds like
a good price for the livepatching feature. I think that livepatching
is used on big systems anyway.

I slightly prefer the per-cpu solution.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 0.5/3] livepatch: Convert stack entries array to percpu
  2023-03-03 14:00       ` Petr Mladek
@ 2023-03-13 23:33         ` Josh Poimboeuf
  2023-03-14 10:50           ` Petr Mladek
  0 siblings, 1 reply; 12+ messages in thread
From: Josh Poimboeuf @ 2023-03-13 23:33 UTC (permalink / raw)
  To: Petr Mladek
  Cc: live-patching, linux-kernel, Seth Forshee, Peter Zijlstra,
	Song Liu, Mark Rutland, Joe Lawrence, Miroslav Benes,
	Jiri Kosina, Ingo Molnar, Rik van Riel

On Fri, Mar 03, 2023 at 03:00:13PM +0100, Petr Mladek wrote:
> > MAX_STACK_ENTRIES is 100, which seems excessive.  If we halved that, the
> > array would be "only" 400 bytes, which is *almost* reasonable to
> > allocate on the stack?
> 
> It is just for the stack in the process context. Right?
> 
> I think that I have never seen a stack with over 50 entries. And in
> the worst case, a bigger amount of entries would "just" result in
> a non-reliable stack which might be acceptable.
> 
> It looks acceptable to me.
> 
> > Alternatively we could have a percpu entries array... :-/
> 
> That said, percpu entries would be fine as well. It sounds like
> a good price for the livepatching feature. I think that livepatching
> is used on big systems anyway.
> 
> I slightly prefer the per-cpu solution.

Booting a kernel with PREEMPT+LOCKDEP gave me a high-water mark of 60+
stack entries, seen when probing a device.  I decided not to mess with
MAX_STACK_ENTRIES, and instead just convert the entries to percpu.  This
patch could be inserted at the beginning of the set.

---8<---

Subject: [PATCH 0.5/3] livepatch: Convert stack entries array to percpu

The entries array in klp_check_stack() is static local because it's too
big to be reasonably allocated on the stack.  Serialized access is
enforced by the klp_mutex.

In preparation for calling klp_check_stack() without the mutex (from
cond_resched), convert it to a percpu variable.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/livepatch/transition.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index f1b25ec581e0..135fc73e2e5d 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -14,6 +14,8 @@
 #include "transition.h"
 
 #define MAX_STACK_ENTRIES  100
+DEFINE_PER_CPU(unsigned long[MAX_STACK_ENTRIES], klp_stack_entries);
+
 #define STACK_ERR_BUF_SIZE 128
 
 #define SIGNALS_TIMEOUT 15
@@ -240,12 +242,15 @@ static int klp_check_stack_func(struct klp_func *func, unsigned long *entries,
  */
 static int klp_check_stack(struct task_struct *task, const char **oldname)
 {
-	static unsigned long entries[MAX_STACK_ENTRIES];
+	unsigned long *entries = this_cpu_ptr(klp_stack_entries);
 	struct klp_object *obj;
 	struct klp_func *func;
 	int ret, nr_entries;
 
-	ret = stack_trace_save_tsk_reliable(task, entries, ARRAY_SIZE(entries));
+	/* Protect 'klp_stack_entries' */
+	lockdep_assert_preemption_disabled();
+
+	ret = stack_trace_save_tsk_reliable(task, entries, MAX_STACK_ENTRIES);
 	if (ret < 0)
 		return -EINVAL;
 	nr_entries = ret;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0.5/3] livepatch: Convert stack entries array to percpu
  2023-03-13 23:33         ` [PATCH 0.5/3] livepatch: Convert stack entries array to percpu Josh Poimboeuf
@ 2023-03-14 10:50           ` Petr Mladek
  2023-03-15  0:26             ` Josh Poimboeuf
  0 siblings, 1 reply; 12+ messages in thread
From: Petr Mladek @ 2023-03-14 10:50 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: live-patching, linux-kernel, Seth Forshee, Peter Zijlstra,
	Song Liu, Mark Rutland, Joe Lawrence, Miroslav Benes,
	Jiri Kosina, Ingo Molnar, Rik van Riel

On Mon 2023-03-13 16:33:46, Josh Poimboeuf wrote:
> On Fri, Mar 03, 2023 at 03:00:13PM +0100, Petr Mladek wrote:
> > > MAX_STACK_ENTRIES is 100, which seems excessive.  If we halved that, the
> > > array would be "only" 400 bytes, which is *almost* reasonable to
> > > allocate on the stack?
> > 
> > It is just for the stack in the process context. Right?
> > 
> > I think that I have never seen a stack with over 50 entries. And in
> > the worst case, a bigger amount of entries would "just" result in
> > a non-reliable stack which might be acceptable.
> > 
> > It looks acceptable to me.
> > 
> > > Alternatively we could have a percpu entries array... :-/
> > 
> > That said, percpu entries would be fine as well. It sounds like
> > a good price for the livepatching feature. I think that livepatching
> > is used on big systems anyway.
> > 
> > I slightly prefer the per-cpu solution.
> 
> Booting a kernel with PREEMPT+LOCKDEP gave me a high-water mark of 60+
> stack entries, seen when probing a device.  I decided not to mess with
> MAX_STACK_ENTRIES, and instead just convert the entries to percpu.  This
> patch could be inserted at the beginning of the set.

Good to know.

> 
> ---8<---
> 
> Subject: [PATCH 0.5/3] livepatch: Convert stack entries array to percpu
> 
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -240,12 +242,15 @@ static int klp_check_stack_func(struct klp_func *func, unsigned long *entries,
>   */
>  static int klp_check_stack(struct task_struct *task, const char **oldname)
>  {
> -	static unsigned long entries[MAX_STACK_ENTRIES];
> +	unsigned long *entries = this_cpu_ptr(klp_stack_entries);
>  	struct klp_object *obj;
>  	struct klp_func *func;
>  	int ret, nr_entries;
>  
> -	ret = stack_trace_save_tsk_reliable(task, entries, ARRAY_SIZE(entries));
> +	/* Protect 'klp_stack_entries' */
> +	lockdep_assert_preemption_disabled();

I think about adding:

	/*
	 * Stay on the safe side even when cond_resched() is called from
	 * an IRQ context by mistake.
	 */
	if (!in_task())
		return -EINVAL;

Or is this prevented another way, please?

> +
> +	ret = stack_trace_save_tsk_reliable(task, entries, MAX_STACK_ENTRIES);
>  	if (ret < 0)
>  		return -EINVAL;
>  	nr_entries = ret;

Otherwise, it looks good to me.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0.5/3] livepatch: Convert stack entries array to percpu
  2023-03-14 10:50           ` Petr Mladek
@ 2023-03-15  0:26             ` Josh Poimboeuf
  0 siblings, 0 replies; 12+ messages in thread
From: Josh Poimboeuf @ 2023-03-15  0:26 UTC (permalink / raw)
  To: Petr Mladek
  Cc: live-patching, linux-kernel, Seth Forshee, Peter Zijlstra,
	Song Liu, Mark Rutland, Joe Lawrence, Miroslav Benes,
	Jiri Kosina, Ingo Molnar, Rik van Riel

On Tue, Mar 14, 2023 at 11:50:21AM +0100, Petr Mladek wrote:
> >  static int klp_check_stack(struct task_struct *task, const char **oldname)
> >  {
> > -	static unsigned long entries[MAX_STACK_ENTRIES];
> > +	unsigned long *entries = this_cpu_ptr(klp_stack_entries);
> >  	struct klp_object *obj;
> >  	struct klp_func *func;
> >  	int ret, nr_entries;
> >  
> > -	ret = stack_trace_save_tsk_reliable(task, entries, ARRAY_SIZE(entries));
> > +	/* Protect 'klp_stack_entries' */
> > +	lockdep_assert_preemption_disabled();
> 
> I think about adding:
> 
> 	/*
> 	 * Stay on the safe side even when cond_resched() is called from
> 	 * an IRQ context by mistake.
> 	 */
> 	if (!in_task())
> 		return -EINVAL;
> 
> Or is this prevented another way, please?

An IRQ handler trying to schedule would be a pretty major bug, no?

I think there are already several different checks for that.  For
example, the __might_resched() call in cond_resched().

-- 
Josh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched()
  2023-02-24 16:49 [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched() Josh Poimboeuf
                   ` (3 preceding siblings ...)
  2023-02-28 16:39 ` [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched() Seth Forshee
@ 2023-03-30 16:11 ` Miroslav Benes
  4 siblings, 0 replies; 12+ messages in thread
From: Miroslav Benes @ 2023-03-30 16:11 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: live-patching, linux-kernel, Seth Forshee, Peter Zijlstra,
	Song Liu, Mark Rutland, Petr Mladek, Joe Lawrence, Jiri Kosina,
	Ingo Molnar, Rik van Riel

On Fri, 24 Feb 2023, Josh Poimboeuf wrote:

> v3:
> - Add barriers (pmladek)
> - Update comments
> 
> v2:
> - Avoid calling klp_cond_resched_disable() in klp_cancel_transition()
> - Fix race in klp_reverse_transition()
> 
> Fix patching stalls caused by busy kthreads.
> 
> Josh Poimboeuf (3):
>   livepatch: Skip task_call_func() for current task
>   livepatch,sched: Add livepatch task switching to cond_resched()
>   vhost: Fix livepatch timeouts in vhost_worker()
> 
>  drivers/vhost/vhost.c           |   3 +-
>  include/linux/livepatch.h       |   1 +
>  include/linux/livepatch_sched.h |  29 ++++++++
>  include/linux/sched.h           |  20 ++++--
>  kernel/livepatch/core.c         |   1 +
>  kernel/livepatch/transition.c   | 113 +++++++++++++++++++++++++++-----
>  kernel/sched/core.c             |  64 +++++++++++++++---
>  7 files changed, 200 insertions(+), 31 deletions(-)
>  create mode 100644 include/linux/livepatch_sched.h

Late, so just recording it here...

Acked-by: Miroslav Benes <mbenes@suse.cz>

Thanks for improving the situation, Josh.

M

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-03-30 16:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-24 16:49 [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched() Josh Poimboeuf
2023-02-24 16:49 ` [PATCH v3 1/3] livepatch: Skip task_call_func() for current task Josh Poimboeuf
2023-02-24 16:50 ` [PATCH v3 2/3] livepatch,sched: Add livepatch task switching to cond_resched() Josh Poimboeuf
2023-02-27 15:55   ` Petr Mladek
2023-02-28 16:56     ` Josh Poimboeuf
2023-03-03 14:00       ` Petr Mladek
2023-03-13 23:33         ` [PATCH 0.5/3] livepatch: Convert stack entries array to percpu Josh Poimboeuf
2023-03-14 10:50           ` Petr Mladek
2023-03-15  0:26             ` Josh Poimboeuf
2023-02-24 16:50 ` [PATCH v3 3/3] vhost: Fix livepatch timeouts in vhost_worker() Josh Poimboeuf
2023-02-28 16:39 ` [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched() Seth Forshee
2023-03-30 16:11 ` Miroslav Benes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).