linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] irq_work: PREEMPT_RT bits
@ 2021-09-27 21:19 Sebastian Andrzej Siewior
  2021-09-27 21:19 ` [PATCH 1/5] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ Sebastian Andrzej Siewior
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-09-27 21:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Thomas Gleixner

This are the PREEMPT_RT bits for irq_work. The annotation for
IRQ_WORK_HARD_IRQ was already merged except for one missed annotation.
Patch #4 introduces the required processing split of callbacks with are
really invoked from hard-irq context and those which are delayed until
the softirq timer tick. This has been done as quite a few them acquire
locks which need to sleep on PREEMPT_RT and must not be invoked in
hardirq context. We can not delay all of them since a few need to be
invoked hardirq context in order to work properly (NOHZ, scheduler, …).

Sebastian



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
  2021-09-27 21:19 [PATCH 0/5] irq_work: PREEMPT_RT bits Sebastian Andrzej Siewior
@ 2021-09-27 21:19 ` Sebastian Andrzej Siewior
  2021-09-27 21:19 ` [PATCH 2/5] irq_work: Ensure that irq_work runs in in-IRQ context Sebastian Andrzej Siewior
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-09-27 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira

The push-IPI logic for RT tasks expects to be invoked from hardirq
context. One reason is that a RT task on the remote CPU would block the
softirq processing on PREEMPT_RT and so avoid pulling / balancing the RT
tasks as intended.

Annotate root_domain::rto_push_work as IRQ_WORK_HARD_IRQ.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 4e8698e62f075..3d0157bd4e144 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -526,7 +526,7 @@ static int init_rootdomain(struct root_domain *rd)
 #ifdef HAVE_RT_PUSH_IPI
 	rd->rto_cpu = -1;
 	raw_spin_lock_init(&rd->rto_lock);
-	init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
+	rd->rto_push_work = IRQ_WORK_INIT_HARD(rto_push_irq_work_func);
 #endif
 
 	rd->visit_gen = 0;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] irq_work: Ensure that irq_work runs in in-IRQ context.
  2021-09-27 21:19 [PATCH 0/5] irq_work: PREEMPT_RT bits Sebastian Andrzej Siewior
  2021-09-27 21:19 ` [PATCH 1/5] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ Sebastian Andrzej Siewior
@ 2021-09-27 21:19 ` Sebastian Andrzej Siewior
  2021-10-05 15:48   ` Sebastian Andrzej Siewior
  2021-09-27 21:19 ` [PATCH 3/5] irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-09-27 21:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior

The irq-work callback should be invoked in hardirq context and some
callbacks rely on this behaviour. At the time irq_work_run_list()
interrupts should be disabled but the important part is that the
callback is invoked from a in-IRQ context.
The "disabled interrupts" check can be satisfied by disabling interrupts
from a kworker which is not the intended context.

Ensure that the callback is invoked from hardirq context and not just
with disabled interrupts.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/irq_work.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index db8c248ebc8c8..caf2edffa20d5 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -167,7 +167,7 @@ static void irq_work_run_list(struct llist_head *list)
 	struct irq_work *work, *tmp;
 	struct llist_node *llnode;
 
-	BUG_ON(!irqs_disabled());
+	BUG_ON(!in_hardirq());
 
 	if (llist_empty(list))
 		return;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
  2021-09-27 21:19 [PATCH 0/5] irq_work: PREEMPT_RT bits Sebastian Andrzej Siewior
  2021-09-27 21:19 ` [PATCH 1/5] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ Sebastian Andrzej Siewior
  2021-09-27 21:19 ` [PATCH 2/5] irq_work: Ensure that irq_work runs in in-IRQ context Sebastian Andrzej Siewior
@ 2021-09-27 21:19 ` Sebastian Andrzej Siewior
  2021-09-27 21:19 ` [PATCH 4/5] irq_work: Handle some irq_work in SOFTIRQ on PREEMPT_RT Sebastian Andrzej Siewior
  2021-09-27 21:19 ` [PATCH 5/5] irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ " Sebastian Andrzej Siewior
  4 siblings, 0 replies; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-09-27 21:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior

irq_work() triggers instantly an interrupt if supported by the
architecture. Otherwise the work will be processed on the next timer
tick. In worst case irq_work_sync() could spin up to a jiffy.

irq_work_sync() is usually used in tear down context which is fully
preemptible. Based on review irq_work_sync() is invoked from preemptible
context and there is one waiter at a time. This qualifies it to use
rcuwait for synchronisation.

Let irq_work_sync() synchronize with rcuwait if the architecture
processes irqwork via the timer tick.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/irq_work.h |  3 +++
 kernel/irq_work.c        | 10 ++++++++++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index ec2a47a81e423..b48955e9c920e 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -3,6 +3,7 @@
 #define _LINUX_IRQ_WORK_H
 
 #include <linux/smp_types.h>
+#include <linux/rcuwait.h>
 
 /*
  * An entry can be in one of four states:
@@ -16,11 +17,13 @@
 struct irq_work {
 	struct __call_single_node node;
 	void (*func)(struct irq_work *);
+	struct rcuwait irqwait;
 };
 
 #define __IRQ_WORK_INIT(_func, _flags) (struct irq_work){	\
 	.node = { .u_flags = (_flags), },			\
 	.func = (_func),					\
+	.irqwait = __RCUWAIT_INITIALIZER(irqwait),		\
 }
 
 #define IRQ_WORK_INIT(_func) __IRQ_WORK_INIT(_func, 0)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index caf2edffa20d5..853af2cee3612 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -160,6 +160,9 @@ void irq_work_single(void *arg)
 	 * else claimed it meanwhile.
 	 */
 	(void)atomic_cmpxchg(&work->node.a_flags, flags, flags & ~IRQ_WORK_BUSY);
+
+	if (!arch_irq_work_has_interrupt())
+		rcuwait_wake_up(&work->irqwait);
 }
 
 static void irq_work_run_list(struct llist_head *list)
@@ -204,6 +207,13 @@ void irq_work_tick(void)
 void irq_work_sync(struct irq_work *work)
 {
 	lockdep_assert_irqs_enabled();
+	might_sleep();
+
+	if (!arch_irq_work_has_interrupt()) {
+		rcuwait_wait_event(&work->irqwait, !irq_work_is_busy(work),
+				   TASK_UNINTERRUPTIBLE);
+		return;
+	}
 
 	while (irq_work_is_busy(work))
 		cpu_relax();
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] irq_work: Handle some irq_work in SOFTIRQ on PREEMPT_RT
  2021-09-27 21:19 [PATCH 0/5] irq_work: PREEMPT_RT bits Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2021-09-27 21:19 ` [PATCH 3/5] irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support Sebastian Andrzej Siewior
@ 2021-09-27 21:19 ` Sebastian Andrzej Siewior
  2021-09-30  9:07   ` Peter Zijlstra
  2021-09-27 21:19 ` [PATCH 5/5] irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ " Sebastian Andrzej Siewior
  4 siblings, 1 reply; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-09-27 21:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior

The irq_work callback is invoked in hard IRQ context. By default all
callbacks are scheduled for invocation right away (given supported by
the architecture) except for the ones marked IRQ_WORK_LAZY which are
delayed until the next timer-tick.

While looking over the callbacks, some of them may acquire locks
(spinlock_t, rwlock_t) which are transformed into sleeping locks on
PREEMPT_RT and must not be acquired in hard IRQ context.
Changing the locks into locks which could be acquired in this context
will lead to other problems such as increased latencies if everything
in the chain has IRQ-off locks. This will not solve all the issues as
one callback has been noticed which invoked kref_put() and its callback
invokes kfree() and this can not be invoked in hardirq context.

Some callbacks are required to be invoked in hardirq context even on
PREEMPT_RT to work properly. This includes for instance the NO_HZ
callback which needs to be able to observe the idle context.

The callbacks which require to be run in hardirq have already been
marked. Use this information to split the callbacks onto the two lists
on PREEMPT_RT:
- lazy_list
  Work items which are not marked with IRQ_WORK_HARD_IRQ will be added
  to this list. Callbacks on this list will be invoked from timer
  softirq handler. The handler here may acquire sleeping locks such as
  spinlock_t and invoke kfree().

- raised_list
  Work items which are marked with IRQ_WORK_HARD_IRQ will be added to
  this list. They will be invoked in hardirq context and must not
  acquire any sleeping locks.

[bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a
          hard and soft variant. Collected fixes over time from Steven
	  Rostedt and Mike Galbraith. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/irq_work.h |  6 +++++
 kernel/irq_work.c        | 58 ++++++++++++++++++++++++++++++++--------
 kernel/time/timer.c      |  2 ++
 3 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index b48955e9c920e..d65e34c8d0fd9 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -67,4 +67,10 @@ static inline void irq_work_run(void) { }
 static inline void irq_work_single(void *arg) { }
 #endif
 
+#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_PREEMPT_RT)
+void irq_work_tick_soft(void);
+#else
+static inline void irq_work_tick_soft(void) { }
+#endif
+
 #endif /* _LINUX_IRQ_WORK_H */
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 853af2cee3612..55c4206b3ad6f 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -18,6 +18,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/interrupt.h>
 #include <asm/processor.h>
 #include <linux/kasan.h>
 
@@ -52,13 +53,27 @@ void __weak arch_irq_work_raise(void)
 /* Enqueue on current CPU, work must already be claimed and preempt disabled */
 static void __irq_work_queue_local(struct irq_work *work)
 {
-	/* If the work is "lazy", handle it from next tick if any */
-	if (atomic_read(&work->node.a_flags) & IRQ_WORK_LAZY) {
-		if (llist_add(&work->node.llist, this_cpu_ptr(&lazy_list)) &&
-		    tick_nohz_tick_stopped())
-			arch_irq_work_raise();
-	} else {
-		if (llist_add(&work->node.llist, this_cpu_ptr(&raised_list)))
+	struct llist_head *list;
+	bool lazy_work;
+	int work_flags;
+
+	work_flags = atomic_read(&work->node.a_flags);
+	if (work_flags & IRQ_WORK_LAZY)
+		lazy_work = true;
+	else if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
+		 !(work_flags & IRQ_WORK_HARD_IRQ))
+		lazy_work = true;
+	else
+		lazy_work = false;
+
+	if (lazy_work)
+		list = this_cpu_ptr(&lazy_list);
+	else
+		list = this_cpu_ptr(&raised_list);
+
+	if (llist_add(&work->node.llist, list)) {
+		/* If the work is "lazy", handle it from next tick if any */
+		if (!lazy_work || tick_nohz_tick_stopped())
 			arch_irq_work_raise();
 	}
 }
@@ -104,7 +119,13 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (cpu != smp_processor_id()) {
 		/* Arch remote IPI send/receive backend aren't NMI safe */
 		WARN_ON_ONCE(in_nmi());
-		__smp_call_single_queue(cpu, &work->node.llist);
+
+		if (IS_ENABLED(CONFIG_PREEMPT_RT) && !(atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ)) {
+			if (llist_add(&work->node.llist, &per_cpu(lazy_list, cpu)))
+				arch_send_call_function_single_ipi(cpu);
+		} else {
+			__smp_call_single_queue(cpu, &work->node.llist);
+		}
 	} else {
 		__irq_work_queue_local(work);
 	}
@@ -121,7 +142,6 @@ bool irq_work_needs_cpu(void)
 
 	raised = this_cpu_ptr(&raised_list);
 	lazy = this_cpu_ptr(&lazy_list);
-
 	if (llist_empty(raised) || arch_irq_work_has_interrupt())
 		if (llist_empty(lazy))
 			return false;
@@ -170,7 +190,11 @@ static void irq_work_run_list(struct llist_head *list)
 	struct irq_work *work, *tmp;
 	struct llist_node *llnode;
 
-	BUG_ON(!in_hardirq());
+	/*
+	 * On PREEMPT_RT IRQ-work may run in SOFTIRQ context if it is not marked
+	 * explicitly that it needs to run in hardirq context.
+	 */
+	BUG_ON(!in_hardirq() && !IS_ENABLED(CONFIG_PREEMPT_RT));
 
 	if (llist_empty(list))
 		return;
@@ -187,7 +211,10 @@ static void irq_work_run_list(struct llist_head *list)
 void irq_work_run(void)
 {
 	irq_work_run_list(this_cpu_ptr(&raised_list));
-	irq_work_run_list(this_cpu_ptr(&lazy_list));
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		irq_work_run_list(this_cpu_ptr(&lazy_list));
+	else if (!llist_empty(this_cpu_ptr(&lazy_list)))
+		raise_softirq(TIMER_SOFTIRQ);
 }
 EXPORT_SYMBOL_GPL(irq_work_run);
 
@@ -197,8 +224,17 @@ void irq_work_tick(void)
 
 	if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
 		irq_work_run_list(raised);
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		irq_work_run_list(this_cpu_ptr(&lazy_list));
+}
+
+#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_PREEMPT_RT)
+void irq_work_tick_soft(void)
+{
 	irq_work_run_list(this_cpu_ptr(&lazy_list));
 }
+#endif
 
 /*
  * Synchronize against the irq_work @entry, ensures the entry is not
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index e3d2c23c413d4..fb235b3e91b3e 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1744,6 +1744,8 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h)
 {
 	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
 
+	irq_work_tick_soft();
+
 	__run_timers(base);
 	if (IS_ENABLED(CONFIG_NO_HZ_COMMON))
 		__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
  2021-09-27 21:19 [PATCH 0/5] irq_work: PREEMPT_RT bits Sebastian Andrzej Siewior
                   ` (3 preceding siblings ...)
  2021-09-27 21:19 ` [PATCH 4/5] irq_work: Handle some irq_work in SOFTIRQ on PREEMPT_RT Sebastian Andrzej Siewior
@ 2021-09-27 21:19 ` Sebastian Andrzej Siewior
  4 siblings, 0 replies; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-09-27 21:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior

On PREEMPT_RT most items are processed as LAZY via softirq context.
Avoid to spin-wait for them because irq_work_sync() could have higher
priority and not allow the irq-work to be completed.

Wait additionally for !IRQ_WORK_HARD_IRQ irq_work items on PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/irq_work.h | 5 +++++
 kernel/irq_work.c        | 6 ++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index d65e34c8d0fd9..3ae8ba3fad3af 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -49,6 +49,11 @@ static inline bool irq_work_is_busy(struct irq_work *work)
 	return atomic_read(&work->node.a_flags) & IRQ_WORK_BUSY;
 }
 
+static inline bool irq_work_is_hard(struct irq_work *work)
+{
+	return atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ;
+}
+
 bool irq_work_queue(struct irq_work *work);
 bool irq_work_queue_on(struct irq_work *work, int cpu);
 
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 55c4206b3ad6f..ee27f56381ee2 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -181,7 +181,8 @@ void irq_work_single(void *arg)
 	 */
 	(void)atomic_cmpxchg(&work->node.a_flags, flags, flags & ~IRQ_WORK_BUSY);
 
-	if (!arch_irq_work_has_interrupt())
+	if ((IS_ENABLED(CONFIG_PREEMPT_RT) && !irq_work_is_hard(work)) ||
+	    !arch_irq_work_has_interrupt())
 		rcuwait_wake_up(&work->irqwait);
 }
 
@@ -245,7 +246,8 @@ void irq_work_sync(struct irq_work *work)
 	lockdep_assert_irqs_enabled();
 	might_sleep();
 
-	if (!arch_irq_work_has_interrupt()) {
+	if ((IS_ENABLED(CONFIG_PREEMPT_RT) && !irq_work_is_hard(work)) ||
+	    !arch_irq_work_has_interrupt()) {
 		rcuwait_wait_event(&work->irqwait, !irq_work_is_busy(work),
 				   TASK_UNINTERRUPTIBLE);
 		return;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] irq_work: Handle some irq_work in SOFTIRQ on PREEMPT_RT
  2021-09-27 21:19 ` [PATCH 4/5] irq_work: Handle some irq_work in SOFTIRQ on PREEMPT_RT Sebastian Andrzej Siewior
@ 2021-09-30  9:07   ` Peter Zijlstra
  2021-09-30  9:53     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2021-09-30  9:07 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-kernel, Thomas Gleixner, Jiri Olsa

On Mon, Sep 27, 2021 at 11:19:18PM +0200, Sebastian Andrzej Siewior wrote:
> The irq_work callback is invoked in hard IRQ context. By default all
> callbacks are scheduled for invocation right away (given supported by
> the architecture) except for the ones marked IRQ_WORK_LAZY which are
> delayed until the next timer-tick.
> 
> While looking over the callbacks, some of them may acquire locks
> (spinlock_t, rwlock_t) which are transformed into sleeping locks on
> PREEMPT_RT and must not be acquired in hard IRQ context.
> Changing the locks into locks which could be acquired in this context
> will lead to other problems such as increased latencies if everything
> in the chain has IRQ-off locks. This will not solve all the issues as
> one callback has been noticed which invoked kref_put() and its callback
> invokes kfree() and this can not be invoked in hardirq context.
> 
> Some callbacks are required to be invoked in hardirq context even on
> PREEMPT_RT to work properly. This includes for instance the NO_HZ
> callback which needs to be able to observe the idle context.
> 
> The callbacks which require to be run in hardirq have already been
> marked. Use this information to split the callbacks onto the two lists
> on PREEMPT_RT:
> - lazy_list
>   Work items which are not marked with IRQ_WORK_HARD_IRQ will be added
>   to this list. Callbacks on this list will be invoked from timer
>   softirq handler. The handler here may acquire sleeping locks such as
>   spinlock_t and invoke kfree().
> 
> - raised_list
>   Work items which are marked with IRQ_WORK_HARD_IRQ will be added to
>   this list. They will be invoked in hardirq context and must not
>   acquire any sleeping locks.
> 
> [bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a
>           hard and soft variant. Collected fixes over time from Steven
> 	  Rostedt and Mike Galbraith. ]
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

IIRC we have existing problems in -RT due to this irq_work softirq muck.

I think the problem was something Jolsa found a while ago, where perf
defers to an irq_work (from NMI context) and that irq_work wants to
deliver signals, which it can't on -RT, so the whole thing gets punted
to softirq. With the end-result that if you self-profile RT tasks,
things come apart or something.

There might have been others as well, I don't know. But generally I
think we want *less* softirq, not more.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] irq_work: Handle some irq_work in SOFTIRQ on PREEMPT_RT
  2021-09-30  9:07   ` Peter Zijlstra
@ 2021-09-30  9:53     ` Sebastian Andrzej Siewior
  2021-09-30 14:39       ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-09-30  9:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Thomas Gleixner, Jiri Olsa

On 2021-09-30 11:07:18 [+0200], Peter Zijlstra wrote:
> 
> IIRC we have existing problems in -RT due to this irq_work softirq muck.

We have existing problems in -RT due irq_work being used without knowing
the consequences.

> I think the problem was something Jolsa found a while ago, where perf
> defers to an irq_work (from NMI context) and that irq_work wants to
> deliver signals, which it can't on -RT, so the whole thing gets punted
> to softirq. With the end-result that if you self-profile RT tasks,
> things come apart or something.

For signals (at least on x86) we this ARCH_RT_DELAYS_SIGNAL_SEND thingy
where the signal is delayed until exit_to_user_mode_loop().

perf_pending_event() is the only non-HARD on RT (on the perf side). I
think that is due to perf_event_wakeup() where we have wake_up_all() and
read_lock_irqsave().

> There might have been others as well, I don't know. But generally I
> think we want *less* softirq, not more.

I agree. The anonymous softirqs concept brings problems of its own.
But what should I do with things like that:
- kernel/trace/ring_buffer.c rb_wake_up_waiters()
  kernel/bpf/ringbuf.c bpf_ringbuf_notify()
   wake_up_all()

- drivers/acpi/apei/ghes.c ghes_proc_in_irq()
  spinlock_t
 
- drivers/dma-buf/dma-fence-array.c irq_dma_fence_array_work()
  spinlock_t, callbacks, potential kfree().

- drivers/gpu/drm/i915/gt/intel_breadcrumbs.c signal_irq_work()
  spinlock_t, rpm_put() -> wake_up_var(), callbacks, kref_put() like
  constructs which may free memory.

I didn't look at _all_ of them but just a few. And the only one I looked
at and didn't add to the list was
    drivers/edac/igen6_edac.c ecclog_irq_work_cb()

which simply reads PCI registers (which acquires raw_spinlock_t only
(however only on x86 are those raw_spinlock_t now that looked around))
and does schedule_work(). All harmless.

I *think* the irq_work in printk is going to leave once John is done
with it. But there are way more of these things in the kernel now
compared to when I first pushed them into softirq because they were
causing trouble.

Sebastian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] irq_work: Handle some irq_work in SOFTIRQ on PREEMPT_RT
  2021-09-30  9:53     ` Sebastian Andrzej Siewior
@ 2021-09-30 14:39       ` Peter Zijlstra
  2021-09-30 16:38         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2021-09-30 14:39 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-kernel, Thomas Gleixner, Jiri Olsa

On Thu, Sep 30, 2021 at 11:53:48AM +0200, Sebastian Andrzej Siewior wrote:
> On 2021-09-30 11:07:18 [+0200], Peter Zijlstra wrote:
> > 
> > IIRC we have existing problems in -RT due to this irq_work softirq muck.
> 
> We have existing problems in -RT due irq_work being used without knowing
> the consequences.

Obviously :-)

> > I think the problem was something Jolsa found a while ago, where perf
> > defers to an irq_work (from NMI context) and that irq_work wants to
> > deliver signals, which it can't on -RT, so the whole thing gets punted
> > to softirq. With the end-result that if you self-profile RT tasks,
> > things come apart or something.
> 
> For signals (at least on x86) we this ARCH_RT_DELAYS_SIGNAL_SEND thingy
> where the signal is delayed until exit_to_user_mode_loop().

Yeah, I think that is what started much of the entry rework.. the signal
rework is still pending.

> perf_pending_event() is the only non-HARD on RT (on the perf side). I
> think that is due to perf_event_wakeup() where we have wake_up_all() and

Right, and that is exactly the problem, that needs to run at a higher
prio than the task that needs it, but softirq makes that 'difficult'.

One possible 'solution' would be to, instead of softirq, run the thing
as a kthread (worker or otherwise) such that userspace can at least set
the priority and has a small chance of making it work.

Runing them all at the same prio still sucks (much like the single
net-RX thing), but at least a kthread is somewhat controllable.

> read_lock_irqsave().

That one is really vexing, that really is just signal delivery to self
but even when signal stuff is fixed, we're stuck behind that fasync
rwlock :/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] irq_work: Handle some irq_work in SOFTIRQ on PREEMPT_RT
  2021-09-30 14:39       ` Peter Zijlstra
@ 2021-09-30 16:38         ` Sebastian Andrzej Siewior
  2021-10-01 10:32           ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-09-30 16:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Thomas Gleixner, Jiri Olsa

On 2021-09-30 16:39:51 [+0200], Peter Zijlstra wrote:
> > > I think the problem was something Jolsa found a while ago, where perf
> > > defers to an irq_work (from NMI context) and that irq_work wants to
> > > deliver signals, which it can't on -RT, so the whole thing gets punted
> > > to softirq. With the end-result that if you self-profile RT tasks,
> > > things come apart or something.
> > 
> > For signals (at least on x86) we this ARCH_RT_DELAYS_SIGNAL_SEND thingy
> > where the signal is delayed until exit_to_user_mode_loop().
> 
> Yeah, I think that is what started much of the entry rework.. the signal
> rework is still pending.

posix timer were also guilty here :)

> > perf_pending_event() is the only non-HARD on RT (on the perf side). I
> > think that is due to perf_event_wakeup() where we have wake_up_all() and
> 
> Right, and that is exactly the problem, that needs to run at a higher
> prio than the task that needs it, but softirq makes that 'difficult'.
> 
> One possible 'solution' would be to, instead of softirq, run the thing
> as a kthread (worker or otherwise) such that userspace can at least set
> the priority and has a small chance of making it work.
>
> Runing them all at the same prio still sucks (much like the single
> net-RX thing), but at least a kthread is somewhat controllable.

I could replace the softirq processing with a per-CPU thread. This
should work. But I would have to (still) delay the wake-up of the thread
to the timer tick - or - we try the wake from the irqwork-self-IPI. I
just don't know how many will arrive back-to-back. The RCU callback
(rcu_preempt_deferred_qs_handler()) pops up a lot. By my naive guesswork
I would say that the irqwork is not needed since preempt-enable
somewhere should do needed scheduling. But then commit
  0864f057b050b ("rcu: Use irq_work to get scheduler's attention in clean context")

claims it is not enough.

> > read_lock_irqsave().
> 
> That one is really vexing, that really is just signal delivery to self
> but even when signal stuff is fixed, we're stuck behind that fasync
> rwlock :/

Yea. We are already in a RCU section and then this.

Sebastian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] irq_work: Handle some irq_work in SOFTIRQ on PREEMPT_RT
  2021-09-30 16:38         ` Sebastian Andrzej Siewior
@ 2021-10-01 10:32           ` Peter Zijlstra
  2021-10-01 12:08             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2021-10-01 10:32 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-kernel, Thomas Gleixner, Jiri Olsa

On Thu, Sep 30, 2021 at 06:38:58PM +0200, Sebastian Andrzej Siewior wrote:
> On 2021-09-30 16:39:51 [+0200], Peter Zijlstra wrote:

> > Runing them all at the same prio still sucks (much like the single
> > net-RX thing), but at least a kthread is somewhat controllable.
> 
> I could replace the softirq processing with a per-CPU thread. This
> should work. But I would have to (still) delay the wake-up of the thread
> to the timer tick - or - we try the wake from the irqwork-self-IPI.

That, just wake the thread from the hardirq.

> I
> just don't know how many will arrive back-to-back. The RCU callback
> (rcu_preempt_deferred_qs_handler()) pops up a lot. By my naive guesswork
> I would say that the irqwork is not needed since preempt-enable
> somewhere should do needed scheduling. But then commit
>   0864f057b050b ("rcu: Use irq_work to get scheduler's attention in clean context")
> 
> claims it is not enough.

Oh gawd, that was something really nasty. I'm not sure that Changelog
captures all (at least I'm not sure I fully understand the problem again
reading it).

But basically that thing wants to reschedule, but suffers the same
problem as:

	preempt_disable();

	<TIF_NEED_RESCHED gets set>

	local_irq_disable();
	preempt_enable();
	  // cannea schedule because IRQs are disabled
	local_irq_enable();
	// lost a reschedule


Yes, that will _eventually_ reschedule, but violates the PREEMPT rules
because there is an unspecified amount of time until it does actually do
reschedule.

So what RCU does there is basically trigger a self-IPI, which guarantees
that we reschedule after IRQs are finally enabled, which then triggers a
resched.

I see no problem marking that particular irq_work as HARD tho, it really
doesn't do anything (other than tell RCU the GP is no longer blocked)
and triggering the return-from-interrupt path.

There's also a fun comment in perf_lock_task_context() that possibly
predates the above RCU fix.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] irq_work: Handle some irq_work in SOFTIRQ on PREEMPT_RT
  2021-10-01 10:32           ` Peter Zijlstra
@ 2021-10-01 12:08             ` Sebastian Andrzej Siewior
  2021-10-01 13:40               ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-01 12:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Thomas Gleixner, Jiri Olsa

On 2021-10-01 12:32:38 [+0200], Peter Zijlstra wrote:
> On Thu, Sep 30, 2021 at 06:38:58PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2021-09-30 16:39:51 [+0200], Peter Zijlstra wrote:
> 
> > > Runing them all at the same prio still sucks (much like the single
> > > net-RX thing), but at least a kthread is somewhat controllable.
> > 
> > I could replace the softirq processing with a per-CPU thread. This
> > should work. But I would have to (still) delay the wake-up of the thread
> > to the timer tick - or - we try the wake from the irqwork-self-IPI.
> 
> That, just wake the thread from the hardirq.

"just". Let me do that and see how bad it gets ;)

> > I
> > just don't know how many will arrive back-to-back. The RCU callback
> > (rcu_preempt_deferred_qs_handler()) pops up a lot. By my naive guesswork
> > I would say that the irqwork is not needed since preempt-enable
> > somewhere should do needed scheduling. But then commit
> >   0864f057b050b ("rcu: Use irq_work to get scheduler's attention in clean context")
> > 
> > claims it is not enough.
> 
> Oh gawd, that was something really nasty. I'm not sure that Changelog
> captures all (at least I'm not sure I fully understand the problem again
> reading it).
> 
> But basically that thing wants to reschedule, but suffers the same
> problem as:
> 
> 	preempt_disable();
> 
> 	<TIF_NEED_RESCHED gets set>
> 
> 	local_irq_disable();
> 	preempt_enable();
> 	  // cannea schedule because IRQs are disabled
> 	local_irq_enable();
> 	// lost a reschedule
> 
> 
> Yes, that will _eventually_ reschedule, but violates the PREEMPT rules
> because there is an unspecified amount of time until it does actually do
> reschedule.

Yeah but buh. We could let local_irq_enable/restore() check that
need-resched bit if the above is considered pretty and supported _or_
start to yell if it is not. A middle way would be to trigger that
self-IPI in such a case. I mean everyone suffers from that lost
reschedule and, if I'm not mistaken, you don't receive a remote wakeup
because the remote CPU notices need-resched bit and assumes that it is
about to be handled. So RCU isn't special here.

> So what RCU does there is basically trigger a self-IPI, which guarantees
> that we reschedule after IRQs are finally enabled, which then triggers a
> resched.
> 
> I see no problem marking that particular irq_work as HARD tho, it really
> doesn't do anything (other than tell RCU the GP is no longer blocked)
> and triggering the return-from-interrupt path.

Hmm. Your Highness. I'm going back to my peasant village to build the
thread you asked for. I will look into this. I see two of those irq-work
things that is the scheduler thingy and this.

Thanks.

> There's also a fun comment in perf_lock_task_context() that possibly
> predates the above RCU fix.

Sebastian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] irq_work: Handle some irq_work in SOFTIRQ on PREEMPT_RT
  2021-10-01 12:08             ` Sebastian Andrzej Siewior
@ 2021-10-01 13:40               ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2021-10-01 13:40 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Thomas Gleixner, Jiri Olsa, Paul McKenney

On Fri, Oct 01, 2021 at 02:08:55PM +0200, Sebastian Andrzej Siewior wrote:
> On 2021-10-01 12:32:38 [+0200], Peter Zijlstra wrote:

> > But basically that thing wants to reschedule, but suffers the same
> > problem as:
> > 
> > 	preempt_disable();
> > 
> > 	<TIF_NEED_RESCHED gets set>
> > 
> > 	local_irq_disable();
> > 	preempt_enable();
> > 	  // cannea schedule because IRQs are disabled
> > 	local_irq_enable();
> > 	// lost a reschedule
> > 
> > 
> > Yes, that will _eventually_ reschedule, but violates the PREEMPT rules
> > because there is an unspecified amount of time until it does actually do
> > reschedule.
> 
> Yeah but buh. We could let local_irq_enable/restore() check that
> need-resched bit if the above is considered pretty and supported _or_
> start to yell if it is not. A middle way would be to trigger that
> self-IPI in such a case. I mean everyone suffers from that lost
> reschedule and, if I'm not mistaken, you don't receive a remote wakeup
> because the remote CPU notices need-resched bit and assumes that it is
> about to be handled. So RCU isn't special here.

Mostly the above pattern isn't 'allowed', but it does tend to happen
with RCU quite a bit.

As per the perf code, I'm actually fine if RCU wouldn't do this. But
Paul feels that he needs to cater for it -- doesn't want to surprise his
users.

Fixing this in local_irq_enable() would blow up the code quite a bit.

I'm not sure it's something we can sanely warn about either, the case
for the remote reschedule IPI could cause false-positives.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] irq_work: Ensure that irq_work runs in in-IRQ context.
  2021-09-27 21:19 ` [PATCH 2/5] irq_work: Ensure that irq_work runs in in-IRQ context Sebastian Andrzej Siewior
@ 2021-10-05 15:48   ` Sebastian Andrzej Siewior
  2021-10-05 20:11     ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-05 15:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Thomas Gleixner

On 2021-09-27 23:19:16 [+0200], To linux-kernel@vger.kernel.org wrote:
> The irq-work callback should be invoked in hardirq context and some
> callbacks rely on this behaviour. At the time irq_work_run_list()
> interrupts should be disabled but the important part is that the
> callback is invoked from a in-IRQ context.
> The "disabled interrupts" check can be satisfied by disabling interrupts
> from a kworker which is not the intended context.
> 
> Ensure that the callback is invoked from hardirq context and not just
> with disabled interrupts.

As noted by lkp, this triggers from smpcfd_dying_cpu().
Do we care enough to change this or should I rather drop this one?

> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  kernel/irq_work.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/irq_work.c b/kernel/irq_work.c
> index db8c248ebc8c8..caf2edffa20d5 100644
> --- a/kernel/irq_work.c
> +++ b/kernel/irq_work.c
> @@ -167,7 +167,7 @@ static void irq_work_run_list(struct llist_head *list)
>  	struct irq_work *work, *tmp;
>  	struct llist_node *llnode;
>  
> -	BUG_ON(!irqs_disabled());
> +	BUG_ON(!in_hardirq());
>  
>  	if (llist_empty(list))
>  		return;

Sebastian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] irq_work: Ensure that irq_work runs in in-IRQ context.
  2021-10-05 15:48   ` Sebastian Andrzej Siewior
@ 2021-10-05 20:11     ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2021-10-05 20:11 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-kernel, Thomas Gleixner

On Tue, Oct 05, 2021 at 05:48:27PM +0200, Sebastian Andrzej Siewior wrote:
> On 2021-09-27 23:19:16 [+0200], To linux-kernel@vger.kernel.org wrote:
> > The irq-work callback should be invoked in hardirq context and some
> > callbacks rely on this behaviour. At the time irq_work_run_list()
> > interrupts should be disabled but the important part is that the
> > callback is invoked from a in-IRQ context.
> > The "disabled interrupts" check can be satisfied by disabling interrupts
> > from a kworker which is not the intended context.
> > 
> > Ensure that the callback is invoked from hardirq context and not just
> > with disabled interrupts.
> 
> As noted by lkp, this triggers from smpcfd_dying_cpu().

It lives then? I don't think I've had it report on my trees in about a
week :/

> Do we care enough to change this or should I rather drop this one?

Drop it for now I suppose...

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-10-05 20:11 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-27 21:19 [PATCH 0/5] irq_work: PREEMPT_RT bits Sebastian Andrzej Siewior
2021-09-27 21:19 ` [PATCH 1/5] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ Sebastian Andrzej Siewior
2021-09-27 21:19 ` [PATCH 2/5] irq_work: Ensure that irq_work runs in in-IRQ context Sebastian Andrzej Siewior
2021-10-05 15:48   ` Sebastian Andrzej Siewior
2021-10-05 20:11     ` Peter Zijlstra
2021-09-27 21:19 ` [PATCH 3/5] irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support Sebastian Andrzej Siewior
2021-09-27 21:19 ` [PATCH 4/5] irq_work: Handle some irq_work in SOFTIRQ on PREEMPT_RT Sebastian Andrzej Siewior
2021-09-30  9:07   ` Peter Zijlstra
2021-09-30  9:53     ` Sebastian Andrzej Siewior
2021-09-30 14:39       ` Peter Zijlstra
2021-09-30 16:38         ` Sebastian Andrzej Siewior
2021-10-01 10:32           ` Peter Zijlstra
2021-10-01 12:08             ` Sebastian Andrzej Siewior
2021-10-01 13:40               ` Peter Zijlstra
2021-09-27 21:19 ` [PATCH 5/5] irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ " Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).