All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.
@ 2024-03-22  6:48 Sebastian Andrzej Siewior
  2024-03-22  6:48 ` [PATCH v3 1/4] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-03-22  6:48 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner

Hi,

Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
the signal gets delayed until event_sched_out() which then uses
task_work_add() for its delivery. This breaks on PREEMPT_RT because the
signal is delivered with disabled preemption.

While looking at this, I also stumbled upon __perf_pending_irq() which
requires disabled interrupts but this is not the case on PREEMPT_RT.

This series aim to address both issues while not introducing a new issue
at the same time ;)
Any testing is appreciated.

v2…v3: https://lore.kernel.org/all/20240312180814.3373778-1-bigeasy@linutronix.de/
    - Marco suggested to add a few comments
      - Added a comment to __perf_event_overflow() to explain why irq_work
        is raised in the in_nmi() case.
      - Added a comment to perf_event_exit_event() to explain why the
        pending event is deleted.

v1…v2: https://lore.kernel.org/all/20240308175810.2894694-1-bigeasy@linutronix.de/
    - Marco pointed me to the testsuite that showed two problems:
      - Delayed task_work from NMI / missing events.
        Fixed by triggering dummy irq_work to enforce an interrupt for
	the exit-to-userland path which checks task_work
      - Increased ref-count on clean up/ during exec.
        Mostly addressed by the former change. There is still a window
	if the NMI occurs during execve(). This is addressed by removing
	the task_work before free_event().
      The testsuite (remove_on_exec) fails sometimes if the event/
      SIGTRAP is sent before the sighandler is installed.

Sebastian


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/4] perf: Move irq_work_queue() where the event is prepared.
  2024-03-22  6:48 [PATCH v3 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
@ 2024-03-22  6:48 ` Sebastian Andrzej Siewior
  2024-03-22  6:48 ` [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-03-22  6:48 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Sebastian Andrzej Siewior, Arnaldo Carvalho de Melo

Only if perf_event::pending_sigtrap is zero, the irq_work accounted by
increminging perf_event::nr_pending. The member perf_event::pending_addr
might be overwritten by a subsequent event if the signal was not yet
delivered and is expected. The irq_work will not be enqeueued again
because it has a check to be only enqueued once.

Move irq_work_queue() to where the counter is incremented and
perf_event::pending_sigtrap is set to make it more obvious that the
irq_work is scheduled once.

Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/events/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f0f0f71213a1d..c7a0274c662c8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9595,6 +9595,7 @@ static int __perf_event_overflow(struct perf_event *event,
 		if (!event->pending_sigtrap) {
 			event->pending_sigtrap = pending_id;
 			local_inc(&event->ctx->nr_pending);
+			irq_work_queue(&event->pending_irq);
 		} else if (event->attr.exclude_kernel && valid_sample) {
 			/*
 			 * Should not be able to return to user space without
@@ -9614,7 +9615,6 @@ static int __perf_event_overflow(struct perf_event *event,
 		event->pending_addr = 0;
 		if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
 			event->pending_addr = data->addr;
-		irq_work_queue(&event->pending_irq);
 	}
 
 	READ_ONCE(event->overflow_handler)(event, data, regs);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
  2024-03-22  6:48 [PATCH v3 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
  2024-03-22  6:48 ` [PATCH v3 1/4] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
@ 2024-03-22  6:48 ` Sebastian Andrzej Siewior
  2024-04-08 21:29   ` Frederic Weisbecker
  2024-03-22  6:48 ` [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task() Sebastian Andrzej Siewior
  2024-03-22  6:48 ` [PATCH v3 4/4] perf: Split __perf_pending_irq() out of perf_pending_irq() Sebastian Andrzej Siewior
  3 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-03-22  6:48 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Sebastian Andrzej Siewior, Arnaldo Carvalho de Melo

A signal is delivered by raising irq_work() which works from any context
including NMI. irq_work() can be delayed if the architecture does not
provide an interrupt vector. In order not to lose a signal, the signal
is injected via task_work during event_sched_out().

Instead going via irq_work, the signal could be added directly via
task_work. The signal is sent to current and can be enqueued on its
return path to userland instead of triggering irq_work. A dummy IRQ is
required in the NMI case to ensure the task_work is handled before
returning to user land. For this irq_work is used. An alternative would
be just raising an interrupt like arch_send_call_function_single_ipi().

During testing with `remove_on_exec' it become visible that the event
can be enqueued via NMI during execve(). The task_work must not be kept
because free_event() will complain later. Also the new task will not
have a sighandler installed.

Queue signal via task_work. Remove perf_event::pending_sigtrap and
and use perf_event::pending_work instead. Raise irq_work in the NMI case
for a dummy interrupt. Remove the task_work if the event is freed.

Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/perf_event.h |  3 +-
 kernel/events/core.c       | 58 ++++++++++++++++++++++----------------
 2 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d2a15c0c6f8a9..24ac6765146c7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -781,7 +781,6 @@ struct perf_event {
 	unsigned int			pending_wakeup;
 	unsigned int			pending_kill;
 	unsigned int			pending_disable;
-	unsigned int			pending_sigtrap;
 	unsigned long			pending_addr;	/* SIGTRAP */
 	struct irq_work			pending_irq;
 	struct callback_head		pending_task;
@@ -959,7 +958,7 @@ struct perf_event_context {
 	struct rcu_head			rcu_head;
 
 	/*
-	 * Sum (event->pending_sigtrap + event->pending_work)
+	 * Sum (event->pending_work + event->pending_work)
 	 *
 	 * The SIGTRAP is targeted at ctx->task, as such it won't do changing
 	 * that until the signal is delivered.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c7a0274c662c8..e0b2da8de485f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
 		state = PERF_EVENT_STATE_OFF;
 	}
 
-	if (event->pending_sigtrap) {
-		bool dec = true;
-
-		event->pending_sigtrap = 0;
-		if (state != PERF_EVENT_STATE_OFF &&
-		    !event->pending_work) {
-			event->pending_work = 1;
-			dec = false;
-			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
-			task_work_add(current, &event->pending_task, TWA_RESUME);
-		}
-		if (dec)
-			local_dec(&event->ctx->nr_pending);
-	}
-
 	perf_event_set_state(event, state);
 
 	if (!is_software_event(event))
@@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
 	 * Yay, we hit home and are in the context of the event.
 	 */
 	if (cpu == smp_processor_id()) {
-		if (event->pending_sigtrap) {
-			event->pending_sigtrap = 0;
-			perf_sigtrap(event);
-			local_dec(&event->ctx->nr_pending);
-		}
 		if (event->pending_disable) {
 			event->pending_disable = 0;
 			perf_event_disable_local(event);
@@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event,
 
 		if (regs)
 			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
-		if (!event->pending_sigtrap) {
-			event->pending_sigtrap = pending_id;
+		if (!event->pending_work) {
+			event->pending_work = pending_id;
 			local_inc(&event->ctx->nr_pending);
-			irq_work_queue(&event->pending_irq);
+			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
+			task_work_add(current, &event->pending_task, TWA_RESUME);
+			/*
+			 * The NMI path returns directly to userland. The
+			 * irq_work is raised as a dummy interrupt to ensure
+			 * regular return path to user is taken and task_work
+			 * is processed.
+			 */
+			if (in_nmi())
+				irq_work_queue(&event->pending_irq);
 		} else if (event->attr.exclude_kernel && valid_sample) {
 			/*
 			 * Should not be able to return to user space without
-			 * consuming pending_sigtrap; with exceptions:
+			 * consuming pending_work; with exceptions:
 			 *
 			 *  1. Where !exclude_kernel, events can overflow again
 			 *     in the kernel without returning to user space.
@@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event,
 			 *     To approximate progress (with false negatives),
 			 *     check 32-bit hash of the current IP.
 			 */
-			WARN_ON_ONCE(event->pending_sigtrap != pending_id);
+			WARN_ON_ONCE(event->pending_work != pending_id);
 		}
 
 		event->pending_addr = 0;
@@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event)
 		     &parent_event->child_total_time_running);
 }
 
+static bool task_work_cb_match(struct callback_head *cb, void *data)
+{
+	struct perf_event *event = container_of(cb, struct perf_event, pending_task);
+
+	return event == data;
+}
+
 static void
 perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
 {
@@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
 		 * Kick perf_poll() for is_event_hup();
 		 */
 		perf_event_wakeup(parent_event);
+		/*
+		 * Cancel pending task_work and update counters if it has not
+		 * yet been delivered to userland. free_event() expects the
+		 * reference counter at one and keeping the event around until
+		 * the task returns to userland can be a unexpected if there is
+		 * no signal handler registered.
+		 */
+		if (event->pending_work &&
+		    task_work_cancel_match(current, task_work_cb_match, event)) {
+			put_event(event);
+			local_dec(&event->ctx->nr_pending);
+		}
 		free_event(event);
 		put_event(parent_event);
 		return;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task().
  2024-03-22  6:48 [PATCH v3 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
  2024-03-22  6:48 ` [PATCH v3 1/4] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
  2024-03-22  6:48 ` [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
@ 2024-03-22  6:48 ` Sebastian Andrzej Siewior
  2024-04-08 22:06   ` Frederic Weisbecker
  2024-03-22  6:48 ` [PATCH v3 4/4] perf: Split __perf_pending_irq() out of perf_pending_irq() Sebastian Andrzej Siewior
  3 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-03-22  6:48 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Sebastian Andrzej Siewior, Arnaldo Carvalho de Melo

perf_swevent_get_recursion_context() is supposed to avoid recursion.
This requires to remain on the same CPU in order to decrement/ increment
the same counter. This is done by using preempt_disable(). Having
preemption disabled while sending a signal leads to locking problems on
PREEMPT_RT because sighand, a spinlock_t, becomes a sleeping lock.

This callback runs in task context and currently delivers only a signal
to "itself". Any kind of recusrion protection in this context is not
required.

Remove recursion protection in perf_pending_task().

Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/events/core.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index e0b2da8de485f..5400f7ed2f98b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6785,14 +6785,6 @@ static void perf_pending_irq(struct irq_work *entry)
 static void perf_pending_task(struct callback_head *head)
 {
 	struct perf_event *event = container_of(head, struct perf_event, pending_task);
-	int rctx;
-
-	/*
-	 * If we 'fail' here, that's OK, it means recursion is already disabled
-	 * and we won't recurse 'further'.
-	 */
-	preempt_disable_notrace();
-	rctx = perf_swevent_get_recursion_context();
 
 	if (event->pending_work) {
 		event->pending_work = 0;
@@ -6800,10 +6792,6 @@ static void perf_pending_task(struct callback_head *head)
 		local_dec(&event->ctx->nr_pending);
 	}
 
-	if (rctx >= 0)
-		perf_swevent_put_recursion_context(rctx);
-	preempt_enable_notrace();
-
 	put_event(event);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 4/4] perf: Split __perf_pending_irq() out of perf_pending_irq()
  2024-03-22  6:48 [PATCH v3 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2024-03-22  6:48 ` [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task() Sebastian Andrzej Siewior
@ 2024-03-22  6:48 ` Sebastian Andrzej Siewior
  3 siblings, 0 replies; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-03-22  6:48 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Sebastian Andrzej Siewior, Arnaldo Carvalho de Melo

perf_pending_irq() invokes perf_event_wakeup() and __perf_pending_irq().
The former is in charge of waking any tasks which wait to be woken up
while the latter disables perf-events.

The irq_work perf_pending_irq(), while this an irq_work, the callback
is invoked in thread context on PREEMPT_RT. This is needed because all
the waking functions (wake_up_all(), kill_fasync()) acquire sleep locks
which must not be used with disabled interrupts.
Disabling events, as done by __perf_pending_irq(), expects a hardirq
context and disabled interrupts. This requirement is not fulfilled on
PREEMPT_RT.

Split functionality based on perf_event::pending_disable into irq_work
named `pending_disable_irq' and invoke it in hardirq context on
PREEMPT_RT. Rename the split out callback to perf_pending_disable().

Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 31 +++++++++++++++++++++++--------
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 24ac6765146c7..c1c6600541657 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -783,6 +783,7 @@ struct perf_event {
 	unsigned int			pending_disable;
 	unsigned long			pending_addr;	/* SIGTRAP */
 	struct irq_work			pending_irq;
+	struct irq_work			pending_disable_irq;
 	struct callback_head		pending_task;
 	unsigned int			pending_work;
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5400f7ed2f98b..7266265ed8cc3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2449,7 +2449,7 @@ static void __perf_event_disable(struct perf_event *event,
  * hold the top-level event's child_mutex, so any descendant that
  * goes to exit will block in perf_event_exit_event().
  *
- * When called from perf_pending_irq it's OK because event->ctx
+ * When called from perf_pending_disable it's OK because event->ctx
  * is the current context on this CPU and preemption is disabled,
  * hence we can't get into perf_event_task_sched_out for this context.
  */
@@ -2489,7 +2489,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
 void perf_event_disable_inatomic(struct perf_event *event)
 {
 	event->pending_disable = 1;
-	irq_work_queue(&event->pending_irq);
+	irq_work_queue(&event->pending_disable_irq);
 }
 
 #define MAX_INTERRUPTS (~0ULL)
@@ -5175,6 +5175,7 @@ static void perf_addr_filters_splice(struct perf_event *event,
 static void _free_event(struct perf_event *event)
 {
 	irq_work_sync(&event->pending_irq);
+	irq_work_sync(&event->pending_disable_irq);
 
 	unaccount_event(event);
 
@@ -6711,7 +6712,7 @@ static void perf_sigtrap(struct perf_event *event)
 /*
  * Deliver the pending work in-event-context or follow the context.
  */
-static void __perf_pending_irq(struct perf_event *event)
+static void __perf_pending_disable(struct perf_event *event)
 {
 	int cpu = READ_ONCE(event->oncpu);
 
@@ -6749,11 +6750,26 @@ static void __perf_pending_irq(struct perf_event *event)
 	 *				  irq_work_queue(); // FAILS
 	 *
 	 *  irq_work_run()
-	 *    perf_pending_irq()
+	 *    perf_pending_disable()
 	 *
 	 * But the event runs on CPU-B and wants disabling there.
 	 */
-	irq_work_queue_on(&event->pending_irq, cpu);
+	irq_work_queue_on(&event->pending_disable_irq, cpu);
+}
+
+static void perf_pending_disable(struct irq_work *entry)
+{
+	struct perf_event *event = container_of(entry, struct perf_event, pending_disable_irq);
+	int rctx;
+
+	/*
+	 * If we 'fail' here, that's OK, it means recursion is already disabled
+	 * and we won't recurse 'further'.
+	 */
+	rctx = perf_swevent_get_recursion_context();
+	__perf_pending_disable(event);
+	if (rctx >= 0)
+		perf_swevent_put_recursion_context(rctx);
 }
 
 static void perf_pending_irq(struct irq_work *entry)
@@ -6776,8 +6792,6 @@ static void perf_pending_irq(struct irq_work *entry)
 		perf_event_wakeup(event);
 	}
 
-	__perf_pending_irq(event);
-
 	if (rctx >= 0)
 		perf_swevent_put_recursion_context(rctx);
 }
@@ -9572,7 +9586,7 @@ static int __perf_event_overflow(struct perf_event *event,
 			 * is processed.
 			 */
 			if (in_nmi())
-				irq_work_queue(&event->pending_irq);
+				irq_work_queue(&event->pending_disable_irq);
 		} else if (event->attr.exclude_kernel && valid_sample) {
 			/*
 			 * Should not be able to return to user space without
@@ -11912,6 +11926,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	init_waitqueue_head(&event->waitq);
 	init_irq_work(&event->pending_irq, perf_pending_irq);
+	event->pending_disable_irq = IRQ_WORK_INIT_HARD(perf_pending_disable);
 	init_task_work(&event->pending_task, perf_pending_task);
 
 	mutex_init(&event->mmap_mutex);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
  2024-03-22  6:48 ` [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
@ 2024-04-08 21:29   ` Frederic Weisbecker
  2024-04-09  8:57     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2024-04-08 21:29 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

Le Fri, Mar 22, 2024 at 07:48:22AM +0100, Sebastian Andrzej Siewior a écrit :
> A signal is delivered by raising irq_work() which works from any context
> including NMI. irq_work() can be delayed if the architecture does not
> provide an interrupt vector. In order not to lose a signal, the signal
> is injected via task_work during event_sched_out().
> 
> Instead going via irq_work, the signal could be added directly via
> task_work. The signal is sent to current and can be enqueued on its
> return path to userland instead of triggering irq_work. A dummy IRQ is
> required in the NMI case to ensure the task_work is handled before
> returning to user land. For this irq_work is used. An alternative would
> be just raising an interrupt like arch_send_call_function_single_ipi().
> 
> During testing with `remove_on_exec' it become visible that the event
> can be enqueued via NMI during execve(). The task_work must not be kept
> because free_event() will complain later. Also the new task will not
> have a sighandler installed.
> 
> Queue signal via task_work. Remove perf_event::pending_sigtrap and
> and use perf_event::pending_work instead. Raise irq_work in the NMI case
> for a dummy interrupt. Remove the task_work if the event is freed.
> 
> Tested-by: Marco Elver <elver@google.com>
> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

It clashes a bit with a series I have posted. Let's see the details:

> ---
>  include/linux/perf_event.h |  3 +-
>  kernel/events/core.c       | 58 ++++++++++++++++++++++----------------
>  2 files changed, 34 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d2a15c0c6f8a9..24ac6765146c7 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -781,7 +781,6 @@ struct perf_event {
>  	unsigned int			pending_wakeup;
>  	unsigned int			pending_kill;
>  	unsigned int			pending_disable;
> -	unsigned int			pending_sigtrap;
>  	unsigned long			pending_addr;	/* SIGTRAP */
>  	struct irq_work			pending_irq;
>  	struct callback_head		pending_task;
> @@ -959,7 +958,7 @@ struct perf_event_context {
>  	struct rcu_head			rcu_head;
>  
>  	/*
> -	 * Sum (event->pending_sigtrap + event->pending_work)
> +	 * Sum (event->pending_work + event->pending_work)
>  	 *
>  	 * The SIGTRAP is targeted at ctx->task, as such it won't do changing
>  	 * that until the signal is delivered.
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index c7a0274c662c8..e0b2da8de485f 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
>  		state = PERF_EVENT_STATE_OFF;
>  	}
>  
> -	if (event->pending_sigtrap) {
> -		bool dec = true;
> -
> -		event->pending_sigtrap = 0;
> -		if (state != PERF_EVENT_STATE_OFF &&
> -		    !event->pending_work) {
> -			event->pending_work = 1;
> -			dec = false;
> -			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> -			task_work_add(current, &event->pending_task, TWA_RESUME);
> -		}
> -		if (dec)
> -			local_dec(&event->ctx->nr_pending);
> -	}
> -
>  	perf_event_set_state(event, state);
>  
>  	if (!is_software_event(event))
> @@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
>  	 * Yay, we hit home and are in the context of the event.
>  	 */
>  	if (cpu == smp_processor_id()) {
> -		if (event->pending_sigtrap) {
> -			event->pending_sigtrap = 0;
> -			perf_sigtrap(event);
> -			local_dec(&event->ctx->nr_pending);
> -		}
>  		if (event->pending_disable) {
>  			event->pending_disable = 0;
>  			perf_event_disable_local(event);
> @@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event,
>  
>  		if (regs)
>  			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> -		if (!event->pending_sigtrap) {
> -			event->pending_sigtrap = pending_id;
> +		if (!event->pending_work) {
> +			event->pending_work = pending_id;
>  			local_inc(&event->ctx->nr_pending);
> -			irq_work_queue(&event->pending_irq);
> +			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> +			task_work_add(current, &event->pending_task, TWA_RESUME);

If the overflow happens between exit_task_work() and perf_event_exit_task(),
you're leaking the event. (This was there before this patch).
See:
	https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m5e6c8ebbef04ab9a1d7f05340cd3e2716a9a8c39

> +			/*
> +			 * The NMI path returns directly to userland. The
> +			 * irq_work is raised as a dummy interrupt to ensure
> +			 * regular return path to user is taken and task_work
> +			 * is processed.
> +			 */
> +			if (in_nmi())
> +				irq_work_queue(&event->pending_irq);
>  		} else if (event->attr.exclude_kernel && valid_sample) {
>  			/*
>  			 * Should not be able to return to user space without
> -			 * consuming pending_sigtrap; with exceptions:
> +			 * consuming pending_work; with exceptions:
>  			 *
>  			 *  1. Where !exclude_kernel, events can overflow again
>  			 *     in the kernel without returning to user space.
> @@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event,
>  			 *     To approximate progress (with false negatives),
>  			 *     check 32-bit hash of the current IP.
>  			 */
> -			WARN_ON_ONCE(event->pending_sigtrap != pending_id);
> +			WARN_ON_ONCE(event->pending_work != pending_id);
>  		}
>  
>  		event->pending_addr = 0;
> @@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event)
>  		     &parent_event->child_total_time_running);
>  }
>  
> +static bool task_work_cb_match(struct callback_head *cb, void *data)
> +{
> +	struct perf_event *event = container_of(cb, struct perf_event, pending_task);
> +
> +	return event == data;
> +}

I suggest we introduce a proper API to cancel an actual callback head, see:

https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#mbfac417463018394f9d80c68c7f2cafe9d066a4b
https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m0a347249a462523358724085f2489ce9ed91e640

> +
>  static void
>  perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
>  {
> @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
>  		 * Kick perf_poll() for is_event_hup();
>  		 */
>  		perf_event_wakeup(parent_event);
> +		/*
> +		 * Cancel pending task_work and update counters if it has not
> +		 * yet been delivered to userland. free_event() expects the
> +		 * reference counter at one and keeping the event around until
> +		 * the task returns to userland can be a unexpected if there is
> +		 * no signal handler registered.
> +		 */
> +		if (event->pending_work &&
> +		    task_work_cancel_match(current, task_work_cb_match, event)) {
> +			put_event(event);
> +			local_dec(&event->ctx->nr_pending);
> +		}

So exiting task, privileged exec and also exit on exec call into this before
releasing the children.

And parents rely on put_event() from file close + the task work.

But what about remote release of children on file close?
See perf_event_release_kernel() directly calling free_event() on them.

One possible fix is to avoid the reference count game around task work
and flush them on free_event().

See here:

https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50

Thanks.

>  		free_event(event);
>  		put_event(parent_event);
>  		return;
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task().
  2024-03-22  6:48 ` [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task() Sebastian Andrzej Siewior
@ 2024-04-08 22:06   ` Frederic Weisbecker
  2024-04-09  6:25     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2024-04-08 22:06 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

Le Fri, Mar 22, 2024 at 07:48:23AM +0100, Sebastian Andrzej Siewior a écrit :
> perf_swevent_get_recursion_context() is supposed to avoid recursion.
> This requires to remain on the same CPU in order to decrement/ increment
> the same counter. This is done by using preempt_disable(). Having
> preemption disabled while sending a signal leads to locking problems on
> PREEMPT_RT because sighand, a spinlock_t, becomes a sleeping lock.
> 
> This callback runs in task context and currently delivers only a signal
> to "itself". Any kind of recusrion protection in this context is not
> required.
> 
> Remove recursion protection in perf_pending_task().
> 
> Tested-by: Marco Elver <elver@google.com>
> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  kernel/events/core.c | 12 ------------
>  1 file changed, 12 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index e0b2da8de485f..5400f7ed2f98b 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6785,14 +6785,6 @@ static void perf_pending_irq(struct irq_work *entry)
>  static void perf_pending_task(struct callback_head *head)
>  {
>  	struct perf_event *event = container_of(head, struct perf_event, pending_task);
> -	int rctx;
> -
> -	/*
> -	 * If we 'fail' here, that's OK, it means recursion is already disabled
> -	 * and we won't recurse 'further'.
> -	 */
> -	preempt_disable_notrace();
> -	rctx = perf_swevent_get_recursion_context();
>  
>  	if (event->pending_work) {
>  		event->pending_work = 0;
> @@ -6800,10 +6792,6 @@ static void perf_pending_task(struct callback_head *head)
>  		local_dec(&event->ctx->nr_pending);
>  	}
>  
> -	if (rctx >= 0)
> -		perf_swevent_put_recursion_context(rctx);
> -	preempt_enable_notrace();

Well, if a software event happens during perf_sigtrap(), the task work
may be requeued endlessly and the task may get stuck in task_work_run()...

> -
>  	put_event(event);
>  }
>  
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task().
  2024-04-08 22:06   ` Frederic Weisbecker
@ 2024-04-09  6:25     ` Sebastian Andrzej Siewior
  2024-04-09 10:35       ` Frederic Weisbecker
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-04-09  6:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

On 2024-04-09 00:06:00 [+0200], Frederic Weisbecker wrote:
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index e0b2da8de485f..5400f7ed2f98b 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -6785,14 +6785,6 @@ static void perf_pending_irq(struct irq_work *entry)
> >  static void perf_pending_task(struct callback_head *head)
> >  {
> >  	struct perf_event *event = container_of(head, struct perf_event, pending_task);
> > -	int rctx;
> > -
> > -	/*
> > -	 * If we 'fail' here, that's OK, it means recursion is already disabled
> > -	 * and we won't recurse 'further'.
> > -	 */
> > -	preempt_disable_notrace();
> > -	rctx = perf_swevent_get_recursion_context();
> >  
> >  	if (event->pending_work) {
> >  		event->pending_work = 0;
> > @@ -6800,10 +6792,6 @@ static void perf_pending_task(struct callback_head *head)
> >  		local_dec(&event->ctx->nr_pending);
> >  	}
> >  
> > -	if (rctx >= 0)
> > -		perf_swevent_put_recursion_context(rctx);
> > -	preempt_enable_notrace();
> 
> Well, if a software event happens during perf_sigtrap(), the task work
> may be requeued endlessly and the task may get stuck in task_work_run()...

The last time I checked it had no users in the task context. How would
that happen?

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
  2024-04-08 21:29   ` Frederic Weisbecker
@ 2024-04-09  8:57     ` Sebastian Andrzej Siewior
  2024-04-09 12:36       ` Frederic Weisbecker
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-04-09  8:57 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

On 2024-04-08 23:29:03 [+0200], Frederic Weisbecker wrote:
> > index c7a0274c662c8..e0b2da8de485f 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
> >  		state = PERF_EVENT_STATE_OFF;
> >  	}
> >  
> > -	if (event->pending_sigtrap) {
> > -		bool dec = true;
> > -
> > -		event->pending_sigtrap = 0;
> > -		if (state != PERF_EVENT_STATE_OFF &&
> > -		    !event->pending_work) {
> > -			event->pending_work = 1;
> > -			dec = false;
> > -			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > -			task_work_add(current, &event->pending_task, TWA_RESUME);
> > -		}
> > -		if (dec)
> > -			local_dec(&event->ctx->nr_pending);
> > -	}
> > -
> >  	perf_event_set_state(event, state);
> >  
> >  	if (!is_software_event(event))
> > @@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
> >  	 * Yay, we hit home and are in the context of the event.
> >  	 */
> >  	if (cpu == smp_processor_id()) {
> > -		if (event->pending_sigtrap) {
> > -			event->pending_sigtrap = 0;
> > -			perf_sigtrap(event);
> > -			local_dec(&event->ctx->nr_pending);
> > -		}
> >  		if (event->pending_disable) {
> >  			event->pending_disable = 0;
> >  			perf_event_disable_local(event);
> > @@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event,
> >  
> >  		if (regs)
> >  			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> > -		if (!event->pending_sigtrap) {
> > -			event->pending_sigtrap = pending_id;
> > +		if (!event->pending_work) {
> > +			event->pending_work = pending_id;
> >  			local_inc(&event->ctx->nr_pending);
> > -			irq_work_queue(&event->pending_irq);
> > +			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > +			task_work_add(current, &event->pending_task, TWA_RESUME);
> 
> If the overflow happens between exit_task_work() and perf_event_exit_task(),
> you're leaking the event. (This was there before this patch).
> See:
> 	https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m5e6c8ebbef04ab9a1d7f05340cd3e2716a9a8c39

Okay.

> > +			/*
> > +			 * The NMI path returns directly to userland. The
> > +			 * irq_work is raised as a dummy interrupt to ensure
> > +			 * regular return path to user is taken and task_work
> > +			 * is processed.
> > +			 */
> > +			if (in_nmi())
> > +				irq_work_queue(&event->pending_irq);
> >  		} else if (event->attr.exclude_kernel && valid_sample) {
> >  			/*
> >  			 * Should not be able to return to user space without
> > -			 * consuming pending_sigtrap; with exceptions:
> > +			 * consuming pending_work; with exceptions:
> >  			 *
> >  			 *  1. Where !exclude_kernel, events can overflow again
> >  			 *     in the kernel without returning to user space.
> > @@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event,
> >  			 *     To approximate progress (with false negatives),
> >  			 *     check 32-bit hash of the current IP.
> >  			 */
> > -			WARN_ON_ONCE(event->pending_sigtrap != pending_id);
> > +			WARN_ON_ONCE(event->pending_work != pending_id);
> >  		}
> >  
> >  		event->pending_addr = 0;
> > @@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event)
> >  		     &parent_event->child_total_time_running);
> >  }
> >  
> > +static bool task_work_cb_match(struct callback_head *cb, void *data)
> > +{
> > +	struct perf_event *event = container_of(cb, struct perf_event, pending_task);
> > +
> > +	return event == data;
> > +}
> 
> I suggest we introduce a proper API to cancel an actual callback head, see:
> 
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#mbfac417463018394f9d80c68c7f2cafe9d066a4b
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m0a347249a462523358724085f2489ce9ed91e640

This rework would work.

> >  static void
> >  perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> >  {
> > @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> >  		 * Kick perf_poll() for is_event_hup();
> >  		 */
> >  		perf_event_wakeup(parent_event);
> > +		/*
> > +		 * Cancel pending task_work and update counters if it has not
> > +		 * yet been delivered to userland. free_event() expects the
> > +		 * reference counter at one and keeping the event around until
> > +		 * the task returns to userland can be a unexpected if there is
> > +		 * no signal handler registered.
> > +		 */
> > +		if (event->pending_work &&
> > +		    task_work_cancel_match(current, task_work_cb_match, event)) {
> > +			put_event(event);
> > +			local_dec(&event->ctx->nr_pending);
> > +		}
> 
> So exiting task, privileged exec and also exit on exec call into this before
> releasing the children.
> 
> And parents rely on put_event() from file close + the task work.
> 
> But what about remote release of children on file close?
> See perf_event_release_kernel() directly calling free_event() on them.

Interesting things you are presenting. I had events popping up at random
even after the task decided that it won't go back to userland to handle
it so letting it free looked like the only option…

> One possible fix is to avoid the reference count game around task work
> and flush them on free_event().
> 
> See here:
> 
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50

That wake_up() within preempt_disable() section breaks on RT.

How do we go on from here?

> Thanks.

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task().
  2024-04-09  6:25     ` Sebastian Andrzej Siewior
@ 2024-04-09 10:35       ` Frederic Weisbecker
  2024-04-09 10:54         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2024-04-09 10:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

Le Tue, Apr 09, 2024 at 08:25:01AM +0200, Sebastian Andrzej Siewior a écrit :
> On 2024-04-09 00:06:00 [+0200], Frederic Weisbecker wrote:
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index e0b2da8de485f..5400f7ed2f98b 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -6785,14 +6785,6 @@ static void perf_pending_irq(struct irq_work *entry)
> > >  static void perf_pending_task(struct callback_head *head)
> > >  {
> > >  	struct perf_event *event = container_of(head, struct perf_event, pending_task);
> > > -	int rctx;
> > > -
> > > -	/*
> > > -	 * If we 'fail' here, that's OK, it means recursion is already disabled
> > > -	 * and we won't recurse 'further'.
> > > -	 */
> > > -	preempt_disable_notrace();
> > > -	rctx = perf_swevent_get_recursion_context();
> > >  
> > >  	if (event->pending_work) {
> > >  		event->pending_work = 0;
> > > @@ -6800,10 +6792,6 @@ static void perf_pending_task(struct callback_head *head)
> > >  		local_dec(&event->ctx->nr_pending);
> > >  	}
> > >  
> > > -	if (rctx >= 0)
> > > -		perf_swevent_put_recursion_context(rctx);
> > > -	preempt_enable_notrace();
> > 
> > Well, if a software event happens during perf_sigtrap(), the task work
> > may be requeued endlessly and the task may get stuck in task_work_run()...
> 
> The last time I checked it had no users in the task context. How would
> that happen?

I guess many tracepoint events would do the trick. Such as trace_lock_acquire()
for example.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task().
  2024-04-09 10:35       ` Frederic Weisbecker
@ 2024-04-09 10:54         ` Sebastian Andrzej Siewior
  2024-04-09 12:00           ` Frederic Weisbecker
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-04-09 10:54 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

On 2024-04-09 12:35:46 [+0200], Frederic Weisbecker wrote:
> > > > @@ -6800,10 +6792,6 @@ static void perf_pending_task(struct callback_head *head)
> > > >  		local_dec(&event->ctx->nr_pending);
> > > >  	}
> > > >  
> > > > -	if (rctx >= 0)
> > > > -		perf_swevent_put_recursion_context(rctx);
> > > > -	preempt_enable_notrace();
> > > 
> > > Well, if a software event happens during perf_sigtrap(), the task work
> > > may be requeued endlessly and the task may get stuck in task_work_run()...
> > 
> > The last time I checked it had no users in the task context. How would
> > that happen?
> 
> I guess many tracepoint events would do the trick. Such as trace_lock_acquire()
> for example.

So the perf_trace_buf_alloc() is invoked from that trace point and
avoids the recursion. And any trace event from within perf_sigtrap()
would trigger the endless loop?

> Thanks.

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task().
  2024-04-09 10:54         ` Sebastian Andrzej Siewior
@ 2024-04-09 12:00           ` Frederic Weisbecker
  2024-04-09 13:33             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2024-04-09 12:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

Le Tue, Apr 09, 2024 at 12:54:05PM +0200, Sebastian Andrzej Siewior a écrit :
> On 2024-04-09 12:35:46 [+0200], Frederic Weisbecker wrote:
> > > > > @@ -6800,10 +6792,6 @@ static void perf_pending_task(struct callback_head *head)
> > > > >  		local_dec(&event->ctx->nr_pending);
> > > > >  	}
> > > > >  
> > > > > -	if (rctx >= 0)
> > > > > -		perf_swevent_put_recursion_context(rctx);
> > > > > -	preempt_enable_notrace();
> > > > 
> > > > Well, if a software event happens during perf_sigtrap(), the task work
> > > > may be requeued endlessly and the task may get stuck in task_work_run()...
> > > 
> > > The last time I checked it had no users in the task context. How would
> > > that happen?
> > 
> > I guess many tracepoint events would do the trick. Such as trace_lock_acquire()
> > for example.
> 
> So the perf_trace_buf_alloc() is invoked from that trace point and
> avoids the recursion. And any trace event from within perf_sigtrap()
> would trigger the endless loop?

No sure I'm following:

1) event->perf_event_overflow() -> task_work_add()
//return to userspace
2) task_work_run() -> perf_pending_task() -> perf_sigtrap() -> tracepoint event
   -> perf_event_overflow() -> task_work_add()
3) task_work_run() -> perf_pending_task() -> etc...

What am I missing?

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
  2024-04-09  8:57     ` Sebastian Andrzej Siewior
@ 2024-04-09 12:36       ` Frederic Weisbecker
  2024-04-09 13:47         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2024-04-09 12:36 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

Le Tue, Apr 09, 2024 at 10:57:32AM +0200, Sebastian Andrzej Siewior a écrit :
> > >  static void
> > >  perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> > >  {
> > > @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> > >  		 * Kick perf_poll() for is_event_hup();
> > >  		 */
> > >  		perf_event_wakeup(parent_event);
> > > +		/*
> > > +		 * Cancel pending task_work and update counters if it has not
> > > +		 * yet been delivered to userland. free_event() expects the
> > > +		 * reference counter at one and keeping the event around until
> > > +		 * the task returns to userland can be a unexpected if there is
> > > +		 * no signal handler registered.
> > > +		 */
> > > +		if (event->pending_work &&
> > > +		    task_work_cancel_match(current, task_work_cb_match, event)) {
> > > +			put_event(event);
> > > +			local_dec(&event->ctx->nr_pending);
> > > +		}
> > 
> > So exiting task, privileged exec and also exit on exec call into this before
> > releasing the children.
> > 
> > And parents rely on put_event() from file close + the task work.
> > 
> > But what about remote release of children on file close?
> > See perf_event_release_kernel() directly calling free_event() on them.
> 
> Interesting things you are presenting. I had events popping up at random
> even after the task decided that it won't go back to userland to handle
> it so letting it free looked like the only option…
> 
> > One possible fix is to avoid the reference count game around task work
> > and flush them on free_event().
> > 
> > See here:
> > 
> > https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
> 
> That wake_up() within preempt_disable() section breaks on RT.

Ah, but the wake-up still wants to go inside recursion protection somehow or
it could generate task_work loop again due to tracepoint events...

Although... the wake up occurs only when the event is dead after all...

> How do we go on from here?

I'd tend to think you need my patchset first because the problems it
fixes were not easily visible as long as there was an irq work to take
care of things most of the time. But once you rely on task_work only then
these become a real problem. Especially the sync against perf_release().

Thanks.

> 
> > Thanks.
> 
> Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task().
  2024-04-09 12:00           ` Frederic Weisbecker
@ 2024-04-09 13:33             ` Sebastian Andrzej Siewior
  2024-04-10 10:38               ` Frederic Weisbecker
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-04-09 13:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

On 2024-04-09 14:00:49 [+0200], Frederic Weisbecker wrote:
> Le Tue, Apr 09, 2024 at 12:54:05PM +0200, Sebastian Andrzej Siewior a écrit :
> > On 2024-04-09 12:35:46 [+0200], Frederic Weisbecker wrote:
> > > > > > @@ -6800,10 +6792,6 @@ static void perf_pending_task(struct callback_head *head)
> > > > > >  		local_dec(&event->ctx->nr_pending);
> > > > > >  	}
> > > > > >  
> > > > > > -	if (rctx >= 0)
> > > > > > -		perf_swevent_put_recursion_context(rctx);
> > > > > > -	preempt_enable_notrace();
> > > > > 
> > > > > Well, if a software event happens during perf_sigtrap(), the task work
> > > > > may be requeued endlessly and the task may get stuck in task_work_run()...
> > > > 
> > > > The last time I checked it had no users in the task context. How would
> > > > that happen?
> > > 
> > > I guess many tracepoint events would do the trick. Such as trace_lock_acquire()
> > > for example.
> > 
> > So the perf_trace_buf_alloc() is invoked from that trace point and
> > avoids the recursion. And any trace event from within perf_sigtrap()
> > would trigger the endless loop?
> 
> No sure I'm following:
> 
> 1) event->perf_event_overflow() -> task_work_add()
> //return to userspace
> 2) task_work_run() -> perf_pending_task() -> perf_sigtrap() -> tracepoint event
>    -> perf_event_overflow() -> task_work_add()
> 3) task_work_run() -> perf_pending_task() -> etc...
>
> What am I missing?

Yes, that is what I tried to say. Anyway, I misunderstood the concept
before. That means we need to keep that counter here and a
migrate_disable() is needed to avoid CPU migration which is sad.

> Thanks.

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
  2024-04-09 12:36       ` Frederic Weisbecker
@ 2024-04-09 13:47         ` Sebastian Andrzej Siewior
  2024-04-10 11:37           ` Frederic Weisbecker
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-04-09 13:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

On 2024-04-09 14:36:51 [+0200], Frederic Weisbecker wrote:
> > That wake_up() within preempt_disable() section breaks on RT.
> 
> Ah, but the wake-up still wants to go inside recursion protection somehow or
> it could generate task_work loop again due to tracepoint events...

okay.

> Although... the wake up occurs only when the event is dead after all...

corner case or not, it has to work, right?

> > How do we go on from here?
> 
> I'd tend to think you need my patchset first because the problems it
> fixes were not easily visible as long as there was an irq work to take
> care of things most of the time. But once you rely on task_work only then
> these become a real problem. Especially the sync against perf_release().

I don't mind rebasing on top of your series. But defaulting to task_work
is not an option then?

RT wise the irq_work is not handled in hardirq because of locks it
acquires and is handled instead in a thread. Depending on the priority
the task (receiving the event) it may run before the irq_work-thread.
Therefore the task_work looked neat because the event would be handled
_before_ the task returned to userland.

Couldn't we either flush _or_ remove the task_work in perf_release()?

> Thanks.
Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task().
  2024-04-09 13:33             ` Sebastian Andrzej Siewior
@ 2024-04-10 10:38               ` Frederic Weisbecker
  2024-04-10 12:51                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2024-04-10 10:38 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

Le Tue, Apr 09, 2024 at 03:33:36PM +0200, Sebastian Andrzej Siewior a écrit :
> On 2024-04-09 14:00:49 [+0200], Frederic Weisbecker wrote:
> > Le Tue, Apr 09, 2024 at 12:54:05PM +0200, Sebastian Andrzej Siewior a écrit :
> > > On 2024-04-09 12:35:46 [+0200], Frederic Weisbecker wrote:
> > > > > > > @@ -6800,10 +6792,6 @@ static void perf_pending_task(struct callback_head *head)
> > > > > > >  		local_dec(&event->ctx->nr_pending);
> > > > > > >  	}
> > > > > > >  
> > > > > > > -	if (rctx >= 0)
> > > > > > > -		perf_swevent_put_recursion_context(rctx);
> > > > > > > -	preempt_enable_notrace();
> > > > > > 
> > > > > > Well, if a software event happens during perf_sigtrap(), the task work
> > > > > > may be requeued endlessly and the task may get stuck in task_work_run()...
> > > > > 
> > > > > The last time I checked it had no users in the task context. How would
> > > > > that happen?
> > > > 
> > > > I guess many tracepoint events would do the trick. Such as trace_lock_acquire()
> > > > for example.
> > > 
> > > So the perf_trace_buf_alloc() is invoked from that trace point and
> > > avoids the recursion. And any trace event from within perf_sigtrap()
> > > would trigger the endless loop?
> > 
> > No sure I'm following:
> > 
> > 1) event->perf_event_overflow() -> task_work_add()
> > //return to userspace
> > 2) task_work_run() -> perf_pending_task() -> perf_sigtrap() -> tracepoint event
> >    -> perf_event_overflow() -> task_work_add()
> > 3) task_work_run() -> perf_pending_task() -> etc...
> >
> > What am I missing?
> 
> Yes, that is what I tried to say.

Oh ok.. :o)

> Anyway, I misunderstood the concept
> before. That means we need to keep that counter here and a
> migrate_disable() is needed to avoid CPU migration which is sad.

I fear that won't work either. The task is then pinned but another
task can come up on that CPU and its software events will be ignored...

Some alternatives:

_ Clear event->pending_work = 0 after perf_sigtrap(), preventing an
event in there from adding a new task work. We may miss a signal though...

_ Make the recursion context per task on -RT...

> > Thanks.
> 
> Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
  2024-04-09 13:47         ` Sebastian Andrzej Siewior
@ 2024-04-10 11:37           ` Frederic Weisbecker
  2024-04-10 13:47             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2024-04-10 11:37 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

Le Tue, Apr 09, 2024 at 03:47:29PM +0200, Sebastian Andrzej Siewior a écrit :
> On 2024-04-09 14:36:51 [+0200], Frederic Weisbecker wrote:
> > > That wake_up() within preempt_disable() section breaks on RT.
> > 
> > Ah, but the wake-up still wants to go inside recursion protection somehow or
> > it could generate task_work loop again due to tracepoint events...
> 
> okay.
> 
> > Although... the wake up occurs only when the event is dead after all...
> 
> corner case or not, it has to work, right?

Yep.

> 
> > > How do we go on from here?
> > 
> > I'd tend to think you need my patchset first because the problems it
> > fixes were not easily visible as long as there was an irq work to take
> > care of things most of the time. But once you rely on task_work only then
> > these become a real problem. Especially the sync against perf_release().
> 
> I don't mind rebasing on top of your series. But defaulting to task_work
> is not an option then?
> 
> RT wise the irq_work is not handled in hardirq because of locks it
> acquires and is handled instead in a thread. Depending on the priority
> the task (receiving the event) it may run before the irq_work-thread.
> Therefore the task_work looked neat because the event would be handled
> _before_ the task returned to userland.

I see.
 
> Couldn't we either flush _or_ remove the task_work in perf_release()?

Right so the problem in perf_release() is that we may be dealing with task works
of other tasks than current. In that case, task_work_cancel() is fine if it
successes. But if it fails, you don't have the guarantee that the task work
isn't concurrently running or about to run. And you have no way to know about
that. So then you need some sort of flushing indeed.

Thanks.

> > Thanks.
> Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task().
  2024-04-10 10:38               ` Frederic Weisbecker
@ 2024-04-10 12:51                 ` Sebastian Andrzej Siewior
  2024-04-10 13:58                   ` Frederic Weisbecker
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-04-10 12:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

On 2024-04-10 12:38:50 [+0200], Frederic Weisbecker wrote:
> > Anyway, I misunderstood the concept
> > before. That means we need to keep that counter here and a
> > migrate_disable() is needed to avoid CPU migration which is sad.
> 
> I fear that won't work either. The task is then pinned but another
> task can come up on that CPU and its software events will be ignored...

oh, right.

> Some alternatives:
> 
> _ Clear event->pending_work = 0 after perf_sigtrap(), preventing an
> event in there from adding a new task work. We may miss a signal though...
> 
> _ Make the recursion context per task on -RT...

The per-task counter would be indeed the easiest thing to do. But then
only for task context, right?

But why would we miss a signal if we clean event->pending_work late?
Isn't cleaning late same as clearing in
perf_swevent_put_recursion_context()?
If clearing pending_work late works, I would prefer to avoid yet another
per-task counter if possible.

> > > Thanks.

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
  2024-04-10 11:37           ` Frederic Weisbecker
@ 2024-04-10 13:47             ` Sebastian Andrzej Siewior
  2024-04-10 14:00               ` Frederic Weisbecker
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-04-10 13:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

On 2024-04-10 13:37:05 [+0200], Frederic Weisbecker wrote:
> > Couldn't we either flush _or_ remove the task_work in perf_release()?
> 
> Right so the problem in perf_release() is that we may be dealing with task works
> of other tasks than current. In that case, task_work_cancel() is fine if it
> successes. But if it fails, you don't have the guarantee that the task work
> isn't concurrently running or about to run. And you have no way to know about
> that. So then you need some sort of flushing indeed.

Since perf_release() preemptible, a wait/sleep for completion would be
best (instead of flushing).

> Thanks.
> 
> > > Thanks.

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task().
  2024-04-10 12:51                 ` Sebastian Andrzej Siewior
@ 2024-04-10 13:58                   ` Frederic Weisbecker
  0 siblings, 0 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2024-04-10 13:58 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

Le Wed, Apr 10, 2024 at 02:51:26PM +0200, Sebastian Andrzej Siewior a écrit :
> On 2024-04-10 12:38:50 [+0200], Frederic Weisbecker wrote:
> > Some alternatives:
> > 
> > _ Clear event->pending_work = 0 after perf_sigtrap(), preventing an
> > event in there from adding a new task work. We may miss a signal though...
> > 
> > _ Make the recursion context per task on -RT...
> 
> The per-task counter would be indeed the easiest thing to do. But then
> only for task context, right?

It should work for CPU context as well. A context switch shouldn't be
considered as recursion. Hopefully...

> 
> But why would we miss a signal if we clean event->pending_work late?
> Isn't cleaning late same as clearing in
> perf_swevent_put_recursion_context()?

Not exactly. perf_swevent_get_recursion_context() avoids a new software event
altogether from being recorded. It is completely ignored. Whereas clearing late
event->pending_work records new software events but ignores the signal.

> If clearing pending_work late works, I would prefer to avoid yet another
> per-task counter if possible.

Yeah I know :-/

Thanks.

> 
> > > > Thanks.
> 
> Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
  2024-04-10 13:47             ` Sebastian Andrzej Siewior
@ 2024-04-10 14:00               ` Frederic Weisbecker
  2024-04-10 14:06                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2024-04-10 14:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

Le Wed, Apr 10, 2024 at 03:47:02PM +0200, Sebastian Andrzej Siewior a écrit :
> On 2024-04-10 13:37:05 [+0200], Frederic Weisbecker wrote:
> > > Couldn't we either flush _or_ remove the task_work in perf_release()?
> > 
> > Right so the problem in perf_release() is that we may be dealing with task works
> > of other tasks than current. In that case, task_work_cancel() is fine if it
> > successes. But if it fails, you don't have the guarantee that the task work
> > isn't concurrently running or about to run. And you have no way to know about
> > that. So then you need some sort of flushing indeed.
> 
> Since perf_release() preemptible, a wait/sleep for completion would be
> best (instead of flushing).

Like this then?

https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50

> > Thanks.
> > 
> > > > Thanks.
> 
> Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
  2024-04-10 14:00               ` Frederic Weisbecker
@ 2024-04-10 14:06                 ` Sebastian Andrzej Siewior
  2024-04-10 14:42                   ` Frederic Weisbecker
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-04-10 14:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

On 2024-04-10 16:00:17 [+0200], Frederic Weisbecker wrote:
> Le Wed, Apr 10, 2024 at 03:47:02PM +0200, Sebastian Andrzej Siewior a écrit :
> > On 2024-04-10 13:37:05 [+0200], Frederic Weisbecker wrote:
> > > > Couldn't we either flush _or_ remove the task_work in perf_release()?
> > > 
> > > Right so the problem in perf_release() is that we may be dealing with task works
> > > of other tasks than current. In that case, task_work_cancel() is fine if it
> > > successes. But if it fails, you don't have the guarantee that the task work
> > > isn't concurrently running or about to run. And you have no way to know about
> > > that. So then you need some sort of flushing indeed.
> > 
> > Since perf_release() preemptible, a wait/sleep for completion would be
> > best (instead of flushing).
> 
> Like this then?
> 
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50

Kind of, yes. Do we have more than one waiter? If not, maybe that
rcuwait would work then.
Otherwise (>1 waiter) we did establish that we may need a per-task
counter for recursion handling so preempt-disable shouldn't be a problem
then. The pending_work_wq must not be used outside of task context (means
no hardirq or something like that).

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
  2024-04-10 14:06                 ` Sebastian Andrzej Siewior
@ 2024-04-10 14:42                   ` Frederic Weisbecker
  2024-04-10 14:48                     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2024-04-10 14:42 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

Le Wed, Apr 10, 2024 at 04:06:33PM +0200, Sebastian Andrzej Siewior a écrit :
> On 2024-04-10 16:00:17 [+0200], Frederic Weisbecker wrote:
> > Le Wed, Apr 10, 2024 at 03:47:02PM +0200, Sebastian Andrzej Siewior a écrit :
> > > On 2024-04-10 13:37:05 [+0200], Frederic Weisbecker wrote:
> > > > > Couldn't we either flush _or_ remove the task_work in perf_release()?
> > > > 
> > > > Right so the problem in perf_release() is that we may be dealing with task works
> > > > of other tasks than current. In that case, task_work_cancel() is fine if it
> > > > successes. But if it fails, you don't have the guarantee that the task work
> > > > isn't concurrently running or about to run. And you have no way to know about
> > > > that. So then you need some sort of flushing indeed.
> > > 
> > > Since perf_release() preemptible, a wait/sleep for completion would be
> > > best (instead of flushing).
> > 
> > Like this then?
> > 
> > https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
> 
> Kind of, yes. Do we have more than one waiter? If not, maybe that
> rcuwait would work then.

Indeed there is only one waiter so that should work. Would
that be something you can call while preemption is disabled?

Thanks.

> Otherwise (>1 waiter) we did establish that we may need a per-task
> counter for recursion handling so preempt-disable shouldn't be a problem
> then. The pending_work_wq must not be used outside of task context (means
> no hardirq or something like that).
> 
> Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
  2024-04-10 14:42                   ` Frederic Weisbecker
@ 2024-04-10 14:48                     ` Sebastian Andrzej Siewior
  2024-04-10 14:50                       ` Frederic Weisbecker
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-04-10 14:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

On 2024-04-10 16:42:56 [+0200], Frederic Weisbecker wrote:
> > > Like this then?
> > > 
> > > https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
> > 
> > Kind of, yes. Do we have more than one waiter? If not, maybe that
> > rcuwait would work then.
> 
> Indeed there is only one waiter so that should work. Would
> that be something you can call while preemption is disabled?

rcuwait_wake_up() does only wake_up_process() which is fine.
wake_up() does spin_lock_irqsave() which is a no.

On the other hand that preempt-disable needs to go anyway due to
perf_sigtrap(). But a slim wake is a slim wake ;)

> Thanks.

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
  2024-04-10 14:48                     ` Sebastian Andrzej Siewior
@ 2024-04-10 14:50                       ` Frederic Weisbecker
  0 siblings, 0 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2024-04-10 14:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo

Le Wed, Apr 10, 2024 at 04:48:21PM +0200, Sebastian Andrzej Siewior a écrit :
> On 2024-04-10 16:42:56 [+0200], Frederic Weisbecker wrote:
> > > > Like this then?
> > > > 
> > > > https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
> > > 
> > > Kind of, yes. Do we have more than one waiter? If not, maybe that
> > > rcuwait would work then.
> > 
> > Indeed there is only one waiter so that should work. Would
> > that be something you can call while preemption is disabled?
> 
> rcuwait_wake_up() does only wake_up_process() which is fine.
> wake_up() does spin_lock_irqsave() which is a no.

Duh!

> On the other hand that preempt-disable needs to go anyway due to
> perf_sigtrap(). But a slim wake is a slim wake ;)

Sure thing :)

> > Thanks.
> 
> Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-04-10 14:50 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-22  6:48 [PATCH v3 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
2024-03-22  6:48 ` [PATCH v3 1/4] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
2024-03-22  6:48 ` [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
2024-04-08 21:29   ` Frederic Weisbecker
2024-04-09  8:57     ` Sebastian Andrzej Siewior
2024-04-09 12:36       ` Frederic Weisbecker
2024-04-09 13:47         ` Sebastian Andrzej Siewior
2024-04-10 11:37           ` Frederic Weisbecker
2024-04-10 13:47             ` Sebastian Andrzej Siewior
2024-04-10 14:00               ` Frederic Weisbecker
2024-04-10 14:06                 ` Sebastian Andrzej Siewior
2024-04-10 14:42                   ` Frederic Weisbecker
2024-04-10 14:48                     ` Sebastian Andrzej Siewior
2024-04-10 14:50                       ` Frederic Weisbecker
2024-03-22  6:48 ` [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task() Sebastian Andrzej Siewior
2024-04-08 22:06   ` Frederic Weisbecker
2024-04-09  6:25     ` Sebastian Andrzej Siewior
2024-04-09 10:35       ` Frederic Weisbecker
2024-04-09 10:54         ` Sebastian Andrzej Siewior
2024-04-09 12:00           ` Frederic Weisbecker
2024-04-09 13:33             ` Sebastian Andrzej Siewior
2024-04-10 10:38               ` Frederic Weisbecker
2024-04-10 12:51                 ` Sebastian Andrzej Siewior
2024-04-10 13:58                   ` Frederic Weisbecker
2024-03-22  6:48 ` [PATCH v3 4/4] perf: Split __perf_pending_irq() out of perf_pending_irq() Sebastian Andrzej Siewior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.