linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] trace: Add option for polling ring buffers
@ 2021-05-19 17:57 Nicolas Saenz Julienne
  2021-05-19 18:07 ` Matthew Wilcox
  2021-05-28 17:32 ` Steven Rostedt
  0 siblings, 2 replies; 6+ messages in thread
From: Nicolas Saenz Julienne @ 2021-05-19 17:57 UTC (permalink / raw)
  To: rostedt, linux-doc, linux-kernel
  Cc: mingo, corbet, mtosatti, Nicolas Saenz Julienne

To minimize trace's effect on isolated CPUs. That is, CPUs were only a
handful or a single, process are allowed to run. Introduce a new trace
option: 'poll-rb'.

This option changes the heuristic used to wait for data on trace
buffers. The default one, based on wait queues, will trigger an IPI[1]
on the CPU responsible for new data, which will take care of waking up
the trace gathering process (generally trace-cmd). Whereas with
'poll-rb' we will poll (as in busy-wait) the ring buffers from the trace
gathering process, releasing the CPUs writing trace data from doing any
wakeup work.

This wakeup work, although negligible in the vast majority of workloads,
may cause unwarranted latencies on systems running trace on isolated
CPUs. This is made worse on PREEMPT_RT kernels, as they defer the IPI
handling into a kernel thread, forcing unwarranted context switches on
otherwise extremely busy CPUs.

To illustrate this, tracing with PREEMPT_RT=y on an isolated CPU with a
single process pinned to it (NO_HZ_FULL=y, and plenty more isolation
options enabled). I see:
  - 50-100us latency spikes with the default trace-cmd options
  - 14-10us latency spikes with 'poll-rb'
  - 11-8us latency spikes with no tracing at all

The obvious drawback of 'poll-rb' is putting more pressure on the
housekeeping CPUs. Wasting cycles. Hence the notice in the documentation
discouraging its use in general.

[1] The IPI, in this case, an irq_work, is needed since trace might run
in NMI context. Which is not suitable for wake-ups.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
---
 Documentation/trace/ftrace.rst | 10 +++++
 include/linux/ring_buffer.h    |  2 +-
 kernel/trace/ring_buffer.c     | 76 +++++++++++++++++++---------------
 kernel/trace/trace.c           |  2 +-
 kernel/trace/trace.h           |  1 +
 5 files changed, 55 insertions(+), 36 deletions(-)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 62c98e9bbdd9..224ed8652101 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -1248,6 +1248,16 @@ Here are the available options:
 	with the "nop" tracer is the same as just enabling the
 	"branch" tracer.
 
+  poll-rb
+        Use polling instead of waiting for new data in ringbuffers. The wait
+        heuristic will trigger IPIs on the CPU resposible for new data, which
+        can generate false positives when tracing low latency workloads (or the
+        IPIs themselves). This is made worse on PREEMPT_RT kernels, as they
+        defer the IPI handling into a kernel thread, forcing unwarranted
+        context switches on otherwise extremely busy processes.
+
+        When in doubt don't enable this, it'll just waste cpu cycles.
+
 .. tip:: Some tracers have their own options. They only appear in this
        file when the tracer is active. They always appear in the
        options directory.
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dac53fd3afea..b492061d369b 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -98,7 +98,7 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k
 	__ring_buffer_alloc((size), (flags), &__key);	\
 })
 
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, bool poll);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
 			  struct file *filp, poll_table *poll_table);
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2c0ee6484990..7840489623b2 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -892,14 +892,17 @@ static void rb_wake_up_waiters(struct irq_work *work)
  * @buffer: buffer to wait on
  * @cpu: the cpu buffer to wait on
  * @full: wait until the percentage of pages are available, if @cpu != RING_BUFFER_ALL_CPUS
+ * @poll: don't use wait queues, poll the ring buffer
  *
  * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon
  * as data is added to any of the @buffer's cpu buffers. Otherwise
  * it will wait for data to be added to a specific cpu buffer.
  */
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, bool poll)
 {
 	struct ring_buffer_per_cpu *cpu_buffer;
+	struct wait_queue_head *wq_head;
+	bool *waiters_pending;
 	DEFINE_WAIT(wait);
 	struct rb_irq_work *work;
 	int ret = 0;
@@ -920,37 +923,41 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
 		work = &cpu_buffer->irq_work;
 	}
 
+	if (full) {
+		wq_head = &work->full_waiters;
+		waiters_pending = &work->full_waiters_pending;
+	} else {
+		wq_head = &work->waiters;
+		waiters_pending = &work->waiters_pending;
+	}
 
 	while (true) {
-		if (full)
-			prepare_to_wait(&work->full_waiters, &wait, TASK_INTERRUPTIBLE);
-		else
-			prepare_to_wait(&work->waiters, &wait, TASK_INTERRUPTIBLE);
+		if (!poll) {
+			prepare_to_wait(wq_head, &wait, TASK_INTERRUPTIBLE);
 
-		/*
-		 * The events can happen in critical sections where
-		 * checking a work queue can cause deadlocks.
-		 * After adding a task to the queue, this flag is set
-		 * only to notify events to try to wake up the queue
-		 * using irq_work.
-		 *
-		 * We don't clear it even if the buffer is no longer
-		 * empty. The flag only causes the next event to run
-		 * irq_work to do the work queue wake up. The worse
-		 * that can happen if we race with !trace_empty() is that
-		 * an event will cause an irq_work to try to wake up
-		 * an empty queue.
-		 *
-		 * There's no reason to protect this flag either, as
-		 * the work queue and irq_work logic will do the necessary
-		 * synchronization for the wake ups. The only thing
-		 * that is necessary is that the wake up happens after
-		 * a task has been queued. It's OK for spurious wake ups.
-		 */
-		if (full)
-			work->full_waiters_pending = true;
-		else
-			work->waiters_pending = true;
+			/*
+			 * The events can happen in critical sections where
+			 * checking a work queue can cause deadlocks.  After
+			 * adding a task to the queue, this flag is set only to
+			 * notify events to try to wake up the queue using
+			 * irq_work.
+			 *
+			 * We don't clear it even if the buffer is no longer
+			 * empty. The flag only causes the next event to run
+			 * irq_work to do the work queue wake up. The worse
+			 * that can happen if we race with !trace_empty() is
+			 * that an event will cause an irq_work to try to wake
+			 * up an empty queue.
+			 *
+			 * There's no reason to protect this flag either, as
+			 * the work queue and irq_work logic will do the
+			 * necessary synchronization for the wake ups. The only
+			 * thing that is necessary is that the wake up happens
+			 * after a task has been queued. It's OK for spurious
+			 * wake ups.
+			 */
+			*waiters_pending = true;
+		}
 
 		if (signal_pending(current)) {
 			ret = -EINTR;
@@ -983,13 +990,14 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
 				break;
 		}
 
-		schedule();
+		if (!poll)
+			schedule();
+		else
+			schedule_timeout_interruptible(1);
 	}
 
-	if (full)
-		finish_wait(&work->full_waiters, &wait);
-	else
-		finish_wait(&work->waiters, &wait);
+	if (!poll)
+		finish_wait(wq_head, &wait);
 
 	return ret;
 }
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 1c4e702133e8..8408b0f50519 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1887,7 +1887,7 @@ static int wait_on_pipe(struct trace_iterator *iter, int full)
 		return 0;
 
 	return ring_buffer_wait(iter->array_buffer->buffer, iter->cpu_file,
-				full);
+				full, iter->tr->trace_flags & TRACE_ITER_POLL_RB);
 }
 
 #ifdef CONFIG_FTRACE_STARTUP_TEST
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index cd80d046c7a5..045bef803549 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1198,6 +1198,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 		C(EVENT_FORK,		"event-fork"),		\
 		C(PAUSE_ON_TRACE,	"pause-on-trace"),	\
 		C(HASH_PTR,		"hash-ptr"),	/* Print hashed pointer */ \
+		C(POLL_RB,		"poll-rb"),		\
 		FUNCTION_FLAGS					\
 		FGRAPH_FLAGS					\
 		STACK_FLAGS					\
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC] trace: Add option for polling ring buffers
  2021-05-19 17:57 [RFC] trace: Add option for polling ring buffers Nicolas Saenz Julienne
@ 2021-05-19 18:07 ` Matthew Wilcox
  2021-05-19 19:33   ` Marcelo Tosatti
  2021-05-20  8:57   ` Nicolas Saenz Julienne
  2021-05-28 17:32 ` Steven Rostedt
  1 sibling, 2 replies; 6+ messages in thread
From: Matthew Wilcox @ 2021-05-19 18:07 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: rostedt, linux-doc, linux-kernel, mingo, corbet, mtosatti

On Wed, May 19, 2021 at 07:57:55PM +0200, Nicolas Saenz Julienne wrote:
> To minimize trace's effect on isolated CPUs. That is, CPUs were only a
> handful or a single, process are allowed to run. Introduce a new trace
> option: 'poll-rb'.

maybe this should take a parameter in ms (us?) saying how frequently
to poll?  it seems like a reasonable assumption that somebody running in
this kind of RT environment would be able to judge how often their
monitoring task needs to collect data.

> [1] The IPI, in this case, an irq_work, is needed since trace might run
> in NMI context. Which is not suitable for wake-ups.

could we also consider a try-wakeup which would not succeed if in NMI
context?  or are there situations where we only gather data in NMI
context, and so would never succeed in waking up?  if so, maybe
schedule the irq_work every 1000 failures to wake up.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] trace: Add option for polling ring buffers
  2021-05-19 18:07 ` Matthew Wilcox
@ 2021-05-19 19:33   ` Marcelo Tosatti
  2021-05-20  8:57   ` Nicolas Saenz Julienne
  1 sibling, 0 replies; 6+ messages in thread
From: Marcelo Tosatti @ 2021-05-19 19:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nicolas Saenz Julienne, rostedt, linux-doc, linux-kernel, mingo, corbet


Hi Willy,

On Wed, May 19, 2021 at 07:07:54PM +0100, Matthew Wilcox wrote:
> On Wed, May 19, 2021 at 07:57:55PM +0200, Nicolas Saenz Julienne wrote:
> > To minimize trace's effect on isolated CPUs. That is, CPUs were only a
> > handful or a single, process are allowed to run. Introduce a new trace
> > option: 'poll-rb'.
> 
> maybe this should take a parameter in ms (us?) saying how frequently
> to poll?  it seems like a reasonable assumption that somebody running in
> this kind of RT environment would be able to judge how often their
> monitoring task needs to collect data.

+1 (yes please).

> > [1] The IPI, in this case, an irq_work, is needed since trace might run
> > in NMI context. Which is not suitable for wake-ups.
> 
> could we also consider a try-wakeup which would not succeed if in NMI
> context?  or are there situations where we only gather data in NMI
> context, and so would never succeed in waking up?  if so, maybe
> schedule the irq_work every 1000 failures to wake up.

We'd like to reduce overhead on the isolated (as in isolcpus=) CPUs as 
much as possible (but yes this option was suggested).


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] trace: Add option for polling ring buffers
  2021-05-19 18:07 ` Matthew Wilcox
  2021-05-19 19:33   ` Marcelo Tosatti
@ 2021-05-20  8:57   ` Nicolas Saenz Julienne
  1 sibling, 0 replies; 6+ messages in thread
From: Nicolas Saenz Julienne @ 2021-05-20  8:57 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: rostedt, linux-doc, linux-kernel, mingo, corbet, mtosatti

Hi Matthew, thanks for your comments.

On Wed, 2021-05-19 at 19:07 +0100, Matthew Wilcox wrote:
> On Wed, May 19, 2021 at 07:57:55PM +0200, Nicolas Saenz Julienne wrote:
> > To minimize trace's effect on isolated CPUs. That is, CPUs were only a
> > handful or a single, process are allowed to run. Introduce a new trace
> > option: 'poll-rb'.
> 
> maybe this should take a parameter in ms (us?) saying how frequently
> to poll?  it seems like a reasonable assumption that somebody running in
> this kind of RT environment would be able to judge how often their
> monitoring task needs to collect data.

I'll look into it.

> > [1] The IPI, in this case, an irq_work, is needed since trace might run
> > in NMI context. Which is not suitable for wake-ups.
> 
> could we also consider a try-wakeup which would not succeed if in NMI
> context?

Yes, in a similar vein, my original idea was to defer the wakeup process into a
non-isolated CPU using irq_work_on(). But that irq_work flavor is not NMI safe
(nor any other IPI mechanisms targeting other CPUs).

> or are there situations where we only gather data in NMI
> context, and so would never succeed in waking up?

Yes, that's a use-case. For ex. 'trace-cmd record -e nmi'.

> if so, maybe schedule the irq_work every 1000 failures to wake up.

You'd be generating latency spikes nonetheless. Which might eventually break
the isolated application latency requirements.

As Marcelo said, the least code we run on the isolated CPU the better.

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] trace: Add option for polling ring buffers
  2021-05-19 17:57 [RFC] trace: Add option for polling ring buffers Nicolas Saenz Julienne
  2021-05-19 18:07 ` Matthew Wilcox
@ 2021-05-28 17:32 ` Steven Rostedt
  2021-06-02  9:38   ` Nicolas Saenz Julienne
  1 sibling, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2021-05-28 17:32 UTC (permalink / raw)
  To: Nicolas Saenz Julienne; +Cc: linux-doc, linux-kernel, mingo, corbet, mtosatti

On Wed, 19 May 2021 19:57:55 +0200
Nicolas Saenz Julienne <nsaenzju@redhat.com> wrote:

> To minimize trace's effect on isolated CPUs. That is, CPUs were only a
> handful or a single, process are allowed to run. Introduce a new trace
> option: 'poll-rb'.
> 
> This option changes the heuristic used to wait for data on trace
> buffers. The default one, based on wait queues, will trigger an IPI[1]
> on the CPU responsible for new data, which will take care of waking up
> the trace gathering process (generally trace-cmd). Whereas with
> 'poll-rb' we will poll (as in busy-wait) the ring buffers from the trace
> gathering process, releasing the CPUs writing trace data from doing any
> wakeup work.
> 
> This wakeup work, although negligible in the vast majority of workloads,
> may cause unwarranted latencies on systems running trace on isolated
> CPUs. This is made worse on PREEMPT_RT kernels, as they defer the IPI
> handling into a kernel thread, forcing unwarranted context switches on
> otherwise extremely busy CPUs.
> 
> To illustrate this, tracing with PREEMPT_RT=y on an isolated CPU with a
> single process pinned to it (NO_HZ_FULL=y, and plenty more isolation
> options enabled). I see:
>   - 50-100us latency spikes with the default trace-cmd options
>   - 14-10us latency spikes with 'poll-rb'
>   - 11-8us latency spikes with no tracing at all
> 
> The obvious drawback of 'poll-rb' is putting more pressure on the
> housekeeping CPUs. Wasting cycles. Hence the notice in the documentation
> discouraging its use in general.
> 
> [1] The IPI, in this case, an irq_work, is needed since trace might run
> in NMI context. Which is not suitable for wake-ups.

Can't this simply be done in user-space?

Set the reading of the trace buffers to O_NONBLOCK and it wont wait for
buffering to happen, and should prevent it from causing the IPI wake ups.

If you need this for trace-cmd, we can add a --poll option that would do
this.

-- Steve

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] trace: Add option for polling ring buffers
  2021-05-28 17:32 ` Steven Rostedt
@ 2021-06-02  9:38   ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 6+ messages in thread
From: Nicolas Saenz Julienne @ 2021-06-02  9:38 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-doc, linux-kernel, mingo, corbet, mtosatti

Hi Steven, thanks for having a look at this.

On Fri, 2021-05-28 at 13:32 -0400, Steven Rostedt wrote:
> On Wed, 19 May 2021 19:57:55 +0200
> Nicolas Saenz Julienne <nsaenzju@redhat.com> wrote:
> 
> > To minimize trace's effect on isolated CPUs. That is, CPUs were only a
> > handful or a single, process are allowed to run. Introduce a new trace
> > option: 'poll-rb'.
> > 
> > This option changes the heuristic used to wait for data on trace
> > buffers. The default one, based on wait queues, will trigger an IPI[1]
> > on the CPU responsible for new data, which will take care of waking up
> > the trace gathering process (generally trace-cmd). Whereas with
> > 'poll-rb' we will poll (as in busy-wait) the ring buffers from the trace
> > gathering process, releasing the CPUs writing trace data from doing any
> > wakeup work.
> > 
> > This wakeup work, although negligible in the vast majority of workloads,
> > may cause unwarranted latencies on systems running trace on isolated
> > CPUs. This is made worse on PREEMPT_RT kernels, as they defer the IPI
> > handling into a kernel thread, forcing unwarranted context switches on
> > otherwise extremely busy CPUs.
> > 
> > To illustrate this, tracing with PREEMPT_RT=y on an isolated CPU with a
> > single process pinned to it (NO_HZ_FULL=y, and plenty more isolation
> > options enabled). I see:
> >   - 50-100us latency spikes with the default trace-cmd options
> >   - 14-10us latency spikes with 'poll-rb'
> >   - 11-8us latency spikes with no tracing at all
> > 
> > The obvious drawback of 'poll-rb' is putting more pressure on the
> > housekeeping CPUs. Wasting cycles. Hence the notice in the documentation
> > discouraging its use in general.
> > 
> > [1] The IPI, in this case, an irq_work, is needed since trace might run
> > in NMI context. Which is not suitable for wake-ups.
> 
> Can't this simply be done in user-space?
>
> Set the reading of the trace buffers to O_NONBLOCK and it wont wait for
> buffering to happen, and should prevent it from causing the IPI wake ups.

Yes, I hadn't thought of O_NONBLOCK...

> If you need this for trace-cmd, we can add a --poll option that would do
> this.

I just sent a patch with my attempt at implementing --poll.

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-06-02  9:38 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19 17:57 [RFC] trace: Add option for polling ring buffers Nicolas Saenz Julienne
2021-05-19 18:07 ` Matthew Wilcox
2021-05-19 19:33   ` Marcelo Tosatti
2021-05-20  8:57   ` Nicolas Saenz Julienne
2021-05-28 17:32 ` Steven Rostedt
2021-06-02  9:38   ` Nicolas Saenz Julienne

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).