[PATCH] ring-buffer: More precise time stamps for nested writes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] ring-buffer: More precise time stamps for nested writes
       [not found] <1426827724-22200-1-git-send-email-warrier@linux.vnet.ibm.com>
@ 2015-03-24 23:10 ` Suresh E. Warrier
  2015-03-24 23:19   ` Steven Rostedt
  0 siblings, 1 reply; 6+ messages in thread
From: Suresh E. Warrier @ 2015-03-24 23:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Paul Mackerras

When tracing the behavior of multiple fio jobs running in parallel
our performance team observed that some scsi_dispatch_cmd_done events
appeared to occur earlier, often several microseconds earlier, than
their associated scsi_dispatch_cmd_start event in the trace records.
Other interrupt events were also observed to have earlier timestamps
than the events that caused them. 

This incorrect chronological ordering of trace records occurs because
ALL nested writes have the same time stamp as the first writer, that is
the first writer in the stack that was preempted. In workloads where
interrupts occur frequently, the first writer can stay preempted across
multiple interrupts and nested trace events can record time stamps that
are many microseconds earlier than their actual value resulting in 
the wrong ordering.

For example, a scsi_dispatch_cmd_start on CPU A occurs, say at time
t1, and the corresponding scsi_dispatch_cmd_done occurs on another
CPU B, say at time t2. However, CPU B is in the middle of a nested
write with the first writer having recorded its event at time t0, where  
t0 < t1 < t2. Eventually, we get out of all the nested writes and
the first writer commits its trace event with a time of t0 and the
the nested scsi_dispatch_cmd_done gets assigned the same timestamp t0.
In the trace report, the scsi_dispatch_cmd_done thus appears to have
occurred before the scsi_dispatch_cmd_start.

In some cases, on larger systems with multiple fio jobs running and
interrupts being sent to a single CPU, we have noticed more than
400 trace events all with the same timestamp.

The reason all nested writes have the same time stamp is because
all nested trace events are assigned a time delta of 0. A correct
time delta cannot be computed for them because the first interrupted
write has not been committed yet and so the commit timestamp in the CPU
buffer does not record the time stamp of the just prior event.

One way of fixing this is to keep a timestamp of the last event we
reserved space for in the ring buffer, so that we can compute the
correct time delta for each event. We also need a short critical
section where we cannot be interrupted, in which we:
 1. Read the current time
 2. Compute the time delta for this event from the last time stamp
 3. Update the value of the last time stamp 
 4. Reserve space in the ring buffer for the event, 

Although the ring buffer design is coded to avoid disabling
interrupts, in this case there does not appear to be a practical way
to solve this problem without disabling interrupts for a short time.
To accommodate those architectures where disabling interrupts is
expensive, we introduce a new kernel configuration option called
CONFIG_RING_BUFFER_NESTED_WRITE_TIMESTAMP_UPDATE.  On architectures
where local_irq_save/restore is cheap, this should be set to Y,
in order to get accurate timestamps.  Other architectures can set it
to N to avoid the overhead of local_irq_save/restore at the cost of
getting inaccurate timestamps for nested writes.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
---
 kernel/trace/Kconfig       | 15 ++++++++++++
 kernel/trace/ring_buffer.c | 59 +++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 71 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 015f85a..8a8e185 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -573,6 +573,21 @@ config RING_BUFFER_STARTUP_TEST

 	 If unsure, say N

+config RING_BUFFER_NESTED_WRITE_TIMESTAMP_UPDATE
+	bool "Ring buffer update timestamps for nested writes"
+	depends on RING_BUFFER
+	default y if PPC64
+	help
+	  This option allows nested writes to have updated time stamps
+	  to correspond to the time that the event occurred instead of
+	  having the same time stamp as the first write that was
+	  interrupted.
+
+	  The option requires a fast IRQ enable/disable mechanism
+	  to work efficiently.
+
+	  If unsure, say N
+
 endif # FTRACE

 endif # TRACING_SUPPORT
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 3d9fee3..bb18d44 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -482,6 +482,7 @@ struct ring_buffer_per_cpu {
 	unsigned long			read_bytes;
 	u64				write_stamp;
 	u64				read_stamp;
+	u64				last_stamp;
 	/* ring buffer pages to update, > 0 to add, < 0 to remove */
 	int				nr_pages_to_update;
 	struct list_head		new_pages; /* new pages to add */
@@ -2008,9 +2009,11 @@ rb_update_event(struct ring_buffer_per_cpu *cpu_buffer,
 		struct ring_buffer_event *event, unsigned length,
 		int add_timestamp, u64 delta)
 {
+#ifndef CONFIG_RING_BUFFER_NESTED_WRITE_TIMESTAMP_UPDATE
 	/* Only a commit updates the timestamp */
 	if (unlikely(!rb_event_is_commit(cpu_buffer, event)))
 		delta = 0;
+#endif

 	/*
 	 * If we need to add a timestamp, then we
@@ -2520,6 +2523,49 @@ static inline void rb_end_commit(struct ring_buffer_per_cpu *cpu_buffer)
 	}
 }

+#ifdef CONFIG_RING_BUFFER_NESTED_WRITE_TIMESTAMP_UPDATE
+/**
+ * get_write_timestamp:
+ *	Must be called before we read the current time stamp
+ *	It returns a pointer to the location of the last
+ *	time stamp.
+ * put_write_timestamp:
+ *	Must be called after we reserve space in the ring
+ *	buffer. It saves the timestamp in the specified
+ *	timestamp location (passed in as *pstamp).
+ */
+static u64 *get_write_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
+				unsigned long *flags)
+{
+	/*
+	 * Ensure that we are not preempted until after we update
+	 * the write timestamp.
+	 */
+	local_irq_save(*flags);
+	return &cpu_buffer->last_stamp;
+}
+
+static void put_write_timestamp(u64 *pstamp, u64 ts, unsigned long *flags)
+{
+	/*
+	 * Update time stamp to match timestamp of latest event on buffer
+	 * before reenabling interrupts
+	 */
+	*pstamp = ts;
+	local_irq_restore(*flags);
+}
+#else
+static u64 *get_write_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
+				unsigned long *flags)
+{
+	return &cpu_buffer->write_stamp;
+}
+
+static void put_write_timestamp(u64 *pstamp, u64 ts, unsigned long *flags)
+{
+}
+#endif
+
 static struct ring_buffer_event *
 rb_reserve_next_event(struct ring_buffer *buffer,
 		      struct ring_buffer_per_cpu *cpu_buffer,
@@ -2530,6 +2576,8 @@ rb_reserve_next_event(struct ring_buffer *buffer,
 	int nr_loops = 0;
 	int add_timestamp;
 	u64 diff;
+	u64 *pstamp;
+	unsigned long flags;

 	rb_start_commit(cpu_buffer);

@@ -2565,14 +2613,16 @@ rb_reserve_next_event(struct ring_buffer *buffer,
 	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 1000))
 		goto out_fail;

+	pstamp = get_write_timestamp(cpu_buffer, &flags);
+
 	ts = rb_time_stamp(cpu_buffer->buffer);
-	diff = ts - cpu_buffer->write_stamp;
+	diff = ts - *pstamp;

 	/* make sure this diff is calculated here */
 	barrier();

 	/* Did the write stamp get updated already? */
-	if (likely(ts >= cpu_buffer->write_stamp)) {
+	if (likely(ts >= *pstamp)) {
 		delta = diff;
 		if (unlikely(test_time_stamp(delta))) {
 			int local_clock_stable = 1;
@@ -2583,7 +2633,7 @@ rb_reserve_next_event(struct ring_buffer *buffer,
 				  KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu\n%s",
 				  (unsigned long long)delta,
 				  (unsigned long long)ts,
-				  (unsigned long long)cpu_buffer->write_stamp,
+				  (unsigned long long)*pstamp,
 				  local_clock_stable ? "" :
 				  "If you just came from a suspend/resume,\n"
 				  "please switch to the trace global clock:\n"
@@ -2594,6 +2644,9 @@ rb_reserve_next_event(struct ring_buffer *buffer,

 	event = __rb_reserve_next(cpu_buffer, length, ts,
 				  delta, add_timestamp);
+
+	put_write_timestamp(pstamp, ts, &flags);
+
 	if (unlikely(PTR_ERR(event) == -EAGAIN))
 		goto again;

-- 
1.8.3.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] ring-buffer: More precise time stamps for nested writes
  2015-03-24 23:10 ` [PATCH] ring-buffer: More precise time stamps for nested writes Suresh E. Warrier
@ 2015-03-24 23:19   ` Steven Rostedt
  2015-03-27  5:38     ` Suresh E. Warrier
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2015-03-24 23:19 UTC (permalink / raw)
  To: Suresh E. Warrier
  Cc: linux-kernel, Frederic Weisbecker, Ingo Molnar, Paul Mackerras

On Tue, 24 Mar 2015 18:10:05 -0500

> Although the ring buffer design is coded to avoid disabling
> interrupts, in this case there does not appear to be a practical way
> to solve this problem without disabling interrupts for a short time.
> To accommodate those architectures where disabling interrupts is
> expensive, we introduce a new kernel configuration option called
> CONFIG_RING_BUFFER_NESTED_WRITE_TIMESTAMP_UPDATE.  On architectures
> where local_irq_save/restore is cheap, this should be set to Y,
> in order to get accurate timestamps.  Other architectures can set it
> to N to avoid the overhead of local_irq_save/restore at the cost of
> getting inaccurate timestamps for nested writes.

There is no architecture where disabling interrupts is cheap. Actually,
enabling them is the killer. Doing function tracing shows the impact of
this rather well, as it would disable and enable interrupts for every
function call, which can get rather expensive in the sum of things, and
it does skew the timings and can make it more difficult to debug
heissenbugs.

That said, I feel your pain. I've had some ideas about doing this
without disabling interrupts. But for now, what can be done is to have
a flag that is set that will implement this or not. Using
static_branch() to implement it such that when its off it has no effect.

In the mean time, I can go and revisit trying to have better timings
with nested writes.

-- Steve

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ring-buffer: More precise time stamps for nested writes
  2015-03-24 23:19   ` Steven Rostedt
@ 2015-03-27  5:38     ` Suresh E. Warrier
  2015-03-27 16:04       ` Steven Rostedt
  0 siblings, 1 reply; 6+ messages in thread
From: Suresh E. Warrier @ 2015-03-27  5:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Frederic Weisbecker, Ingo Molnar, Paul Mackerras

On 03/24/2015 06:19 PM, Steven Rostedt wrote:
> On Tue, 24 Mar 2015 18:10:05 -0500
> 
..
..
> There is no architecture where disabling interrupts is cheap. Actually,
> enabling them is the killer. Doing function tracing shows the impact of
> this rather well, as it would disable and enable interrupts for every
> function call, which can get rather expensive in the sum of things, and
> it does skew the timings and can make it more difficult to debug
> heissenbugs.
>

On PPC64, enabling and disabling soft IRQs is simply a matter of 
setting/clearing a per-CPU global variable (in the per-CPU PACA
structure), so it shouldn't be expensive unless an interrupt is
pending when we re-enable, which should be rare. I ran a series of 
tests on a Power8 box and even with a cold cache, both the 
local_irq_save and  local_irq_restore routines only took a few 
nanoseconds, although the restore was a little more expensive 
than the save. (The restore took about 80-160 ns on a cold cache 
and 20-30 ns on a warm one.)

> That said, I feel your pain. I've had some ideas about doing this
> without disabling interrupts. 

That is excellent! 

> But for now, what can be done is to have
> a flag that is set that will implement this or not. Using
> static_branch() to implement it such that when its off it has no effect.
> 

Are you recommending that for now I use a static_branch() instead
of a CONFIG option to fix this? I could do that but the resulting 
code will either be messier to read (with several if condition checks)
or will require some duplication of code. My assumption is that the
new CONFIG option when disabled should have negligible impact since
the compiler inlines the functions.

-suresh

> In the mean time, I can go and revisit trying to have better timings
> with nested writes.
> 
> -- Steve
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ring-buffer: More precise time stamps for nested writes
  2015-03-27  5:38     ` Suresh E. Warrier
@ 2015-03-27 16:04       ` Steven Rostedt
  2015-03-27 22:14         ` Paul Mackerras
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2015-03-27 16:04 UTC (permalink / raw)
  To: Suresh E. Warrier
  Cc: linux-kernel, Frederic Weisbecker, Ingo Molnar, Paul Mackerras

On Fri, 27 Mar 2015 00:38:43 -0500
"Suresh E. Warrier" <warrier@linux.vnet.ibm.com> wrote:

> > But for now, what can be done is to have
> > a flag that is set that will implement this or not. Using
> > static_branch() to implement it such that when its off it has no effect.
> > 
> 
> Are you recommending that for now I use a static_branch() instead
> of a CONFIG option to fix this? I could do that but the resulting 
> code will either be messier to read (with several if condition checks)
> or will require some duplication of code. My assumption is that the
> new CONFIG option when disabled should have negligible impact since
> the compiler inlines the functions.

It can be done cleanly if you encapsulate it properly.

Too bad I'm not going on any trips soon. This is a project I would work
on on the plane.

-- Steve

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ring-buffer: More precise time stamps for nested writes
  2015-03-27 16:04       ` Steven Rostedt
@ 2015-03-27 22:14         ` Paul Mackerras
  2015-03-27 22:17           ` Steven Rostedt
  0 siblings, 1 reply; 6+ messages in thread
From: Paul Mackerras @ 2015-03-27 22:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Suresh E. Warrier, linux-kernel, Frederic Weisbecker, Ingo Molnar

On Fri, Mar 27, 2015 at 12:04:15PM -0400, Steven Rostedt wrote:
> On Fri, 27 Mar 2015 00:38:43 -0500
> "Suresh E. Warrier" <warrier@linux.vnet.ibm.com> wrote:
> 
> > > But for now, what can be done is to have
> > > a flag that is set that will implement this or not. Using
> > > static_branch() to implement it such that when its off it has no effect.
> > > 
> > 
> > Are you recommending that for now I use a static_branch() instead
> > of a CONFIG option to fix this? I could do that but the resulting 
> > code will either be messier to read (with several if condition checks)
> > or will require some duplication of code. My assumption is that the
> > new CONFIG option when disabled should have negligible impact since
> > the compiler inlines the functions.
> 
> It can be done cleanly if you encapsulate it properly.

Sure, but what is the advantage to using a static branch?  When would
you ever want a single kernel image that could run either way
depending on what machine it was running on?

> Too bad I'm not going on any trips soon. This is a project I would work
> on on the plane.

:)

Paul.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ring-buffer: More precise time stamps for nested writes
  2015-03-27 22:14         ` Paul Mackerras
@ 2015-03-27 22:17           ` Steven Rostedt
  0 siblings, 0 replies; 6+ messages in thread
From: Steven Rostedt @ 2015-03-27 22:17 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Suresh E. Warrier, linux-kernel, Frederic Weisbecker, Ingo Molnar

On Sat, 28 Mar 2015 09:14:30 +1100
Paul Mackerras <paulus@samba.org> wrote:


> > It can be done cleanly if you encapsulate it properly.
> 
> Sure, but what is the advantage to using a static branch?  When would
> you ever want a single kernel image that could run either way
> depending on what machine it was running on?

For x86 you don't want it, unless you want to see interrupts
serialized, where some times you do.

It's not just an arch thing.

-- Steve

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-03-27 22:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1426827724-22200-1-git-send-email-warrier@linux.vnet.ibm.com>
2015-03-24 23:10 ` [PATCH] ring-buffer: More precise time stamps for nested writes Suresh E. Warrier
2015-03-24 23:19   ` Steven Rostedt
2015-03-27  5:38     ` Suresh E. Warrier
2015-03-27 16:04       ` Steven Rostedt
2015-03-27 22:14         ` Paul Mackerras
2015-03-27 22:17           ` Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.