All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] ring-buffer: More precise timestamps for nested writes
       [not found] <1428459449-10664-1-git-send-email-warrier@linux.vnet.ibm.com>
@ 2015-04-14  2:29 ` Suresh E. Warrier
  2015-04-14  2:31   ` [PATCH 1/2] ring-buffer: Introduce precise nested timestamp sysfs attribute Suresh E. Warrier
  0 siblings, 1 reply; 8+ messages in thread
From: Suresh E. Warrier @ 2015-04-14  2:29 UTC (permalink / raw)
  To: lklml; +Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Paul Mackerras

This patch set provides supports a way to record precise timestamps
for nested writes and a tunable to enable the behavior.

Nested writes occur when a trace writer is preempted, usually through
an interrupt and the interrupt handler also writes its own trace. It has
been noticed on large systems with lots of interrupts, several hundred
trace events can be recorded with the same time stamp.

Fixing this requires a small critical section in the trace write code
in which interrupts are disabled. This is an expensive operation for many
architectures. Hence, the need for a kernel tunable that allows the
feature to be enabled by admins only when needed.

Suresh Warrier (2):
  ring-buffer: Introduce precise nested timestamp sysfs attribute
  ring-buffer: More precise timestamps for nested writes

 Documentation/trace/ftrace.txt |  8 ++++
 include/linux/ring_buffer.h    |  9 +++++
 kernel/trace/ring_buffer.c     | 85 ++++++++++++++++++++++++++++++++++++++++--
 kernel/trace/trace.c           | 51 +++++++++++++++++++++++++
 4 files changed, 149 insertions(+), 4 deletions(-)

-- 
1.8.3.4




^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] ring-buffer: Introduce precise nested timestamp sysfs attribute
  2015-04-14  2:29 ` [PATCH 0/2] ring-buffer: More precise timestamps for nested writes Suresh E. Warrier
@ 2015-04-14  2:31   ` Suresh E. Warrier
  2015-04-14  2:38     ` [PATCH 2/2] More precise timestamps for nested writes Suresh E. Warrier
  0 siblings, 1 reply; 8+ messages in thread
From: Suresh E. Warrier @ 2015-04-14  2:31 UTC (permalink / raw)
  To: lklml; +Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Paul Mackerras

Introduce a new tunable nested_precise_ts that allows precise time
stamps to be recorded for nested trace writes. 

The tunable is: /sys/kernel/debug/tracing/nested_precise_ts. By default,
it is zero and the behavior is disabled. Set the tunable to 1 in order
to get accurate timestamps.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
---
 Documentation/trace/ftrace.txt |  8 +++++++
 include/linux/ring_buffer.h    |  9 ++++++++
 kernel/trace/ring_buffer.c     | 14 ++++++++++++
 kernel/trace/trace.c           | 51 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 82 insertions(+)

diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index bfe8c29..4d1e83b 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -264,6 +264,14 @@ of ftrace. Here is a list of some of the key files:
 	one will show only the first kernel function that is called
 	from user space.

+  nested_precise_ts:
+
+	When set, it will generate precise timestamps even for nested
+	writers which otherwise usually all get the same timestamp as
+	the initial preempted writer. However, this will require briefly
+	disabling interrupts on the local CPU for each trace event, which
+	can be expensive on some architectures.
+
   printk_formats:

 	This is for tools that read the raw format files. If an event in
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 49a4d6f..682147a 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/seq_file.h>
 #include <linux/poll.h>
+#include <linux/jump_label.h>

 struct ring_buffer;
 struct ring_buffer_iter;
@@ -97,6 +98,14 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k
 	__ring_buffer_alloc((size), (flags), &__key);	\
 })

+extern struct static_key __precise_nested_write_ts;
+static inline bool rb_precise_nested_write_ts(void)
+{
+	return static_key_false(&__precise_nested_write_ts);
+}
+void rb_enable_precise_nested_write_ts(void);
+void rb_disable_precise_nested_write_ts(void);
+
 int ring_buffer_wait(struct ring_buffer *buffer, int cpu);
 int ring_buffer_poll_wait(struct ring_buffer *buffer, int cpu,
 			  struct file *filp, poll_table *poll_table);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 3d9fee3..c9b3005 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2520,6 +2520,20 @@ static inline void rb_end_commit(struct ring_buffer_per_cpu *cpu_buffer)
 	}
 }

+struct static_key __precise_nested_write_ts;
+
+void rb_enable_precise_nested_write_ts(void)
+{
+	if (!rb_precise_nested_write_ts())
+		static_key_slow_inc(&__precise_nested_write_ts);
+}
+
+void rb_disable_precise_nested_write_ts(void)
+{
+	if (rb_precise_nested_write_ts())
+		static_key_slow_dec(&__precise_nested_write_ts);
+}
+
 static struct ring_buffer_event *
 rb_reserve_next_event(struct ring_buffer *buffer,
 		      struct ring_buffer_per_cpu *cpu_buffer,
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 18cdf91..f27dcde 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3486,6 +3486,7 @@ static const char readme_msg[] =
 	"\t\t\t  Remove sub-buffer with rmdir\n"
 	"  trace_options\t\t- Set format or modify how tracing happens\n"
 	"\t\t\t  Disable an option by adding a suffix 'no' to the option name\n"
+	"  nested_precise_ts\t- Enable/disable precise timestamps for nested writes\n"
 #ifdef CONFIG_DYNAMIC_FTRACE
 	"\n  available_filter_functions - list of functions that can be filtered on\n"
 	"  set_ftrace_filter\t- echo function name in here to only trace these functions\n"
@@ -4680,6 +4681,45 @@ static int tracing_clock_open(struct inode *inode, struct file *file)
 	return ret;
 }

+static ssize_t
+tracing_nested_precise_read(struct file *filp, char __user *ubuf,
+	       size_t cnt, loff_t *ppos)
+{
+	char buf[64];
+	int r;
+
+	r = rb_precise_nested_write_ts() ? 1 : 0;
+	r = sprintf(buf, "%d\n", r);
+
+	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
+}
+
+static ssize_t
+tracing_nested_precise_write(struct file *filp, const char __user *ubuf,
+		size_t cnt, loff_t *ppos)
+{
+	struct trace_array *tr = filp->private_data;
+	struct ring_buffer *buffer = tr->trace_buffer.buffer;
+	unsigned long val;
+	int ret;
+
+	ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
+	if (ret)
+		return ret;
+
+	if (buffer) {
+		mutex_lock(&trace_types_lock);
+		if (val)
+			rb_enable_precise_nested_write_ts();
+		else
+			rb_disable_precise_nested_write_ts();
+		mutex_unlock(&trace_types_lock);
+	}
+
+	(*ppos)++;
+
+	return cnt;
+}
 struct ftrace_buffer_info {
 	struct trace_iterator	iter;
 	void			*spare;
@@ -4910,6 +4950,14 @@ static const struct file_operations trace_clock_fops = {
 	.write		= tracing_clock_write,
 };

+static const struct file_operations tracing_nested_precise_fops = {
+	.open		= tracing_open_generic_tr,
+	.read		= tracing_nested_precise_read,
+	.write		= tracing_nested_precise_write,
+	.llseek		= generic_file_llseek,
+	.release	= tracing_release_generic_tr,
+};
+
 #ifdef CONFIG_TRACER_SNAPSHOT
 static const struct file_operations snapshot_fops = {
 	.open		= tracing_snapshot_open,
@@ -6160,6 +6208,9 @@ init_tracer_debugfs(struct trace_array *tr, struct dentry *d_tracer)
 	trace_create_file("trace_clock", 0644, d_tracer, tr,
 			  &trace_clock_fops);

+	trace_create_file("nested_precise_ts", 0644, d_tracer,
+			  tr, &tracing_nested_precise_fops);
+
 	trace_create_file("tracing_on", 0644, d_tracer,
 			  tr, &rb_simple_fops);

-- 1.8.3.4 


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] More precise timestamps for nested writes
  2015-04-14  2:31   ` [PATCH 1/2] ring-buffer: Introduce precise nested timestamp sysfs attribute Suresh E. Warrier
@ 2015-04-14  2:38     ` Suresh E. Warrier
  2015-04-14 17:13       ` Peter Zijlstra
  2015-04-15  9:16       ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Suresh E. Warrier @ 2015-04-14  2:38 UTC (permalink / raw)
  To: lklml; +Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Paul Mackerras

When tracing the behavior of multiple fio jobs running in parallel
our performance team observed that some scsi_dispatch_cmd_done events
appeared to occur earlier, often several microseconds earlier, than
their associated scsi_dispatch_cmd_start event in the trace records.
Other interrupt events were also observed to have earlier timestamps
than the events that caused them. 

This incorrect chronological ordering of trace records occurs because
ALL nested writes have the same time stamp as the first writer, that is
the first writer in the stack that was preempted. In workloads where
interrupts occur frequently, the first writer can stay preempted across
multiple interrupts and nested trace events can record time stamps that
are many microseconds earlier than their actual value resulting in 
the wrong ordering.

For example, a scsi_dispatch_cmd_start on CPU A occurs, say at time
t1, and the corresponding scsi_dispatch_cmd_done occurs on another
CPU B, say at time t2. However, CPU B is in the middle of a nested
write with the first writer having recorded its event at time t0, where  
t0 < t1 < t2. Eventually, we get out of all the nested writes and
the first writer commits its trace event with a time of t0 and the
the nested scsi_dispatch_cmd_done gets assigned the same timestamp t0.
In the trace report, the scsi_dispatch_cmd_done thus appears to have
occurred before the scsi_dispatch_cmd_start.

In some cases, on larger systems with multiple fio jobs running and
interrupts being sent to a single CPU, we have noticed more than
400 trace events all with the same timestamp.

The reason all nested writes have the same time stamp is because
all nested trace events are assigned a time delta of 0. A correct
time delta cannot be computed for them because the first interrupted
write has not been committed yet and so the commit timestamp in the CPU
buffer does not record the time stamp of the just prior event.

One way of fixing this is to keep a timestamp of the last event we
reserved space for in the ring buffer, so that we can compute the
correct time delta for each event. We also need a short critical
section where we cannot be interrupted, in which we:
 1. Read the current time
 2. Compute the time delta for this event from the last time stamp
 3. Update the value of the last time stamp 
 4. Reserve space in the ring buffer for the event, 

Although the ring buffer design is carefully coded to avoid disabling
interrupts, in this case there does not appear to be a practical way
to solve this problem without disabling interrupts for a short time.
To accommodate those architectures where disabling interrupts is
expensive, a kernel tunable (/sys/kernel/debug/tracing/nested_precise_ts)
needs to be set to 1 to enable this behavior and get accurate
timestamps.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
---
 kernel/trace/ring_buffer.c | 71 +++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 67 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c9b3005..0a2d862 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -482,6 +482,7 @@ struct ring_buffer_per_cpu {
 	unsigned long			read_bytes;
 	u64				write_stamp;
 	u64				read_stamp;
+	u64				last_stamp;
 	/* ring buffer pages to update, > 0 to add, < 0 to remove */
 	int				nr_pages_to_update;
 	struct list_head		new_pages; /* new pages to add */
@@ -2010,7 +2011,8 @@ rb_update_event(struct ring_buffer_per_cpu *cpu_buffer,
 {
 	/* Only a commit updates the timestamp */
 	if (unlikely(!rb_event_is_commit(cpu_buffer, event)))
-		delta = 0;
+		if (!rb_precise_nested_write_ts())
+			delta = 0;
 
 	/*
 	 * If we need to add a timestamp, then we
@@ -2534,6 +2536,60 @@ void rb_disable_precise_nested_write_ts(void)
 		static_key_slow_dec(&__precise_nested_write_ts);
 }
 
+/**
+ * get_write_timestamp:
+ *	Must be called before we read the current time stamp
+ *	It returns a pointer to the location of the last
+ *	time stamp to be used in the delta calculation.
+ *	If  precise nested write timestamps are enabled, it first
+ *	disables interrupts on the current processor so that
+ *	we can reserve space on the buffer and save the event's
+ *	timestamp without being preempted.
+ *
+ * put_write_timestamp:
+ *	Must be called after we reserve space in the ring buffer.
+ *	If precise nested write timestamps are enabled, it saves the
+ *	timestamp in the specified timestamp location (passed in
+ *	as *pstamp) so that the nested writers always have a valid
+ *	timestamp to compute the timestamp deltas for their events.
+ *	This must be done before we re-enable interrupts.
+ *
+ */
+static u64 *get_write_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
+				unsigned long *flags)
+{
+	if (rb_precise_nested_write_ts()) {
+		/*
+		 * Ensure that we are not preempted until after we update
+		 * the write timestamp.
+		 */
+		local_irq_save(*flags);
+		return &cpu_buffer->last_stamp;
+	} else {
+		return &cpu_buffer->write_stamp;
+	}
+}
+
+static void put_write_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
+				u64 *pstamp,
+				u64 ts,
+				unsigned long *flags)
+{
+	/*
+	 * Rather than make the rb_precise_nested_write_ts()
+	 * call (this would be a NOP for the default case) to check
+	 * if we need to restore interrupts, we need to explicitly
+	 * test if pstamp points to the last_stamp field below to
+	 * handle the case where the nested_precise_ts sysfs tunable
+	 * was updated between the get_write_timestamp() and
+	 * put_write_timestamp() routines.
+	 */
+	if (&cpu_buffer->last_stamp == pstamp) {
+		*pstamp = ts;
+		local_irq_restore(*flags);
+	}
+}
+
 static struct ring_buffer_event *
 rb_reserve_next_event(struct ring_buffer *buffer,
 		      struct ring_buffer_per_cpu *cpu_buffer,
@@ -2544,6 +2600,8 @@ rb_reserve_next_event(struct ring_buffer *buffer,
 	int nr_loops = 0;
 	int add_timestamp;
 	u64 diff;
+	u64 *pstamp;
+	unsigned long flags = 0;	/* GCC */
 
 	rb_start_commit(cpu_buffer);
 
@@ -2579,14 +2637,16 @@ rb_reserve_next_event(struct ring_buffer *buffer,
 	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 1000))
 		goto out_fail;
 
+	pstamp = get_write_timestamp(cpu_buffer, &flags);
+
 	ts = rb_time_stamp(cpu_buffer->buffer);
-	diff = ts - cpu_buffer->write_stamp;
+	diff = ts - *pstamp;
 
 	/* make sure this diff is calculated here */
 	barrier();
 
 	/* Did the write stamp get updated already? */
-	if (likely(ts >= cpu_buffer->write_stamp)) {
+	if (likely(ts >= *pstamp)) {
 		delta = diff;
 		if (unlikely(test_time_stamp(delta))) {
 			int local_clock_stable = 1;
@@ -2597,7 +2657,7 @@ rb_reserve_next_event(struct ring_buffer *buffer,
 				  KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu\n%s",
 				  (unsigned long long)delta,
 				  (unsigned long long)ts,
-				  (unsigned long long)cpu_buffer->write_stamp,
+				  (unsigned long long)*pstamp,
 				  local_clock_stable ? "" :
 				  "If you just came from a suspend/resume,\n"
 				  "please switch to the trace global clock:\n"
@@ -2608,6 +2668,9 @@ rb_reserve_next_event(struct ring_buffer *buffer,
 
 	event = __rb_reserve_next(cpu_buffer, length, ts,
 				  delta, add_timestamp);
+
+	put_write_timestamp(cpu_buffer, pstamp, ts, &flags);
+
 	if (unlikely(PTR_ERR(event) == -EAGAIN))
 		goto again;
 
-- 
1.8.3.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] More precise timestamps for nested writes
  2015-04-14  2:38     ` [PATCH 2/2] More precise timestamps for nested writes Suresh E. Warrier
@ 2015-04-14 17:13       ` Peter Zijlstra
  2015-04-15  0:33         ` Suresh E. Warrier
  2015-04-15  9:16       ` Peter Zijlstra
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2015-04-14 17:13 UTC (permalink / raw)
  To: Suresh E. Warrier
  Cc: lklml, Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Paul Mackerras

On Mon, Apr 13, 2015 at 09:38:01PM -0500, Suresh E. Warrier wrote:
> +static u64 *get_write_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
> +				unsigned long *flags)
> +{
> +	if (rb_precise_nested_write_ts()) {
> +		/*
> +		 * Ensure that we are not preempted until after we update
> +		 * the write timestamp.
> +		 */
> +		local_irq_save(*flags);
> +		return &cpu_buffer->last_stamp;

Yeah, ever hear about NMIs? This isn't going to work.

> +	} else {
> +		return &cpu_buffer->write_stamp;
> +	}
> +}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] More precise timestamps for nested writes
  2015-04-14 17:13       ` Peter Zijlstra
@ 2015-04-15  0:33         ` Suresh E. Warrier
  2015-04-15  8:45           ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Suresh E. Warrier @ 2015-04-15  0:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lklml, Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Paul Mackerras

On 04/14/2015 12:13 PM, Peter Zijlstra wrote:
> On Mon, Apr 13, 2015 at 09:38:01PM -0500, Suresh E. Warrier wrote:
>> +static u64 *get_write_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
>> +				unsigned long *flags)
>> +{
>> +	if (rb_precise_nested_write_ts()) {
>> +		/*
>> +		 * Ensure that we are not preempted until after we update
>> +		 * the write timestamp.
>> +		 */
>> +		local_irq_save(*flags);
>> +		return &cpu_buffer->last_stamp;
> 
> Yeah, ever hear about NMIs? This isn't going to work.

That is a good point! If a NMI can come in and start running a handler
that can generate a trace event, this code is indeed broken.

Some architectures like PowerPC don't have NMIs like Intel and so 
I hadn't thought of that. Thanks for catching that!

Let me update the patch to handle NMIs - trace events from NMI code 
cannot be made precise (the behavior will be the same as without the
patch).

-suresh


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] More precise timestamps for nested writes
  2015-04-15  0:33         ` Suresh E. Warrier
@ 2015-04-15  8:45           ` Peter Zijlstra
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2015-04-15  8:45 UTC (permalink / raw)
  To: Suresh E. Warrier
  Cc: lklml, Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Paul Mackerras

On Tue, Apr 14, 2015 at 07:33:00PM -0500, Suresh E. Warrier wrote:
> On 04/14/2015 12:13 PM, Peter Zijlstra wrote:
> > On Mon, Apr 13, 2015 at 09:38:01PM -0500, Suresh E. Warrier wrote:
> >> +static u64 *get_write_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
> >> +				unsigned long *flags)
> >> +{
> >> +	if (rb_precise_nested_write_ts()) {
> >> +		/*
> >> +		 * Ensure that we are not preempted until after we update
> >> +		 * the write timestamp.
> >> +		 */
> >> +		local_irq_save(*flags);
> >> +		return &cpu_buffer->last_stamp;
> > 
> > Yeah, ever hear about NMIs? This isn't going to work.
> 
> That is a good point! If a NMI can come in and start running a handler
> that can generate a trace event, this code is indeed broken.
> 
> Some architectures like PowerPC don't have NMIs like Intel and so 
> I hadn't thought of that. Thanks for catching that!

Learn you arch better, Power actually has the nesting level.
local_irq_disable() is a software disable, only if an actual interrupt
comes in while that flag is set do you disable the irq in hardware.

Now Paul Mackerras (ab)used this to allow PMU interrupts while soft
disabled, which effectively gets you NMI nesting on Power.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] More precise timestamps for nested writes
  2015-04-14  2:38     ` [PATCH 2/2] More precise timestamps for nested writes Suresh E. Warrier
  2015-04-14 17:13       ` Peter Zijlstra
@ 2015-04-15  9:16       ` Peter Zijlstra
  2015-04-15 16:53         ` Suresh E. Warrier
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2015-04-15  9:16 UTC (permalink / raw)
  To: Suresh E. Warrier
  Cc: lklml, Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Paul Mackerras

On Mon, Apr 13, 2015 at 09:38:01PM -0500, Suresh E. Warrier wrote:

> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index c9b3005..0a2d862 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -482,6 +482,7 @@ struct ring_buffer_per_cpu {
>  	unsigned long			read_bytes;
>  	u64				write_stamp;
>  	u64				read_stamp;
> +	u64				last_stamp;
>  	/* ring buffer pages to update, > 0 to add, < 0 to remove */
>  	int				nr_pages_to_update;
>  	struct list_head		new_pages; /* new pages to add */

So what is wrong with something like:

u64 rb_get_delta(struct ring_buffer_per_cpu *cpu_buffer)
{
	u64 last, now;

	for (;;) {
		last = cpu_buffer->last_stamp;
		now = cpu_buffer->buffer->clock();

		if (cmpxchg_local(&cpu_buffer->last_stamp, last, now) == last)
			break;

		cpu_relax();
	}

	return now - last;
}

Of course, LL/SC on power stinks, but on most archs this is actually
fairly fast.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] More precise timestamps for nested writes
  2015-04-15  9:16       ` Peter Zijlstra
@ 2015-04-15 16:53         ` Suresh E. Warrier
  0 siblings, 0 replies; 8+ messages in thread
From: Suresh E. Warrier @ 2015-04-15 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lklml, Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Paul Mackerras



On 04/15/2015 04:16 AM, Peter Zijlstra wrote:
> On Mon, Apr 13, 2015 at 09:38:01PM -0500, Suresh E. Warrier wrote:
> 
>> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
>> index c9b3005..0a2d862 100644
>> --- a/kernel/trace/ring_buffer.c
>> +++ b/kernel/trace/ring_buffer.c
>> @@ -482,6 +482,7 @@ struct ring_buffer_per_cpu {
>>  	unsigned long			read_bytes;
>>  	u64				write_stamp;
>>  	u64				read_stamp;
>> +	u64				last_stamp;
>>  	/* ring buffer pages to update, > 0 to add, < 0 to remove */
>>  	int				nr_pages_to_update;
>>  	struct list_head		new_pages; /* new pages to add */
> 
> So what is wrong with something like:
> 
> u64 rb_get_delta(struct ring_buffer_per_cpu *cpu_buffer)
> {
> 	u64 last, now;
> 
> 	for (;;) {
> 		last = cpu_buffer->last_stamp;
> 		now = cpu_buffer->buffer->clock();
> 
> 		if (cmpxchg_local(&cpu_buffer->last_stamp, last, now) == last)
> 			break;
> 
> 		cpu_relax();
> 	}
> 
> 	return now - last;
> }
> 
> Of course, LL/SC on power stinks, but on most archs this is actually
> fairly fast.
> 

I am not sure it will help. The delta we assign to the current 
event has to be the difference of the "actual" timestamps of the 
last event in the buffer and the current event. If we can't be sure 
of the delta, we can let it be zero which is the current behavior 
for nested writes.

Thus, one way to fix my patch to handle NMIs is to recheck the 
last_stamp after the event has been reserved in the buffer. If it 
has changed since we computed the delta, we can simply reset the
delta to zero. This should only happen if an NMI occurs within the
short critical section and it also does a trace.

-suresh


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-04-15 16:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1428459449-10664-1-git-send-email-warrier@linux.vnet.ibm.com>
2015-04-14  2:29 ` [PATCH 0/2] ring-buffer: More precise timestamps for nested writes Suresh E. Warrier
2015-04-14  2:31   ` [PATCH 1/2] ring-buffer: Introduce precise nested timestamp sysfs attribute Suresh E. Warrier
2015-04-14  2:38     ` [PATCH 2/2] More precise timestamps for nested writes Suresh E. Warrier
2015-04-14 17:13       ` Peter Zijlstra
2015-04-15  0:33         ` Suresh E. Warrier
2015-04-15  8:45           ` Peter Zijlstra
2015-04-15  9:16       ` Peter Zijlstra
2015-04-15 16:53         ` Suresh E. Warrier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.