linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
@ 2020-06-25 13:44 Steven Rostedt
  2020-06-25 13:53 ` Mathieu Desnoyers
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Steven Rostedt @ 2020-06-25 13:44 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Jiri Olsa, Namhyung Kim,
	Yordan Karadzhov, Tzvetomir Stoyanov, Mathieu Desnoyers,
	Tom Zanussi, Jason Behmer, Julia Lawall, Clark Williams,
	Daniel Bristot de Oliveira, Daniel Wagner, Darren Hart,
	Jonathan Corbet, Suresh E. Warrier

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

[ SEVEN YEAR PROBLEM SOLVED! ]

Up until now, if an event is interrupted while it is recorded by an
interrupt, and that interrupt records events, the time of those events will
all be the same. This is because events only record the delta of the time
since the previous event (or beginning of a page), and to handle updating
the time keeping for that of nested events is extremely racy. After years of
thinking about this and several failed attempts, I finally have a solution
to solve this puzzle.

The problem is that you need to atomically calculate the delta and then
update the time stamp you made the delta from, as well as then record it
into the buffer, all this while at any time an interrupt can come in and
do the same thing. This is easy to solve with heavy weight atomics, but that
would be detrimental to the performance of the ring buffer. The current
state of affairs sacrificed the time deltas for nested events for
performance.

The reason for previous failed attempts at solving this puzzle was because I
was trying to completely avoid slow atomic operations like cmpxchg. I final
came to the conclusion to always avoid cmpxchg is not possible, which is why
those previous attempts always failed. But it is possible to pick one path
(the most common case) and avoid cmpxchg in that path, which is the "fast
path". The most common case is that an event will not be interrupted and
have other events added into it. An event can detect if it has
interrupted another event, and for these cases we can make it the slow
path and use the heavy operations like cmpxchg.

One more player was added to the game that made this possible, and that is
the "absolute timestamp" (by Tom Zanussi) that allows us to inject a full 59
bit time stamp. (Of course this breaks if a machine is running for more than
18 years without a reboot!).

There's barrier() placements around for being paranoid, even when they
are not needed because of other atomic functions near by. But those
should not hurt, as if they are not needed, they basically become a nop.

Note, this also makes the race window much smaller, which means there
are less slow paths to slow down the performance.

Here's the design of this solution:

 All this is per cpu, and only needs to worry about nested events (not
 parallel events).

The players:

 write_tail: The index in the buffer where new events can be written to.
     It is incremented via local_add() to reserve space for a new event.

 before_stamp: A time stamp set by all events before reserving space.

 write_stamp: A time stamp updated by events after it has successfully
     reserved space.

 next_write: A copy of "write_tail" used to help with races.

	/* Save the current position of write */
 [A]	w = local_read(write_tail);
	barrier();
	/* Read both before and write stamps before touching anything */
	before = READ_ONCE(before_stamp);
	after = local_read(write_stamp);
	barrier();

	/*
	 * If before and after are the same, then this event is not
	 * preempting a time update. If it is, then reserve space for adding
	 * a full time stamp (this can turn into a time extend which is
	 * just an extended time delta but fill up the extra space).
	 */
	if (after != before)
		abs = true;

	ts = clock();

	/* Now update the before_stamp (everyone does this!) */
 [B]	WRITE_ONCE(before_stamp, ts);

	/* Read the current next_write and update it to what we want write
	 * to be after we reserve space. */
 	next = READ_ONCE(next_write);
	WRITE_ONCE(next_write, w + len);

	/* Now reserve space on the buffer */
 [C]	write = local_add_return(len, write_tail);

	/* Set tail to be were this event's data is */
	tail = write - len;

 	if (w == tail) {

		/* Nothing preempted this between A and C */
 [D]		local_set(write_stamp, ts);
		barrier();
 [E]		save_before = READ_ONCE(before_stamp);

 		if (!abs) {
			/* This did not preempt a time update */
			delta = ts - a;
		} else {
			/* slow path */
			if (w == next) {
				/* This event preempted the previous event
				 * just after it reserved its buffer.
				 * It's timestamp should be "before_stamp" */
				delta = ts - before;
				abs = 0; /* No need for that full time stamp */
			} else {
				delta = ts; /* The full time stamp will be in use */
			}
		}
		if (ts != save_before) {
			/* slow path - Was preempted between C and E */
			/* The update to write_stamp could have overwritten the update to
			 * it by the preempting event, but before and after should be
			 * the same for all completed top events */
			after = local_read(write_stamp);
			if (save_before > after)
				local_cmpxchg(write_stamp, after, save_before);
		}
	} else {
		/* slow path - Preempted between A and C */

		after = local_read(write_stamp);
		temp_ts = clock();
		barrier();
 [F]		if (write == local_read(write_tail) && after < temp_ts) {
			/* This was not preempted since C and F
			 * The last write_stamp is still valid for the previous event
			 * in the buffer. */
			delta = temp_ts - after;
			/* OK to keep this new time stamp */
			ts = temp_ts;
		} else {
			/* Preempted between C and F
			 * Well, there's no use to try to know what the time stamp
			 * is for the previous event. Just set delta to zero and
			 * be the same time as that event that preempted us before
			 * the reservation of the buffer. */

			delta = 0;
		}
		/* No need to use full timestamps here */
		abs = 0;
	}

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 274 +++++++++++++++++++++++++------------
 1 file changed, 188 insertions(+), 86 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index b8e1ca48be50..a5c3f1608de8 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -418,6 +418,17 @@ struct rb_event_info {
 	int			add_timestamp;
 };
 
+/*
+ * Used for the add_timestamp
+ *  NONE
+ *  NORMAL - may be for either time extend or absolute
+ *  FORCE - force a full time stamp.
+ */
+enum {
+	RB_ADD_STAMP_NONE,
+	RB_ADD_STAMP_NORMAL,
+	RB_ADD_STAMP_FORCE
+};
 /*
  * Used for which event context the event is in.
  *  NMI     = 0
@@ -470,8 +481,10 @@ struct ring_buffer_per_cpu {
 	size_t				shortest_full;
 	unsigned long			read;
 	unsigned long			read_bytes;
-	u64				write_stamp;
+	unsigned long			next_write;
+	local64_t			write_stamp;
 	u64				read_stamp;
+	u64				before_stamp;
 	/* ring buffer pages to update, > 0 to add, < 0 to remove */
 	long				nr_pages_to_update;
 	struct list_head		new_pages; /* new pages to add */
@@ -2416,16 +2429,13 @@ rb_update_event(struct ring_buffer_per_cpu *cpu_buffer,
 	unsigned length = info->length;
 	u64 delta = info->delta;
 
-	/* Only a commit updates the timestamp */
-	if (unlikely(!rb_event_is_commit(cpu_buffer, event)))
-		delta = 0;
-
 	/*
 	 * If we need to add a timestamp, then we
 	 * add it to the start of the reserved space.
 	 */
 	if (unlikely(info->add_timestamp)) {
-		bool abs = ring_buffer_time_stamp_abs(cpu_buffer->buffer);
+		bool abs = info->add_timestamp == RB_ADD_STAMP_FORCE ||
+			ring_buffer_time_stamp_abs(cpu_buffer->buffer);
 
 		event = rb_add_time_stamp(event, info->delta, abs);
 		length -= RB_LEN_TIME_EXTEND;
@@ -2480,6 +2490,39 @@ static inline bool sched_clock_stable(void)
 }
 #endif
 
+static __always_inline bool
+rb_event_is_commit(struct ring_buffer_per_cpu *cpu_buffer,
+		   struct ring_buffer_event *event)
+{
+	unsigned long addr = (unsigned long)event;
+	unsigned long index;
+
+	index = rb_event_index(event);
+	addr &= PAGE_MASK;
+
+	return cpu_buffer->commit_page->page == (void *)addr &&
+		rb_commit_index(cpu_buffer) == index;
+}
+
+static u64 rb_time_delta(struct ring_buffer_event *event)
+{
+	switch (event->type_len) {
+	case RINGBUF_TYPE_PADDING:
+		return 0;
+
+	case RINGBUF_TYPE_TIME_EXTEND:
+		return ring_buffer_event_time_stamp(event);
+
+	case RINGBUF_TYPE_TIME_STAMP:
+		return 0;
+
+	case RINGBUF_TYPE_DATA:
+		return event->time_delta;
+	default:
+		return 0;
+	}
+}
+
 static inline int
 rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
 		  struct ring_buffer_event *event)
@@ -2488,6 +2531,8 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
 	struct buffer_page *bpage;
 	unsigned long index;
 	unsigned long addr;
+	u64 write_stamp;
+	u64 delta;
 
 	new_index = rb_event_index(event);
 	old_index = new_index + rb_event_ts_length(event);
@@ -2496,10 +2541,32 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
 
 	bpage = READ_ONCE(cpu_buffer->tail_page);
 
+	delta = rb_time_delta(event);
+
+	write_stamp = local64_read(&cpu_buffer->write_stamp);
+
+	/* Make sure the write stamp is read before testing the location */
+	barrier();
+
 	if (bpage->page == (void *)addr && rb_page_write(bpage) == old_index) {
 		unsigned long write_mask =
 			local_read(&bpage->write) & ~RB_WRITE_MASK;
 		unsigned long event_length = rb_event_length(event);
+		u64 ret;
+
+		ret = local64_cmpxchg(&cpu_buffer->write_stamp, write_stamp, write_stamp - delta);
+		/* Something came in, can't discard */
+		if (ret != write_stamp)
+			return 0;
+
+		/*
+		 * If an event were to come in now, it would see that the
+		 * write_stamp and the before_stamp are different, and assume
+		 * that this event just added itself before updating
+		 * the write stamp. The preempting event will fix the
+		 * write stamp for us, and use the before stamp as its delta.
+		 */
+
 		/*
 		 * This is on the tail page. It is possible that
 		 * a write could come in and move the tail page
@@ -2551,10 +2618,6 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
 		local_set(&cpu_buffer->commit_page->page->commit,
 			  rb_page_write(cpu_buffer->commit_page));
 		rb_inc_page(cpu_buffer, &cpu_buffer->commit_page);
-		/* Only update the write stamp if the page has an event */
-		if (rb_page_write(cpu_buffer->commit_page))
-			cpu_buffer->write_stamp =
-				cpu_buffer->commit_page->page->time_stamp;
 		/* add barrier to keep gcc from optimizing too much */
 		barrier();
 	}
@@ -2626,54 +2689,10 @@ static inline void rb_event_discard(struct ring_buffer_event *event)
 		event->time_delta = 1;
 }
 
-static __always_inline bool
-rb_event_is_commit(struct ring_buffer_per_cpu *cpu_buffer,
-		   struct ring_buffer_event *event)
-{
-	unsigned long addr = (unsigned long)event;
-	unsigned long index;
-
-	index = rb_event_index(event);
-	addr &= PAGE_MASK;
-
-	return cpu_buffer->commit_page->page == (void *)addr &&
-		rb_commit_index(cpu_buffer) == index;
-}
-
-static __always_inline void
-rb_update_write_stamp(struct ring_buffer_per_cpu *cpu_buffer,
-		      struct ring_buffer_event *event)
-{
-	u64 delta;
-
-	/*
-	 * The event first in the commit queue updates the
-	 * time stamp.
-	 */
-	if (rb_event_is_commit(cpu_buffer, event)) {
-		/*
-		 * A commit event that is first on a page
-		 * updates the write timestamp with the page stamp
-		 */
-		if (!rb_event_index(event))
-			cpu_buffer->write_stamp =
-				cpu_buffer->commit_page->page->time_stamp;
-		else if (event->type_len == RINGBUF_TYPE_TIME_EXTEND) {
-			delta = ring_buffer_event_time_stamp(event);
-			cpu_buffer->write_stamp += delta;
-		} else if (event->type_len == RINGBUF_TYPE_TIME_STAMP) {
-			delta = ring_buffer_event_time_stamp(event);
-			cpu_buffer->write_stamp = delta;
-		} else
-			cpu_buffer->write_stamp += event->time_delta;
-	}
-}
-
 static void rb_commit(struct ring_buffer_per_cpu *cpu_buffer,
 		      struct ring_buffer_event *event)
 {
 	local_inc(&cpu_buffer->entries);
-	rb_update_write_stamp(cpu_buffer, event);
 	rb_end_commit(cpu_buffer);
 }
 
@@ -2872,13 +2891,13 @@ rb_handle_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
 		  KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu\n%s",
 		  (unsigned long long)info->delta,
 		  (unsigned long long)info->ts,
-		  (unsigned long long)cpu_buffer->write_stamp,
+		  (unsigned long long)local64_read(&cpu_buffer->write_stamp),
 		  sched_clock_stable() ? "" :
 		  "If you just came from a suspend/resume,\n"
 		  "please switch to the trace global clock:\n"
 		  "  echo global > /sys/kernel/debug/tracing/trace_clock\n"
 		  "or add trace_clock=global to the kernel command line\n");
-	info->add_timestamp = 1;
+	info->add_timestamp = RB_ADD_STAMP_NORMAL;
 }
 
 static struct ring_buffer_event *
@@ -2887,8 +2906,22 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 {
 	struct ring_buffer_event *event;
 	struct buffer_page *tail_page;
-	unsigned long tail, write;
+	unsigned long tail, write, w, next;
+	u64 delta, before, after;
+
+	/* Don't let the compiler play games with cpu_buffer->tail_page */
+	tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page);
 
+ /*A*/	w = local_read(&tail_page->write) & RB_WRITE_MASK;
+	barrier();
+	before = READ_ONCE(cpu_buffer->before_stamp);
+	after = local64_read(&cpu_buffer->write_stamp);
+	/*
+	 * If preempting an event time update, we may need absolute timestamp.
+	 * Don't bother if this is the start of a new page (w == 0).
+	 */
+	if (unlikely(before != after && w))
+		info->add_timestamp = RB_ADD_STAMP_FORCE;
 	/*
 	 * If the time delta since the last event is too big to
 	 * hold in the time field of the event, then we append a
@@ -2897,25 +2930,101 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	if (unlikely(info->add_timestamp))
 		info->length += RB_LEN_TIME_EXTEND;
 
-	/* Don't let the compiler play games with cpu_buffer->tail_page */
-	tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page);
-	write = local_add_return(info->length, &tail_page->write);
+	next = READ_ONCE(cpu_buffer->next_write);
+	WRITE_ONCE(cpu_buffer->next_write, w + info->length);
+
+	info->ts = rb_time_stamp(cpu_buffer->buffer);
+ /*B*/	WRITE_ONCE(cpu_buffer->before_stamp, info->ts);
+
+ /*C*/	write = local_add_return(info->length, &tail_page->write);
 
 	/* set write to only the index of the write */
 	write &= RB_WRITE_MASK;
+
 	tail = write - info->length;
 
+	/* See if we shot pass the end of this buffer page */
+	if (unlikely(write > BUF_PAGE_SIZE)) {
+		if (tail != w) {
+			/* before and after may now different, fix it up*/
+			before = READ_ONCE(cpu_buffer->before_stamp);
+			after = local64_read(&cpu_buffer->write_stamp);
+			if (before != after)
+				(void)cmpxchg(&cpu_buffer->before_stamp, before, after);
+		}
+		return rb_move_tail(cpu_buffer, tail, info);
+	}
+
+	if (likely(tail == w)) {
+		u64 save_before;
+
+		/* Nothing preempted us between A and C */
+ /*D*/		local64_set(&cpu_buffer->write_stamp, info->ts);
+		barrier();
+ /*E*/		save_before = READ_ONCE(cpu_buffer->before_stamp);
+		if (likely(info->add_timestamp != RB_ADD_STAMP_FORCE)) {
+			/* This did not preempt any time update */
+			info->delta = info->ts - after;
+		} else {
+			/* SLOW PATH */
+			if (w == next) {
+				/* before is the time stamp of the last event */
+				info->delta = info->ts - before;
+				info->add_timestamp = RB_ADD_STAMP_NORMAL;
+			} else {
+				/* Last event time stamp is lost, inject absolute. */
+				info->delta = info->ts;
+			}
+		}
+		barrier();
+		if (unlikely(info->ts != save_before)) {
+			/* SLOW PATH - Preempted between C and E */
+
+			after = local64_read(&cpu_buffer->write_stamp);
+			/* Write stamp must only go forward */
+			if (save_before > after) {
+				/*
+				 * We do not care about the result, only that
+				 * it gets updated atomically.
+				 */
+				(void)local64_cmpxchg(&cpu_buffer->write_stamp, after, save_before);
+			}
+		}
+	} else {
+		u64 ts;
+		/* SLOW PATH - Preempted between A and C */
+		after = local64_read(&cpu_buffer->write_stamp);
+		ts = rb_time_stamp(cpu_buffer->buffer);
+		barrier();
+ /*E*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
+		    after < ts) {
+			/* Nothing came after this event between C and E */
+			info->delta = ts - after;
+			(void)local64_cmpxchg(&cpu_buffer->write_stamp, after, info->ts);
+			info->ts = ts;
+		} else {
+			/*
+			 * Preempted beween C and E:
+			 * Lost the previous events time stamp. Just set the
+			 * delta to zero, and this will be the same time as
+			 * the veent this event preempted. And the events that
+			 * came after this will still be correct (as they would
+			 * have built their delta on the previous event.
+			 */
+			info->delta = 0;
+		}
+		if (info->add_timestamp == RB_ADD_STAMP_FORCE)
+			info->add_timestamp = RB_ADD_STAMP_NORMAL;
+	}
+
 	/*
 	 * If this is the first commit on the page, then it has the same
 	 * timestamp as the page itself.
 	 */
-	if (!tail && !ring_buffer_time_stamp_abs(cpu_buffer->buffer))
+	if (unlikely(!tail && info->add_timestamp != RB_ADD_STAMP_FORCE &&
+		     !ring_buffer_time_stamp_abs(cpu_buffer->buffer)))
 		info->delta = 0;
 
-	/* See if we shot pass the end of this buffer page */
-	if (unlikely(write > BUF_PAGE_SIZE))
-		return rb_move_tail(cpu_buffer, tail, info);
-
 	/* We reserved something on the buffer */
 
 	event = __rb_page_index(tail_page, tail);
@@ -2933,6 +3042,9 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	/* account for these added bytes */
 	local_add(info->length, &cpu_buffer->entries_bytes);
 
+	/* This will be used to update write stamp */
+	info->delta = delta;
+
 	return event;
 }
 
@@ -2944,9 +3056,10 @@ rb_reserve_next_event(struct trace_buffer *buffer,
 	struct ring_buffer_event *event;
 	struct rb_event_info info;
 	int nr_loops = 0;
-	u64 diff;
+	u64 write_stamp;
 
 	rb_start_commit(cpu_buffer);
+	/* The commit page can not change after this */
 
 #ifdef CONFIG_RING_BUFFER_ALLOW_SWAP
 	/*
@@ -2965,7 +3078,7 @@ rb_reserve_next_event(struct trace_buffer *buffer,
 
 	info.length = rb_calculate_event_length(length);
  again:
-	info.add_timestamp = 0;
+	info.add_timestamp = RB_ADD_STAMP_NONE;
 	info.delta = 0;
 
 	/*
@@ -2980,18 +3093,14 @@ rb_reserve_next_event(struct trace_buffer *buffer,
 	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 1000))
 		goto out_fail;
 
+	write_stamp = local64_read(&cpu_buffer->write_stamp);
 	info.ts = rb_time_stamp(cpu_buffer->buffer);
-	diff = info.ts - cpu_buffer->write_stamp;
-
-	/* make sure this diff is calculated here */
-	barrier();
 
 	if (ring_buffer_time_stamp_abs(buffer)) {
 		info.delta = info.ts;
 		rb_handle_timestamp(cpu_buffer, &info);
-	} else /* Did the write stamp get updated already? */
-		if (likely(info.ts >= cpu_buffer->write_stamp)) {
-		info.delta = diff;
+	} else {
+		info.delta = info.ts - write_stamp ;
 		if (unlikely(test_time_stamp(info.delta)))
 			rb_handle_timestamp(cpu_buffer, &info);
 	}
@@ -3004,11 +3113,8 @@ rb_reserve_next_event(struct trace_buffer *buffer,
 		goto again;
 	}
 
-	if (!event)
-		goto out_fail;
-
-	return event;
-
+	if (likely(event))
+		return event;
  out_fail:
 	rb_end_commit(cpu_buffer);
 	return NULL;
@@ -3154,11 +3260,6 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer,
 	if (rb_try_to_discard(cpu_buffer, event))
 		goto out;
 
-	/*
-	 * The commit is still visible by the reader, so we
-	 * must still update the timestamp.
-	 */
-	rb_update_write_stamp(cpu_buffer, event);
  out:
 	rb_end_commit(cpu_buffer);
 
@@ -4475,8 +4576,9 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
 	cpu_buffer->read = 0;
 	cpu_buffer->read_bytes = 0;
 
-	cpu_buffer->write_stamp = 0;
-	cpu_buffer->read_stamp = 0;
+	local64_set(&cpu_buffer->write_stamp, 0);
+	cpu_buffer->before_stamp = 0;
+	cpu_buffer->next_write = 0;
 
 	cpu_buffer->lost_events = 0;
 	cpu_buffer->last_overrun = 0;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 13:44 [RFC][PATCH] ring-buffer: Have nested events still record running time stamp Steven Rostedt
@ 2020-06-25 13:53 ` Mathieu Desnoyers
  2020-06-25 14:37   ` Steven Rostedt
  2020-06-25 17:55 ` Mathieu Desnoyers
  2020-06-25 18:09 ` Steven Rostedt
  2 siblings, 1 reply; 24+ messages in thread
From: Mathieu Desnoyers @ 2020-06-25 13:53 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

----- On Jun 25, 2020, at 9:44 AM, rostedt rostedt@goodmis.org wrote:

> From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
> 
> [ SEVEN YEAR PROBLEM SOLVED! ]
> 
> Up until now, if an event is interrupted while it is recorded by an
> interrupt, and that interrupt records events, the time of those events will
> all be the same. This is because events only record the delta of the time
> since the previous event (or beginning of a page), and to handle updating
> the time keeping for that of nested events is extremely racy. After years of
> thinking about this and several failed attempts, I finally have a solution
> to solve this puzzle.

Out of curiosity, considering that LTTng has solved this problem 10+ years ago
with a simpler concurrency-friendly time-stamping model, why not simply use it
rather than add complexity to the current ftrace timestamp scheme ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 13:53 ` Mathieu Desnoyers
@ 2020-06-25 14:37   ` Steven Rostedt
  2020-06-25 16:42     ` Korben Rusek
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2020-06-25 14:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Thu, 25 Jun 2020 09:53:15 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> ----- On Jun 25, 2020, at 9:44 AM, rostedt rostedt@goodmis.org wrote:
> 
> > From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
> > 
> > [ SEVEN YEAR PROBLEM SOLVED! ]
> > 
> > Up until now, if an event is interrupted while it is recorded by an
> > interrupt, and that interrupt records events, the time of those events will
> > all be the same. This is because events only record the delta of the time
> > since the previous event (or beginning of a page), and to handle updating
> > the time keeping for that of nested events is extremely racy. After years of
> > thinking about this and several failed attempts, I finally have a solution
> > to solve this puzzle.  
> 
> Out of curiosity, considering that LTTng has solved this problem 10+ years ago
> with a simpler concurrency-friendly time-stamping model, why not simply use it
> rather than add complexity to the current ftrace timestamp scheme ?

Because it requires updating all the tools that read this from user
space.

I found a solution that works, so why change it and cause the backward
compatibility pain now?

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 14:37   ` Steven Rostedt
@ 2020-06-25 16:42     ` Korben Rusek
  2020-06-25 18:12       ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Korben Rusek @ 2020-06-25 16:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Jiri Olsa, Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov,
	Tom Zanussi, Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Thu, Jun 25, 2020 at 7:38 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 25 Jun 2020 09:53:15 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> > ----- On Jun 25, 2020, at 9:44 AM, rostedt rostedt@goodmis.org wrote:
> >
> > > From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
> > >
> > > [ SEVEN YEAR PROBLEM SOLVED! ]
> > >
> > > Up until now, if an event is interrupted while it is recorded by an
> > > interrupt, and that interrupt records events, the time of those events will
> > > all be the same. This is because events only record the delta of the time
> > > since the previous event (or beginning of a page), and to handle updating
> > > the time keeping for that of nested events is extremely racy. After years of
> > > thinking about this and several failed attempts, I finally have a solution
> > > to solve this puzzle.
> >
> > Out of curiosity, considering that LTTng has solved this problem 10+ years ago
> > with a simpler concurrency-friendly time-stamping model, why not simply use it
> > rather than add complexity to the current ftrace timestamp scheme ?
>
> Because it requires updating all the tools that read this from user
> space.
>
> I found a solution that works, so why change it and cause the backward
> compatibility pain now?
>
> -- Steve

Great work! I'm not exactly qualified to review the code, but the
logic seems correct. I'm curious how unlikely a zero delta is now and
how you quantify it. Also does it negate the patch that I emailed out
last week that adds a `force_abs_timestamp` trace/option in an attempt
to get around this particular issue?

In reading through, I did notice a couple simple typos in the comments
that are probably worth pointing out:

> If preempting an event time update, we may need absolute timestamp.

Not a big deal, but it should be "may need *an* absolute timestamp"

> * Preempted beween C and E:
> * Lost the previous events time stamp. Just set the
> * delta to zero, and this will be the same time as
> * the veent this event preempted. And the events that
> * came after this will still be correct (as they would
> * have built their delta on the previous event.

Should be "the *event* this event preempted." It also needs a
parenthesis at the end of the comment to close the parenthetical
statement.

Thanks, Korben

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 13:44 [RFC][PATCH] ring-buffer: Have nested events still record running time stamp Steven Rostedt
  2020-06-25 13:53 ` Mathieu Desnoyers
@ 2020-06-25 17:55 ` Mathieu Desnoyers
  2020-06-25 18:35   ` Steven Rostedt
                     ` (2 more replies)
  2020-06-25 18:09 ` Steven Rostedt
  2 siblings, 3 replies; 24+ messages in thread
From: Mathieu Desnoyers @ 2020-06-25 17:55 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

----- On Jun 25, 2020, at 9:44 AM, rostedt rostedt@goodmis.org wrote:

> From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
> 
[...]
> Here's the design of this solution:
> 
> All this is per cpu, and only needs to worry about nested events (not
> parallel events).
> 
> The players:
> 
> write_tail: The index in the buffer where new events can be written to.
>     It is incremented via local_add() to reserve space for a new event.
> 
> before_stamp: A time stamp set by all events before reserving space.
> 
> write_stamp: A time stamp updated by events after it has successfully
>     reserved space.

What are the types used for before_stamp and write_stamp ? If those
are 64-bit integers, how does sharing them between nested contexts
work on 32-bit architectures ?

> 
> next_write: A copy of "write_tail" used to help with races.
> 
>	/* Save the current position of write */
> [A]	w = local_read(write_tail);
>	barrier();
>	/* Read both before and write stamps before touching anything */
>	before = READ_ONCE(before_stamp);
>	after = local_read(write_stamp);
>	barrier();
> 
>	/*
>	 * If before and after are the same, then this event is not
>	 * preempting a time update. If it is, then reserve space for adding

You should use the term "interrupt" rather than "preempting", because as you
stated yourself, this algorithm only works with nested interrupts, not with
preemption.

>	 * a full time stamp (this can turn into a time extend which is
>	 * just an extended time delta but fill up the extra space).
>	 */
>	if (after != before)
>		abs = true;
> 
>	ts = clock();
> 
>	/* Now update the before_stamp (everyone does this!) */
> [B]	WRITE_ONCE(before_stamp, ts);
> 
>	/* Read the current next_write and update it to what we want write
>	 * to be after we reserve space. */
> 	next = READ_ONCE(next_write);
>	WRITE_ONCE(next_write, w + len);
> 
>	/* Now reserve space on the buffer */
> [C]	write = local_add_return(len, write_tail);

So the reservation is not "just" an add instruction, it's actually an xadd on
x86. Is that really faster than a cmpxchg ?

> 
>	/* Set tail to be were this event's data is */
>	tail = write - len;
> 
> 	if (w == tail) {
> 
>		/* Nothing preempted this between A and C */
> [D]		local_set(write_stamp, ts);
>		barrier();
> [E]		save_before = READ_ONCE(before_stamp);
> 
> 		if (!abs) {
>			/* This did not preempt a time update */
>			delta = ts - a;

What does "a" refer to ? What triggers its update ?

>		} else {
>			/* slow path */
>			if (w == next) {

If it's a slow path, why try to play games with a delta timestamp ? Space
should not be an issue for a slow path like this. It would be simpler to just
use the full timestamp here.

>				/* This event preempted the previous event
>				 * just after it reserved its buffer.

You mean nesting after [C] but before [D].

>				 * It's timestamp should be "before_stamp" */

It's -> Its

>				delta = ts - before;
>				abs = 0; /* No need for that full time stamp */

What worries me about this "optimization" is what happens in a multi-nesting scenario.
If we just fallback to a full timestamp as soon as nesting is detected, things become
really simple for multiple nesting. However using a delta timestamp adds a lot of
additional states to consider, and those are not discussed here.

More specifically, one scenario I'm concerned about is:

Nesting level 0                      Nesting level 1                      Nesting level 2

[A] w = local_read(write_tail);
    before = READ_ONCE(before_stamp);
[B] WRITE_ONCE(before_stamp, ts);
    WRITE_ONCE(next_write, w + len);
[C] write = local_add_return(len,
    write_tail);
   <interrupted>
                                    [A] w = local_read(write_tail);
                                        before = READ_ONCE(before_stamp);
                                    [B] WRITE_ONCE(before_stamp, ts);
                                    <interrupted>
                                                                          [A] w = local_read(write_tail);
                                                                              before = READ_ONCE(before_stamp);
                                                                          [B] WRITE_ONCE(before_stamp, ts);
                                                                              WRITE_ONCE(next_write, w + len);
                                                                          [C] write = local_add_return(len,
                                                                              write_tail);
                                                                              if (w == tail) { -> true
                                                                                if (!abs) { -> false
                                                                                  if (w == next) { -> true

Nesting level 2 then ends up using the "before_stamp" of nesting level 1 to generate
the time delta, but the delta is really between the event from nesting level 0 and the
event at nesting level 2, because nesting level 1 will only record its event _after_
the event of nesting level 2.

>			} else {
>				delta = ts; /* The full time stamp will be in use */
>			}
>		}
>		if (ts != save_before) {
>			/* slow path - Was preempted between C and E */
>			/* The update to write_stamp could have overwritten the update to
>			 * it by the preempting event,

This sentence is hard to parse.

>                          but before and after should be
>			 * the same for all completed top events */
>			after = local_read(write_stamp);
>			if (save_before > after)
>				local_cmpxchg(write_stamp, after, save_before);

What is the type of write_stamp, and how does it work on 32-bit architectures with
NMIs ?

I'm not entirely sure to understand what this cmpxchg try to accomplish. Moreover,
what happens if right after an interrupt nests between [C] and [E], another interrupt
try to write into the buffer after before that cmpxchg ?

>		}
>	} else {
>		/* slow path - Preempted between A and C */
> 
>		after = local_read(write_stamp);
>		temp_ts = clock();
>		barrier();
> [F]		if (write == local_read(write_tail) && after < temp_ts) {
>			/* This was not preempted since C and F
>			 * The last write_stamp is still valid for the previous event
>			 * in the buffer. */
>			delta = temp_ts - after;
>			/* OK to keep this new time stamp */
>			ts = temp_ts;
>		} else {
>			/* Preempted between C and F
>			 * Well, there's no use to try to know what the time stamp
>			 * is for the previous event. Just set delta to zero and
>			 * be the same time as that event that preempted us before
>			 * the reservation of the buffer. */
> 
>			delta = 0;

If this is a rare case, why not just use a full timestamp instead ?

Thanks,

Mathieu

>		}
>		/* No need to use full timestamps here */
>		abs = 0;
>	}
> -- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 13:44 [RFC][PATCH] ring-buffer: Have nested events still record running time stamp Steven Rostedt
  2020-06-25 13:53 ` Mathieu Desnoyers
  2020-06-25 17:55 ` Mathieu Desnoyers
@ 2020-06-25 18:09 ` Steven Rostedt
  2 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2020-06-25 18:09 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Jiri Olsa, Namhyung Kim,
	Yordan Karadzhov, Tzvetomir Stoyanov, Mathieu Desnoyers,
	Tom Zanussi, Jason Behmer, Julia Lawall, Clark Williams,
	Daniel Bristot de Oliveira, Daniel Wagner, Darren Hart,
	Jonathan Corbet, Suresh E. Warrier

After running my ring buffer benchmark on this update, it went from
55ns to 85ns per event. A 30ns increase! At first I freaked out, but
then noticed that I unnecessarily take the time stamp counter twice,
which is the most expensive operation.

Here's a diff to my patch (I'll send a v2 later) that removes the
double taking of the time stamp.

It also removes the double testing of the absolute value. Which I have
another patch that consolidates that with the info add_timestamp field.
But that patch wont be folded into this.

-- Steve

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a5c3f1608de8..7370c483c6cd 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2908,6 +2908,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	struct buffer_page *tail_page;
 	unsigned long tail, write, w, next;
 	u64 delta, before, after;
+	bool abs = false;
 
 	/* Don't let the compiler play games with cpu_buffer->tail_page */
 	tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page);
@@ -2916,6 +2917,19 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	barrier();
 	before = READ_ONCE(cpu_buffer->before_stamp);
 	after = local64_read(&cpu_buffer->write_stamp);
+	barrier();
+	info->ts = rb_time_stamp(cpu_buffer->buffer);
+
+	if (ring_buffer_time_stamp_abs(cpu_buffer->buffer)) {
+		info->delta = info->ts;
+		abs = true;
+	} else {
+		info->delta = info->ts - after;
+	}
+
+	if (unlikely(test_time_stamp(info->delta)))
+		rb_handle_timestamp(cpu_buffer, info);
+
 	/*
 	 * If preempting an event time update, we may need absolute timestamp.
 	 * Don't bother if this is the start of a new page (w == 0).
@@ -2933,7 +2947,6 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	next = READ_ONCE(cpu_buffer->next_write);
 	WRITE_ONCE(cpu_buffer->next_write, w + info->length);
 
-	info->ts = rb_time_stamp(cpu_buffer->buffer);
  /*B*/	WRITE_ONCE(cpu_buffer->before_stamp, info->ts);
 
  /*C*/	write = local_add_return(info->length, &tail_page->write);
@@ -3021,8 +3034,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	 * If this is the first commit on the page, then it has the same
 	 * timestamp as the page itself.
 	 */
-	if (unlikely(!tail && info->add_timestamp != RB_ADD_STAMP_FORCE &&
-		     !ring_buffer_time_stamp_abs(cpu_buffer->buffer)))
+	if (unlikely(!tail && info->add_timestamp != RB_ADD_STAMP_FORCE && !abs))
 		info->delta = 0;
 
 	/* We reserved something on the buffer */
@@ -3056,7 +3068,6 @@ rb_reserve_next_event(struct trace_buffer *buffer,
 	struct ring_buffer_event *event;
 	struct rb_event_info info;
 	int nr_loops = 0;
-	u64 write_stamp;
 
 	rb_start_commit(cpu_buffer);
 	/* The commit page can not change after this */
@@ -3093,18 +3104,6 @@ rb_reserve_next_event(struct trace_buffer *buffer,
 	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 1000))
 		goto out_fail;
 
-	write_stamp = local64_read(&cpu_buffer->write_stamp);
-	info.ts = rb_time_stamp(cpu_buffer->buffer);
-
-	if (ring_buffer_time_stamp_abs(buffer)) {
-		info.delta = info.ts;
-		rb_handle_timestamp(cpu_buffer, &info);
-	} else {
-		info.delta = info.ts - write_stamp ;
-		if (unlikely(test_time_stamp(info.delta)))
-			rb_handle_timestamp(cpu_buffer, &info);
-	}
-
 	event = __rb_reserve_next(cpu_buffer, &info);
 
 	if (unlikely(PTR_ERR(event) == -EAGAIN)) {

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 16:42     ` Korben Rusek
@ 2020-06-25 18:12       ` Steven Rostedt
  0 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2020-06-25 18:12 UTC (permalink / raw)
  To: Korben Rusek
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Jiri Olsa, Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov,
	Tom Zanussi, Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Thu, 25 Jun 2020 09:42:34 -0700
Korben Rusek <korben@google.com> wrote:

> Great work! I'm not exactly qualified to review the code, but the
> logic seems correct. I'm curious how unlikely a zero delta is now and
> how you quantify it. Also does it negate the patch that I emailed out

Actually, in all my stress testing (where I also add nested
trace_printk()s to read what is happening), I was never once able to
trigger the zero delta path! I only tested it by adding code to inject
the event to force the given race condition.

Note, zero deltas are still there between absolute time stamps and
start of page, but that's still different than a zero delta from the
previous event.

> last week that adds a `force_abs_timestamp` trace/option in an attempt
> to get around this particular issue?
> 
> In reading through, I did notice a couple simple typos in the comments
> that are probably worth pointing out:

Thanks.

-- Steve


> 
> > If preempting an event time update, we may need absolute timestamp.  
> 
> Not a big deal, but it should be "may need *an* absolute timestamp"
> 
> > * Preempted beween C and E:
> > * Lost the previous events time stamp. Just set the
> > * delta to zero, and this will be the same time as
> > * the veent this event preempted. And the events that
> > * came after this will still be correct (as they would
> > * have built their delta on the previous event.  
> 
> Should be "the *event* this event preempted." It also needs a
> parenthesis at the end of the comment to close the parenthetical
> statement.
> 
> Thanks, Korben


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 17:55 ` Mathieu Desnoyers
@ 2020-06-25 18:35   ` Steven Rostedt
  2020-06-25 19:35     ` Mathieu Desnoyers
  2020-06-25 19:04   ` Steven Rostedt
  2020-06-25 19:09   ` Steven Rostedt
  2 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2020-06-25 18:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Thu, 25 Jun 2020 13:55:07 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote

> > Here's the design of this solution:
> > 
> > All this is per cpu, and only needs to worry about nested events (not
> > parallel events).
> > 
> > The players:
> > 
> > write_tail: The index in the buffer where new events can be written to.
> >     It is incremented via local_add() to reserve space for a new event.
> > 
> > before_stamp: A time stamp set by all events before reserving space.
> > 
> > write_stamp: A time stamp updated by events after it has successfully
> >     reserved space.  
> 
> What are the types used for before_stamp and write_stamp ? If those
> are 64-bit integers, how does sharing them between nested contexts
> work on 32-bit architectures ?

Well, write_stamp is updated via local64, which I believe handles this
for us. I probably should make before_stamp handle it as well.

> 
> > 
> > next_write: A copy of "write_tail" used to help with races.
> > 
> >	/* Save the current position of write */
> > [A]	w = local_read(write_tail);
> >	barrier();
> >	/* Read both before and write stamps before touching
> >	anything */ before = READ_ONCE(before_stamp);
> >	after = local_read(write_stamp);
> >	barrier();
> > 
> >	/*
> >	 * If before and after are the same, then this event is not
> >	 * preempting a time update. If it is, then reserve space
> >	for adding  
> 
> You should use the term "interrupt" rather than "preempting", because
> as you stated yourself, this algorithm only works with nested
> interrupts, not with preemption.

The two terms are basically interchangeable here, but for consistency,
I will update it. Thanks.


> 
> >	 * a full time stamp (this can turn into a time extend which
> >	is
> >	 * just an extended time delta but fill up the extra space).
> >	 */
> >	if (after != before)
> >		abs = true;
> > 
> >	ts = clock();
> > 
> >	/* Now update the before_stamp (everyone does this!) */
> > [B]	WRITE_ONCE(before_stamp, ts);
> > 
> >	/* Read the current next_write and update it to what we want
> >	write
> >	 * to be after we reserve space. */
> > 	next = READ_ONCE(next_write);
> >	WRITE_ONCE(next_write, w + len);
> > 
> >	/* Now reserve space on the buffer */
> > [C]	write = local_add_return(len, write_tail);  
> 
> So the reservation is not "just" an add instruction, it's actually an
> xadd on x86. Is that really faster than a cmpxchg ?

I believe the answer is still yes. But I can run some benchmarks to
make sure.

> 
> > 
> >	/* Set tail to be were this event's data is */
> >	tail = write - len;
> > 
> > 	if (w == tail) {
> > 
> >		/* Nothing preempted this between A and C */
> > [D]		local_set(write_stamp, ts);
> >		barrier();
> > [E]		save_before = READ_ONCE(before_stamp);
> > 
> > 		if (!abs) {
> >			/* This did not preempt a time update */
> >			delta = ts - a;  
> 
> What does "a" refer to ? What triggers its update ?

Oops, When I first wrote this, I used "a" and "b" for "after" and
"before" and had "after" and "before" be "after_stamp" and
"before_stamp". I missed this update. Nice catch.


> 
> >		} else {
> >			/* slow path */
> >			if (w == next) {  
> 
> If it's a slow path, why try to play games with a delta timestamp ?
> Space should not be an issue for a slow path like this. It would be
> simpler to just use the full timestamp here.

Hmm, you may be right. Previous iterations of this code had a distinct
difference here, but after restructuring it, I don't think that
distinction is valid anymore. If anything, having this at least lets me
validate that it's doing what I believe it should be doing (as I added
a ton of trace_printk()s into this).

> 
> >				/* This event preempted the previous
> >	event
> >				 * just after it reserved its
> >	buffer.  
> 
> You mean nesting after [C] but before [D].

Yes. I can add that for clarity, but perhaps I don't need that if I
merge the two.

> 
> >				 * It's timestamp should be
> >	"before_stamp" */  
> 
> It's -> Its

;-)

My email client messed up your formatting of the rest of the email, so
I'll send a separate reply.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 17:55 ` Mathieu Desnoyers
  2020-06-25 18:35   ` Steven Rostedt
@ 2020-06-25 19:04   ` Steven Rostedt
  2020-06-25 19:58     ` Mathieu Desnoyers
  2020-06-25 19:09   ` Steven Rostedt
  2 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2020-06-25 19:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Thu, 25 Jun 2020 13:55:07 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:


> What worries me about this "optimization" is what happens in a multi-nesting scenario.
> If we just fallback to a full timestamp as soon as nesting is detected, things become
> really simple for multiple nesting. However using a delta timestamp adds a lot of
> additional states to consider, and those are not discussed here.
> 
> More specifically, one scenario I'm concerned about is:
> 
> Nesting level 0                      Nesting level 1                      Nesting level 2
> 
> [A] w = local_read(write_tail);
>     before = READ_ONCE(before_stamp);
> [B] WRITE_ONCE(before_stamp, ts);
>     WRITE_ONCE(next_write, w + len);
> [C] write = local_add_return(len,
>     write_tail);
>    <interrupted>
>                                     [A] w = local_read(write_tail);
>                                         before = READ_ONCE(before_stamp);
>                                     [B] WRITE_ONCE(before_stamp, ts);
>                                     <interrupted>
>                                                                           [A] w = local_read(write_tail);
>                                                                               before = READ_ONCE(before_stamp);
>                                                                           [B] WRITE_ONCE(before_stamp, ts);
>                                                                               WRITE_ONCE(next_write, w + len);
>                                                                           [C] write = local_add_return(len,
>                                                                               write_tail);
>                                                                               if (w == tail) { -> true
>                                                                                 if (!abs) { -> false
>                                                                                   if (w == next) { -> true
> 
> Nesting level 2 then ends up using the "before_stamp" of nesting level 1 to generate
> the time delta, but the delta is really between the event from nesting level 0 and the
> event at nesting level 2, because nesting level 1 will only record its event _after_
> the event of nesting level 2.

You may have a valid point. But if we do your suggestion of just
punting in this path and use a full timestamp, then there's nothing to
worry about here. Right?

> 
> >			} else {
> >				delta = ts; /* The full time stamp will be in use */
> >			}
> >		}
> >		if (ts != save_before) {
> >			/* slow path - Was preempted between C and E */
> >			/* The update to write_stamp could have overwritten the update to
> >			 * it by the preempting event,  
> 
> This sentence is hard to parse.

Ah, that needs better wording: "An interrupt of this event could have
written to the write_stamp before this interrupt wrote to it. Which
means this event could have made the write_stamp go backwards. Need to
put back the write stamp using the before stamp, as that can only be
updated by events that interrupt this current one."


> 
> >                          but before and after should be
> >			 * the same for all completed top events */
> >			after = local_read(write_stamp);
> >			if (save_before > after)
> >				local_cmpxchg(write_stamp, after, save_before);  
> 
> What is the type of write_stamp, and how does it work on 32-bit architectures with
> NMIs ?
> 

local64_cmpxchg(), I left it out of change log, as this is the
algorithm not all the details. See the patch for those ;-)

> I'm not entirely sure to understand what this cmpxchg try to
> accomplish. Moreover, what happens if right after an interrupt nests
> between [C] and [E], another interrupt try to write into the buffer
> after before that cmpxchg ?

We are just fixing up this events write into write_stamp. We only worry
about being interrupted between C and D, as then we set the write stamp
backward. But any interrupting event will also update the before time
stamp. That is, the before_stamp can only be the last time stamp of an
event that interrupted us here. By doing the cmpxchg, we only update
the write stamp to the before stamp of the last event to interrupt this
event and if the cmpxchg fails, that means another event came in and
updated it to its own time stamp which is what we want it to be!

> 
> >		}
> >	} else {
> >		/* slow path - Preempted between A and C */
> > 
> >		after = local_read(write_stamp);
> >		temp_ts = clock();
> >		barrier();
> > [F]		if (write == local_read(write_tail) && after < temp_ts) {
> >			/* This was not preempted since C and F
> >			 * The last write_stamp is still valid for the previous event
> >			 * in the buffer. */
> >			delta = temp_ts - after;
> >			/* OK to keep this new time stamp */
> >			ts = temp_ts;
> >		} else {
> >			/* Preempted between C and F
> >			 * Well, there's no use to try to know what
> >	the time stamp
> >			 * is for the previous event. Just set delta
> >	to zero and
> >			 * be the same time as that event that
> >	preempted us before
> >			 * the reservation of the buffer. */
> > 
> >			delta = 0;  
> 
> If this is a rare case, why not just use a full timestamp instead ?

Because the current time stamp is wrong.

This is what happened:

	nesting 0							    nesting 1
	---------							    ---------
        < start event 1 >
 [A]	w = local_read(write_tail);
	ts = clock(); [ time 1 ]

 									< start event 2 >
									    ts = clock(); [ time 2 ]
									[C] write = local_add_return();
									<finish>

 [C]	write = local_add_return();

									< start event 3 >
									    ts = clock(); [ time 3]
									[C] write = local_add_return();
									<finish>


 [F]	if (write == local_read(write_tail) && after < temp_ts) {

	} else {

The recorded time stamps are this:

   time 1 < time 2 < time 3

The buffer would look like this:

  ----+------------------+------------------+------------------+----
  ... | time 2 : event 2 | time ? : event 1 | time 3 : event 3 | ...
  ----+------------------+------------------+------------------+----


What do you put in for event 1's time. It's time stamp is before time
2, but the recorded event is after event 2. We can't use event 3 as a
time.

But in this case, it really doesn't matter in practice. Why? because
what is the time of that middle event? How is that useful information?

For example, you record a syscall event.



 syscall() {
 	[...]

 +	trace_syscall() {
 |		< event 1 >
 |			-------------> [interrupt]
 |				 	 < event 2 >
 |					 < time 2 >
 |			<------------
 |		< time 1>
 |			-------------> [interrupt]
 |				 	 < event 3 >
 |					 < time 3 >
 |			<------------
 +	}

	[...]
}

What's the difference if event 1's time stamp is time 2 or time 3?
The event that it is representing is paused for that entire path.

It's not really useful to get that "exact" because there's no real
meaning of what that "exact" is. The vertical line there is the call to
record a trace event. The event that is being represented by the trace
event is paused for all three time stamps.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 17:55 ` Mathieu Desnoyers
  2020-06-25 18:35   ` Steven Rostedt
  2020-06-25 19:04   ` Steven Rostedt
@ 2020-06-25 19:09   ` Steven Rostedt
  2020-06-25 20:03     ` Mathieu Desnoyers
  2 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2020-06-25 19:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Thu, 25 Jun 2020 13:55:07 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> What are the types used for before_stamp and write_stamp ? If those
> are 64-bit integers, how does sharing them between nested contexts
> work on 32-bit architectures ?

BTW, I forgot to state in either of my two emails, THANK YOU!

I really appreciate the thorough review of this patch, and I welcome
the challenges you give me to make sure I can verify all that I claim
can happen.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 18:35   ` Steven Rostedt
@ 2020-06-25 19:35     ` Mathieu Desnoyers
  2020-06-25 19:58       ` Steven Rostedt
                         ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Mathieu Desnoyers @ 2020-06-25 19:35 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

----- On Jun 25, 2020, at 2:35 PM, rostedt rostedt@goodmis.org wrote:

> On Thu, 25 Jun 2020 13:55:07 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote
> 
>> > Here's the design of this solution:
>> > 
>> > All this is per cpu, and only needs to worry about nested events (not
>> > parallel events).
>> > 
>> > The players:
>> > 
>> > write_tail: The index in the buffer where new events can be written to.
>> >     It is incremented via local_add() to reserve space for a new event.
>> > 
>> > before_stamp: A time stamp set by all events before reserving space.
>> > 
>> > write_stamp: A time stamp updated by events after it has successfully
>> >     reserved space.
>> 
>> What are the types used for before_stamp and write_stamp ? If those
>> are 64-bit integers, how does sharing them between nested contexts
>> work on 32-bit architectures ?
> 
> Well, write_stamp is updated via local64, which I believe handles this
> for us. I probably should make before_stamp handle it as well.

By looking at local64 headers, it appears that 32-bit rely on atomic64,
which on x86 is implemented with LOCK; cmpxchg8b for 586+ (which is AFAIK
painfully slow) and with cli/sti for 386/486 (which is not nmi-safe).

For all other 32-bit architectures, the generic atomic64.h implements 64-bit
atomics using spinlocks with irqs off, which seems to also bring considerable
overhead, in addition to be non-reentrant with respect to NMI-like interrupts,
e.g. FIQ on ARM32.

That seems at odds with the performance constraints of ftrace's ring buffer.

Those performance and reentrancy concerns are why I always stick to local_t
(long), and never use a full 64-bit type for anything that has to
do with concurrent store/load between execution contexts in lttng.

> 
> 
>> 
>> >	 * a full time stamp (this can turn into a time extend which
>> >	is
>> >	 * just an extended time delta but fill up the extra space).
>> >	 */
>> >	if (after != before)
>> >		abs = true;
>> > 
>> >	ts = clock();
>> > 
>> >	/* Now update the before_stamp (everyone does this!) */
>> > [B]	WRITE_ONCE(before_stamp, ts);
>> > 
>> >	/* Read the current next_write and update it to what we want
>> >	write
>> >	 * to be after we reserve space. */
>> > 	next = READ_ONCE(next_write);
>> >	WRITE_ONCE(next_write, w + len);
>> > 
>> >	/* Now reserve space on the buffer */
>> > [C]	write = local_add_return(len, write_tail);
>> 
>> So the reservation is not "just" an add instruction, it's actually an
>> xadd on x86. Is that really faster than a cmpxchg ?
> 
> I believe the answer is still yes. But I can run some benchmarks to
> make sure.

This would be interesting to see, because if xadd and cmpxchg have
similar overhead, then going for a cmpxchg-loop for the space
reservation could vastly decrease the overall complexity of this
timestamp+space reservation algorithm.

If we decrease complexity of the fast-paths, we can then reduce the
amount of test+branches which need to be performed and instruction cache
being used, and this can lead to performance improvements. So it's not only
good in making things more easily verifiable, but it also improves
performance.

[...]
> 
> 
>> 
>> >		} else {
>> >			/* slow path */
>> >			if (w == next) {
>> 
>> If it's a slow path, why try to play games with a delta timestamp ?
>> Space should not be an issue for a slow path like this. It would be
>> simpler to just use the full timestamp here.
> 
> Hmm, you may be right. Previous iterations of this code had a distinct
> difference here, but after restructuring it, I don't think that
> distinction is valid anymore. If anything, having this at least lets me
> validate that it's doing what I believe it should be doing (as I added
> a ton of trace_printk()s into this).

Good. Simple is good. :-)

> 
>> 
>> >				/* This event preempted the previous
>> >	event
>> >				 * just after it reserved its
>> >	buffer.
>> 
>> You mean nesting after [C] but before [D].
> 
> Yes. I can add that for clarity, but perhaps I don't need that if I
> merge the two.

OK

> 
>> 
>> >				 * It's timestamp should be
>> >	"before_stamp" */
>> 
>> It's -> Its
> 
> ;-)
> 
> My email client messed up your formatting of the rest of the email, so
> I'll send a separate reply.

OK,

Thanks!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 19:04   ` Steven Rostedt
@ 2020-06-25 19:58     ` Mathieu Desnoyers
  2020-06-25 20:42       ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Mathieu Desnoyers @ 2020-06-25 19:58 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

----- On Jun 25, 2020, at 3:04 PM, rostedt rostedt@goodmis.org wrote:

> On Thu, 25 Jun 2020 13:55:07 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> 
>> What worries me about this "optimization" is what happens in a multi-nesting
>> scenario.
>> If we just fallback to a full timestamp as soon as nesting is detected, things
>> become
>> really simple for multiple nesting. However using a delta timestamp adds a lot
>> of
>> additional states to consider, and those are not discussed here.
>> 
>> More specifically, one scenario I'm concerned about is:
>> 
>> Nesting level 0                      Nesting level 1
>> Nesting level 2
>> 
>> [A] w = local_read(write_tail);
>>     before = READ_ONCE(before_stamp);
>> [B] WRITE_ONCE(before_stamp, ts);
>>     WRITE_ONCE(next_write, w + len);
>> [C] write = local_add_return(len,
>>     write_tail);
>>    <interrupted>
>>                                     [A] w = local_read(write_tail);
>>                                         before = READ_ONCE(before_stamp);
>>                                     [B] WRITE_ONCE(before_stamp, ts);
>>                                     <interrupted>
>>                                                                           [A] w = local_read(write_tail);
>>                                                                               before = READ_ONCE(before_stamp);
>>                                                                           [B] WRITE_ONCE(before_stamp, ts);
>>                                                                               WRITE_ONCE(next_write, w + len);
>>                                                                           [C] write = local_add_return(len,
>>                                                                               write_tail);
>>                                                                               if (w == tail) { -> true
>>                                                                                 if (!abs) { -> false
>>                                                                                   if (w == next) { -> true
>> 
>> Nesting level 2 then ends up using the "before_stamp" of nesting level 1 to
>> generate
>> the time delta, but the delta is really between the event from nesting level 0
>> and the
>> event at nesting level 2, because nesting level 1 will only record its event
>> _after_
>> the event of nesting level 2.
> 
> You may have a valid point. But if we do your suggestion of just
> punting in this path and use a full timestamp, then there's nothing to
> worry about here. Right?

Yes, this problematic scenario is just there to illustrate my point.

> 
>> 
>> >			} else {
>> >				delta = ts; /* The full time stamp will be in use */
>> >			}
>> >		}
>> >		if (ts != save_before) {
>> >			/* slow path - Was preempted between C and E */
>> >			/* The update to write_stamp could have overwritten the update to
>> >			 * it by the preempting event,
>> 
>> This sentence is hard to parse.
> 
> Ah, that needs better wording: "An interrupt of this event could have
> written to the write_stamp before this interrupt wrote to it.

I would have though the part "before this interrupt wrote to it" would
actually be "before this event wrote to it" (?) Am I confused ?

> Which
> means this event could have made the write_stamp go backwards. Need to
> put back the write stamp using the before stamp, as that can only be
> updated by events that interrupt this current one."
> 
> 
>> 
>> >                          but before and after should be
>> >			 * the same for all completed top events */
>> >			after = local_read(write_stamp);
>> >			if (save_before > after)
>> >				local_cmpxchg(write_stamp, after, save_before);
>> 
>> What is the type of write_stamp, and how does it work on 32-bit architectures
>> with
>> NMIs ?
>> 
> 
> local64_cmpxchg(), I left it out of change log, as this is the
> algorithm not all the details. See the patch for those ;-)

I'm still concerned about local64 on 32-bit, as per my prior email.

> 
>> I'm not entirely sure to understand what this cmpxchg try to
>> accomplish. Moreover, what happens if right after an interrupt nests
>> between [C] and [E], another interrupt try to write into the buffer
>> after before that cmpxchg ?
> 
> We are just fixing up this events write into write_stamp. We only worry
> about being interrupted between C and D, as then we set the write stamp
> backward. But any interrupting event will also update the before time
> stamp. That is, the before_stamp can only be the last time stamp of an
> event that interrupted us here. By doing the cmpxchg, we only update
> the write stamp to the before stamp of the last event to interrupt this
> event and if the cmpxchg fails, that means another event came in and
> updated it to its own time stamp which is what we want it to be!

This seems to assume that before_stamp is always written to in a monotonic
fashion except in this very specific corner-case. But what happens if this
algorithm is interrupted between reading the clock() current time and
writing to before_stamp ? This can happen both for the outer event and
at various nesting levels as well. I'm concerned that this scenario
would also cause before_stamp to go backwards, but would not be handled
correctly.

> 
>> 
>> >		}
>> >	} else {
>> >		/* slow path - Preempted between A and C */
>> > 
>> >		after = local_read(write_stamp);
>> >		temp_ts = clock();
>> >		barrier();
>> > [F]		if (write == local_read(write_tail) && after < temp_ts) {
>> >			/* This was not preempted since C and F
>> >			 * The last write_stamp is still valid for the previous event
>> >			 * in the buffer. */
>> >			delta = temp_ts - after;
>> >			/* OK to keep this new time stamp */
>> >			ts = temp_ts;
>> >		} else {
>> >			/* Preempted between C and F
>> >			 * Well, there's no use to try to know what
>> >	the time stamp
>> >			 * is for the previous event. Just set delta
>> >	to zero and
>> >			 * be the same time as that event that
>> >	preempted us before
>> >			 * the reservation of the buffer. */
>> > 
>> >			delta = 0;
>> 
>> If this is a rare case, why not just use a full timestamp instead ?
> 
> Because the current time stamp is wrong.
> 
> This is what happened:
> 
>	nesting 0							    nesting 1
>	---------							    ---------
>        < start event 1 >
> [A]	w = local_read(write_tail);
>	ts = clock(); [ time 1 ]
> 
> 									< start event 2 >
>									    ts = clock(); [ time 2 ]
>									[C] write = local_add_return();
>									<finish>
> 
> [C]	write = local_add_return();
> 
>									< start event 3 >
>									    ts = clock(); [ time 3]
>									[C] write = local_add_return();
>									<finish>
> 
> 
> [F]	if (write == local_read(write_tail) && after < temp_ts) {
> 
>	} else {
> 
> The recorded time stamps are this:
> 
>   time 1 < time 2 < time 3
> 
> The buffer would look like this:
> 
>  ----+------------------+------------------+------------------+----
>  ... | time 2 : event 2 | time ? : event 1 | time 3 : event 3 | ...
>  ----+------------------+------------------+------------------+----
> 
> 
> What do you put in for event 1's time. It's time stamp is before time
> 2, but the recorded event is after event 2. We can't use event 3 as a
> time.

I solve this by using cmpxchg instead of xadd for [C], and retry space
reservation with a new timestamp. ;-)

> 
> But in this case, it really doesn't matter in practice. Why? because
> what is the time of that middle event? How is that useful information?
> 
> For example, you record a syscall event.
> 
> 
> 
> syscall() {
> 	[...]
> 
> +	trace_syscall() {
> |		< event 1 >
> |			-------------> [interrupt]
> |				 	 < event 2 >
> |					 < time 2 >
> |			<------------
> |		< time 1>
> |			-------------> [interrupt]
> |				 	 < event 3 >
> |					 < time 3 >
> |			<------------
> +	}
> 
>	[...]
> }
> 
> What's the difference if event 1's time stamp is time 2 or time 3?
> The event that it is representing is paused for that entire path.
> 
> It's not really useful to get that "exact" because there's no real
> meaning of what that "exact" is. The vertical line there is the call to
> record a trace event. The event that is being represented by the trace
> event is paused for all three time stamps.

Agreed on the fact that the difference does not matter much in practice.

Can we have this scenario at an extra nesting level ? Basically
we add an outermost event 0 at nesting level 0, and shift events 1, 2, 3
to nesting levels 1, 2 in your scenario ? If it is possible, will the
special-case handling it also trigger ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 19:35     ` Mathieu Desnoyers
@ 2020-06-25 19:58       ` Steven Rostedt
  2020-06-26  2:36       ` Steven Rostedt
  2020-06-30  0:21       ` Steven Rostedt
  2 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2020-06-25 19:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Thu, 25 Jun 2020 15:35:02 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > 
> > Well, write_stamp is updated via local64, which I believe handles this
> > for us. I probably should make before_stamp handle it as well.  
> 
> By looking at local64 headers, it appears that 32-bit rely on atomic64,
> which on x86 is implemented with LOCK; cmpxchg8b for 586+ (which is AFAIK
> painfully slow) and with cli/sti for 386/486 (which is not nmi-safe).
> 
> For all other 32-bit architectures, the generic atomic64.h implements 64-bit
> atomics using spinlocks with irqs off, which seems to also bring considerable
> overhead, in addition to be non-reentrant with respect to NMI-like interrupts,
> e.g. FIQ on ARM32.
> 
> That seems at odds with the performance constraints of ftrace's ring buffer.
> 
> Those performance and reentrancy concerns are why I always stick to local_t
> (long), and never use a full 64-bit type for anything that has to
> do with concurrent store/load between execution contexts in lttng.

If this is an issue, I'm sure I can make my own wrappers for
"time_local()", and implement something that you probably do. Because,
we only need to worry about wrapping the 32 bit lower number, as that
only happens every 4 seconds. But that is an implementation detail, it
doesn't affect the overall design correctness.

But it is something I should definitely look in to.

> 
> > 
> >   
> >>   
> >> >	 * a full time stamp (this can turn into a time extend which
> >> >	is
> >> >	 * just an extended time delta but fill up the extra space).
> >> >	 */
> >> >	if (after != before)
> >> >		abs = true;
> >> > 
> >> >	ts = clock();
> >> > 
> >> >	/* Now update the before_stamp (everyone does this!) */
> >> > [B]	WRITE_ONCE(before_stamp, ts);
> >> > 
> >> >	/* Read the current next_write and update it to what we want
> >> >	write
> >> >	 * to be after we reserve space. */
> >> > 	next = READ_ONCE(next_write);
> >> >	WRITE_ONCE(next_write, w + len);
> >> > 
> >> >	/* Now reserve space on the buffer */
> >> > [C]	write = local_add_return(len, write_tail);  
> >> 
> >> So the reservation is not "just" an add instruction, it's actually an
> >> xadd on x86. Is that really faster than a cmpxchg ?  
> > 
> > I believe the answer is still yes. But I can run some benchmarks to
> > make sure.  
> 
> This would be interesting to see, because if xadd and cmpxchg have
> similar overhead, then going for a cmpxchg-loop for the space
> reservation could vastly decrease the overall complexity of this
> timestamp+space reservation algorithm.

It would most likely cause userspace breakage, and that would be a show
stopper.

But still good to see.

Thanks for the review.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 19:09   ` Steven Rostedt
@ 2020-06-25 20:03     ` Mathieu Desnoyers
  0 siblings, 0 replies; 24+ messages in thread
From: Mathieu Desnoyers @ 2020-06-25 20:03 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

----- On Jun 25, 2020, at 3:09 PM, rostedt rostedt@goodmis.org wrote:

> On Thu, 25 Jun 2020 13:55:07 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> What are the types used for before_stamp and write_stamp ? If those
>> are 64-bit integers, how does sharing them between nested contexts
>> work on 32-bit architectures ?
> 
> BTW, I forgot to state in either of my two emails, THANK YOU!
> 
> I really appreciate the thorough review of this patch, and I welcome
> the challenges you give me to make sure I can verify all that I claim
> can happen.

You're welcome! I believe these discussions we've been having for the past
10+ years on ring buffer design have helped improve our respective projects.

I'm glad to be of service!

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 19:58     ` Mathieu Desnoyers
@ 2020-06-25 20:42       ` Steven Rostedt
  0 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2020-06-25 20:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Thu, 25 Jun 2020 15:58:32 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> >> >		if (ts != save_before) {
> >> >			/* slow path - Was preempted between C and E */
> >> >			/* The update to write_stamp could have overwritten the update to
> >> >			 * it by the preempting event,  
> >> 
> >> This sentence is hard to parse.  
> > 
> > Ah, that needs better wording: "An interrupt of this event could have
> > written to the write_stamp before this interrupt wrote to it.  
> 
> I would have though the part "before this interrupt wrote to it" would
> actually be "before this event wrote to it" (?) Am I confused ?

No, this is what I get for replying while in a meeting ;-)

Yes, it's "before this event".

> 
> > Which
> > means this event could have made the write_stamp go backwards. Need to
> > put back the write stamp using the before stamp, as that can only be
> > updated by events that interrupt this current one."
> > 
> >   
> >>   
> >> >                          but before and after should be
> >> >			 * the same for all completed top events */
> >> >			after = local_read(write_stamp);
> >> >			if (save_before > after)
> >> >				local_cmpxchg(write_stamp, after, save_before);  
> >> 
> >> What is the type of write_stamp, and how does it work on 32-bit architectures
> >> with
> >> NMIs ?
> >>   
> > 
> > local64_cmpxchg(), I left it out of change log, as this is the
> > algorithm not all the details. See the patch for those ;-)  
> 
> I'm still concerned about local64 on 32-bit, as per my prior email.

Yes. And I think we could make a local_time() wrapper of local_t to do
these updates for us. They don't need to be totally atomic, just appear
to be.

> 
> >   
> >> I'm not entirely sure to understand what this cmpxchg try to
> >> accomplish. Moreover, what happens if right after an interrupt nests
> >> between [C] and [E], another interrupt try to write into the buffer
> >> after before that cmpxchg ?  
> > 
> > We are just fixing up this events write into write_stamp. We only worry
> > about being interrupted between C and D, as then we set the write stamp
> > backward. But any interrupting event will also update the before time
> > stamp. That is, the before_stamp can only be the last time stamp of an
> > event that interrupted us here. By doing the cmpxchg, we only update
> > the write stamp to the before stamp of the last event to interrupt this
> > event and if the cmpxchg fails, that means another event came in and
> > updated it to its own time stamp which is what we want it to be!  
> 
> This seems to assume that before_stamp is always written to in a monotonic
> fashion except in this very specific corner-case. But what happens if this
> algorithm is interrupted between reading the clock() current time and
> writing to before_stamp ? This can happen both for the outer event and
> at various nesting levels as well. I'm concerned that this scenario
> would also cause before_stamp to go backwards, but would not be handled
> correctly.

For this version I'm using local64_t for before_stamp. I'll look into
other ways to update it for 32 bit. Sequence updates should be easy to
implement (if it hasn't already).


> > 
> > The buffer would look like this:
> > 
> >  ----+------------------+------------------+------------------+----
> >  ... | time 2 : event 2 | time ? : event 1 | time 3 : event 3 | ...
> >  ----+------------------+------------------+------------------+----
> > 
> > 
> > What do you put in for event 1's time. It's time stamp is before time
> > 2, but the recorded event is after event 2. We can't use event 3 as a
> > time.  
> 
> I solve this by using cmpxchg instead of xadd for [C], and retry space
> reservation with a new timestamp. ;-)

Not if it breaks user space. 10 years ago I may have considered a
change like that, but not today.


> 
> Can we have this scenario at an extra nesting level ? Basically
> we add an outermost event 0 at nesting level 0, and shift events 1, 2, 3
> to nesting levels 1, 2 in your scenario ? If it is possible, will the
> special-case handling it also trigger ?

I'm not sure I understand what you mean. The outer most event only
cares that its timestamp is:

 1) In order between other events in the buffer
 2) Exists between the time the event started recording and the time it
    finished recording. As the caller doesn't care where the timestamp
    lies within that event.

This is true even if this trace happens within a nesting event.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 19:35     ` Mathieu Desnoyers
  2020-06-25 19:58       ` Steven Rostedt
@ 2020-06-26  2:36       ` Steven Rostedt
  2020-06-26  3:35         ` Steven Rostedt
  2020-06-30  0:21       ` Steven Rostedt
  2 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2020-06-26  2:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Thu, 25 Jun 2020 15:35:02 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> Those performance and reentrancy concerns are why I always stick to local_t
> (long), and never use a full 64-bit type for anything that has to
> do with concurrent store/load between execution contexts in lttng.

Although the adding of this patch slows the recording down by around 7
nanoseconds (average 55ns brought to 62ns), but I can wrap it so that
it stays local64_t for 64 bit and use this for 32 bit.

I added three primitives:

 bool rb_time_read(rb_time_t *t, u64 *val)

It loads val with t if t wasn't being updated at the time. If it is, it
returns false and val is not touched, otherwise it returns true and val
has a real value (either before, or updated by an interrupt).

 void rb_time_set(rb_time_t *t, u64 val)

This will update t with val, and wont return until it has done so. It
will overwrite any update that interrupted the process.

 bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set);

This will attempt to do a cmpxchg on t, if it matches "expect" it will
write "set". If it gets interrupted, then it will detect that, and return
false. Otherwise it will return true if it fully changed the value.


This is all I need, and is rather trivial to implement. The trick is
the order of rb_time_t is always written from:

  start_cnt
  top
  bottom
  end_cnt

This means that end_cnt will never be greater than start_cnt. If we see
that it is, it means that the process was interrupted with another
update. That process that interrupted a write, would complete its write.

I don't plan on applying this, but if it turns out that local64_t is a
problem on 32 bit archs (and I hear complaints), I add this (and but
still use local64_t on 64 bit archs).

-- Steve

Index: linux-trace.git/kernel/trace/ring_buffer.c
===================================================================
--- linux-trace.git.orig/kernel/trace/ring_buffer.c
+++ linux-trace.git/kernel/trace/ring_buffer.c
@@ -446,6 +446,22 @@ enum {
 	RB_CTX_MAX
 };
 
+struct rb_time_struct {
+	local_t		start_cnt;
+	local_t		top;
+	local_t		bottom;
+	local_t		end_cnt;
+};
+
+struct rb_time_read {
+	unsigned long	start_cnt;
+	unsigned long	top;
+	unsigned long	bottom;
+	unsigned long	end_cnt;
+};
+
+typedef struct rb_time_struct rb_time_t;
+
 /*
  * head_page == tail_page && head == tail then buffer is empty.
  */
@@ -482,8 +498,8 @@ struct ring_buffer_per_cpu {
 	unsigned long			read;
 	unsigned long			read_bytes;
 	unsigned long			next_write;
-	local64_t			write_stamp;
-	local64_t			before_stamp;
+	rb_time_t			write_stamp;
+	rb_time_t			before_stamp;
 	u64				read_stamp;
 	/* ring buffer pages to update, > 0 to add, < 0 to remove */
 	long				nr_pages_to_update;
@@ -526,6 +542,78 @@ struct ring_buffer_iter {
 	int				missed_events;
 };
 
+static bool rb_time_read(rb_time_t *t, u64 *ret)
+{
+	struct rb_time_read r;
+	u64 val;
+
+	do {
+		r.start_cnt = local_read(&t->start_cnt);
+		r.top = local_read(&t->top);
+		r.bottom = local_read(&t->bottom);
+		r.end_cnt = local_read(&t->end_cnt);
+	} while (r.start_cnt < r.end_cnt);
+
+	if (r.start_cnt != r.end_cnt)
+		return false;
+
+	val = r.top;
+	val <<= 32;
+	val |= r.bottom;
+	*ret = val;
+	return true;
+}
+
+static inline void rb_time_read_set(struct rb_time_read *r, u64 val)
+{
+	r->top = (unsigned long)(val >> 32);
+	r->bottom = (unsigned long)(val & ((1ULL << 32) - 1));
+}
+
+static void rb_time_set(rb_time_t *t, u64 val)
+{
+	struct rb_time_read r;
+
+	rb_time_read_set(&r, val);
+
+	do {
+		r.start_cnt = local_inc_return(&t->start_cnt);
+		local_set(&t->top, r.top);
+		local_set(&t->bottom, r.bottom);
+		local_set(&t->end_cnt, r.start_cnt);
+	} while (r.start_cnt != local_read(&t->start_cnt));
+}
+
+static bool rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set)
+{
+	unsigned long ret;
+
+	ret = local_cmpxchg(l, expect, set);
+	return ret == expect;
+}
+
+static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
+{
+	struct rb_time_read e, s;
+
+	rb_time_read_set(&e, expect);
+	rb_time_read_set(&s, set);
+
+	e.start_cnt = local_read(&t->start_cnt);
+	e.end_cnt = local_read(&t->end_cnt);
+
+	s.start_cnt = e.start_cnt + 1;
+	s.end_cnt = e.start_cnt;
+
+	if (!rb_time_read_cmpxchg(&t->start_cnt, e.start_cnt, s.start_cnt))
+		return false;
+	if (!rb_time_read_cmpxchg(&t->top, e.top, s.top))
+		return false;
+	if (!rb_time_read_cmpxchg(&t->bottom, e.bottom, s.bottom))
+		return false;
+	return rb_time_read_cmpxchg(&t->end_cnt, e.end_cnt, s.end_cnt);
+}
+
 /**
  * ring_buffer_nr_pages - get the number of buffer pages in the ring buffer
  * @buffer: The ring_buffer to get the number of pages from
@@ -2543,7 +2631,8 @@ rb_try_to_discard(struct ring_buffer_per
 
 	delta = rb_time_delta(event);
 
-	write_stamp = local64_read(&cpu_buffer->write_stamp);
+	if (!rb_time_read(&cpu_buffer->write_stamp, &write_stamp))
+		return 0;
 
 	/* Make sure the write stamp is read before testing the location */
 	barrier();
@@ -2552,11 +2641,10 @@ rb_try_to_discard(struct ring_buffer_per
 		unsigned long write_mask =
 			local_read(&bpage->write) & ~RB_WRITE_MASK;
 		unsigned long event_length = rb_event_length(event);
-		u64 ret;
 
-		ret = local64_cmpxchg(&cpu_buffer->write_stamp, write_stamp, write_stamp - delta);
 		/* Something came in, can't discard */
-		if (ret != write_stamp)
+		if (!rb_time_cmpxchg(&cpu_buffer->write_stamp,
+				     write_stamp, write_stamp - delta))
 			return 0;
 
 		/*
@@ -2887,11 +2975,13 @@ static noinline void
 rb_handle_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
 		    struct rb_event_info *info)
 {
+	u64 write_stamp;
+
 	WARN_ONCE(info->delta > (1ULL << 59),
 		  KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu\n%s",
 		  (unsigned long long)info->delta,
 		  (unsigned long long)info->ts,
-		  (unsigned long long)local64_read(&cpu_buffer->write_stamp),
+		  (unsigned long long)(rb_time_read(&cpu_buffer->write_stamp, &write_stamp) ? write_stamp : 0),
 		  sched_clock_stable() ? "" :
 		  "If you just came from a suspend/resume,\n"
 		  "please switch to the trace global clock:\n"
@@ -2909,14 +2999,16 @@ __rb_reserve_next(struct ring_buffer_per
 	unsigned long tail, write, w, next;
 	u64 delta, before, after;
 	bool abs = false;
+	bool a_ok;
+	bool b_ok;
 
 	/* Don't let the compiler play games with cpu_buffer->tail_page */
 	tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page);
 
  /*A*/	w = local_read(&tail_page->write) & RB_WRITE_MASK;
 	barrier();
-	before = local64_read(&cpu_buffer->before_stamp);
-	after = local64_read(&cpu_buffer->write_stamp);
+	b_ok = rb_time_read(&cpu_buffer->before_stamp, &before);
+	a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
 	barrier();
 	info->ts = rb_time_stamp(cpu_buffer->buffer);
 
@@ -2934,7 +3026,7 @@ __rb_reserve_next(struct ring_buffer_per
 	 * If interrupting an event time update, we may need an absolute timestamp.
 	 * Don't bother if this is the start of a new page (w == 0).
 	 */
-	if (unlikely(before != after && w))
+	if (unlikely(!a_ok || !b_ok || (before != after && w)))
 		info->add_timestamp = RB_ADD_STAMP_FORCE;
 	/*
 	 * If the time delta since the last event is too big to
@@ -2947,7 +3039,7 @@ __rb_reserve_next(struct ring_buffer_per
 	next = READ_ONCE(cpu_buffer->next_write);
 	WRITE_ONCE(cpu_buffer->next_write, w + info->length);
 
- /*B*/	local64_set(&cpu_buffer->before_stamp, info->ts);
+ /*B*/	rb_time_set(&cpu_buffer->before_stamp, info->ts);
 
  /*C*/	write = local_add_return(info->length, &tail_page->write);
 
@@ -2960,21 +3052,23 @@ __rb_reserve_next(struct ring_buffer_per
 	if (unlikely(write > BUF_PAGE_SIZE)) {
 		if (tail != w) {
 			/* before and after may now different, fix it up*/
-			before = local64_read(&cpu_buffer->before_stamp);
-			after = local64_read(&cpu_buffer->write_stamp);
-			if (before != after)
-				(void)local64_cmpxchg(&cpu_buffer->before_stamp, before, after);
+			b_ok = rb_time_read(&cpu_buffer->before_stamp, &before);
+			a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
+			if (a_ok && b_ok && before != after)
+				(void)rb_time_cmpxchg(&cpu_buffer->before_stamp, before, after);
 		}
 		return rb_move_tail(cpu_buffer, tail, info);
 	}
 
 	if (likely(tail == w)) {
 		u64 save_before;
+		bool s_ok;
 
 		/* Nothing interrupted us between A and C */
- /*D*/		local64_set(&cpu_buffer->write_stamp, info->ts);
+ /*D*/		rb_time_set(&cpu_buffer->write_stamp, info->ts);
 		barrier();
- /*E*/		save_before = local64_read(&cpu_buffer->before_stamp);
+ /*E*/		s_ok = rb_time_read(&cpu_buffer->before_stamp, &save_before);
+		RB_WARN_ON(cpu_buffer, !s_ok);
 		if (likely(info->add_timestamp != RB_ADD_STAMP_FORCE))
 			/* This did not interrupt any time update */
 			info->delta = info->ts - after;
@@ -2985,27 +3079,29 @@ __rb_reserve_next(struct ring_buffer_per
 		if (unlikely(info->ts != save_before)) {
 			/* SLOW PATH - Interrupted between C and E */
 
-			after = local64_read(&cpu_buffer->write_stamp);
+			a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
+			RB_WARN_ON(cpu_buffer, !a_ok);
+
 			/* Write stamp must only go forward */
 			if (save_before > after) {
 				/*
 				 * We do not care about the result, only that
 				 * it gets updated atomically.
 				 */
-				(void)local64_cmpxchg(&cpu_buffer->write_stamp, after, save_before);
+				(void)rb_time_cmpxchg(&cpu_buffer->write_stamp, after, save_before);
 			}
 		}
 	} else {
 		u64 ts;
 		/* SLOW PATH - Interrupted between A and C */
-		after = local64_read(&cpu_buffer->write_stamp);
+		a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
 		ts = rb_time_stamp(cpu_buffer->buffer);
 		barrier();
  /*E*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
 		    after < ts) {
 			/* Nothing came after this event between C and E */
 			info->delta = ts - after;
-			(void)local64_cmpxchg(&cpu_buffer->write_stamp, after, info->ts);
+			(void)rb_time_cmpxchg(&cpu_buffer->write_stamp, after, info->ts);
 			info->ts = ts;
 		} else {
 			/*
@@ -4567,8 +4663,8 @@ rb_reset_cpu(struct ring_buffer_per_cpu
 	cpu_buffer->read = 0;
 	cpu_buffer->read_bytes = 0;
 
-	local64_set(&cpu_buffer->write_stamp, 0);
-	local64_set(&cpu_buffer->before_stamp, 0);
+	rb_time_set(&cpu_buffer->write_stamp, 0);
+	rb_time_set(&cpu_buffer->before_stamp, 0);
 	cpu_buffer->next_write = 0;
 
 	cpu_buffer->lost_events = 0;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-26  2:36       ` Steven Rostedt
@ 2020-06-26  3:35         ` Steven Rostedt
  2020-06-26 13:58           ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2020-06-26  3:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Thu, 25 Jun 2020 22:36:11 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> +static void rb_time_set(rb_time_t *t, u64 val)
> +{
> +	struct rb_time_read r;
> +
> +	rb_time_read_set(&r, val);
> +
> +	do {
> +		r.start_cnt = local_inc_return(&t->start_cnt);
> +		local_set(&t->top, r.top);
> +		local_set(&t->bottom, r.bottom);
> +		local_set(&t->end_cnt, r.start_cnt);
> +	} while (r.start_cnt != local_read(&t->start_cnt));
> +}
> +
> +static bool rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set)
> +{
> +	unsigned long ret;
> +
> +	ret = local_cmpxchg(l, expect, set);
> +	return ret == expect;
> +}
> +
> +static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
> +{
> +	struct rb_time_read e, s;
> +
> +	rb_time_read_set(&e, expect);
> +	rb_time_read_set(&s, set);
> +
> +	e.start_cnt = local_read(&t->start_cnt);
> +	e.end_cnt = local_read(&t->end_cnt);
> +
> +	s.start_cnt = e.start_cnt + 1;
> +	s.end_cnt = e.start_cnt;
> +
> +	if (!rb_time_read_cmpxchg(&t->start_cnt, e.start_cnt, s.start_cnt))
> +		return false;
> +	if (!rb_time_read_cmpxchg(&t->top, e.top, s.top))
> +		return false;
> +	if (!rb_time_read_cmpxchg(&t->bottom, e.bottom, s.bottom))
> +		return false;
> +	return rb_time_read_cmpxchg(&t->end_cnt, e.end_cnt, s.end_cnt);
> +}
> +

I have to think about this more, as I think there's a flaw in this
cmpxchg algorithm.

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-26  3:35         ` Steven Rostedt
@ 2020-06-26 13:58           ` Steven Rostedt
  2020-06-26 18:13             ` Mathieu Desnoyers
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2020-06-26 13:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Thu, 25 Jun 2020 23:35:52 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I have to think about this more, as I think there's a flaw in this
> cmpxchg algorithm.

Bah, that was all wrong. I need to incorporate the seq counter into the
time stamp. Luckily I only use 59 bits for a full time stamp (gives me
18 years of nanoseconds), which gives me 5 bits to pay with, and all I
need is 4 (2 for top and 2 for bottom). Now I can make the timestamp 60
bits with 30 bits from the top and 30 from the bottom, and using two
bits in each of those as a sequence counter.

If after reading top and bottom, if the seq bits match, then the two
match and can be put together as a valid number. Now I have this:


struct rb_time_struct {
	local_t		cnt;
	local_t		top;
	local_t		bottom;
};

typedef struct rb_time_struct rb_time_t;
	
#define RB_TIME_SHIFT	30
#define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1)

static int rb_time_cnt(unsigned long val)
{
	return (val >> RB_TIME_SHIFT) & 3;
}

static u64 rb_time_val(unsigned long top, unsigned long bottom)
{
	u64 val;

	val = top & RB_TIME_VAL_MASK;
	val <<= RB_TIME_SHIFT;
	val |= bottom & RB_TIME_VAL_MASK;

	return val;
}

static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt)
{
	unsigned long top, bottom;

	top = local_read(&t->top);
	bottom = local_read(&t->bottom);

	*cnt = rb_time_cnt(top);

	if (*cnt != rb_time_cnt(bottom))
		return false;

	*ret = rb_time_val(top, bottom);
	return true;
}

static bool rb_time_read(rb_time_t *t, u64 *ret)
{
	unsigned long cnt;

	return __rb_time_read(t, ret, &cnt);
}

static inline unsigned long rb_time_val_cnt(unsigned long val, unsigned long cnt)
{
	return (val & RB_TIME_VAL_MASK) | ((cnt & 3) << RB_TIME_SHIFT);
}

static inline void rb_time_split(u64 val, unsigned long *top, unsigned long *bottom)
{
	*top = (unsigned long)((val >> RB_TIME_SHIFT) & RB_TIME_VAL_MASK);
	*bottom = (unsigned long)(val & RB_TIME_VAL_MASK);
}

static void rb_time_val_set(local_t *t, unsigned long val, unsigned long cnt)
{
	val = rb_time_val_cnt(val, cnt);
	local_set(t, val);
}

static void rb_time_set(rb_time_t *t, u64 val)
{
	unsigned long cnt, top, bottom;

	rb_time_split(val, &top, &bottom);
	do {
		cnt = local_inc_return(&t->cnt);
		rb_time_val_set(&t->top, top, cnt);
		rb_time_val_set(&t->bottom, bottom, cnt);
	} while (cnt != local_read(&t->cnt));
}

static bool rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set)
{
	unsigned long ret;

	ret = local_cmpxchg(l, expect, set);
	return ret == expect;
}

static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
{
	unsigned long cnt, top, bottom;
	unsigned long cnt2, top2, bottom2;
	u64 val;

	 if (!__rb_time_read(t, &val, &cnt))
		 return false;

	 if (val != expect)
		 return false;

	 cnt2 = local_read(&t->cnt);
	 if (cnt2 != cnt)
		 return false;
	 cnt2++;

	 rb_time_split(val, &top, &bottom);
	 top = rb_time_val_cnt(top, cnt);
	 bottom = rb_time_val_cnt(bottom, cnt);

	 rb_time_split(set, &top2, &bottom2);
	 top = rb_time_val_cnt(top2, cnt2);
	 bottom = rb_time_val_cnt(bottom2, cnt2);
	 
	if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
		return false;
	if (!rb_time_read_cmpxchg(&t->top, top, top2))
		return false;
	if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
		return false;
	return true;
}

Maybe I might implement this for BITS_FOR_LONG == 32.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-26 13:58           ` Steven Rostedt
@ 2020-06-26 18:13             ` Mathieu Desnoyers
  2020-06-26 18:58               ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Mathieu Desnoyers @ 2020-06-26 18:13 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

----- On Jun 26, 2020, at 9:58 AM, rostedt rostedt@goodmis.org wrote:

> On Thu, 25 Jun 2020 23:35:52 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>> I have to think about this more, as I think there's a flaw in this
>> cmpxchg algorithm.
> 
> Bah, that was all wrong. I need to incorporate the seq counter into the
> time stamp. Luckily I only use 59 bits for a full time stamp (gives me
> 18 years of nanoseconds), which gives me 5 bits to pay with, and all I
> need is 4 (2 for top and 2 for bottom). Now I can make the timestamp 60
> bits with 30 bits from the top and 30 from the bottom, and using two
> bits in each of those as a sequence counter.
> 
> If after reading top and bottom, if the seq bits match, then the two
> match and can be put together as a valid number. Now I have this:

[...]
> 
> static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt)
> {
>	unsigned long top, bottom;
> 
>	top = local_read(&t->top);
>	bottom = local_read(&t->bottom);
> 
>	*cnt = rb_time_cnt(top);
> 
>	if (*cnt != rb_time_cnt(bottom))
>		return false;
> 
>	*ret = rb_time_val(top, bottom);
>	return true;
> }

If __rb_time_read or rb_time_cmpxchg are used in an interrupt over
rb_time_set (between setting top and bottom), those will never succeed.

How is this case handled ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-26 18:13             ` Mathieu Desnoyers
@ 2020-06-26 18:58               ` Steven Rostedt
  2020-06-26 19:39                 ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2020-06-26 18:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Fri, 26 Jun 2020 14:13:50 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > 
> > static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt)
> > {
> >	unsigned long top, bottom;
> > 
> >	top = local_read(&t->top);
> >	bottom = local_read(&t->bottom);
> > 
> >	*cnt = rb_time_cnt(top);
> > 
> >	if (*cnt != rb_time_cnt(bottom))
> >		return false;
> > 
> >	*ret = rb_time_val(top, bottom);
> >	return true;
> > }  

BTW, I've decided to include this in my final patch series which I'm
finishing up now.

> 
> If __rb_time_read or rb_time_cmpxchg are used in an interrupt over
> rb_time_set (between setting top and bottom), those will never succeed.
> 
> How is this case handled ?
> 

The three places using cmpxchg:

In the discard code (which is only called when nesting already happened
as it uses per_cpu buffers first, as they are faster), thus this
requires a double nest:


	if (bpage->page == (void *)addr && rb_page_write(bpage) == old_index) {
		unsigned long write_mask =
			local_read(&bpage->write) & ~RB_WRITE_MASK;
		unsigned long event_length = rb_event_length(event);

		/* Something came in, can't discard */
		if (!rb_time_cmpxchg(&cpu_buffer->write_stamp,
				     write_stamp, write_stamp - delta))
			return 0;


It fails to do the discard and fills the buffer with "padding".

Second usage:

		/* SLOW PATH - Interrupted between A and C */
		a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
		ts = rb_time_stamp(cpu_buffer->buffer);
		barrier();
 /*E*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
		    a_ok && after < ts) {
			/* Nothing came after this event between C and E */
			info->delta = ts - after;
			(void)rb_time_cmpxchg(&cpu_buffer->write_stamp, after, info->ts);
			info->ts = ts;
		} else {
			info->delta = 0;

Not only did this event interrupt something, but something interrupt
this event. Punt and just use the delta zero.

Third usage:

		if (unlikely(info->ts != save_before)) {
			/* SLOW PATH - Interrupted between C and E */

			a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
			RB_WARN_ON(cpu_buffer, !a_ok);

			/* Write stamp must only go forward */
			if (save_before > after) {
				/*
				 * We do not care about the result, only that
				 * it gets updated atomically.
				 */
				(void)rb_time_cmpxchg(&cpu_buffer->write_stamp, after, save_before);
			}
		}

It's where we want to update the write stamp to the latest, and which
it already wrote to the write_stamp. Notice there's even a warning that
will cause the ring buffer to crash (fail) if it's not OK. Which I
triggered when I forgot to handle the case where reading could get
corrupted by being interrupted, which I fixed soon after ;-)

As for the save_before. This event writes to the before after reading
it, so the before_stamp had better be ok (and there's a warning there
too).

Here's the read locations:

In the discard as well. If it can't read it, it will just fill with padding.

	if (!rb_time_read(&cpu_buffer->write_stamp, &write_stamp))
		return 0;


On reserving the event:

	barrier();
	b_ok = rb_time_read(&cpu_buffer->before_stamp, &before);
	a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
	barrier();

[..]
	if (unlikely(!a_ok || !b_ok || (before != after && w)))
		info->add_timestamp |= RB_ADD_STAMP_FORCE | RB_ADD_STAMP_EXTEND;

It treats these failed reads as interrupting the event and uses the
full time stamp.  Thanks to your suggestion about combining the the
case where I went back to the delta, I don't need these values anymore.

	/* See if we shot pass the end of this buffer page */
	if (unlikely(write > BUF_PAGE_SIZE)) {
		if (tail != w) {
			/* before and after may now different, fix it up*/
			b_ok = rb_time_read(&cpu_buffer->before_stamp, &before);
			a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
			if (a_ok && b_ok && before != after)
				(void)rb_time_cmpxchg(&cpu_buffer->before_stamp, before, after);
		}
		return rb_move_tail(cpu_buffer, tail, info);
	}

This is when we overrun the sub buffer. This is more of a clean up.
because I found that because we always write to before_stamp, and then
go back to the page, it triggered it as being different. The only
reason we want them to be the same is to prevent a unnecessary time
stamp at the beginning of the page, which already stores a full time
stamp.

 /*E*/		s_ok = rb_time_read(&cpu_buffer->before_stamp, &save_before);
		RB_WARN_ON(cpu_buffer, !s_ok);

As mentioned, this before_stamp was written to by this event, and is
now used to know if it was interrupted since.

I think I hit all the locations, and that is how it handles not being
able to read it. The important thing is, it must be able to read it if
it was written by itself or an event that interrupted it. And it must
be able to know if it is invalid, and that can only happen if the event
interrupted the writing of it.

Cheers!

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-26 18:58               ` Steven Rostedt
@ 2020-06-26 19:39                 ` Steven Rostedt
  0 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2020-06-26 19:39 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier

On Fri, 26 Jun 2020 14:58:19 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Second usage:
> 
> 		/* SLOW PATH - Interrupted between A and C */
> 		a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
> 		ts = rb_time_stamp(cpu_buffer->buffer);
> 		barrier();
>  /*E*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
> 		    a_ok && after < ts) {
> 			/* Nothing came after this event between C and E */
> 			info->delta = ts - after;
> 			(void)rb_time_cmpxchg(&cpu_buffer->write_stamp, after, info->ts);
> 			info->ts = ts;
> 		} else {
> 			info->delta = 0;

Actually, I don't think a_ok can every be false here. An uninterrupted
event will leave with both before_stamp and write_stamp valid. As an
uninterrupted event will write to both (and a rb_time_t is only invalid
from reading an interrupted event).

To get to this path we have:

	w = local_read(write_tail);

	<--- Interrupt event (makes write_stamp valid!)

	write = local_add_return(write_tail, length);

	w != write - length;


	[..]

	a_ok = rb_time_read(write_stamp);

Must always be valid!

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-25 19:35     ` Mathieu Desnoyers
  2020-06-25 19:58       ` Steven Rostedt
  2020-06-26  2:36       ` Steven Rostedt
@ 2020-06-30  0:21       ` Steven Rostedt
  2020-06-30  3:13         ` Mathieu Desnoyers
  2 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2020-06-30  0:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier,
	Joel Fernandes

On Thu, 25 Jun 2020 15:35:02 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> >> So the reservation is not "just" an add instruction, it's actually an
> >> xadd on x86. Is that really faster than a cmpxchg ?  
> > 
> > I believe the answer is still yes. But I can run some benchmarks to
> > make sure.  
> 
> This would be interesting to see, because if xadd and cmpxchg have
> similar overhead, then going for a cmpxchg-loop for the space
> reservation could vastly decrease the overall complexity of this
> timestamp+space reservation algorithm.

So I tested this on three different machines to see how it compared. I
applied this patch:

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c2c0d25ea004..8a7d62375eb3 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3161,7 +3161,19 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 
  /*B*/	rb_time_set(&cpu_buffer->before_stamp, info->ts);
 
+#if 0
  /*C*/	write = local_add_return(info->length, &tail_page->write);
+#else
+	{
+		unsigned long expect, result;
+
+		do {
+			expect = local_read(&tail_page->write);
+			write = expect + info->length;
+			result = local_cmpxchg(&tail_page->write, expect, write);
+		} while (expect != result);
+	}
+#endif
 
 	/* set write to only the index of the write */
 	write &= RB_WRITE_MASK;


And tested before and after the results.

To test, I used my fedora 31 kernel config, and enabled
CONFIG_TRACEPOINT_BENCHMARK.

When that trace event is enabled, a kthread is kicked off that runs in
a loop (see kernel/trace/trace_benchmark.c) that basically does:

	local_irq_disable();
	start = trace_clock_local();
	trace_benchmark_event();
	end = trace_clock_local();
	local_irq_enable();

It takes the delta of end - start. As the first time through is cold
cache it records it separately, and after that it creates a string that
lists the delta of the last event recorded (as it can't make the delta
before it records the end time stamp). The first time stamp (the cold
cache one), the max, min, running average. It also records a running
standard deviation. All numbers are in nanoseconds.

The test was simply doing:

 # trace-cmd record -e benchmark:benchmark_event -o trace-tp-bench-{inc|cmpxchg}.dat sleep 3
 # trace-cmd report trace-tp-bench-{inc|cmpxchg}.dat | tail

The reason for the tail, is to let the machine settle, as x86 has all
sorts of crazy optimizations there's a lot of noise in the beginning.

Anyway, here's the results of my three machines, and feel free to test
this out yourself.

Going from my fastest to slowest machines:

Machine 1:

 model name	: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
 (4 core / 8 logical)

# trace-cmd report trace-tp-bench-inc.dat |tail 
 event_benchmark-2072  [003]  2234.664661 benchmark_event:      last=92 first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
 event_benchmark-2072  [003]  2234.664662 benchmark_event:      last=92 first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
 event_benchmark-2072  [003]  2234.664662 benchmark_event:      last=88 first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
 event_benchmark-2072  [003]  2234.664663 benchmark_event:      last=85 first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
 event_benchmark-2072  [003]  2234.664663 benchmark_event:      last=89 first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
 event_benchmark-2072  [003]  2234.664664 benchmark_event:      last=86 first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
 event_benchmark-2072  [003]  2234.664664 benchmark_event:      last=88 first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
 event_benchmark-2072  [003]  2234.664665 benchmark_event:      last=171 first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
 event_benchmark-2072  [003]  2234.664665 benchmark_event:      last=88 first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
 event_benchmark-2072  [003]  2234.664666 benchmark_event:      last=87 first=2794 max=11707 min=81 avg=93 std=49 std^2=2469


# trace-cmd report trace-tp-bench-cmpxchg.dat |tail 
 event_benchmark-1924  [000]   499.515711 benchmark_event:      last=309 first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
 event_benchmark-1924  [000]   499.515711 benchmark_event:      last=96 first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
 event_benchmark-1924  [000]   499.515712 benchmark_event:      last=189 first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
 event_benchmark-1924  [000]   499.515713 benchmark_event:      last=134 first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
 event_benchmark-1924  [000]   499.515713 benchmark_event:      last=92 first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
 event_benchmark-1924  [000]   499.515714 benchmark_event:      last=93 first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
 event_benchmark-1924  [000]   499.515714 benchmark_event:      last=92 first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
 event_benchmark-1924  [000]   499.515715 benchmark_event:      last=89 first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
 event_benchmark-1924  [000]   499.515715 benchmark_event:      last=86 first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
 event_benchmark-1924  [000]   499.515716 benchmark_event:      last=107 first=2955 max=10636 min=81 avg=93 std=73 std^2=5367


Machine 2:

model name	: Intel(R) Core(TM) i3-4130 CPU @ 3.40GHz
 (2 core / 4 logical)

# trace-cmd report trace-tp-bench-inc.dat |tail 
 event_benchmark-3179  [001]   606.466907: benchmark_event:      last=132 first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
 event_benchmark-3179  [001]   606.466907: benchmark_event:      last=133 first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
 event_benchmark-3179  [001]   606.466908: benchmark_event:      last=136 first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
 event_benchmark-3179  [001]   606.466909: benchmark_event:      last=135 first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
 event_benchmark-3179  [001]   606.466909: benchmark_event:      last=135 first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
 event_benchmark-3179  [001]   606.466910: benchmark_event:      last=133 first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
 event_benchmark-3179  [001]   606.466911: benchmark_event:      last=131 first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
 event_benchmark-3179  [001]   606.466912: benchmark_event:      last=132 first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
 event_benchmark-3179  [001]   606.466913: benchmark_event:      last=134 first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
 event_benchmark-3179  [001]   606.466913: benchmark_event:      last=223 first=1463 max=1938 min=115 avg=140 std=36 std^2=1341


# trace-cmd report trace-tp-bench-cmpxchg.dat |tail 
 event_benchmark-3169  [001]   477.024485: benchmark_event:      last=102 first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
 event_benchmark-3169  [001]   477.024486: benchmark_event:      last=103 first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
 event_benchmark-3169  [001]   477.024486: benchmark_event:      last=131 first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
 event_benchmark-3169  [001]   477.024487: benchmark_event:      last=104 first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
 event_benchmark-3169  [001]   477.024487: benchmark_event:      last=181 first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
 event_benchmark-3169  [001]   477.024488: benchmark_event:      last=195 first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
 event_benchmark-3169  [001]   477.024489: benchmark_event:      last=155 first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
 event_benchmark-3169  [001]   477.024489: benchmark_event:      last=102 first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
 event_benchmark-3169  [001]   477.024490: benchmark_event:      last=166 first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
 event_benchmark-3169  [001]   477.024490: benchmark_event:      last=256 first=1820 max=12210 min=95 avg=266 std=75 std^2=5768


Machine 3:

model name	: Intel(R) Core(TM)2 Quad  CPU   Q9450  @ 2.66GHz
( 4 cores / no hyperthreading enabled )

# trace-cmd report trace-tp-bench-inc.dat| tail
 event_benchmark-2686  [000]   860.217674: benchmark_event:      last=195 first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
 event_benchmark-2686  [000]   860.217675: benchmark_event:      last=196 first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
 event_benchmark-2686  [000]   860.217676: benchmark_event:      last=196 first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
 event_benchmark-2686  [000]   860.217677: benchmark_event:      last=193 first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
 event_benchmark-2686  [000]   860.217679: benchmark_event:      last=195 first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
 event_benchmark-2686  [000]   860.217680: benchmark_event:      last=195 first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
 event_benchmark-2686  [000]   860.217681: benchmark_event:      last=195 first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
 event_benchmark-2686  [000]   860.217682: benchmark_event:      last=193 first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
 event_benchmark-2686  [000]   860.217683: benchmark_event:      last=196 first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
 event_benchmark-2686  [000]   860.217684: benchmark_event:      last=195 first=2527 max=4571 min=186 avg=264 std=144 std^2=20943


# trace-cmd report trace-tp-bench-cmpxchg.dat| tail
 event_benchmark-2393  [003]   380.342603: benchmark_event:      last=199 first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
 event_benchmark-2393  [003]   380.342604: benchmark_event:      last=196 first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
 event_benchmark-2393  [003]   380.342606: benchmark_event:      last=201 first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
 event_benchmark-2393  [003]   380.342607: benchmark_event:      last=195 first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
 event_benchmark-2393  [003]   380.342608: benchmark_event:      last=195 first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
 event_benchmark-2393  [003]   380.342609: benchmark_event:      last=198 first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
 event_benchmark-2393  [003]   380.342610: benchmark_event:      last=199 first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
 event_benchmark-2393  [003]   380.342611: benchmark_event:      last=196 first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
 event_benchmark-2393  [003]   380.342613: benchmark_event:      last=196 first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
 event_benchmark-2393  [003]   380.342614: benchmark_event:      last=199 first=2707 max=3215 min=189 avg=297 std=163 std^2=26612


Not that big of a difference, but there is a difference. And the
standard deviation appears to be better, which would make sense as an
local_add_return() needs no repeat on collision.

-- Steve

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-30  0:21       ` Steven Rostedt
@ 2020-06-30  3:13         ` Mathieu Desnoyers
  2020-06-30  3:26           ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Mathieu Desnoyers @ 2020-06-30  3:13 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier,
	Joel Fernandes, Google

----- On Jun 29, 2020, at 8:21 PM, rostedt rostedt@goodmis.org wrote:

> On Thu, 25 Jun 2020 15:35:02 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> >> So the reservation is not "just" an add instruction, it's actually an
>> >> xadd on x86. Is that really faster than a cmpxchg ?
>> > 
>> > I believe the answer is still yes. But I can run some benchmarks to
>> > make sure.
>> 
>> This would be interesting to see, because if xadd and cmpxchg have
>> similar overhead, then going for a cmpxchg-loop for the space
>> reservation could vastly decrease the overall complexity of this
>> timestamp+space reservation algorithm.
> 
> So I tested this on three different machines to see how it compared. I
> applied this patch:
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index c2c0d25ea004..8a7d62375eb3 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -3161,7 +3161,19 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
> 
>  /*B*/	rb_time_set(&cpu_buffer->before_stamp, info->ts);
> 
> +#if 0
>  /*C*/	write = local_add_return(info->length, &tail_page->write);
> +#else
> +	{
> +		unsigned long expect, result;
> +
> +		do {
> +			expect = local_read(&tail_page->write);
> +			write = expect + info->length;
> +			result = local_cmpxchg(&tail_page->write, expect, write);
> +		} while (expect != result);
> +	}
> +#endif
> 
> 	/* set write to only the index of the write */
> 	write &= RB_WRITE_MASK;
> 
> 
> And tested before and after the results.

[...]
> 
> Not that big of a difference, but there is a difference. And the
> standard deviation appears to be better, which would make sense as an
> local_add_return() needs no repeat on collision.

I tried running your recommended test on my machine many times with the same
configuration. It appears that the avg and std dev results vary a lot between
runs of the exact same configuration:

Model name:          Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

* With xadd (existing code):

 event_benchmark-1828  [016]    71.309475: benchmark_event:      last=190 first=4200 max=234226 min=185 avg=256 std=333 std^2=111437
 event_benchmark-1828  [016]    71.309476: benchmark_event:      last=192 first=4200 max=234226 min=185 avg=256 std=333 std^2=111437
 event_benchmark-1828  [016]    71.309476: benchmark_event:      last=195 first=4200 max=234226 min=185 avg=256 std=333 std^2=111437
 event_benchmark-1828  [016]    71.309477: benchmark_event:      last=193 first=4200 max=234226 min=185 avg=256 std=333 std^2=111437
 event_benchmark-1828  [016]    71.309478: benchmark_event:      last=190 first=4200 max=234226 min=185 avg=256 std=333 std^2=111437
 event_benchmark-1828  [016]    71.309479: benchmark_event:      last=190 first=4200 max=234226 min=185 avg=256 std=333 std^2=111437
 event_benchmark-1828  [016]    71.309479: benchmark_event:      last=192 first=4200 max=234226 min=185 avg=256 std=333 std^2=111437
 event_benchmark-1828  [016]    71.309480: benchmark_event:      last=192 first=4200 max=234226 min=185 avg=256 std=333 std^2=111437
 event_benchmark-1828  [016]    71.309481: benchmark_event:      last=195 first=4200 max=234226 min=185 avg=256 std=333 std^2=111437
 event_benchmark-1828  [016]    71.309481: benchmark_event:      last=193 first=4200 max=234226 min=185 avg=256 std=333 std^2=111437


 event_benchmark-1865  [003]   162.955546: benchmark_event:      last=489 first=4386 max=29000 min=185 avg=243 std=161 std^2=26057
 event_benchmark-1865  [003]   162.955546: benchmark_event:      last=205 first=4386 max=29000 min=185 avg=243 std=161 std^2=26057
 event_benchmark-1865  [003]   162.955547: benchmark_event:      last=204 first=4386 max=29000 min=185 avg=243 std=161 std^2=26057
 event_benchmark-1865  [003]   162.955548: benchmark_event:      last=204 first=4386 max=29000 min=185 avg=243 std=161 std^2=26057
 event_benchmark-1865  [003]   162.955549: benchmark_event:      last=203 first=4386 max=29000 min=185 avg=243 std=161 std^2=26057
 event_benchmark-1865  [003]   162.955549: benchmark_event:      last=206 first=4386 max=29000 min=185 avg=243 std=161 std^2=26057
 event_benchmark-1865  [003]   162.955550: benchmark_event:      last=203 first=4386 max=29000 min=185 avg=243 std=161 std^2=26057
 event_benchmark-1865  [003]   162.955551: benchmark_event:      last=205 first=4386 max=29000 min=185 avg=243 std=161 std^2=26057
 event_benchmark-1865  [003]   162.955552: benchmark_event:      last=205 first=4386 max=29000 min=185 avg=243 std=161 std^2=26057
 event_benchmark-1865  [003]   162.955552: benchmark_event:      last=205 first=4386 max=29000 min=185 avg=243 std=161 std^2=26057

 event_benchmark-1904  [003]   202.847955: benchmark_event:      last=205 first=5230 max=27647 min=185 avg=239 std=150 std^2=22605
 event_benchmark-1904  [003]   202.847956: benchmark_event:      last=500 first=5230 max=27647 min=185 avg=239 std=150 std^2=22605
 event_benchmark-1904  [003]   202.847957: benchmark_event:      last=355 first=5230 max=27647 min=185 avg=239 std=150 std^2=22605
 event_benchmark-1904  [003]   202.847958: benchmark_event:      last=340 first=5230 max=27647 min=185 avg=239 std=150 std^2=22605
 event_benchmark-1904  [003]   202.847959: benchmark_event:      last=462 first=5230 max=27647 min=185 avg=239 std=150 std^2=22605
 event_benchmark-1904  [003]   202.847960: benchmark_event:      last=278 first=5230 max=27647 min=185 avg=239 std=150 std^2=22605
 event_benchmark-1904  [003]   202.847961: benchmark_event:      last=243 first=5230 max=27647 min=185 avg=239 std=150 std^2=22605
 event_benchmark-1904  [003]   202.847962: benchmark_event:      last=205 first=5230 max=27647 min=185 avg=239 std=150 std^2=22605
 event_benchmark-1904  [003]   202.847963: benchmark_event:      last=470 first=5230 max=27647 min=185 avg=239 std=150 std^2=22605
 event_benchmark-1904  [003]   202.847964: benchmark_event:      last=364 first=5230 max=27647 min=185 avg=239 std=150 std^2=22605

 event_benchmark-1941  [003]   234.577797: benchmark_event:      last=204 first=5117 max=17387 min=185 avg=233 std=141 std^2=19989
 event_benchmark-1941  [003]   234.577798: benchmark_event:      last=445 first=5117 max=17387 min=185 avg=233 std=141 std^2=19989
 event_benchmark-1941  [003]   234.577799: benchmark_event:      last=693 first=5117 max=17387 min=185 avg=233 std=141 std^2=19989
 event_benchmark-1941  [003]   234.577800: benchmark_event:      last=293 first=5117 max=17387 min=185 avg=233 std=141 std^2=19989
 event_benchmark-1941  [003]   234.577801: benchmark_event:      last=205 first=5117 max=17387 min=185 avg=233 std=141 std^2=19989
 event_benchmark-1941  [003]   234.577802: benchmark_event:      last=202 first=5117 max=17387 min=185 avg=233 std=141 std^2=19989
 event_benchmark-1941  [003]   234.577802: benchmark_event:      last=203 first=5117 max=17387 min=185 avg=233 std=141 std^2=19989
 event_benchmark-1941  [003]   234.577803: benchmark_event:      last=204 first=5117 max=17387 min=185 avg=233 std=141 std^2=19989
 event_benchmark-1941  [003]   234.577804: benchmark_event:      last=204 first=5117 max=17387 min=185 avg=233 std=141 std^2=19989
 event_benchmark-1941  [003]   234.577805: benchmark_event:      last=205 first=5117 max=17387 min=185 avg=233 std=141 std^2=19989


 event_benchmark-1978  [003]   279.276351: benchmark_event:      last=205 first=4583 max=17638 min=185 avg=241 std=163 std^2=26885
 event_benchmark-1978  [003]   279.276352: benchmark_event:      last=203 first=4583 max=17638 min=185 avg=241 std=163 std^2=26885
 event_benchmark-1978  [003]   279.276353: benchmark_event:      last=207 first=4583 max=17638 min=185 avg=241 std=163 std^2=26885
 event_benchmark-1978  [003]   279.276354: benchmark_event:      last=207 first=4583 max=17638 min=185 avg=241 std=163 std^2=26885
 event_benchmark-1978  [003]   279.276355: benchmark_event:      last=205 first=4583 max=17638 min=185 avg=241 std=163 std^2=26885
 event_benchmark-1978  [003]   279.276355: benchmark_event:      last=205 first=4583 max=17638 min=185 avg=241 std=163 std^2=26885
 event_benchmark-1978  [003]   279.276356: benchmark_event:      last=204 first=4583 max=17638 min=185 avg=241 std=163 std^2=26885
 event_benchmark-1978  [003]   279.276357: benchmark_event:      last=206 first=4583 max=17638 min=185 avg=241 std=163 std^2=26885
 event_benchmark-1978  [003]   279.276358: benchmark_event:      last=286 first=4583 max=17638 min=185 avg=241 std=163 std^2=26885
 event_benchmark-1978  [003]   279.276359: benchmark_event:      last=202 first=4583 max=17638 min=185 avg=241 std=163 std^2=26885


 event_benchmark-2015  [003]   309.974432: benchmark_event:      last=202 first=5090 max=22623 min=185 avg=241 std=152 std^2=23323
 event_benchmark-2015  [003]   309.974433: benchmark_event:      last=204 first=5090 max=22623 min=185 avg=241 std=152 std^2=23323
 event_benchmark-2015  [003]   309.974434: benchmark_event:      last=205 first=5090 max=22623 min=185 avg=241 std=152 std^2=23323
 event_benchmark-2015  [003]   309.974435: benchmark_event:      last=206 first=5090 max=22623 min=185 avg=241 std=152 std^2=23323
 event_benchmark-2015  [003]   309.974435: benchmark_event:      last=205 first=5090 max=22623 min=185 avg=241 std=152 std^2=23323
 event_benchmark-2015  [003]   309.974436: benchmark_event:      last=203 first=5090 max=22623 min=185 avg=241 std=152 std^2=23323
 event_benchmark-2015  [003]   309.974437: benchmark_event:      last=205 first=5090 max=22623 min=185 avg=241 std=152 std^2=23323
 event_benchmark-2015  [003]   309.974438: benchmark_event:      last=275 first=5090 max=22623 min=185 avg=241 std=152 std^2=23323
 event_benchmark-2015  [003]   309.974439: benchmark_event:      last=205 first=5090 max=22623 min=185 avg=241 std=152 std^2=23323
 event_benchmark-2015  [003]   309.974440: benchmark_event:      last=600 first=5090 max=22623 min=185 avg=241 std=152 std^2=23323

-> This configuration's avg varies between 233 and 256 ns based on your benchmark methodology. The std. dev. varies between
141 and 333 ns.


* With your cmpxchg modification:

 event_benchmark-1256  [010]    56.943734: benchmark_event:      last=210 first=5210 max=83243 min=187 avg=265 std=201 std^2=40524
 event_benchmark-1256  [010]    56.943736: benchmark_event:      last=208 first=5210 max=83243 min=187 avg=265 std=201 std^2=40524
 event_benchmark-1256  [010]    56.943737: benchmark_event:      last=1223 first=5210 max=83243 min=187 avg=265 std=201 std^2=40524
 event_benchmark-1256  [010]    56.943738: benchmark_event:      last=676 first=5210 max=83243 min=187 avg=265 std=201 std^2=40524
 event_benchmark-1256  [010]    56.943739: benchmark_event:      last=337 first=5210 max=83243 min=187 avg=265 std=201 std^2=40524
 event_benchmark-1256  [010]    56.943740: benchmark_event:      last=455 first=5210 max=83243 min=187 avg=265 std=201 std^2=40524
 event_benchmark-1256  [010]    56.943741: benchmark_event:      last=280 first=5210 max=83243 min=187 avg=265 std=201 std^2=40524
 event_benchmark-1256  [010]    56.943742: benchmark_event:      last=246 first=5210 max=83243 min=187 avg=265 std=201 std^2=40524
 event_benchmark-1256  [010]    56.943743: benchmark_event:      last=205 first=5210 max=83243 min=187 avg=265 std=201 std^2=40524
 event_benchmark-1256  [010]    56.943743: benchmark_event:      last=207 first=5210 max=83243 min=187 avg=265 std=201 std^2=40524

 event_benchmark-1293  [025]    84.352468: benchmark_event:      last=198 first=5446 max=25969 min=188 avg=241 std=168 std^2=28469
 event_benchmark-1293  [025]    84.352469: benchmark_event:      last=198 first=5446 max=25969 min=188 avg=241 std=168 std^2=28469
 event_benchmark-1293  [025]    84.352470: benchmark_event:      last=338 first=5446 max=25969 min=188 avg=241 std=168 std^2=28469
 event_benchmark-1293  [025]    84.352471: benchmark_event:      last=197 first=5446 max=25969 min=188 avg=241 std=168 std^2=28469
 event_benchmark-1293  [025]    84.352471: benchmark_event:      last=196 first=5446 max=25969 min=188 avg=241 std=168 std^2=28469
 event_benchmark-1293  [025]    84.352472: benchmark_event:      last=197 first=5446 max=25969 min=188 avg=241 std=168 std^2=28469
 event_benchmark-1293  [025]    84.352473: benchmark_event:      last=198 first=5446 max=25969 min=188 avg=241 std=168 std^2=28469
 event_benchmark-1293  [025]    84.352474: benchmark_event:      last=197 first=5446 max=25969 min=188 avg=241 std=168 std^2=28469
 event_benchmark-1293  [025]    84.352474: benchmark_event:      last=197 first=5446 max=25969 min=188 avg=241 std=168 std^2=28469
 event_benchmark-1293  [025]    84.352475: benchmark_event:      last=196 first=5446 max=25969 min=188 avg=241 std=168 std^2=28469


 event_benchmark-1330  [014]   111.729937: benchmark_event:      last=473 first=3724 max=31525 min=187 avg=245 std=160 std^2=25626
 event_benchmark-1330  [014]   111.729939: benchmark_event:      last=1085 first=3724 max=31525 min=187 avg=245 std=160 std^2=25627
 event_benchmark-1330  [014]   111.729940: benchmark_event:      last=602 first=3724 max=31525 min=187 avg=245 std=160 std^2=25627
 event_benchmark-1330  [014]   111.729942: benchmark_event:      last=602 first=3724 max=31525 min=187 avg=245 std=160 std^2=25627
 event_benchmark-1330  [014]   111.729943: benchmark_event:      last=393 first=3724 max=31525 min=187 avg=245 std=160 std^2=25627
 event_benchmark-1330  [014]   111.729944: benchmark_event:      last=558 first=3724 max=31525 min=187 avg=245 std=160 std^2=25627
 event_benchmark-1330  [014]   111.729946: benchmark_event:      last=410 first=3724 max=31525 min=187 avg=245 std=160 std^2=25627
 event_benchmark-1330  [014]   111.729947: benchmark_event:      last=408 first=3724 max=31525 min=187 avg=245 std=160 std^2=25627
 event_benchmark-1330  [014]   111.729949: benchmark_event:      last=428 first=3724 max=31525 min=187 avg=245 std=160 std^2=25627
 event_benchmark-1330  [014]   111.729950: benchmark_event:      last=717 first=3724 max=31525 min=187 avg=245 std=160 std^2=25627

 event_benchmark-1367  [025]   143.251162: benchmark_event:      last=209 first=3706 max=19397 min=187 avg=244 std=146 std^2=21567
 event_benchmark-1367  [025]   143.251163: benchmark_event:      last=420 first=3706 max=19397 min=187 avg=244 std=146 std^2=21567
 event_benchmark-1367  [025]   143.251164: benchmark_event:      last=413 first=3706 max=19397 min=187 avg=244 std=146 std^2=21567
 event_benchmark-1367  [025]   143.251165: benchmark_event:      last=209 first=3706 max=19397 min=187 avg=244 std=146 std^2=21567
 event_benchmark-1367  [025]   143.251165: benchmark_event:      last=205 first=3706 max=19397 min=187 avg=244 std=146 std^2=21567
 event_benchmark-1367  [025]   143.251166: benchmark_event:      last=209 first=3706 max=19397 min=187 avg=244 std=146 std^2=21567
 event_benchmark-1367  [025]   143.251167: benchmark_event:      last=208 first=3706 max=19397 min=187 avg=244 std=146 std^2=21567
 event_benchmark-1367  [025]   143.251168: benchmark_event:      last=207 first=3706 max=19397 min=187 avg=244 std=146 std^2=21567
 event_benchmark-1367  [025]   143.251169: benchmark_event:      last=208 first=3706 max=19397 min=187 avg=244 std=146 std^2=21567
 event_benchmark-1367  [025]   143.251169: benchmark_event:      last=210 first=3706 max=19397 min=187 avg=244 std=146 std^2=21567

 event_benchmark-1404  [025]   182.750612: benchmark_event:      last=207 first=3727 max=18550 min=187 avg=239 std=138 std^2=19319
 event_benchmark-1404  [025]   182.750612: benchmark_event:      last=207 first=3727 max=18550 min=187 avg=239 std=138 std^2=19319
 event_benchmark-1404  [025]   182.750613: benchmark_event:      last=207 first=3727 max=18550 min=187 avg=239 std=138 std^2=19319
 event_benchmark-1404  [025]   182.750614: benchmark_event:      last=210 first=3727 max=18550 min=187 avg=239 std=138 std^2=19319
 event_benchmark-1404  [025]   182.750615: benchmark_event:      last=209 first=3727 max=18550 min=187 avg=239 std=138 std^2=19319
 event_benchmark-1404  [025]   182.750616: benchmark_event:      last=208 first=3727 max=18550 min=187 avg=239 std=138 std^2=19319
 event_benchmark-1404  [025]   182.750616: benchmark_event:      last=206 first=3727 max=18550 min=187 avg=239 std=138 std^2=19319
 event_benchmark-1404  [025]   182.750617: benchmark_event:      last=207 first=3727 max=18550 min=187 avg=239 std=138 std^2=19319
 event_benchmark-1404  [025]   182.750618: benchmark_event:      last=206 first=3727 max=18550 min=187 avg=239 std=138 std^2=19319
 event_benchmark-1404  [025]   182.750619: benchmark_event:      last=209 first=3727 max=18550 min=187 avg=239 std=138 std^2=19319

-> This configuration's avg varies between 239 and 256 ns based on your benchmark methodology. The std. dev. varies between
138 and 201 ns.

In the benchmark results you collected, config #1 appeared to have little delta between xadd and cmpxchg, whereas
config #2 and #3 had some avg difference. Did you try running this test many times on each machine to see if the
result was indeed stable ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
  2020-06-30  3:13         ` Mathieu Desnoyers
@ 2020-06-30  3:26           ` Steven Rostedt
  0 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2020-06-30  3:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, Yordan Karadzhov, Tzvetomir Stoyanov, Tom Zanussi,
	Jason Behmer, Julia Lawall, Clark Williams, bristot,
	Daniel Wagner, Darren Hart, Jonathan Corbet, Suresh E. Warrier,
	Joel Fernandes, Google

On Mon, 29 Jun 2020 23:13:32 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> -> This configuration's avg varies between 239 and 256 ns based on your benchmark methodology. The std. dev. varies between  
> 138 and 201 ns.
> 
> In the benchmark results you collected, config #1 appeared to have little delta between xadd and cmpxchg, whereas
> config #2 and #3 had some avg difference. Did you try running this test many times on each machine to see if the
> result was indeed stable ?


Yes, but as we discussed on IRC, my machines are much smaller than this
one, and that makes it a bit more deterministic. I also run them with
distro configs. When there's debugging enabled, I get much more erratic
results.

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-06-30  3:26 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25 13:44 [RFC][PATCH] ring-buffer: Have nested events still record running time stamp Steven Rostedt
2020-06-25 13:53 ` Mathieu Desnoyers
2020-06-25 14:37   ` Steven Rostedt
2020-06-25 16:42     ` Korben Rusek
2020-06-25 18:12       ` Steven Rostedt
2020-06-25 17:55 ` Mathieu Desnoyers
2020-06-25 18:35   ` Steven Rostedt
2020-06-25 19:35     ` Mathieu Desnoyers
2020-06-25 19:58       ` Steven Rostedt
2020-06-26  2:36       ` Steven Rostedt
2020-06-26  3:35         ` Steven Rostedt
2020-06-26 13:58           ` Steven Rostedt
2020-06-26 18:13             ` Mathieu Desnoyers
2020-06-26 18:58               ` Steven Rostedt
2020-06-26 19:39                 ` Steven Rostedt
2020-06-30  0:21       ` Steven Rostedt
2020-06-30  3:13         ` Mathieu Desnoyers
2020-06-30  3:26           ` Steven Rostedt
2020-06-25 19:04   ` Steven Rostedt
2020-06-25 19:58     ` Mathieu Desnoyers
2020-06-25 20:42       ` Steven Rostedt
2020-06-25 19:09   ` Steven Rostedt
2020-06-25 20:03     ` Mathieu Desnoyers
2020-06-25 18:09 ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).