Re: Nested events with zero deltas, can use absolute timestamps instead?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Nested events with zero deltas, can use absolute timestamps instead?
       [not found] <CAMmhGqKc27W03roONYXhmwB0dtz5Z8nGoS2MLSsKJ3Zotv5-JA@mail.gmail.com>
@ 2019-03-29 16:52 ` Steven Rostedt
  2019-04-01 22:54   ` Jason Behmer
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2019-03-29 16:52 UTC (permalink / raw)
  To: Jason Behmer; +Cc: tom.zanussi, linux-kernel

On Fri, 29 Mar 2019 09:30:36 -0700
Jason Behmer <jbehmer@google.com> wrote:

> Hi Steven and Tom,
> I recently ran into a problem with correlating timestamps across CPUs and
> finally was able to track it down to the fact that nested events are given
> zero deltas, and so are placed earlier in time than they actually occurred.
> 
> I'm on an older version and wanted to check if this problem still exists on
> head and it appears that it does, but I also ran into Tom's commits adding
> the ability to use absolute timestamps instead of deltas.  It seems like
> this option could fix the problem.
> 
> It seems like ideally nested events could just always use these absolute
> timestamps instead of zero deltas, as it appears the implementation of
> absolute timestamps is flexible enough to have a mix of deltas and
> absolutes in the stream of events.  A fix that would require less work
> would be for me to just turn on absolute timestamps for my entire trace,
> but I see that the file to do that isn't yet writable to support that.
> 
> Before trying to do either of these things I wanted to run it by both of
> you to see if there's something I'm missing here.  What do you think?
> 

Yes this is an issue I need to fix. I have some ideas on how to handle
this. I may need to go about looking into them again.

It may be possible to add an option to have all nested events use the
full timestamp as well.

I'll have to think about this a bit more.

-- Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Nested events with zero deltas, can use absolute timestamps instead?
  2019-03-29 16:52 ` Nested events with zero deltas, can use absolute timestamps instead? Steven Rostedt
@ 2019-04-01 22:54   ` Jason Behmer
  2019-04-02  2:20     ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Behmer @ 2019-04-01 22:54 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: tom.zanussi, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3878 bytes --]

On Fri, Mar 29, 2019 at 9:53 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 29 Mar 2019 09:30:36 -0700
> Jason Behmer <jbehmer@google.com> wrote:
>
> > Hi Steven and Tom,
> > I recently ran into a problem with correlating timestamps across CPUs and
> > finally was able to track it down to the fact that nested events are given
> > zero deltas, and so are placed earlier in time than they actually occurred.
> >
> > I'm on an older version and wanted to check if this problem still exists on
> > head and it appears that it does, but I also ran into Tom's commits adding
> > the ability to use absolute timestamps instead of deltas.  It seems like
> > this option could fix the problem.
> >
> > It seems like ideally nested events could just always use these absolute
> > timestamps instead of zero deltas, as it appears the implementation of
> > absolute timestamps is flexible enough to have a mix of deltas and
> > absolutes in the stream of events.  A fix that would require less work
> > would be for me to just turn on absolute timestamps for my entire trace,
> > but I see that the file to do that isn't yet writable to support that.
> >
> > Before trying to do either of these things I wanted to run it by both of
> > you to see if there's something I'm missing here.  What do you think?
> >
>
> Yes this is an issue I need to fix. I have some ideas on how to handle
> this. I may need to go about looking into them again.
>
> It may be possible to add an option to have all nested events use the
> full timestamp as well.
>
> I'll have to think about this a bit more.
>
> -- Steve

The concurrency model is still a little bit unclear to me as I'm new
to this codebase.  So I'm having some trouble reasoning about what
operations are safe at one point on the ring buffer.    It seems like
we can't be preempted in general, just interrupts?  And the events for
the events emitted by interrupts will be fully processed before
getting back to the event pointed at by the commit pointer?  If this
is true I think the approach below (and prototyped in the attached
patch against head) might work and would love feedback.  If not, this
problem is way harder.

We detect nested events by checking our event pointer against the
commit pointer.  This is safe because we reserve our event space
atomically in the buffer, leading to an ordering of events we can
depend on.  But to add a TIME_STAMP event we need to reserve more
space before we even have an event pointer, so we need to know
something about the ordering of events before we've actually
atomically reserved ours.  We could check if the write pointer is set
to the commit pointer, and if it isn't we know we're a nested event.
But, someone could update the write pointer and/or commit pointer
between the time we check it and the time we atomically reserve our
space in the buffer.  However, I think maybe this is ok.

If we see that the write pointer is not equal to the commit pointer,
then we're in an interrupt, and the only thing that could update the
commit pointer is the original event emitting code that was
interrupted, which can't run again until we're finished.  And the only
thing that can update the write pointer is further interrupts of us,
which will advance the write pointer further away from commit, leaving
our decision to allocate a TIME_STAMP event as valid.

If we see that the write pointer is equal to the commit pointer, then
anything that interrupts us before we move the write pointer will see
that same state and will need to, before returning to us, commit their
event and set commit to their new write pointer, which will make our
decision valid once again.

There's a lot of assumptions in there that I'd love to be checked on
as I'm new to this code base.  For example I haven't read the read
path at all and have no idea if it interacts with this at all.

Thanks,
Jason

[-- Attachment #2: 0001-Adding-TIME_STAMP-events-to-all-nested-events.patch --]
[-- Type: text/x-patch, Size: 6959 bytes --]

From 4be4e74032336522691bf2e751e40ae679cc885b Mon Sep 17 00:00:00 2001
From: Jason Behmer <jbehmer@google.com>
Date: Mon, 1 Apr 2019 15:15:13 -0700
Subject: [PATCH] Adding TIME_STAMP events to all nested events.

---
 kernel/trace/ring_buffer.c | 96 +++++++++++++++++++++++++++-----------
 1 file changed, 70 insertions(+), 26 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 41b6f96e5366..07a408ed7248 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -425,7 +425,8 @@ struct rb_event_info {
 	u64			delta;
 	unsigned long		length;
 	struct buffer_page	*tail_page;
-	int			add_timestamp;
+	bool			add_time_extend;
+	bool			add_time_stamp;
 };
 
 /*
@@ -2313,15 +2314,12 @@ rb_move_tail(struct ring_buffer_per_cpu *cpu_buffer,
 
 /* Slow path, do not inline */
 static noinline struct ring_buffer_event *
-rb_add_time_stamp(struct ring_buffer_event *event, u64 delta, bool abs)
+rb_add_time_extend(struct ring_buffer_event *event, u64 delta)
 {
-	if (abs)
-		event->type_len = RINGBUF_TYPE_TIME_STAMP;
-	else
-		event->type_len = RINGBUF_TYPE_TIME_EXTEND;
+	event->type_len = RINGBUF_TYPE_TIME_EXTEND;
 
 	/* Not the first event on the page, or not delta? */
-	if (abs || rb_event_index(event)) {
+	if (rb_event_index(event)) {
 		event->time_delta = delta & TS_MASK;
 		event->array[0] = delta >> TS_SHIFT;
 	} else {
@@ -2333,6 +2331,21 @@ rb_add_time_stamp(struct ring_buffer_event *event, u64 delta, bool abs)
 	return skip_time_extend(event);
 }
 
+/* Slow path, do not inline */
+static noinline struct ring_buffer_event *
+rb_add_time_stamp(struct ring_buffer_event *event, u64 timestamp)
+{
+	event->type_len = RINGBUF_TYPE_TIME_STAMP;
+        /*
+         * Store a full timestamp in the time_delta field and the first data
+         * element.
+         */
+	event->time_delta = timestamp & TS_MASK;
+	event->array[0] = timestamp >> TS_SHIFT;
+
+	return skip_time_extend(event);
+}
+
 static inline bool rb_event_is_commit(struct ring_buffer_per_cpu *cpu_buffer,
 				     struct ring_buffer_event *event);
 
@@ -2360,15 +2373,17 @@ rb_update_event(struct ring_buffer_per_cpu *cpu_buffer,
 		delta = 0;
 
 	/*
-	 * If we need to add a timestamp, then we
+	 * If we need to add a TIME_EXTEND or TIME_STAMP, then we
 	 * add it to the start of the reserved space.
 	 */
-	if (unlikely(info->add_timestamp)) {
-		bool abs = ring_buffer_time_stamp_abs(cpu_buffer->buffer);
-
-		event = rb_add_time_stamp(event, info->delta, abs);
+	if (unlikely(info->add_time_extend)) {
+		event = rb_add_time_extend(event, info->delta);
 		length -= RB_LEN_TIME_EXTEND;
 		delta = 0;
+	} else if (unlikely(info->add_time_stamp)) {
+		event = rb_add_time_stamp(event, info->delta);
+		length -= RB_LEN_TIME_STAMP;
+		delta = 0;
 	}
 
 	event->time_delta = delta;
@@ -2520,6 +2535,14 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
 		goto again;
 }
 
+static __always_inline bool
+rb_commit_is_write(struct ring_buffer_per_cpu *cpu_buffer)
+{
+  return cpu_buffer->commit_page == cpu_buffer->tail_page &&
+      rb_commit_index(cpu_buffer) == rb_page_write(cpu_buffer->commit_page);
+
+}
+
 static __always_inline void rb_end_commit(struct ring_buffer_per_cpu *cpu_buffer)
 {
 	unsigned long commits;
@@ -2805,19 +2828,23 @@ EXPORT_SYMBOL_GPL(ring_buffer_unlock_commit);
 
 static noinline void
 rb_handle_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
-		    struct rb_event_info *info)
+		    struct rb_event_info *info, int abs)
 {
 	WARN_ONCE(info->delta > (1ULL << 59),
-		  KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu\n%s",
+		  KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu abs=%d\n%s",
 		  (unsigned long long)info->delta,
 		  (unsigned long long)info->ts,
 		  (unsigned long long)cpu_buffer->write_stamp,
+		  abs,
 		  sched_clock_stable() ? "" :
 		  "If you just came from a suspend/resume,\n"
 		  "please switch to the trace global clock:\n"
 		  "  echo global > /sys/kernel/debug/tracing/trace_clock\n"
 		  "or add trace_clock=global to the kernel command line\n");
-	info->add_timestamp = 1;
+	if (abs)
+		info->add_time_stamp = true;
+	else
+		info->add_time_extend = true;
 }
 
 static struct ring_buffer_event *
@@ -2828,13 +2855,20 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	struct buffer_page *tail_page;
 	unsigned long tail, write;
 
-	/*
-	 * If the time delta since the last event is too big to
-	 * hold in the time field of the event, then we append a
-	 * TIME EXTEND event ahead of the data event.
-	 */
-	if (unlikely(info->add_timestamp))
+	if (unlikely(info->add_time_extend))
+		/*
+		 * If the time delta since the last event is too big to
+		 * hold in the time field of the event, then we append a
+		 * TIME EXTEND event ahead of the data event.
+		 */
 		info->length += RB_LEN_TIME_EXTEND;
+	else if (unlikely(info->add_time_stamp))
+		/*
+		 * If using absolute time stamps for this event, then we append
+                 * a TIME STAMP event ahead of the data event to fit the full
+                 * timestamp.
+		 */
+		info->length += RB_LEN_TIME_STAMP;
 
 	/* Don't let the compiler play games with cpu_buffer->tail_page */
 	tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page);
@@ -2848,7 +2882,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	 * If this is the first commit on the page, then it has the same
 	 * timestamp as the page itself.
 	 */
-	if (!tail && !ring_buffer_time_stamp_abs(cpu_buffer->buffer))
+	if (!tail && !unlikely(info->add_time_stamp))
 		info->delta = 0;
 
 	/* See if we shot pass the end of this buffer page */
@@ -2904,7 +2938,8 @@ rb_reserve_next_event(struct ring_buffer *buffer,
 
 	info.length = rb_calculate_event_length(length);
  again:
-	info.add_timestamp = 0;
+	info.add_time_extend = false;
+	info.add_time_stamp = false;
 	info.delta = 0;
 
 	/*
@@ -2927,19 +2962,28 @@ rb_reserve_next_event(struct ring_buffer *buffer,
 
 	if (ring_buffer_time_stamp_abs(buffer)) {
 		info.delta = info.ts;
-		rb_handle_timestamp(cpu_buffer, &info);
+		rb_handle_timestamp(cpu_buffer, &info, true);
+	} else if (!rb_commit_is_write(cpu_buffer)) {
+		/*
+		 * Use absolute timestamps for nested events, where we can't
+		 * compute a delta.
+		 */
+		info.delta = info.ts;
+		rb_handle_timestamp(cpu_buffer, &info, true);
 	} else /* Did the write stamp get updated already? */
 		if (likely(info.ts >= cpu_buffer->write_stamp)) {
 		info.delta = diff;
 		if (unlikely(test_time_stamp(info.delta)))
-			rb_handle_timestamp(cpu_buffer, &info);
+			rb_handle_timestamp(cpu_buffer, &info, false);
 	}
 
 	event = __rb_reserve_next(cpu_buffer, &info);
 
 	if (unlikely(PTR_ERR(event) == -EAGAIN)) {
-		if (info.add_timestamp)
+		if (info.add_time_extend)
 			info.length -= RB_LEN_TIME_EXTEND;
+		else if (info.add_time_stamp)
+			info.length -= RB_LEN_TIME_STAMP;
 		goto again;
 	}
 
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Nested events with zero deltas, can use absolute timestamps instead?
  2019-04-01 22:54   ` Jason Behmer
@ 2019-04-02  2:20     ` Steven Rostedt
  2019-04-02  4:42       ` Jason Behmer
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2019-04-02  2:20 UTC (permalink / raw)
  To: Jason Behmer; +Cc: tom.zanussi, linux-kernel

On Mon, 1 Apr 2019 15:54:20 -0700
Jason Behmer <jbehmer@google.com> wrote:

> The concurrency model is still a little bit unclear to me as I'm new
> to this codebase.  So I'm having some trouble reasoning about what
> operations are safe at one point on the ring buffer.    It seems like
> we can't be preempted in general, just interrupts?  And the events for
> the events emitted by interrupts will be fully processed before
> getting back to the event pointed at by the commit pointer?  If this
> is true I think the approach below (and prototyped in the attached
> patch against head) might work and would love feedback.  If not, this
> problem is way harder.
> 
> We detect nested events by checking our event pointer against the
> commit pointer.  This is safe because we reserve our event space
> atomically in the buffer, leading to an ordering of events we can
> depend on.  But to add a TIME_STAMP event we need to reserve more
> space before we even have an event pointer, so we need to know
> something about the ordering of events before we've actually
> atomically reserved ours.  We could check if the write pointer is set
> to the commit pointer, and if it isn't we know we're a nested event.
> But, someone could update the write pointer and/or commit pointer
> between the time we check it and the time we atomically reserve our
> space in the buffer.  However, I think maybe this is ok.
> 
> If we see that the write pointer is not equal to the commit pointer,
> then we're in an interrupt, and the only thing that could update the
> commit pointer is the original event emitting code that was
> interrupted, which can't run again until we're finished.  And the only
> thing that can update the write pointer is further interrupts of us,
> which will advance the write pointer further away from commit, leaving
> our decision to allocate a TIME_STAMP event as valid.
> 
> If we see that the write pointer is equal to the commit pointer, then
> anything that interrupts us before we move the write pointer will see
> that same state and will need to, before returning to us, commit their
> event and set commit to their new write pointer, which will make our
> decision valid once again.
> 
> There's a lot of assumptions in there that I'd love to be checked on
> as I'm new to this code base.  For example I haven't read the read
> path at all and have no idea if it interacts with this at all.

I think you pretty much got the idea correct. The issue is what to put
into the extra timestamp value. As the time we record the timestamp
compared to the time we allocate the space for the timestamp is not
atomic. And we can't have time go backwards :-(

		|                                  |
commit 	--->	+----------------------------------+
		| TS offset from previous event    |  (A)
		+----------------------------------+
		| outer event data                 |
<interrupt>	+----------------------------------+
		| extended TS                      |  (B)
		+----------------------------------+
		| interrupt event data             |
		+----------------------------------+
 head	--->	|                                  |


	TS = rdstc();
	A = reserve_ring_buffer
	*A = TS

interrupt:
	TS = rdtsc();
	B = reserve_ring_buffer
	*B = TS


What's important is what we store in A and B

	TS = rdtsc();
		<interrupt> --->
					TS = rdstc()
					(this is first commit!)
					A = reserver_ring_buffer
					*A = TS
					(finish commit)
		<----
	A = reserver_ring_buffer
	*A = TS

You can see how the recording of the timestamp and writing it gets
complex. Also it gets more complex when we use deltas and not direct writes.

Now we may be able to handle this if we take the timestamp before doing
anything, and if it's nested, take it again (which should guarantee
that it's after the previous timestamp)

Now of course the question is, how do we update the write stamp that we
will use to compute new "deltas"? Or we just use absolute timestamps to
the end of the page, and start over again, when we start a new page
that isn't nested.

But see where the complexity comes from?

-- Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Nested events with zero deltas, can use absolute timestamps instead?
  2019-04-02  2:20     ` Steven Rostedt
@ 2019-04-02  4:42       ` Jason Behmer
  2019-05-24 14:17         ` Jason Behmer
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Behmer @ 2019-04-02  4:42 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: tom.zanussi, linux-kernel

On Mon, Apr 1, 2019 at 7:21 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 1 Apr 2019 15:54:20 -0700
> Jason Behmer <jbehmer@google.com> wrote:
>
> > The concurrency model is still a little bit unclear to me as I'm new
> > to this codebase.  So I'm having some trouble reasoning about what
> > operations are safe at one point on the ring buffer.    It seems like
> > we can't be preempted in general, just interrupts?  And the events for
> > the events emitted by interrupts will be fully processed before
> > getting back to the event pointed at by the commit pointer?  If this
> > is true I think the approach below (and prototyped in the attached
> > patch against head) might work and would love feedback.  If not, this
> > problem is way harder.
> >
> > We detect nested events by checking our event pointer against the
> > commit pointer.  This is safe because we reserve our event space
> > atomically in the buffer, leading to an ordering of events we can
> > depend on.  But to add a TIME_STAMP event we need to reserve more
> > space before we even have an event pointer, so we need to know
> > something about the ordering of events before we've actually
> > atomically reserved ours.  We could check if the write pointer is set
> > to the commit pointer, and if it isn't we know we're a nested event.
> > But, someone could update the write pointer and/or commit pointer
> > between the time we check it and the time we atomically reserve our
> > space in the buffer.  However, I think maybe this is ok.
> >
> > If we see that the write pointer is not equal to the commit pointer,
> > then we're in an interrupt, and the only thing that could update the
> > commit pointer is the original event emitting code that was
> > interrupted, which can't run again until we're finished.  And the only
> > thing that can update the write pointer is further interrupts of us,
> > which will advance the write pointer further away from commit, leaving
> > our decision to allocate a TIME_STAMP event as valid.
> >
> > If we see that the write pointer is equal to the commit pointer, then
> > anything that interrupts us before we move the write pointer will see
> > that same state and will need to, before returning to us, commit their
> > event and set commit to their new write pointer, which will make our
> > decision valid once again.
> >
> > There's a lot of assumptions in there that I'd love to be checked on
> > as I'm new to this code base.  For example I haven't read the read
> > path at all and have no idea if it interacts with this at all.
>
> I think you pretty much got the idea correct. The issue is what to put
> into the extra timestamp value. As the time we record the timestamp
> compared to the time we allocate the space for the timestamp is not
> atomic. And we can't have time go backwards :-(
>
>                 |                                  |
> commit  --->    +----------------------------------+
>                 | TS offset from previous event    |  (A)
>                 +----------------------------------+
>                 | outer event data                 |
> <interrupt>     +----------------------------------+
>                 | extended TS                      |  (B)
>                 +----------------------------------+
>                 | interrupt event data             |
>                 +----------------------------------+
>  head   --->    |                                  |
>
>
>         TS = rdstc();
>         A = reserve_ring_buffer
>         *A = TS
>
> interrupt:
>         TS = rdtsc();
>         B = reserve_ring_buffer
>         *B = TS
>
>
> What's important is what we store in A and B
>
>         TS = rdtsc();
>                 <interrupt> --->
>                                         TS = rdstc()
>                                         (this is first commit!)
>                                         A = reserver_ring_buffer
>                                         *A = TS
>                                         (finish commit)
>                 <----
>         A = reserver_ring_buffer
>         *A = TS
>
> You can see how the recording of the timestamp and writing it gets
> complex. Also it gets more complex when we use deltas and not direct writes.
>
> Now we may be able to handle this if we take the timestamp before doing
> anything, and if it's nested, take it again (which should guarantee
> that it's after the previous timestamp)
>
> Now of course the question is, how do we update the write stamp that we
> will use to compute new "deltas"? Or we just use absolute timestamps to
> the end of the page, and start over again, when we start a new page
> that isn't nested.
>
> But see where the complexity comes from?
>
> -- Steve

Ah, yes, I was too focused on the first problem of if we could even
reserve the space due to concurrency restrictions.

First, I had assumed there weren't restrictions on timestamps needing
to not go backwards between consecutive events in the buffers, as it
seems the current implementation of absolute timestamps should have
this issue.  It also seems incredibly difficult to solve.  Even with
your suggestion of taking the timestamp a second time if you're nested
- can't you end up with multiple layers of nesting?  So how do you
make sure that the third level nested event has a timestamp after the
second level nested event?  It seems like maybe there's some very
complicated way you can make this work given that it seems like your
stack depth here is limited here to One of each of normal, nmi,
softirq, and hardirq at a time (is that right?).

And second, I forgot entirely about updating the write stamp.  It
looks like my prototype patch as written will still just update the
write stamp to the outer event's timestamp, whereas really logically
we'd like to set it to the last thing in the buffer when we commit.
It looks like rb_set_commit_to_write already leapfrogs write_stamp to
the next page timestamp every time it commits a full page.  Ideally
we'd be able to also do this when we get to the tail page and update
commit to its write index.  Because, I think, nothing will be diffing
off of write_stamp until we set commit to be back to write.  But of
course within a page we have no clue what the timestamp of the event
previous to the write pointer is.

And with ordering problems, you can't even just try the brute force
approach of having all of the nested events update write_stamp on
commit if their ts is larger than the current write_stamp (although
even if you did want to do that it seems like you'd have to do some
more complicate atomic access to write stamp as you would no longer
have the nice property of it only being updated by one piece of code
at a time).

As for using absolute timestamps for the rest of the page, it seems
like in many use cases that would end up being functionally very
similar (with a far more complex implementation) to just turning on
absolute timestamps for everything.  This of course depends entirely
on the trace points you're gathering, but some common ones, like
sched_* seem to get called fairly often from an interrupt context even
on an idle system.


Also, I was actually going to send out a patch to try to make the
timestamp_mode file writable, as an easy short-term fix for getting
absolute time stamps.  However, it looks like the current hook to turn
it on is doing some ref counting on enabling/disabling some histogram
listeners.  Combining that with a user-writable config file seems a
little tricky in terms of semantics, which do we want to take
priority?  It seems like a user changing a config an accidentally
clobbering the histogram listeners isn't great, but neither is giving
back a user cryptic errors when they try to write the file and
histogram listeners are configured.  Any thoughts on what to do there
and how to implement it?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Nested events with zero deltas, can use absolute timestamps instead?
  2019-04-02  4:42       ` Jason Behmer
@ 2019-05-24 14:17         ` Jason Behmer
  2019-05-24 15:00           ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Behmer @ 2019-05-24 14:17 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: tom.zanussi, linux-kernel

On Mon, Apr 1, 2019 at 9:42 PM Jason Behmer <jbehmer@google.com> wrote:
>
> On Mon, Apr 1, 2019 at 7:21 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Mon, 1 Apr 2019 15:54:20 -0700
> > Jason Behmer <jbehmer@google.com> wrote:
> >
> > > The concurrency model is still a little bit unclear to me as I'm new
> > > to this codebase.  So I'm having some trouble reasoning about what
> > > operations are safe at one point on the ring buffer.    It seems like
> > > we can't be preempted in general, just interrupts?  And the events for
> > > the events emitted by interrupts will be fully processed before
> > > getting back to the event pointed at by the commit pointer?  If this
> > > is true I think the approach below (and prototyped in the attached
> > > patch against head) might work and would love feedback.  If not, this
> > > problem is way harder.
> > >
> > > We detect nested events by checking our event pointer against the
> > > commit pointer.  This is safe because we reserve our event space
> > > atomically in the buffer, leading to an ordering of events we can
> > > depend on.  But to add a TIME_STAMP event we need to reserve more
> > > space before we even have an event pointer, so we need to know
> > > something about the ordering of events before we've actually
> > > atomically reserved ours.  We could check if the write pointer is set
> > > to the commit pointer, and if it isn't we know we're a nested event.
> > > But, someone could update the write pointer and/or commit pointer
> > > between the time we check it and the time we atomically reserve our
> > > space in the buffer.  However, I think maybe this is ok.
> > >
> > > If we see that the write pointer is not equal to the commit pointer,
> > > then we're in an interrupt, and the only thing that could update the
> > > commit pointer is the original event emitting code that was
> > > interrupted, which can't run again until we're finished.  And the only
> > > thing that can update the write pointer is further interrupts of us,
> > > which will advance the write pointer further away from commit, leaving
> > > our decision to allocate a TIME_STAMP event as valid.
> > >
> > > If we see that the write pointer is equal to the commit pointer, then
> > > anything that interrupts us before we move the write pointer will see
> > > that same state and will need to, before returning to us, commit their
> > > event and set commit to their new write pointer, which will make our
> > > decision valid once again.
> > >
> > > There's a lot of assumptions in there that I'd love to be checked on
> > > as I'm new to this code base.  For example I haven't read the read
> > > path at all and have no idea if it interacts with this at all.
> >
> > I think you pretty much got the idea correct. The issue is what to put
> > into the extra timestamp value. As the time we record the timestamp
> > compared to the time we allocate the space for the timestamp is not
> > atomic. And we can't have time go backwards :-(
> >
> >                 |                                  |
> > commit  --->    +----------------------------------+
> >                 | TS offset from previous event    |  (A)
> >                 +----------------------------------+
> >                 | outer event data                 |
> > <interrupt>     +----------------------------------+
> >                 | extended TS                      |  (B)
> >                 +----------------------------------+
> >                 | interrupt event data             |
> >                 +----------------------------------+
> >  head   --->    |                                  |
> >
> >
> >         TS = rdstc();
> >         A = reserve_ring_buffer
> >         *A = TS
> >
> > interrupt:
> >         TS = rdtsc();
> >         B = reserve_ring_buffer
> >         *B = TS
> >
> >
> > What's important is what we store in A and B
> >
> >         TS = rdtsc();
> >                 <interrupt> --->
> >                                         TS = rdstc()
> >                                         (this is first commit!)
> >                                         A = reserver_ring_buffer
> >                                         *A = TS
> >                                         (finish commit)
> >                 <----
> >         A = reserver_ring_buffer
> >         *A = TS
> >
> > You can see how the recording of the timestamp and writing it gets
> > complex. Also it gets more complex when we use deltas and not direct writes.
> >
> > Now we may be able to handle this if we take the timestamp before doing
> > anything, and if it's nested, take it again (which should guarantee
> > that it's after the previous timestamp)
> >
> > Now of course the question is, how do we update the write stamp that we
> > will use to compute new "deltas"? Or we just use absolute timestamps to
> > the end of the page, and start over again, when we start a new page
> > that isn't nested.
> >
> > But see where the complexity comes from?
> >
> > -- Steve
>
> Ah, yes, I was too focused on the first problem of if we could even
> reserve the space due to concurrency restrictions.
>
> First, I had assumed there weren't restrictions on timestamps needing
> to not go backwards between consecutive events in the buffers, as it
> seems the current implementation of absolute timestamps should have
> this issue.  It also seems incredibly difficult to solve.  Even with
> your suggestion of taking the timestamp a second time if you're nested
> - can't you end up with multiple layers of nesting?  So how do you
> make sure that the third level nested event has a timestamp after the
> second level nested event?  It seems like maybe there's some very
> complicated way you can make this work given that it seems like your
> stack depth here is limited here to One of each of normal, nmi,
> softirq, and hardirq at a time (is that right?).
>
> And second, I forgot entirely about updating the write stamp.  It
> looks like my prototype patch as written will still just update the
> write stamp to the outer event's timestamp, whereas really logically
> we'd like to set it to the last thing in the buffer when we commit.
> It looks like rb_set_commit_to_write already leapfrogs write_stamp to
> the next page timestamp every time it commits a full page.  Ideally
> we'd be able to also do this when we get to the tail page and update
> commit to its write index.  Because, I think, nothing will be diffing
> off of write_stamp until we set commit to be back to write.  But of
> course within a page we have no clue what the timestamp of the event
> previous to the write pointer is.
>
> And with ordering problems, you can't even just try the brute force
> approach of having all of the nested events update write_stamp on
> commit if their ts is larger than the current write_stamp (although
> even if you did want to do that it seems like you'd have to do some
> more complicate atomic access to write stamp as you would no longer
> have the nice property of it only being updated by one piece of code
> at a time).
>
> As for using absolute timestamps for the rest of the page, it seems
> like in many use cases that would end up being functionally very
> similar (with a far more complex implementation) to just turning on
> absolute timestamps for everything.  This of course depends entirely
> on the trace points you're gathering, but some common ones, like
> sched_* seem to get called fairly often from an interrupt context even
> on an idle system.
>
>
> Also, I was actually going to send out a patch to try to make the
> timestamp_mode file writable, as an easy short-term fix for getting
> absolute time stamps.  However, it looks like the current hook to turn
> it on is doing some ref counting on enabling/disabling some histogram
> listeners.  Combining that with a user-writable config file seems a
> little tricky in terms of semantics, which do we want to take
> priority?  It seems like a user changing a config an accidentally
> clobbering the histogram listeners isn't great, but neither is giving
> back a user cryptic errors when they try to write the file and
> histogram listeners are configured.  Any thoughts on what to do there
> and how to implement it?
>
> Thanks,
> Jason

Hi Steven,
Your other email reminded me of this thread.  The easy "fix" we
decided to pursue was to simply turn on absolute timestamps for all
events and use up the extra space, which in our particular application
isn't a huge deal.  We haven't yet gotten around to trying to send a
patch for plumbing user-configurable absolute timestamps, but as noted
immediately above, the configuration for timestamp_mode is actually a
bit tricky to implement with the existing histogram ref counting.  The
way I was thinking about dealing with that was to have a separate bool
to indicate the state the user has indicated they want, and then you
have to work through all the possible combinations of behavior:

If user absolute timestamps is false, all behavior is exactly as today.
If user absolute timestamps is true, histogram refs transitioning 0->1
is a no-op, as is histogram refs transitioning 1->0.
If histogram refs are 0 and user absolute timestamps transition
false->true or true->false, they get what they want.
If histogram refs are >0 and user absolute timestamps transition
false->true, it's a no-op.

And the confusing one:
If histogram refs are >0 and user absolute timestamps transitions
true->false we can't turn off absolute timestamps and screw up the
histograms, so we return an error.  But user absolute timestamps is
now false, which means when histogram refs transitions back to 0, it
will turn off absolute timestamps.

What do you think of that?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Nested events with zero deltas, can use absolute timestamps instead?
  2019-05-24 14:17         ` Jason Behmer
@ 2019-05-24 15:00           ` Steven Rostedt
  2019-05-24 15:11             ` Jason Behmer
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2019-05-24 15:00 UTC (permalink / raw)
  To: Jason Behmer; +Cc: tom.zanussi, linux-kernel

On Fri, 24 May 2019 07:17:15 -0700
Jason Behmer <jbehmer@google.com> wrote:


> Hi Steven,
> Your other email reminded me of this thread.  The easy "fix" we
> decided to pursue was to simply turn on absolute timestamps for all
> events and use up the extra space, which in our particular application
> isn't a huge deal.  We haven't yet gotten around to trying to send a
> patch for plumbing user-configurable absolute timestamps, but as noted
> immediately above, the configuration for timestamp_mode is actually a
> bit tricky to implement with the existing histogram ref counting.  The
> way I was thinking about dealing with that was to have a separate bool
> to indicate the state the user has indicated they want, and then you
> have to work through all the possible combinations of behavior:
> 
> If user absolute timestamps is false, all behavior is exactly as today.
> If user absolute timestamps is true, histogram refs transitioning 0->1
> is a no-op, as is histogram refs transitioning 1->0.
> If histogram refs are 0 and user absolute timestamps transition
> false->true or true->false, they get what they want.
> If histogram refs are >0 and user absolute timestamps transition
> false->true, it's a no-op.
> 
> And the confusing one:
> If histogram refs are >0 and user absolute timestamps transitions
> true->false we can't turn off absolute timestamps and screw up the
> histograms, so we return an error.  But user absolute timestamps is
> now false, which means when histogram refs transitions back to 0, it
> will turn off absolute timestamps.
> 
> What do you think of that?

I don't think that's confusing if its well documented. Have the user
flag called "force_absolute_timestamps", that way it's not something
that the user will think that we wont have absolute timestamps if it is
zero. Have the documentation say:

 Various utilities within the tracing system require that the ring
 buffer uses absolute timestamps. But you may force the ring buffer to
 always use it, which will give you unique timings with nested tracing
 at the cost of more usage in the ring buffer.

-- Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Nested events with zero deltas, can use absolute timestamps instead?
  2019-05-24 15:00           ` Steven Rostedt
@ 2019-05-24 15:11             ` Jason Behmer
  2019-05-24 15:24               ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Behmer @ 2019-05-24 15:11 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: tom.zanussi, linux-kernel

On Fri, May 24, 2019 at 8:00 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 24 May 2019 07:17:15 -0700
> Jason Behmer <jbehmer@google.com> wrote:
>
>
> > Hi Steven,
> > Your other email reminded me of this thread.  The easy "fix" we
> > decided to pursue was to simply turn on absolute timestamps for all
> > events and use up the extra space, which in our particular application
> > isn't a huge deal.  We haven't yet gotten around to trying to send a
> > patch for plumbing user-configurable absolute timestamps, but as noted
> > immediately above, the configuration for timestamp_mode is actually a
> > bit tricky to implement with the existing histogram ref counting.  The
> > way I was thinking about dealing with that was to have a separate bool
> > to indicate the state the user has indicated they want, and then you
> > have to work through all the possible combinations of behavior:
> >
> > If user absolute timestamps is false, all behavior is exactly as today.
> > If user absolute timestamps is true, histogram refs transitioning 0->1
> > is a no-op, as is histogram refs transitioning 1->0.
> > If histogram refs are 0 and user absolute timestamps transition
> > false->true or true->false, they get what they want.
> > If histogram refs are >0 and user absolute timestamps transition
> > false->true, it's a no-op.
> >
> > And the confusing one:
> > If histogram refs are >0 and user absolute timestamps transitions
> > true->false we can't turn off absolute timestamps and screw up the
> > histograms, so we return an error.  But user absolute timestamps is
> > now false, which means when histogram refs transitions back to 0, it
> > will turn off absolute timestamps.
> >
> > What do you think of that?
>
> I don't think that's confusing if its well documented. Have the user
> flag called "force_absolute_timestamps", that way it's not something
> that the user will think that we wont have absolute timestamps if it is
> zero. Have the documentation say:
>
>  Various utilities within the tracing system require that the ring
>  buffer uses absolute timestamps. But you may force the ring buffer to
>  always use it, which will give you unique timings with nested tracing
>  at the cost of more usage in the ring buffer.
>
> -- Steve

Ah, I was thinking of doing this within the existing timestamp_mode
config file.  Having a separate file does make it much less confusing.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Nested events with zero deltas, can use absolute timestamps instead?
  2019-05-24 15:11             ` Jason Behmer
@ 2019-05-24 15:24               ` Steven Rostedt
  2019-05-24 15:48                 ` Jason Behmer
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2019-05-24 15:24 UTC (permalink / raw)
  To: Jason Behmer; +Cc: tom.zanussi, linux-kernel

On Fri, 24 May 2019 08:11:12 -0700
Jason Behmer <jbehmer@google.com> wrote:

> > > What do you think of that?  
> >
> > I don't think that's confusing if its well documented. Have the user
> > flag called "force_absolute_timestamps", that way it's not something
> > that the user will think that we wont have absolute timestamps if it is
> > zero. Have the documentation say:
> >
> >  Various utilities within the tracing system require that the ring
> >  buffer uses absolute timestamps. But you may force the ring buffer to
> >  always use it, which will give you unique timings with nested tracing
> >  at the cost of more usage in the ring buffer.
> >
> > -- Steve  
> 
> Ah, I was thinking of doing this within the existing timestamp_mode
> config file.  Having a separate file does make it much less confusing.

Not a separate file, but a new tracing option.

-- Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Nested events with zero deltas, can use absolute timestamps instead?
  2019-05-24 15:24               ` Steven Rostedt
@ 2019-05-24 15:48                 ` Jason Behmer
  2019-05-24 15:54                   ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Behmer @ 2019-05-24 15:48 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: tom.zanussi, linux-kernel

On Fri, May 24, 2019 at 8:25 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 24 May 2019 08:11:12 -0700
> Jason Behmer <jbehmer@google.com> wrote:
>
> > > > What do you think of that?
> > >
> > > I don't think that's confusing if its well documented. Have the user
> > > flag called "force_absolute_timestamps", that way it's not something
> > > that the user will think that we wont have absolute timestamps if it is
> > > zero. Have the documentation say:
> > >
> > >  Various utilities within the tracing system require that the ring
> > >  buffer uses absolute timestamps. But you may force the ring buffer to
> > >  always use it, which will give you unique timings with nested tracing
> > >  at the cost of more usage in the ring buffer.
> > >
> > > -- Steve
> >
> > Ah, I was thinking of doing this within the existing timestamp_mode
> > config file.  Having a separate file does make it much less confusing.
>
> Not a separate file, but a new tracing option.
>
> -- Steve

Sorry, I'm not sure I follow.  By new tracing option do you mean a new
option in the timestamp_mode file?  I guess in that case would that
still be the only writable option?  You could write 1/0 to the file
which would turn on/off force_absolute_timestamps, and reading the
file would show which of absolute, delta, and force_absolute was set?
Or did you mean something else by tracing option?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Nested events with zero deltas, can use absolute timestamps instead?
  2019-05-24 15:48                 ` Jason Behmer
@ 2019-05-24 15:54                   ` Steven Rostedt
  2019-05-24 15:58                     ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2019-05-24 15:54 UTC (permalink / raw)
  To: Jason Behmer; +Cc: tom.zanussi, linux-kernel

On Fri, 24 May 2019 08:48:44 -0700
Jason Behmer <jbehmer@google.com> wrote:

> Sorry, I'm not sure I follow.  By new tracing option do you mean a new
> option in the timestamp_mode file?  I guess in that case would that
> still be the only writable option?  You could write 1/0 to the file
> which would turn on/off force_absolute_timestamps, and reading the
> file would show which of absolute, delta, and force_absolute was set?
> Or did you mean something else by tracing option?

I mean the options in tracefs/options/*

Do a search for OVERWRITE in kernel/trace/ and see what it does.

-- Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Nested events with zero deltas, can use absolute timestamps instead?
  2019-05-24 15:54                   ` Steven Rostedt
@ 2019-05-24 15:58                     ` Steven Rostedt
  0 siblings, 0 replies; 11+ messages in thread
From: Steven Rostedt @ 2019-05-24 15:58 UTC (permalink / raw)
  To: Jason Behmer; +Cc: tom.zanussi, linux-kernel

On Fri, 24 May 2019 11:54:28 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I mean the options in tracefs/options/*
> 

Come to think of it, the timestamp_mode probably should have been an
option instead.

-- Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-05-24 15:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAMmhGqKc27W03roONYXhmwB0dtz5Z8nGoS2MLSsKJ3Zotv5-JA@mail.gmail.com>
2019-03-29 16:52 ` Nested events with zero deltas, can use absolute timestamps instead? Steven Rostedt
2019-04-01 22:54   ` Jason Behmer
2019-04-02  2:20     ` Steven Rostedt
2019-04-02  4:42       ` Jason Behmer
2019-05-24 14:17         ` Jason Behmer
2019-05-24 15:00           ` Steven Rostedt
2019-05-24 15:11             ` Jason Behmer
2019-05-24 15:24               ` Steven Rostedt
2019-05-24 15:48                 ` Jason Behmer
2019-05-24 15:54                   ` Steven Rostedt
2019-05-24 15:58                     ` Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.