All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
@ 2013-11-14  8:07 Joseph Schuchart
  2013-11-14  8:39 ` Ingo Molnar
  0 siblings, 1 reply; 24+ messages in thread
From: Joseph Schuchart @ 2013-11-14  8:07 UTC (permalink / raw)
  To: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: thomas.ilsche, Joseph Schuchart, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2937 bytes --]

Good morning,

We came across a problem in perf script which causes it to abort reading
a file produced with perf record, complaining about timestamps being
earlier than the last flushed timeslice. ("Warning: Timestamp below last
timeslice flush")

While investigating the issue, we found that the assumptions about the
relation of timestamps between different CPUs and the passes in which
the data is read from the kernel might not be entirely valid. The
current model (as is described in util/session.c lines 552 to 590)
states that it is safe to store the maximum timestamp seen in pass n
(T_n) and use it to flush the samples after pass n+1, assuming that
there will be samples with timestamps >=T_n for all CPUs in pass n+1.

However, based on the description given in the source file, we can come
up with an example that breaks this assumption:

 *    ============ PASS n =================
 *       CPU 0         |   CPU 1
 *                     |
 *    cnt1 timestamps  |   cnt2 timestamps
 *          1          |         2
 *          -          |         3
 *          -          |         4  <--- max recorded
 *
 *    ============ PASS n + 1 ==============
 *       CPU 0         |   CPU 1
 *                     |
 *    cnt1 timestamps  |   cnt2 timestamps
 *          2          |         5
 *          -          |         6
 *          -          |         7 <---- max recorded
 *
 *      Flush every events below timestamp 4
 *
 *    ============ PASS n + 2 ==============
 *       CPU 0         |   CPU 1
 *                     |
 *    cnt1 timestamps  |   cnt2 timestamps
 *          3          |         8
 *    --> This event will break the assumption and lead to the error
described above.
 *          4          |         9
 *          -          |         10
 *

Note that the trace depicted above is perfectly legal (to our
understanding) and we have seen such a trace coming out of perf record.
Unfortunately, we can only reproduce it when running a large proprietary
application for which we cannot provide the trace data. Any attempt to
come up with a small reproducer failed so far. I hope the above
description of the problem is sufficient, though.

The attached patch modifies the session to keep a record of the maximum
timestamp seen per CPU and the samples are flushed after reading pass n
up to the minimum of the maximum timestamps of each CPU in this pass.
The change should be small and do not interfere with any other
functionality of perf. The patch is based on the current Linux version
in git.

Please let me know if you need any further information.

Signed-off-by: Joseph Schuchart <joseph.schuchart@tu-dresden.de>

Thanks,
Joseph
-- 
Dipl. Inf. Joseph Schuchart
Computer Scientist

Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany

Phone: +49 351 463-36494
Fax: +49 351 463-3773
E-Mail: joseph.schuchart@tu-dresden.de

[-- Attachment #2: perf_timestamp_pass.diff --]
[-- Type: text/x-patch, Size: 4813 bytes --]

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index f36d24a..1d9c81e 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -70,6 +70,7 @@ static void perf_session__destroy_kernel_maps(struct perf_session *session)
 struct perf_session *perf_session__new(struct perf_data_file *file,
 				       bool repipe, struct perf_tool *tool)
 {
+	int i;
 	struct perf_session *session = zalloc(sizeof(*session));
 
 	if (!session)
@@ -79,6 +80,8 @@ struct perf_session *perf_session__new(struct perf_data_file *file,
 	INIT_LIST_HEAD(&session->ordered_samples.samples);
 	INIT_LIST_HEAD(&session->ordered_samples.sample_cache);
 	INIT_LIST_HEAD(&session->ordered_samples.to_free);
+	for (i = 0; i < MAX_NR_CPUS; i++)
+		session->ordered_samples.max_timestamps[i] = ULLONG_MAX;
 	machines__init(&session->machines);
 
 	if (file) {
@@ -549,15 +552,24 @@ static int flush_sample_queue(struct perf_session *s,
 	return 0;
 }
 
+static inline void set_next_flush(struct perf_session *session)
+{
+	int i;
+	u64 min_max_timestamp = session->ordered_samples.max_timestamps[0];
+	for (i = 1; i < MAX_NR_CPUS; i++) {
+		if (min_max_timestamp > session->ordered_samples.max_timestamps[i])
+			min_max_timestamp = session->ordered_samples.max_timestamps[i];
+	}
+	session->ordered_samples.next_flush = min_max_timestamp;
+}
+
 /*
  * When perf record finishes a pass on every buffers, it records this pseudo
  * event.
- * We record the max timestamp t found in the pass n.
- * Assuming these timestamps are monotonic across cpus, we know that if
- * a buffer still has events with timestamps below t, they will be all
- * available and then read in the pass n + 1.
- * Hence when we start to read the pass n + 2, we can safely flush every
- * events with timestamps below t.
+ * We record the max timestamp t_i for every cpu found in the pass n and
+ * flush up to min(t_i) after reading the samples of this pass.
+ * All other samples need to be stored until min(t_i) changes
+ * during an upcoming pass.
  *
  *    ============ PASS n =================
  *       CPU 0         |   CPU 1
@@ -565,7 +577,8 @@ static int flush_sample_queue(struct perf_session *s,
  *    cnt1 timestamps  |   cnt2 timestamps
  *          1          |         2
  *          2          |         3
- *          -          |         4  <--- max recorded
+ *          -          |         4
+ *     Flush all events up to timestamp 2
  *
  *    ============ PASS n + 1 ==============
  *       CPU 0         |   CPU 1
@@ -573,29 +586,28 @@ static int flush_sample_queue(struct perf_session *s,
  *    cnt1 timestamps  |   cnt2 timestamps
  *          3          |         5
  *          4          |         6
- *          5          |         7 <---- max recorded
+ *          5          |         7
  *
- *      Flush every events below timestamp 4
+ *      Flush all events up to timestamp 5
  *
  *    ============ PASS n + 2 ==============
  *       CPU 0         |   CPU 1
  *                     |
  *    cnt1 timestamps  |   cnt2 timestamps
  *          6          |         8
- *          7          |         9
+ *          -          |         9
  *          -          |         10
  *
- *      Flush every events below timestamp 7
+ *      Flush all events up to timestamp 6
  *      etc...
  */
 static int process_finished_round(struct perf_tool *tool,
 				  union perf_event *event __maybe_unused,
 				  struct perf_session *session)
 {
-	int ret = flush_sample_queue(session, tool);
-	if (!ret)
-		session->ordered_samples.next_flush = session->ordered_samples.max_timestamp;
-
+	int ret;
+	set_next_flush(session);
+	ret = flush_sample_queue(session, tool);
 	return ret;
 }
 
@@ -612,7 +624,6 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
 
 	if (!sample) {
 		list_add(&new->list, &os->samples);
-		os->max_timestamp = timestamp;
 		return;
 	}
 
@@ -626,7 +637,6 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
 			p = sample->list.next;
 			if (p == &os->samples) {
 				list_add_tail(&new->list, &os->samples);
-				os->max_timestamp = timestamp;
 				return;
 			}
 			sample = list_entry(p, struct sample_queue, list);
@@ -685,6 +695,8 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
 
 	__queue_event(new, s);
 
+	os->max_timestamps[sample->cpu] = timestamp;
+
 	return 0;
 }
 
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 50f6409..6e3bc16 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -18,7 +18,7 @@ struct thread;
 struct ordered_samples {
 	u64			last_flush;
 	u64			next_flush;
-	u64			max_timestamp;
+	u64			max_timestamps[MAX_NR_CPUS];
 	struct list_head	samples;
 	struct list_head	sample_cache;
 	struct list_head	to_free;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-11-14  8:07 [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes Joseph Schuchart
@ 2013-11-14  8:39 ` Ingo Molnar
  2013-11-14  8:59   ` Joseph Schuchart
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2013-11-14  8:39 UTC (permalink / raw)
  To: Joseph Schuchart
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, thomas.ilsche, linux-kernel,
	Fr??d??ric Weisbecker, David Ahern


* Joseph Schuchart <joseph.schuchart@tu-dresden.de> wrote:

> @@ -549,15 +552,24 @@ static int flush_sample_queue(struct perf_session *s,
>  	return 0;
>  }
>  
> +static inline void set_next_flush(struct perf_session *session)
> +{
> +	int i;
> +	u64 min_max_timestamp = session->ordered_samples.max_timestamps[0];
> +	for (i = 1; i < MAX_NR_CPUS; i++) {
> +		if (min_max_timestamp > session->ordered_samples.max_timestamps[i])
> +			min_max_timestamp = session->ordered_samples.max_timestamps[i];
> +	}
> +	session->ordered_samples.next_flush = min_max_timestamp;
> +}

>  static int process_finished_round(struct perf_tool *tool,
>  				  union perf_event *event __maybe_unused,
>  				  struct perf_session *session)
>  {
> -	int ret = flush_sample_queue(session, tool);
> -	if (!ret)
> -		session->ordered_samples.next_flush = session->ordered_samples.max_timestamp;
> -
> +	int ret;
> +	set_next_flush(session);
> +	ret = flush_sample_queue(session, tool);

Just a quick side note, while I realize that you are 
(rightfully!) concerned about correctness primarily, if that loop 
over MAX_NR_CPUS executes often enough then this might hurt 
performance:

   perf.h:#define MAX_NR_CPUS                      256

So it might be better to maintain a rolling min_max_timestamp in 
this place:

+       os->max_timestamps[sample->cpu] = timestamp;

?

If done that way then AFAICS we could even eliminate the 
->max_timestamps[NR_CPUS] array.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-11-14  8:39 ` Ingo Molnar
@ 2013-11-14  8:59   ` Joseph Schuchart
  2013-11-14 10:05     ` Ingo Molnar
  0 siblings, 1 reply; 24+ messages in thread
From: Joseph Schuchart @ 2013-11-14  8:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, thomas.ilsche, linux-kernel,
	Fr??d??ric Weisbecker, David Ahern

> Just a quick side note, while I realize that you are 
> (rightfully!) concerned about correctness primarily, if that loop 
> over MAX_NR_CPUS executes often enough then this might hurt 
> performance:
> 
>    perf.h:#define MAX_NR_CPUS                      256
> 
> So it might be better to maintain a rolling min_max_timestamp in 
> this place:
> 
> +       os->max_timestamps[sample->cpu] = timestamp;
> 
> ?
> 
> If done that way then AFAICS we could even eliminate the 
> ->max_timestamps[NR_CPUS] array.

I can understand your performance concerns. However, I am not sure how
we can determine the minimal max_timestamp of all cpus without storing
the information on a per-cpu basis first. Accumulating it on the fly
would only lead to a global max_timestamp. If we had any information
about the number of CPUs, we could limit the number of iterations in
set_next_flush() to a minimum, though. However, that would require
keeping track of the maximum cpu id while reading the samples. Or is
there an easy way to determine the actual number cpus in the trace?

Thanks
Joseph

> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-11-14  8:59   ` Joseph Schuchart
@ 2013-11-14 10:05     ` Ingo Molnar
  2013-11-14 14:26       ` David Ahern
  2013-11-21 14:55       ` Joseph Schuchart
  0 siblings, 2 replies; 24+ messages in thread
From: Ingo Molnar @ 2013-11-14 10:05 UTC (permalink / raw)
  To: Joseph Schuchart
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, thomas.ilsche, linux-kernel,
	Fr??d??ric Weisbecker, David Ahern

* Joseph Schuchart <joseph.schuchart@tu-dresden.de> wrote:


> > Just a quick side note, while I realize that you are 
> > (rightfully!) concerned about correctness primarily, if that loop 
> > over MAX_NR_CPUS executes often enough then this might hurt 
> > performance:
> > 
> >    perf.h:#define MAX_NR_CPUS                      256
> > 
> > So it might be better to maintain a rolling min_max_timestamp in 
> > this place:
> > 
> > +       os->max_timestamps[sample->cpu] = timestamp;
> > 
> > ?
> > 
> > If done that way then AFAICS we could even eliminate the 
> > ->max_timestamps[NR_CPUS] array.
> 
> I can understand your performance concerns. However, I am not 
> sure how we can determine the minimal max_timestamp of all cpus 
> without storing the information on a per-cpu basis first. 
> Accumulating it on the fly would only lead to a global 
> max_timestamp. [...]

Ok. So this:

+static inline void set_next_flush(struct perf_session *session)
+{
+       int i;
+       u64 min_max_timestamp = session->ordered_samples.max_timestamps[0];
+       for (i = 1; i < MAX_NR_CPUS; i++) {
+               if (min_max_timestamp > session->ordered_samples.max_timestamps[i])
+                       min_max_timestamp = session->ordered_samples.max_timestamps[i];
+       }
+       session->ordered_samples.next_flush = min_max_timestamp;
+}

which should IMHO be written in a bit clearer form as:

static inline void set_next_flush(struct perf_session *session)
{
	u64 *timestamps = session->ordered_samples.max_timestamps;
	u64 min_timestamp = timestamps[0];
 	int i;

	for (i = 1; i < MAX_NR_CPUS; i++) {
		if (min_timestamp > timestamps[i])
			min_timestamp = timestamps[i];
	}

	session->ordered_samples.next_flush = min_timestamp;
}

calculates the minimum of the max_timestamps[] array, right?

Now, the max_timestamps[] array gets modified only in a single 
place, from the sample timestamps, via:

	os->max_timestamps[sample->cpu] = timestamp;

My suggestion was an identity transformation: to calculate the 
minimum of the array when the max_timestamps[] array is modified. 
A new minimum happens if the freshly written value is smaller 
than the current minimum.

I.e. the max_timestamps[] array itself is redundant, and we just 
have to update a rolling minimum - which is a (session-) global 
minimum - which is equivalent to the more complex minimum 
calculation in your patch.

What am I missing?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-11-14 10:05     ` Ingo Molnar
@ 2013-11-14 14:26       ` David Ahern
  2013-11-14 14:44         ` Peter Zijlstra
  2013-11-21 14:55       ` Joseph Schuchart
  1 sibling, 1 reply; 24+ messages in thread
From: David Ahern @ 2013-11-14 14:26 UTC (permalink / raw)
  To: Ingo Molnar, Joseph Schuchart, Peter Zijlstra, Fr??d??ric Weisbecker
  Cc: Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	thomas.ilsche, linux-kernel

On 11/14/13, 3:05 AM, Ingo Molnar wrote:

> What am I missing?

I have spent quite a bit of time on this problem on this well. I think 
the flush time needs to be based on the start time of each round, not 
the minimum time observed across mmaps. I have tried the minimum time 
stamp route and it still fails often enough to be annoying.

See builtin-kvm.c, perf_kvm__mmap_read(). The problem is that it takes 
time to move from mmap to mmap and sample can come in behind you - an 
mmap that has already be scanned with a timestamp less than what is 
determined to be the lowest minimum for the samples actually read. 'perf 
kvm stat live' in a nested environment is a stress test for the problem.

I believe reading perf_clock at the start of each round and using that 
as the flush time will fix the problem (to the degree that perf_clock is 
monotonic across all cpus and sockets).

But, right now we have no means of reading the perf_clock timestamp in 
userspace.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-11-14 14:26       ` David Ahern
@ 2013-11-14 14:44         ` Peter Zijlstra
  2013-11-14 15:02           ` David Ahern
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2013-11-14 14:44 UTC (permalink / raw)
  To: David Ahern
  Cc: Ingo Molnar, Joseph Schuchart, Fr??d??ric Weisbecker,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	thomas.ilsche, linux-kernel

On Thu, Nov 14, 2013 at 07:26:06AM -0700, David Ahern wrote:
> On 11/14/13, 3:05 AM, Ingo Molnar wrote:
> 
> >What am I missing?
> 
> I have spent quite a bit of time on this problem on this well. I think the
> flush time needs to be based on the start time of each round, not the
> minimum time observed across mmaps. I have tried the minimum time stamp
> route and it still fails often enough to be annoying.
> 
> See builtin-kvm.c, perf_kvm__mmap_read(). The problem is that it takes time
> to move from mmap to mmap and sample can come in behind you - an mmap that
> has already be scanned with a timestamp less than what is determined to be
> the lowest minimum for the samples actually read. 'perf kvm stat live' in a
> nested environment is a stress test for the problem.

In which case you need the sliding sort window to be n*buf_size, where n
is the number of buffers flushed into the one file. Or move to one file
per buffer and merge sort the lot, buffers should be monotonic.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-11-14 14:44         ` Peter Zijlstra
@ 2013-11-14 15:02           ` David Ahern
  2013-11-14 15:25             ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: David Ahern @ 2013-11-14 15:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joseph Schuchart, Fr??d??ric Weisbecker,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	thomas.ilsche, linux-kernel

On 11/14/13, 7:44 AM, Peter Zijlstra wrote:
> On Thu, Nov 14, 2013 at 07:26:06AM -0700, David Ahern wrote:
>> On 11/14/13, 3:05 AM, Ingo Molnar wrote:
>>
>>> What am I missing?
>>
>> I have spent quite a bit of time on this problem on this well. I think the
>> flush time needs to be based on the start time of each round, not the
>> minimum time observed across mmaps. I have tried the minimum time stamp
>> route and it still fails often enough to be annoying.
>>
>> See builtin-kvm.c, perf_kvm__mmap_read(). The problem is that it takes time
>> to move from mmap to mmap and sample can come in behind you - an mmap that
>> has already be scanned with a timestamp less than what is determined to be
>> the lowest minimum for the samples actually read. 'perf kvm stat live' in a
>> nested environment is a stress test for the problem.
>
> In which case you need the sliding sort window to be n*buf_size, where n
> is the number of buffers flushed into the one file. Or move to one file
> per buffer and merge sort the lot, buffers should be monotonic.

The one file per buffer does not work for live commands -- like perf-top 
and kvm-stat-live. perf-trace is not using the ordered samples code, but 
I think it needs to - especially for system wide.

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-11-14 15:02           ` David Ahern
@ 2013-11-14 15:25             ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2013-11-14 15:25 UTC (permalink / raw)
  To: David Ahern
  Cc: Ingo Molnar, Joseph Schuchart, Fr??d??ric Weisbecker,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	thomas.ilsche, linux-kernel

On Thu, Nov 14, 2013 at 08:02:48AM -0700, David Ahern wrote:
> On 11/14/13, 7:44 AM, Peter Zijlstra wrote:
> >On Thu, Nov 14, 2013 at 07:26:06AM -0700, David Ahern wrote:
> >>On 11/14/13, 3:05 AM, Ingo Molnar wrote:
> >>
> >>>What am I missing?
> >>
> >>I have spent quite a bit of time on this problem on this well. I think the
> >>flush time needs to be based on the start time of each round, not the
> >>minimum time observed across mmaps. I have tried the minimum time stamp
> >>route and it still fails often enough to be annoying.
> >>
> >>See builtin-kvm.c, perf_kvm__mmap_read(). The problem is that it takes time
> >>to move from mmap to mmap and sample can come in behind you - an mmap that
> >>has already be scanned with a timestamp less than what is determined to be
> >>the lowest minimum for the samples actually read. 'perf kvm stat live' in a
> >>nested environment is a stress test for the problem.
> >
> >In which case you need the sliding sort window to be n*buf_size, where n
> >is the number of buffers flushed into the one file. Or move to one file
> >per buffer and merge sort the lot, buffers should be monotonic.
> 
> The one file per buffer does not work for live commands -- like perf-top and
> kvm-stat-live. perf-trace is not using the ordered samples code, but I think
> it needs to - especially for system wide.

Why would you need files for live things? Just merge sort the buffers
directly.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-11-14 10:05     ` Ingo Molnar
  2013-11-14 14:26       ` David Ahern
@ 2013-11-21 14:55       ` Joseph Schuchart
  2013-11-27 13:51         ` Ingo Molnar
  1 sibling, 1 reply; 24+ messages in thread
From: Joseph Schuchart @ 2013-11-21 14:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, thomas.ilsche, linux-kernel,
	Fr??d??ric Weisbecker, David Ahern

Sorry for my delayed reply, it's been a busy week and I really wanted to
give Ingo's idea below some thought. Please find my comments below.

>>> If done that way then AFAICS we could even eliminate the 
>>> ->max_timestamps[NR_CPUS] array.
>>
>> I can understand your performance concerns. However, I am not 
>> sure how we can determine the minimal max_timestamp of all cpus 
>> without storing the information on a per-cpu basis first. 
>> Accumulating it on the fly would only lead to a global 
>> max_timestamp. [...]
> 
> Ok. So this:
> 
> +static inline void set_next_flush(struct perf_session *session)
> +{
> +       int i;
> +       u64 min_max_timestamp = session->ordered_samples.max_timestamps[0];
> +       for (i = 1; i < MAX_NR_CPUS; i++) {
> +               if (min_max_timestamp > session->ordered_samples.max_timestamps[i])
> +                       min_max_timestamp = session->ordered_samples.max_timestamps[i];
> +       }
> +       session->ordered_samples.next_flush = min_max_timestamp;
> +}
> 
> which should IMHO be written in a bit clearer form as:
> 
> static inline void set_next_flush(struct perf_session *session)
> {
> 	u64 *timestamps = session->ordered_samples.max_timestamps;
> 	u64 min_timestamp = timestamps[0];
>  	int i;
> 
> 	for (i = 1; i < MAX_NR_CPUS; i++) {
> 		if (min_timestamp > timestamps[i])
> 			min_timestamp = timestamps[i];
> 	}
> 
> 	session->ordered_samples.next_flush = min_timestamp;
> }

Agreed.

> 
> calculates the minimum of the max_timestamps[] array, right?
> 
> Now, the max_timestamps[] array gets modified only in a single 
> place, from the sample timestamps, via:
> 
> 	os->max_timestamps[sample->cpu] = timestamp;
> 
> My suggestion was an identity transformation: to calculate the 
> minimum of the array when the max_timestamps[] array is modified. 
> A new minimum happens if the freshly written value is smaller 
> than the current minimum.

I am really not sure how this would work since I don't see where we
could make progress while flushing if the max_timestamp is only replaced
with a smaller one but is never increased. IMO, it is necessary to
distinguish between timestamps of different cpus to determine the
next_flush timestamp across all cpus in each pass. I have tried to come
up with a working implementation of your idea but do not see a way to
safely increase the value of max_timestamp (without making assumptions
about the order of timestamps between cpus and passes).

Thinking about it a little more I see two catches in my solution:
(1) a cpu is only considered in determining the next_flush timestamp if
at least one event of this cpu has been processed already. Otherwise the
cpu is unknown and its first timestamp could be below the already
flushed timeslice timestamp. This can also happen in the current
implementation.
(2) Flush progress is only guaranteed while events from all cpus are
read. As soon as there are no events left for a cpu which already had
events in this session, the next_flush will remain at the timestamp of
the last pass which had events from all cpus until the final flush. I am
not sure how likely this is. This does not happen in the current
implementation.

As for David's suggestion to read perf_clock() at the start of every
round as a flush timestamp: from looking at the comments in
kernel/sched/clock.c, local_clock() (which is called by perf_clock) is
not guaranteed to be monotonic across cpus so there could still be
earlier timestamps coming late. Or am I missing something there?

Simply increasing the window to more than two passes in the current
implementation might lower the chance of hitting this problem but does
not guarantee success since passes can be very short (only a few events
from a few cpus) or even be empty.

Cheers
Joseph

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-11-21 14:55       ` Joseph Schuchart
@ 2013-11-27 13:51         ` Ingo Molnar
  2013-12-20 12:27           ` Joseph Schuchart
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2013-11-27 13:51 UTC (permalink / raw)
  To: Joseph Schuchart
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, thomas.ilsche, linux-kernel,
	Fr??d??ric Weisbecker, David Ahern


* Joseph Schuchart <joseph.schuchart@tu-dresden.de> wrote:

> Sorry for my delayed reply, it's been a busy week and I really wanted to
> give Ingo's idea below some thought. Please find my comments below.
> 
> >>> If done that way then AFAICS we could even eliminate the 
> >>> ->max_timestamps[NR_CPUS] array.
> >>
> >> I can understand your performance concerns. However, I am not 
> >> sure how we can determine the minimal max_timestamp of all cpus 
> >> without storing the information on a per-cpu basis first. 
> >> Accumulating it on the fly would only lead to a global 
> >> max_timestamp. [...]
> > 
> > Ok. So this:
> > 
> > +static inline void set_next_flush(struct perf_session *session)
> > +{
> > +       int i;
> > +       u64 min_max_timestamp = session->ordered_samples.max_timestamps[0];
> > +       for (i = 1; i < MAX_NR_CPUS; i++) {
> > +               if (min_max_timestamp > session->ordered_samples.max_timestamps[i])
> > +                       min_max_timestamp = session->ordered_samples.max_timestamps[i];
> > +       }
> > +       session->ordered_samples.next_flush = min_max_timestamp;
> > +}
> > 
> > which should IMHO be written in a bit clearer form as:
> > 
> > static inline void set_next_flush(struct perf_session *session)
> > {
> > 	u64 *timestamps = session->ordered_samples.max_timestamps;
> > 	u64 min_timestamp = timestamps[0];
> >  	int i;
> > 
> > 	for (i = 1; i < MAX_NR_CPUS; i++) {
> > 		if (min_timestamp > timestamps[i])
> > 			min_timestamp = timestamps[i];
> > 	}
> > 
> > 	session->ordered_samples.next_flush = min_timestamp;
> > }
> 
> Agreed.
> 
> > 
> > calculates the minimum of the max_timestamps[] array, right?
> > 
> > Now, the max_timestamps[] array gets modified only in a single 
> > place, from the sample timestamps, via:
> > 
> > 	os->max_timestamps[sample->cpu] = timestamp;
> > 
> > My suggestion was an identity transformation: to calculate the 
> > minimum of the array when the max_timestamps[] array is modified. 
> > A new minimum happens if the freshly written value is smaller 
> > than the current minimum.
> 
> I am really not sure how this would work since I don't see where we 
> could make progress while flushing if the max_timestamp is only 
> replaced with a smaller one but is never increased. IMO, it is 
> necessary to distinguish between timestamps of different cpus to 
> determine the next_flush timestamp across all cpus in each pass. I 
> have tried to come up with a working implementation of your idea but 
> do not see a way to safely increase the value of max_timestamp 
> (without making assumptions about the order of timestamps between 
> cpus and passes).

Mine isn't really an 'idea' - I did to the code what I see an identity 
transformation, a change that does not change the principle or the 
working of the code.

And after the identity transformation your code seems to have 
simplified down significantly - at which point I was wondering.

If what I did is _not_ an identity transformation then please point 
out where I break your logic. (it might easily be some really simple 
misundestanding on my part.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-11-27 13:51         ` Ingo Molnar
@ 2013-12-20 12:27           ` Joseph Schuchart
  2013-12-20 17:09             ` David Ahern
  0 siblings, 1 reply; 24+ messages in thread
From: Joseph Schuchart @ 2013-12-20 12:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, thomas.ilsche, linux-kernel,
	Fr??d??ric Weisbecker, David Ahern



On 27.11.2013 14:51, Ingo Molnar wrote:
> 
> * Joseph Schuchart <joseph.schuchart@tu-dresden.de> wrote:
> 
>> Sorry for my delayed reply, it's been a busy week and I really wanted to
>> give Ingo's idea below some thought. Please find my comments below.
>>
>>>>> If done that way then AFAICS we could even eliminate the 
>>>>> ->max_timestamps[NR_CPUS] array.
>>>>
>>>> I can understand your performance concerns. However, I am not 
>>>> sure how we can determine the minimal max_timestamp of all cpus 
>>>> without storing the information on a per-cpu basis first. 
>>>> Accumulating it on the fly would only lead to a global 
>>>> max_timestamp. [...]
>>>
>>> Ok. So this:
>>>
>>> +static inline void set_next_flush(struct perf_session *session)
>>> +{
>>> +       int i;
>>> +       u64 min_max_timestamp = session->ordered_samples.max_timestamps[0];
>>> +       for (i = 1; i < MAX_NR_CPUS; i++) {
>>> +               if (min_max_timestamp > session->ordered_samples.max_timestamps[i])
>>> +                       min_max_timestamp = session->ordered_samples.max_timestamps[i];
>>> +       }
>>> +       session->ordered_samples.next_flush = min_max_timestamp;
>>> +}
>>>
>>> which should IMHO be written in a bit clearer form as:
>>>
>>> static inline void set_next_flush(struct perf_session *session)
>>> {
>>> 	u64 *timestamps = session->ordered_samples.max_timestamps;
>>> 	u64 min_timestamp = timestamps[0];
>>>  	int i;
>>>
>>> 	for (i = 1; i < MAX_NR_CPUS; i++) {
>>> 		if (min_timestamp > timestamps[i])
>>> 			min_timestamp = timestamps[i];
>>> 	}
>>>
>>> 	session->ordered_samples.next_flush = min_timestamp;
>>> }
>>
>> Agreed.
>>
>>>
>>> calculates the minimum of the max_timestamps[] array, right?
>>>
>>> Now, the max_timestamps[] array gets modified only in a single 
>>> place, from the sample timestamps, via:
>>>
>>> 	os->max_timestamps[sample->cpu] = timestamp;
>>>
>>> My suggestion was an identity transformation: to calculate the 
>>> minimum of the array when the max_timestamps[] array is modified. 
>>> A new minimum happens if the freshly written value is smaller 
>>> than the current minimum.
>>
>> I am really not sure how this would work since I don't see where we 
>> could make progress while flushing if the max_timestamp is only 
>> replaced with a smaller one but is never increased. IMO, it is 
>> necessary to distinguish between timestamps of different cpus to 
>> determine the next_flush timestamp across all cpus in each pass. I 
>> have tried to come up with a working implementation of your idea but 
>> do not see a way to safely increase the value of max_timestamp 
>> (without making assumptions about the order of timestamps between 
>> cpus and passes).
> 
> Mine isn't really an 'idea' - I did to the code what I see an identity 
> transformation, a change that does not change the principle or the 
> working of the code.
> 
> And after the identity transformation your code seems to have 
> simplified down significantly - at which point I was wondering.
> 
> If what I did is _not_ an identity transformation then please point 
> out where I break your logic. (it might easily be some really simple 
> misundestanding on my part.)

I know this comes late, but: As far as I can see, your change does not
preserve the logic of the code I suggested. The idea was to first gather
all the maximum timestamps of all cpus (that is, the last timestamp seen
on each cpu) and then determine the minimum of these maxima. These are
two distinct steps that I think cannot be combined in one update. Your
change would only compute the minimum of all timestamps without
determining the maximum of each process and hence no progress is
guaranteed. Sorry for not being more precise earlier, I hope this
explanation is clearer now :)

Please let me know if you think my explanation is incorrect at some
point or if anything is still unclear.

Thanks
Joseph

> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-12-20 12:27           ` Joseph Schuchart
@ 2013-12-20 17:09             ` David Ahern
  2013-12-23 13:10               ` Frederic Weisbecker
  0 siblings, 1 reply; 24+ messages in thread
From: David Ahern @ 2013-12-20 17:09 UTC (permalink / raw)
  To: Joseph Schuchart, Ingo Molnar
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, thomas.ilsche, linux-kernel,
	Fr??d??ric Weisbecker

On 12/20/13, 5:27 AM, Joseph Schuchart wrote:
> I know this comes late, but: As far as I can see, your change does not
> preserve the logic of the code I suggested. The idea was to first gather
> all the maximum timestamps of all cpus (that is, the last timestamp seen
> on each cpu) and then determine the minimum of these maxima. These are
> two distinct steps that I think cannot be combined in one update. Your

  A number of people have reported similar problems -- timestamps below 
last flush time. This approach would solve that problem for data 
processed from files, so it would be a good improvement.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-12-20 17:09             ` David Ahern
@ 2013-12-23 13:10               ` Frederic Weisbecker
  2013-12-23 14:44                 ` David Ahern
  0 siblings, 1 reply; 24+ messages in thread
From: Frederic Weisbecker @ 2013-12-23 13:10 UTC (permalink / raw)
  To: David Ahern
  Cc: Joseph Schuchart, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, thomas.ilsche,
	linux-kernel

On Fri, Dec 20, 2013 at 10:09:53AM -0700, David Ahern wrote:
> On 12/20/13, 5:27 AM, Joseph Schuchart wrote:
> >I know this comes late, but: As far as I can see, your change does not
> >preserve the logic of the code I suggested. The idea was to first gather
> >all the maximum timestamps of all cpus (that is, the last timestamp seen
> >on each cpu) and then determine the minimum of these maxima. These are
> >two distinct steps that I think cannot be combined in one update. Your
> 
>  A number of people have reported similar problems -- timestamps
> below last flush time. This approach would solve that problem for
> data processed from files, so it would be a good improvement.

Could it be near what you're looking for?

https://lkml.org/lkml/2012/2/18/53

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-12-23 13:10               ` Frederic Weisbecker
@ 2013-12-23 14:44                 ` David Ahern
  2013-12-26 15:14                   ` Frederic Weisbecker
  0 siblings, 1 reply; 24+ messages in thread
From: David Ahern @ 2013-12-23 14:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joseph Schuchart, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, thomas.ilsche,
	linux-kernel

On 12/23/13, 8:10 AM, Frederic Weisbecker wrote:
> On Fri, Dec 20, 2013 at 10:09:53AM -0700, David Ahern wrote:
>> On 12/20/13, 5:27 AM, Joseph Schuchart wrote:
>>> I know this comes late, but: As far as I can see, your change does not
>>> preserve the logic of the code I suggested. The idea was to first gather
>>> all the maximum timestamps of all cpus (that is, the last timestamp seen
>>> on each cpu) and then determine the minimum of these maxima. These are
>>> two distinct steps that I think cannot be combined in one update. Your
>>
>>   A number of people have reported similar problems -- timestamps
>> below last flush time. This approach would solve that problem for
>> data processed from files, so it would be a good improvement.
>
> Could it be near what you're looking for?
>
> https://lkml.org/lkml/2012/2/18/53
>

Forgot about that patch. It is similar to what Joseph wants for 
analyzing a file.

I was carrying that patch while working on perf-kvm-stat-live last Fall. 
It does not solve the problem for live commands, so ended up dropping it 
and going with local (to the command) hacks. I still think for live 
commands getting a perf_clock timestamp at the start of a round and 
using that as the flush time will work best.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-12-23 14:44                 ` David Ahern
@ 2013-12-26 15:14                   ` Frederic Weisbecker
  2013-12-26 15:24                     ` David Ahern
  0 siblings, 1 reply; 24+ messages in thread
From: Frederic Weisbecker @ 2013-12-26 15:14 UTC (permalink / raw)
  To: David Ahern
  Cc: Joseph Schuchart, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, thomas.ilsche,
	linux-kernel

On Mon, Dec 23, 2013 at 09:44:25AM -0500, David Ahern wrote:
> On 12/23/13, 8:10 AM, Frederic Weisbecker wrote:
> >On Fri, Dec 20, 2013 at 10:09:53AM -0700, David Ahern wrote:
> >>On 12/20/13, 5:27 AM, Joseph Schuchart wrote:
> >>>I know this comes late, but: As far as I can see, your change does not
> >>>preserve the logic of the code I suggested. The idea was to first gather
> >>>all the maximum timestamps of all cpus (that is, the last timestamp seen
> >>>on each cpu) and then determine the minimum of these maxima. These are
> >>>two distinct steps that I think cannot be combined in one update. Your
> >>
> >>  A number of people have reported similar problems -- timestamps
> >>below last flush time. This approach would solve that problem for
> >>data processed from files, so it would be a good improvement.
> >
> >Could it be near what you're looking for?
> >
> >https://lkml.org/lkml/2012/2/18/53
> >
> 
> Forgot about that patch. It is similar to what Joseph wants for
> analyzing a file.
> 
> I was carrying that patch while working on perf-kvm-stat-live last
> Fall. It does not solve the problem for live commands, so ended up
> dropping it and going with local (to the command) hacks. I still
> think for live commands getting a perf_clock timestamp at the start
> of a round and using that as the flush time will work best.

Ok, but how would you fetch this perf clock timestamp, with an explicit read?

In the meantime, I can fix and post my old patch, which should solve at least
the perf.data based event stream.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-12-26 15:14                   ` Frederic Weisbecker
@ 2013-12-26 15:24                     ` David Ahern
  2013-12-26 15:30                       ` Frederic Weisbecker
  0 siblings, 1 reply; 24+ messages in thread
From: David Ahern @ 2013-12-26 15:24 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joseph Schuchart, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, thomas.ilsche,
	linux-kernel

On 12/26/13, 10:14 AM, Frederic Weisbecker wrote:
>> I was carrying that patch while working on perf-kvm-stat-live last
>> Fall. It does not solve the problem for live commands, so ended up
>> dropping it and going with local (to the command) hacks. I still
>> think for live commands getting a perf_clock timestamp at the start
>> of a round and using that as the flush time will work best.
>
> Ok, but how would you fetch this perf clock timestamp, with an explicit read?

One option is exporting perf_clock to userspace.

Preferably a built-in option, https://lkml.org/lkml/2013/12/10/781, but 
I can go the module route if I have to: 
https://github.com/dsahern/linux/blob/perf-full-monty/README.ahern

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-12-26 15:24                     ` David Ahern
@ 2013-12-26 15:30                       ` Frederic Weisbecker
  2014-01-01 18:37                         ` David Ahern
  0 siblings, 1 reply; 24+ messages in thread
From: Frederic Weisbecker @ 2013-12-26 15:30 UTC (permalink / raw)
  To: David Ahern
  Cc: Joseph Schuchart, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, thomas.ilsche,
	linux-kernel

On Thu, Dec 26, 2013 at 10:24:03AM -0500, David Ahern wrote:
> On 12/26/13, 10:14 AM, Frederic Weisbecker wrote:
> >>I was carrying that patch while working on perf-kvm-stat-live last
> >>Fall. It does not solve the problem for live commands, so ended up
> >>dropping it and going with local (to the command) hacks. I still
> >>think for live commands getting a perf_clock timestamp at the start
> >>of a round and using that as the flush time will work best.
> >
> >Ok, but how would you fetch this perf clock timestamp, with an explicit read?
> 
> One option is exporting perf_clock to userspace.
> 
> Preferably a built-in option, https://lkml.org/lkml/2013/12/10/781,
> but I can go the module route if I have to:
> https://github.com/dsahern/linux/blob/perf-full-monty/README.ahern

I see, that's one of the very verbose threads I've seen lately ;)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2013-12-26 15:30                       ` Frederic Weisbecker
@ 2014-01-01 18:37                         ` David Ahern
  2014-01-03 22:07                           ` Frederic Weisbecker
  0 siblings, 1 reply; 24+ messages in thread
From: David Ahern @ 2014-01-01 18:37 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joseph Schuchart, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, thomas.ilsche,
	linux-kernel

On 12/26/13, 8:30 AM, Frederic Weisbecker wrote:
> On Thu, Dec 26, 2013 at 10:24:03AM -0500, David Ahern wrote:
>> On 12/26/13, 10:14 AM, Frederic Weisbecker wrote:
>>>> I was carrying that patch while working on perf-kvm-stat-live last
>>>> Fall. It does not solve the problem for live commands, so ended up
>>>> dropping it and going with local (to the command) hacks. I still
>>>> think for live commands getting a perf_clock timestamp at the start
>>>> of a round and using that as the flush time will work best.

For perf-kvm-stat-live using perf_clock value at the start of the round 
as the flush time works beautifully:

https://github.com/dsahern/linux/commit/ba8b7b63d5dbdc95aedbbafa670c2232e0cc07a2

Never once failed with "Warning: Timestamp below last timeslice flush" 
error.

David

>>>
>>> Ok, but how would you fetch this perf clock timestamp, with an explicit read?
>>
>> One option is exporting perf_clock to userspace.
>>
>> Preferably a built-in option, https://lkml.org/lkml/2013/12/10/781,
>> but I can go the module route if I have to:
>> https://github.com/dsahern/linux/blob/perf-full-monty/README.ahern
>
> I see, that's one of the very verbose threads I've seen lately ;)
>



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2014-01-01 18:37                         ` David Ahern
@ 2014-01-03 22:07                           ` Frederic Weisbecker
  2014-01-03 22:45                             ` David Ahern
  0 siblings, 1 reply; 24+ messages in thread
From: Frederic Weisbecker @ 2014-01-03 22:07 UTC (permalink / raw)
  To: David Ahern
  Cc: Joseph Schuchart, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, thomas.ilsche,
	linux-kernel

On Wed, Jan 01, 2014 at 11:37:55AM -0700, David Ahern wrote:
> On 12/26/13, 8:30 AM, Frederic Weisbecker wrote:
> >On Thu, Dec 26, 2013 at 10:24:03AM -0500, David Ahern wrote:
> >>On 12/26/13, 10:14 AM, Frederic Weisbecker wrote:
> >>>>I was carrying that patch while working on perf-kvm-stat-live last
> >>>>Fall. It does not solve the problem for live commands, so ended up
> >>>>dropping it and going with local (to the command) hacks. I still
> >>>>think for live commands getting a perf_clock timestamp at the start
> >>>>of a round and using that as the flush time will work best.
> 
> For perf-kvm-stat-live using perf_clock value at the start of the
> round as the flush time works beautifully:
> 
> https://github.com/dsahern/linux/commit/ba8b7b63d5dbdc95aedbbafa670c2232e0cc07a2
> 
> Never once failed with "Warning: Timestamp below last timeslice
> flush" error.

I'm not sure I understand why we need that. Why doesn't it work by simply flushing
events prior to the earliest timestamp among every CPUs last event?

I can see one remaining issue when an event interrupts another in a CPU. If the
interrupt happens after perf_prepare_sample() -> perf_clock() and perf_output_begin(),
we may have locally non-monotonic timestamps in a CPU buffer.

That could be solved with a heuristic similar to yours: flush events prior a few millisecs
before the barrier since interrupt are supposed to be short. Or we could move the perf_clock()
event snapshot to perf_output_sample() to make sure that the event space is reserved before
we get the timestamp, thus the interrupting events having superior timestamps are guaranteed
to be past the interrupted event in the stream.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2014-01-03 22:07                           ` Frederic Weisbecker
@ 2014-01-03 22:45                             ` David Ahern
  2014-01-04 15:05                               ` Frederic Weisbecker
  0 siblings, 1 reply; 24+ messages in thread
From: David Ahern @ 2014-01-03 22:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joseph Schuchart, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, thomas.ilsche,
	linux-kernel

On 1/3/14, 3:07 PM, Frederic Weisbecker wrote:
> On Wed, Jan 01, 2014 at 11:37:55AM -0700, David Ahern wrote:
>> On 12/26/13, 8:30 AM, Frederic Weisbecker wrote:
>>> On Thu, Dec 26, 2013 at 10:24:03AM -0500, David Ahern wrote:
>>>> On 12/26/13, 10:14 AM, Frederic Weisbecker wrote:
>>>>>> I was carrying that patch while working on perf-kvm-stat-live last
>>>>>> Fall. It does not solve the problem for live commands, so ended up
>>>>>> dropping it and going with local (to the command) hacks. I still
>>>>>> think for live commands getting a perf_clock timestamp at the start
>>>>>> of a round and using that as the flush time will work best.
>>
>> For perf-kvm-stat-live using perf_clock value at the start of the
>> round as the flush time works beautifully:
>>
>> https://github.com/dsahern/linux/commit/ba8b7b63d5dbdc95aedbbafa670c2232e0cc07a2
>>
>> Never once failed with "Warning: Timestamp below last timeslice
>> flush" error.
>
> I'm not sure I understand why we need that. Why doesn't it work by simply flushing
> events prior to the earliest timestamp among every CPUs last event?

Here's one scenario. Consider N-mmaps:

        |----- t_flush
        v
0   -----|---x------------------------
1   -----|----|------------------------
...      |
N   -----|-------ssss-|-----------------

      t_start t_1 ... t_N

You start a round at some time -- t_start. By starting a round it means 
you go to mmap 0 and check for events, then mmap 1, ..., mmap N. It 
takes a finite amount of time to move from one mmap to another.

Assume there are no events on mmap 0, 1, ... N-1 but samples are 
generated in mmap N. In the time it takes to move forward from 0 to N, a 
sample can be generated for mmap 0 and written to the buffer - the 'x' 
above. It now contains a timestamp < than samples on any other mmap and 
out pops the flush error.

perf-kvm can have over 650,000 events per second and those tend to come 
in bunches on a single mmap. So even if you go for a "max of the min 
times across mmaps" it is often wrong.

The non-perf_clock logic in perf-kvm uses the min time across all mmaps 
and even it occasionally fails with the flush error.

David


>
> I can see one remaining issue when an event interrupts another in a CPU. If the
> interrupt happens after perf_prepare_sample() -> perf_clock() and perf_output_begin(),
> we may have locally non-monotonic timestamps in a CPU buffer.
>
> That could be solved with a heuristic similar to yours: flush events prior a few millisecs
> before the barrier since interrupt are supposed to be short. Or we could move the perf_clock()
> event snapshot to perf_output_sample() to make sure that the event space is reserved before
> we get the timestamp, thus the interrupting events having superior timestamps are guaranteed
> to be past the interrupted event in the stream.
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2014-01-03 22:45                             ` David Ahern
@ 2014-01-04 15:05                               ` Frederic Weisbecker
  2014-01-08 21:48                                 ` David Ahern
  0 siblings, 1 reply; 24+ messages in thread
From: Frederic Weisbecker @ 2014-01-04 15:05 UTC (permalink / raw)
  To: David Ahern
  Cc: Joseph Schuchart, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, thomas.ilsche,
	linux-kernel

On Fri, Jan 03, 2014 at 03:45:36PM -0700, David Ahern wrote:
> On 1/3/14, 3:07 PM, Frederic Weisbecker wrote:
> >I'm not sure I understand why we need that. Why doesn't it work by simply flushing
> >events prior to the earliest timestamp among every CPUs last event?
> 
> Here's one scenario. Consider N-mmaps:
> 
>        |----- t_flush
>        v
> 0   -----|---x------------------------
> 1   -----|----|------------------------
> ...      |
> N   -----|-------ssss-|-----------------
> 
>      t_start t_1 ... t_N
> 
> You start a round at some time -- t_start. By starting a round it
> means you go to mmap 0 and check for events, then mmap 1, ..., mmap
> N. It takes a finite amount of time to move from one mmap to
> another.
> 
> Assume there are no events on mmap 0, 1, ... N-1 but samples are
> generated in mmap N. In the time it takes to move forward from 0 to
> N, a sample can be generated for mmap 0 and written to the buffer -
> the 'x' above. It now contains a timestamp < than samples on any
> other mmap and out pops the flush error.

Lets reformulate as following. I'm copy-pasting the example in session.c
but adapting it to your scenario.

 *    ============ PASS n =================
 *       CPU 0         |   CPU 1
 *                     |
 *          -          |         2
 *          -          |         3
 *          -          |         4  <--- max recorded
 *
 *    ============ PASS n + 1 ==============
 *       CPU 0         |   CPU 1
 *                     |
 *          1          |         5
 *          -          |         6
 *          -          |         7 <---- max recorded
 *
 *      Flush every events below timestamp 4
 *

So in the first round, CPU 0 has no event by the time we read it. Then while
we read the events from CPU 1 (val 2,3,4), a new one comes in concurrently
in CPU 0 (val 1, which matches 's' in your example ). We missed it due to the
linear mmap read on all buffers so we'll get it on the second round.

We find it out in the second round, CPU 1 has also new events. At this time we know that
if CPU 0 had events up to timestamp 4, we should have seen all of them because
we read CPU 0 buffer in PASS n + 1 after we read CPU 1 buffer on PASS n.

Of course that's what happens in a perfect world with the assumption that ordering
is all correct, that events write/commit doesn't take too much time to complete,
that perf_clock() is globally monotonic (and it's not IIUC). But a little heuristical
correction on the timestamp barrier should prevent from issues with that.

So this is how the code behaves currently and it should handle a case like above.
Now there is still the problem of:

1) local timestamps not moving forward (could it happen when events happen in storm,
when they overflow multiple times in once for example, and clock is not granular
enough?)
Anyway this should be solved with the patch that takes the earliest last event on all
CPU buffer instead of the maximum of a round as a barrier.

2) local timestamps not monotonic due to interrupting events. This could be fixed
in the kernel with moving perf_clock() snapshot in perf_output_sample().

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2014-01-04 15:05                               ` Frederic Weisbecker
@ 2014-01-08 21:48                                 ` David Ahern
  2014-01-09 15:19                                   ` Frederic Weisbecker
  0 siblings, 1 reply; 24+ messages in thread
From: David Ahern @ 2014-01-08 21:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joseph Schuchart, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, thomas.ilsche,
	linux-kernel

On 1/4/14, 8:05 AM, Frederic Weisbecker wrote:
> On Fri, Jan 03, 2014 at 03:45:36PM -0700, David Ahern wrote:
>> On 1/3/14, 3:07 PM, Frederic Weisbecker wrote:
>>> I'm not sure I understand why we need that. Why doesn't it work by simply flushing
>>> events prior to the earliest timestamp among every CPUs last event?
>>
>> Here's one scenario. Consider N-mmaps:
>>
>>         |----- t_flush
>>         v
>> 0   -----|---x------------------------
>> 1   -----|----|------------------------
>> ...      |
>> N   -----|-------ssss-|-----------------
>>
>>       t_start t_1 ... t_N
>>
>> You start a round at some time -- t_start. By starting a round it
>> means you go to mmap 0 and check for events, then mmap 1, ..., mmap
>> N. It takes a finite amount of time to move from one mmap to
>> another.
>>
>> Assume there are no events on mmap 0, 1, ... N-1 but samples are
>> generated in mmap N. In the time it takes to move forward from 0 to
>> N, a sample can be generated for mmap 0 and written to the buffer -
>> the 'x' above. It now contains a timestamp < than samples on any
>> other mmap and out pops the flush error.
>
> Lets reformulate as following. I'm copy-pasting the example in session.c
> but adapting it to your scenario.
>
>   *    ============ PASS n =================
>   *       CPU 0         |   CPU 1
>   *                     |
>   *          -          |         2
>   *          -          |         3
>   *          -          |         4  <--- max recorded
>   *
>   *    ============ PASS n + 1 ==============
>   *       CPU 0         |   CPU 1
>   *                     |
>   *          1          |         5
>   *          -          |         6
>   *          -          |         7 <---- max recorded
>   *
>   *      Flush every events below timestamp 4
>   *
>
> So in the first round, CPU 0 has no event by the time we read it. Then while
> we read the events from CPU 1 (val 2,3,4), a new one comes in concurrently
> in CPU 0 (val 1, which matches 's' in your example ). We missed it due to the
> linear mmap read on all buffers so we'll get it on the second round.
>
> We find it out in the second round, CPU 1 has also new events. At this time we know that
> if CPU 0 had events up to timestamp 4, we should have seen all of them because
> we read CPU 0 buffer in PASS n + 1 after we read CPU 1 buffer on PASS n.
>
> Of course that's what happens in a perfect world with the assumption that ordering
> is all correct, that events write/commit doesn't take too much time to complete,
> that perf_clock() is globally monotonic (and it's not IIUC). But a little heuristical
> correction on the timestamp barrier should prevent from issues with that.
>
> So this is how the code behaves currently and it should handle a case like above.

The existing code does not work. Your unstable tsc patch did not work. I 
have not tried Joseph's patch. Are you proposing that one or do you have 
something else in mind?

> Now there is still the problem of:
>
> 1) local timestamps not moving forward (could it happen when events happen in storm,
> when they overflow multiple times in once for example, and clock is not granular
> enough?)

Even at 650k events/sec I am not seeing this problem.

> Anyway this should be solved with the patch that takes the earliest last event on all
> CPU buffer instead of the maximum of a round as a barrier.
>
> 2) local timestamps not monotonic due to interrupting events. This could be fixed
> in the kernel with moving perf_clock() snapshot in perf_output_sample().
>

For perf-kvm the events are all tracepoints, so there should not be a 
problem of overlap due to interruption.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2014-01-08 21:48                                 ` David Ahern
@ 2014-01-09 15:19                                   ` Frederic Weisbecker
  2014-01-12 15:46                                     ` David Ahern
  0 siblings, 1 reply; 24+ messages in thread
From: Frederic Weisbecker @ 2014-01-09 15:19 UTC (permalink / raw)
  To: David Ahern
  Cc: Joseph Schuchart, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, thomas.ilsche,
	linux-kernel

On Wed, Jan 08, 2014 at 02:48:37PM -0700, David Ahern wrote:
> The existing code does not work. Your unstable tsc patch did not
> work. I have not tried Joseph's patch. Are you proposing that one or
> do you have something else in mind?

I think we should integrate Joseph's patch (or mine, or some mixup, I mean
they do about the same IIRC) as it solves known and understood bugs in any case.

Then we need to check what is the real issue in your case.

> 
> >Now there is still the problem of:
> >
> >1) local timestamps not moving forward (could it happen when events happen in storm,
> >when they overflow multiple times in once for example, and clock is not granular
> >enough?)
> 
> Even at 650k events/sec I am not seeing this problem.

Yeah it happens mostly when a single event, supposed to overflow on period of 1, trigger
with a higher period. This is the case of sched stat runtime tracepoints for example
because it is a weighted tracepoint (see perf_count). So it demux into gazillions of
events all having very close timestamps. But normal tracepoints shouldn't have this problem.

> 
> >Anyway this should be solved with the patch that takes the earliest last event on all
> >CPU buffer instead of the maximum of a round as a barrier.
> >
> >2) local timestamps not monotonic due to interrupting events. This could be fixed
> >in the kernel with moving perf_clock() snapshot in perf_output_sample().
> >
> 
> For perf-kvm the events are all tracepoints, so there should not be
> a problem of overlap due to interruption.

Nope, I'm curious what kind of issue happens with kvm events. Could you send me a perf.data
that has this ordering problem?

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes
  2014-01-09 15:19                                   ` Frederic Weisbecker
@ 2014-01-12 15:46                                     ` David Ahern
  0 siblings, 0 replies; 24+ messages in thread
From: David Ahern @ 2014-01-12 15:46 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joseph Schuchart, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, thomas.ilsche,
	linux-kernel

On 1/9/14, 8:19 AM, Frederic Weisbecker wrote:
> On Wed, Jan 08, 2014 at 02:48:37PM -0700, David Ahern wrote:
>> The existing code does not work. Your unstable tsc patch did not
>> work. I have not tried Joseph's patch. Are you proposing that one or
>> do you have something else in mind?
>
> I think we should integrate Joseph's patch (or mine, or some mixup, I mean
> they do about the same IIRC) as it solves known and understood bugs in any case.
>
> Then we need to check what is the real issue in your case.
>
>>
>>> Now there is still the problem of:
>>>
>>> 1) local timestamps not moving forward (could it happen when events happen in storm,
>>> when they overflow multiple times in once for example, and clock is not granular
>>> enough?)
>>
>> Even at 650k events/sec I am not seeing this problem.
>
> Yeah it happens mostly when a single event, supposed to overflow on period of 1, trigger
> with a higher period. This is the case of sched stat runtime tracepoints for example
> because it is a weighted tracepoint (see perf_count). So it demux into gazillions of
> events all having very close timestamps. But normal tracepoints shouldn't have this problem.
>
>>
>>> Anyway this should be solved with the patch that takes the earliest last event on all
>>> CPU buffer instead of the maximum of a round as a barrier.
>>>
>>> 2) local timestamps not monotonic due to interrupting events. This could be fixed
>>> in the kernel with moving perf_clock() snapshot in perf_output_sample().
>>>
>>
>> For perf-kvm the events are all tracepoints, so there should not be
>> a problem of overlap due to interruption.
>
> Nope, I'm curious what kind of issue happens with kvm events. Could you send me a perf.data
> that has this ordering problem?

There's no data file in this case; it's perf kvm stat live. For this use 
case the flush code in session.c causes the 'timestamp below last flush 
error' really quick. That's what led me down the road of tweaking the 
flush time in builtin-kvm.c. With the existing tweaks perf-kvm-stat-live 
runs for a lot longer between timeslice failures. With my perf_clock 
based flush time I have yet to the 'Timestamp below last timeslice 
flush' error.

David



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-01-12 15:47 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-14  8:07 [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes Joseph Schuchart
2013-11-14  8:39 ` Ingo Molnar
2013-11-14  8:59   ` Joseph Schuchart
2013-11-14 10:05     ` Ingo Molnar
2013-11-14 14:26       ` David Ahern
2013-11-14 14:44         ` Peter Zijlstra
2013-11-14 15:02           ` David Ahern
2013-11-14 15:25             ` Peter Zijlstra
2013-11-21 14:55       ` Joseph Schuchart
2013-11-27 13:51         ` Ingo Molnar
2013-12-20 12:27           ` Joseph Schuchart
2013-12-20 17:09             ` David Ahern
2013-12-23 13:10               ` Frederic Weisbecker
2013-12-23 14:44                 ` David Ahern
2013-12-26 15:14                   ` Frederic Weisbecker
2013-12-26 15:24                     ` David Ahern
2013-12-26 15:30                       ` Frederic Weisbecker
2014-01-01 18:37                         ` David Ahern
2014-01-03 22:07                           ` Frederic Weisbecker
2014-01-03 22:45                             ` David Ahern
2014-01-04 15:05                               ` Frederic Weisbecker
2014-01-08 21:48                                 ` David Ahern
2014-01-09 15:19                                   ` Frederic Weisbecker
2014-01-12 15:46                                     ` David Ahern

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.