Re: [PATCH] perf tools: Fix ordering with unstable tsc

From: Stephane Eranian <eranian@google.com>
To: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	David Ahern <dsahern@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>
Subject: Re: [PATCH] perf tools: Fix ordering with unstable tsc
Date: Wed, 22 Feb 2012 16:35:54 +0100	[thread overview]
Message-ID: <CABPqkBRDLrK=GvXWTZob_EQ2dqnWJg5GeWMYFj0N5t8z265Y9Q@mail.gmail.com> (raw)
In-Reply-To: <1329583837-7469-1-git-send-email-fweisbec@gmail.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 15998 bytes --]

On Sat, Feb 18, 2012 at 5:50 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On a system with a TSC considered as unstable, one can encounter this
> kind of warning:
>
> Â  Â  $ perf sched rec pong 2
> Â  Â  $ perf sched lat
> Â  Â  Warning: Timestamp below last timeslice flush
>
> This happens when trace events trigger with a potentially high period,
> such as sched_stat_sleep, sched_stat_runtime, sched_stat_wait, etc...
> The perf event core then implement that weight by sending as many events
> as the given period. For example as many as the time the task has been
> sleeping in sched_stat_sleep event.
>
> If this happens while irqs are disabled with an unstable tsc and this takes
> more time than a jiffy, then the timestamps of the events get stuck to
> the value of that next jiffy because sched_clock_local() bounds the timestamp
> to that maximum. The local timer tick is supposed to update that boundary but
> it can't given that irqs are disabled.
>
> We can then meet this kind of scenario in perf record:
>
> ===== CPU 0 ===== Â  Â  Â ==== CPU 1 ====
>
> Â  Â  Â  Â  Â  Â  Â PASS n
> Â  Â  ... Â  Â  Â  Â  Â  Â  Â  Â  Â  Â ...
> Â  Â  Â 1 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â 1
> Â  Â  Â 1 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â 2
> Â  Â  Â 1 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â 3 <-- max recorded
>
> Â  Â  Â  Â  Â  finished round event
> Â  Â  Â  Â  Â  Â PASS n + 1
>
> Â  Â  Â 1 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â 4
> Â  Â  Â 1 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â 5
> Â  Â  Â 1 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â 6
>
> Â  Â  Â  Â  Â  finished round event
> Â  Â  Â  Â  Â  Â PASS n + 2
>
> Â  Â  Â 1 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â 7
> Â  Â  ... Â  Â  Â  Â  Â  Â  Â  Â  Â  Â ...
>
> CPU 0 is stuck sending events with irqs disabled and with the stale
> timestamp. When we do the events reordering for perf script for example,
> we flush all the events before timestamp 3 when we reach PASS n + 2,
> considering we can't anymore have timestamps below 3 now.
> But we still do have timestamps below 3 on PASS n + 2.
>
> To solve that issue, instead of considering that timestamps are globally
> monotonic, we assume they are locally monotonic. Instead of recording
> the max timestamp on each pass, we check the max one per CPU on each
> pass and keep the smallest over these as the new barrier up to which
> we flush the events on the PASS n + 2. This still relies on a bit of
> global monotonicity because if some CPU doesn't have events in PASS n,
> we expect it not to have event in PASS n + 2 past the barrier recorded
> in PASS n. So this is still not a totally robust ordering but it's still
> better than what we had before.
>
> The only way to have a deterministic and solid ordering will be to use
> per cpu perf.data files.
>
> Reported-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephane Eranian <eranian@google.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
> Â tools/perf/util/evsel.c Â  | Â  Â 5 +-
> Â tools/perf/util/session.c | Â 146 +++++++++++++++++++++++++++++++++-----------
> Â tools/perf/util/session.h | Â  Â 3 +-
> Â 3 files changed, 115 insertions(+), 39 deletions(-)
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 302d49a..1c8eb4b 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -119,9 +119,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts)
> Â  Â  Â  Â if (opts->raw_samples) {
> Â  Â  Â  Â  Â  Â  Â  Â attr->sample_type Â  Â  Â  |= PERF_SAMPLE_TIME;
> Â  Â  Â  Â  Â  Â  Â  Â attr->sample_type Â  Â  Â  |= PERF_SAMPLE_RAW;
> - Â  Â  Â  Â  Â  Â  Â  attr->sample_type Â  Â  Â  |= PERF_SAMPLE_CPU;
> Â  Â  Â  Â }
>
I don't get this bit here. You may want CPU information when capturing
in raw + per-thread mode.


> + Â  Â  Â  /* Need to know the CPU for tools that need to order events */
> + Â  Â  Â  if (attr->sample_type & PERF_SAMPLE_TIME)
> + Â  Â  Â  Â  Â  Â  Â  attr->sample_type Â  Â  Â  |= PERF_SAMPLE_CPU;
> +
> Â  Â  Â  Â if (opts->no_delay) {
> Â  Â  Â  Â  Â  Â  Â  Â attr->watermark = 0;
> Â  Â  Â  Â  Â  Â  Â  Â attr->wakeup_events = 1;
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 9f833cf..f297342 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -494,6 +494,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session)
> Â  Â  Â  Â  Â  Â  Â  Â list_del(&sq->list);
> Â  Â  Â  Â  Â  Â  Â  Â free(sq);
> Â  Â  Â  Â }
> +
> + Â  Â  Â  free(os->last_cpu_timestamp);
> Â }
>
> Â static int perf_session_deliver_event(struct perf_session *session,
> @@ -549,56 +551,89 @@ static void flush_sample_queue(struct perf_session *s,
> Â }
>
> Â /*
> - * When perf record finishes a pass on every buffers, it records this pseudo
> - * event.
> - * We record the max timestamp t found in the pass n.
> - * Assuming these timestamps are monotonic across cpus, we know that if
> - * a buffer still has events with timestamps below t, they will be all
> - * available and then read in the pass n + 1.
> - * Hence when we start to read the pass n + 2, we can safely flush every
> - * events with timestamps below t.
> + * We make the assumption that timestamps are not globally monotonic but locally
> + * non-strictly monotonic. In practice, this is because if we are dealing with a
> + * machine with unstable TSC, the kernel bounds the result of the tsc between
> + * last_tick_time < tsc < next_tick_time. Thus, if a CPU disables interrupts for more
> + * than one jiffy, all of its timestamps will be equal to next_tick_time after we
> + * cross that jiffy, without any further progress whereas the other CPU continue
> + * with normal timestamps. This can happen if a CPU sends crazillions of events
> + * while interrupts are disabled. But there are potentially other random scenarios
> + * with unstable TSC that drives us to assume the monotonicity of time only per CPU
> + * and not globally.
> + *
> + * To solve this, when perf record finishes a round of write on every buffers, it
> + * records a pseudo event named "finished round". The frame of events that happen
> + * between two finished rounds is called a "pass".
> + * We record the max timestamp T[cpu] per CPU found over the events in the pass n.
> + * Then when we finish a round, we iterate over these T[cpu]and keep the smallest
> + * one: min(T).
> + *
> + * Assuming these timestamps are locally monotonic (non strictly), we can flush all
> + * queued events having a timestamp below min(T) when we start to process PASS n + 1.
> + * But we actually wait until we start PASS n + 2 in case a CPU did not have any
> + * event in PASS n but came in PASS n + 1 with events below min(T). We truly
> + * hope no CPU will come with events below min(T) after pass n + 1. This
> + * heuristicly rely on some minimal global consistancy. This should work in most
> + * real world case, the only way to ensure a truly safe ordering with regular
> + * flush will be to switch to per CPU record files.
> + *
> Â *
> - * Â  Â ============ PASS n =================
> - * Â  Â  Â  CPU 0 Â  Â  Â  Â  | Â  CPU 1
> - * Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  |
> - * Â  Â cnt1 timestamps Â | Â  cnt2 timestamps
> - * Â  Â  Â  Â  Â 1 Â  Â  Â  Â  Â | Â  Â  Â  Â  2
> - * Â  Â  Â  Â  Â 2 Â  Â  Â  Â  Â | Â  Â  Â  Â  3
> - * Â  Â  Â  Â  Â - Â  Â  Â  Â  Â | Â  Â  Â  Â  4 Â <--- max recorded
> + * Â  Â ========================== PASS n ============================
> + * Â  Â  Â  CPU 0 Â  Â  Â  Â  Â  Â  Â  Â  Â  | Â  CPU 1
> + * Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  |
> + * Â  Â cnt1 timestamps Â  Â  Â  Â  Â  Â | Â  cnt2 timestamps
> + * Â  Â  Â  Â  Â 1 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â | Â  Â  Â  Â  2
> + * Â  Â  Â  Â  Â 2 <--- max recorded Â | Â  Â  Â  Â  3
> + * Â  Â  Â  Â  Â - Â  Â  Â  Â  Â  Â  Â  Â  Â  Â | Â  Â  Â  Â  4 <--- max recorded
> + * Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â min(T) = 2
> Â *
> - * Â  Â ============ PASS n + 1 ==============
> - * Â  Â  Â  CPU 0 Â  Â  Â  Â  | Â  CPU 1
> - * Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  |
> - * Â  Â cnt1 timestamps Â | Â  cnt2 timestamps
> - * Â  Â  Â  Â  Â 3 Â  Â  Â  Â  Â | Â  Â  Â  Â  5
> - * Â  Â  Â  Â  Â 4 Â  Â  Â  Â  Â | Â  Â  Â  Â  6
> - * Â  Â  Â  Â  Â 5 Â  Â  Â  Â  Â | Â  Â  Â  Â  7 <---- max recorded
> + * Â  Â ========================== PASS n + 1 ========================
> + * Â  Â  Â  CPU 0 Â  Â  Â  Â  Â  Â  Â  Â  Â  | Â  CPU 1
> + * Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  |
> + * Â  Â cnt1 timestamps Â  Â  Â  Â  Â  Â | Â  cnt2 timestamps
> + * Â  Â  Â  Â  Â 3 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â | Â  Â  Â  Â  5
> + * Â  Â  Â  Â  Â 4 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â | Â  Â  Â  Â  6
> + * Â  Â  Â  Â  Â 5 <--- max record Â  Â | Â  Â  Â  Â  7 <---- max recorded
> + * Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â min(T) = 5
> Â *
> - * Â  Â  Â Flush every events below timestamp 4
> + * Â  Â  Â  Â  Â  Â  Â  Â Flush every events below timestamp 2
> Â *
> - * Â  Â ============ PASS n + 2 ==============
> - * Â  Â  Â  CPU 0 Â  Â  Â  Â  | Â  CPU 1
> - * Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  |
> - * Â  Â cnt1 timestamps Â | Â  cnt2 timestamps
> - * Â  Â  Â  Â  Â 6 Â  Â  Â  Â  Â | Â  Â  Â  Â  8
> - * Â  Â  Â  Â  Â 7 Â  Â  Â  Â  Â | Â  Â  Â  Â  9
> - * Â  Â  Â  Â  Â - Â  Â  Â  Â  Â | Â  Â  Â  Â  10
> + * Â  Â ========================== PASS n + 2 ========================
> + * Â  Â  Â  CPU 0 Â  Â  Â  Â  Â  Â  Â  Â  Â  | Â  CPU 1
> + * Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  |
> + * Â  Â cnt1 timestamps Â  Â  Â  Â  Â  Â | Â  cnt2 timestamps
> + * Â  Â  Â  Â  Â 6 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â | Â  Â  Â  Â  8
> + * Â  Â  Â  Â  Â 7 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â | Â  Â  Â  Â  9
> + * Â  Â  Â  Â  Â - Â  Â  Â  Â  Â  Â  Â  Â  Â  Â | Â  Â  Â  Â  10
> Â *
> - * Â  Â  Â Flush every events below timestamp 7
> - * Â  Â  Â etc...
> + * Â  Â  Â  Â  Â  Â  Â  Â Flush every events below timestamp 5, etc...
> Â */
> Â static int process_finished_round(struct perf_tool *tool,
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â union perf_event *event __used,
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â struct perf_session *session)
> Â {
> + Â  Â  Â  unsigned int i;
> + Â  Â  Â  u64 min = ULLONG_MAX;
> + Â  Â  Â  struct ordered_samples *os = &session->ordered_samples;
> +
> Â  Â  Â  Â flush_sample_queue(session, tool);
> - Â  Â  Â  session->ordered_samples.next_flush = session->ordered_samples.max_timestamp;
> +
> + Â  Â  Â  for (i = 0; i < session->nr_cpus; i++) {
> + Â  Â  Â  Â  Â  Â  Â  if (os->last_cpu_timestamp[i] < min)
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  min = os->last_cpu_timestamp[i];
> +
> + Â  Â  Â  Â  Â  Â  Â  os->last_cpu_timestamp[i] = ULLONG_MAX;
> + Â  Â  Â  }
> +
> + Â  Â  Â  if (min != ULLONG_MAX)
> + Â  Â  Â  Â  Â  Â  Â  os->next_flush = min;
>
> Â  Â  Â  Â return 0;
> Â }
>
> Â /* The queue is ordered by time */
> -static void __queue_event(struct sample_queue *new, struct perf_session *s)
> +static void __queue_event(struct sample_queue *new, struct perf_session *s, int cpu)
> Â {
> Â  Â  Â  Â struct ordered_samples *os = &s->ordered_samples;
> Â  Â  Â  Â struct sample_queue *sample = os->last_sample;
> @@ -607,10 +642,10 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
>
> Â  Â  Â  Â ++os->nr_samples;
> Â  Â  Â  Â os->last_sample = new;
> + Â  Â  Â  os->last_cpu_timestamp[cpu] = timestamp;
>
> Â  Â  Â  Â if (!sample) {
> Â  Â  Â  Â  Â  Â  Â  Â list_add(&new->list, &os->samples);
> - Â  Â  Â  Â  Â  Â  Â  os->max_timestamp = timestamp;
> Â  Â  Â  Â  Â  Â  Â  Â return;
> Â  Â  Â  Â }
>
> @@ -624,7 +659,6 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â p = sample->list.next;
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â if (p == &os->samples) {
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â list_add_tail(&new->list, &os->samples);
> - Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  os->max_timestamp = timestamp;
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â return;
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â }
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â sample = list_entry(p, struct sample_queue, list);
> @@ -643,6 +677,34 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
> Â  Â  Â  Â }
> Â }
>
> +static int alloc_cpus_timestamp_array(struct perf_session *s,
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  struct perf_sample *sample,
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  struct ordered_samples *os)
> +{
> + Â  Â  Â  int i;
> + Â  Â  Â  int nr_cpus;
> +
> + Â  Â  Â  if (sample->cpu < s->nr_cpus)
> + Â  Â  Â  Â  Â  Â  Â  return 0;
> +
> + Â  Â  Â  nr_cpus = sample->cpu + 1;
> +
> + Â  Â  Â  if (!os->last_cpu_timestamp)
> + Â  Â  Â  Â  Â  Â  Â  os->last_cpu_timestamp = malloc(sizeof(u64) * nr_cpus);
> + Â  Â  Â  else
> + Â  Â  Â  Â  Â  Â  Â  os->last_cpu_timestamp = realloc(os->last_cpu_timestamp,
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â sizeof(u64) * nr_cpus);
> + Â  Â  Â  if (!os->last_cpu_timestamp)
> + Â  Â  Â  Â  Â  Â  Â  return -ENOMEM;
> +
> + Â  Â  Â  for (i = s->nr_cpus; i < nr_cpus; i++)
> + Â  Â  Â  Â  Â  Â  Â  os->last_cpu_timestamp[i] = ULLONG_MAX;
> +
> + Â  Â  Â  s->nr_cpus = nr_cpus;
> +
> + Â  Â  Â  return 0;
> +}
> +
> Â #define MAX_SAMPLE_BUFFER Â  Â  Â (64 * 1024 / sizeof(struct sample_queue))
>
> Â static int perf_session_queue_event(struct perf_session *s, union perf_event *event,
> @@ -652,6 +714,12 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
> Â  Â  Â  Â struct list_head *sc = &os->sample_cache;
> Â  Â  Â  Â u64 timestamp = sample->time;
> Â  Â  Â  Â struct sample_queue *new;
> + Â  Â  Â  int err;
> +
> + Â  Â  Â  if (!(s->sample_type & PERF_SAMPLE_CPU)) {
> + Â  Â  Â  Â  Â  Â  Â  pr_err("Warning: Need to record CPU on samples for ordering\n");
> + Â  Â  Â  Â  Â  Â  Â  return -EINVAL;
> + Â  Â  Â  }
>
> Â  Â  Â  Â if (!timestamp || timestamp == ~0ULL)
> Â  Â  Â  Â  Â  Â  Â  Â return -ETIME;
> @@ -661,6 +729,10 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
> Â  Â  Â  Â  Â  Â  Â  Â return -EINVAL;
> Â  Â  Â  Â }
>
> + Â  Â  Â  err = alloc_cpus_timestamp_array(s, sample, os);
> + Â  Â  Â  if (err)
> + Â  Â  Â  Â  Â  Â  Â  return err;
> +
> Â  Â  Â  Â if (!list_empty(sc)) {
> Â  Â  Â  Â  Â  Â  Â  Â new = list_entry(sc->next, struct sample_queue, list);
> Â  Â  Â  Â  Â  Â  Â  Â list_del(&new->list);
> @@ -681,7 +753,7 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
> Â  Â  Â  Â new->file_offset = file_offset;
> Â  Â  Â  Â new->event = event;
>
> - Â  Â  Â  __queue_event(new, s);
> + Â  Â  Â  __queue_event(new, s, sample->cpu);
>
> Â  Â  Â  Â return 0;
> Â }
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index c8d9017..642591b 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -16,7 +16,7 @@ struct thread;
> Â struct ordered_samples {
> Â  Â  Â  Â u64 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  last_flush;
> Â  Â  Â  Â u64 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  next_flush;
> - Â  Â  Â  u64 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  max_timestamp;
> + Â  Â  Â  u64 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  *last_cpu_timestamp;
> Â  Â  Â  Â struct list_head Â  Â  Â  Â samples;
> Â  Â  Â  Â struct list_head Â  Â  Â  Â sample_cache;
> Â  Â  Â  Â struct list_head Â  Â  Â  Â to_free;
> @@ -50,6 +50,7 @@ struct perf_session {
> Â  Â  Â  Â int Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  cwdlen;
> Â  Â  Â  Â char Â  Â  Â  Â  Â  Â  Â  Â  Â  Â *cwd;
> Â  Â  Â  Â struct ordered_samples Â ordered_samples;
> + Â  Â  Â  unsigned int Â  Â  Â  Â  Â  Â nr_cpus;
> Â  Â  Â  Â char Â  Â  Â  Â  Â  Â  Â  Â  Â  Â filename[1];
> Â };
>
> --
> 1.7.5.4
>
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥