From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752824AbdF3KPh (ORCPT <rfc822;w@1wt.eu>);
        Fri, 30 Jun 2017 06:15:37 -0400
Received: from mga06.intel.com ([134.134.136.31]:26377 "EHLO mga06.intel.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752605AbdF3KPe (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 30 Jun 2017 06:15:34 -0400
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.40,286,1496127600"; 
   d="scan'208";a="103119406"
From: Alexey Budankov <alexey.budankov@linux.intel.com>
Subject: [PATCH v5 2/4] perf/core: addressing 4x slowdown during per-process
 profiling of STREAM benchmark on Intel Xeon Phi
To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>, Kan Liang <kan.liang@intel.com>,
        Dmitri Prokhorov <Dmitry.Prohorov@intel.com>,
        Valery Cherepennikov <valery.cherepennikov@intel.com>,
        Mark Rutland <mark.rutland@arm.com>,
        David Carrillo-Cisneros <davidcc@google.com>,
        Stephane Eranian <eranian@google.com>,
        linux-kernel <linux-kernel@vger.kernel.org>
Organization: Intel Corp.
Message-ID: <17d78db1-0756-bb59-0d9e-08b764504514@linux.intel.com>
Date: Fri, 30 Jun 2017 13:15:29 +0300
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

perf/core: use context tstamp_data for skipped events on mux interrupt

By default, the userspace perf tool opens per-cpu task-bound events
when sampling, so for N logical events requested by the user, the tool
will open N * NR_CPUS events.

In the kernel, we mux events with a hrtimer, periodically rotating the
flexible group list and trying to schedule each group in turn. We skip
groups whose cpu filter doesn't match. So when we get unlucky, we can 
walk N * (NR_CPUS - 1) groups pointlessly for each hrtimer invocation.

This has been observed to result in significant overhead when running
the STREAM benchmark on 272 core Xeon Phi systems.

One way to avoid this is to place our events into an rb tree sorted by
CPU filter, so that our hrtimer can skip to the current CPU's
list and ignore everything else.

However skipped events still require its tstamp_* fields maintained
properly on groups switching.

To implement that tstamp_data object is introduced at event context
so skipped events' tstamps would point to the object where timings are
updated only once by update_context_time() on every mux hrtimer interrupt. 
Thus iteration of skipped events can be avoided with its tstamp_* 
timings properly updated still.

Signed-off-by Alexey Budankov <alexey.budankov@linux.intel.com>
---
 include/linux/perf_event.h | 36 ++++++++++++++++++----------
 kernel/events/core.c       | 58 ++++++++++++++++++++++++++++------------------
 2 files changed, 60 insertions(+), 34 deletions(-)

1. separated tstamp_enabled, tstamp_running and tstamp_stopped
   fields into struct perf_event_tstamp type with equal enabled, running
   and stopped fields.

2. introduced tstamp pointer at perf_event type and tstamp_data objects
   at perf_event and perf_event_context types.

3. updated event_sched_out(), ctx_pinned_sched_in(), ctx_flexible_sched_in()
   and perf_event_alloc() to properly maintain tstamp pointer.

4. implemented context's tstamp_data times updating at
   update_context_time().

5. updated references in the code to accommodate new data layout of
   perf_event and perf_event_context objects.

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index dcedfd7..7b2cddf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -550,6 +550,22 @@ struct pmu_event_list {
 	struct list_head	list;
 };
 
+struct perf_event_tstamp {
+	/*
+	 * These are timestamps used for computing total_time_enabled
+	 * and total_time_running when the event is in INACTIVE or
+	 * ACTIVE state, measured in nanoseconds from an arbitrary point
+	 * in time.
+	 * enabled: the notional time when the event was enabled
+	 * running: the notional time when the event was scheduled on
+	 * stopped: in INACTIVE state, the notional time when the
+	 *    event was scheduled off.
+	 */
+	u64 enabled;
+	u64 running;
+	u64 stopped;
+};
+
 /**
  * struct perf_event - performance event kernel representation:
  */
@@ -631,19 +647,11 @@ struct perf_event {
 	u64				total_time_running;
 
 	/*
-	 * These are timestamps used for computing total_time_enabled
-	 * and total_time_running when the event is in INACTIVE or
-	 * ACTIVE state, measured in nanoseconds from an arbitrary point
-	 * in time.
-	 * tstamp_enabled: the notional time when the event was enabled
-	 * tstamp_running: the notional time when the event was scheduled on
-	 * tstamp_stopped: in INACTIVE state, the notional time when the
-	 *	event was scheduled off.
+	 * tstamp points to the tstamp_data object below or to the object
+	 * located at the event context;
 	 */
-	u64				tstamp_enabled;
-	u64				tstamp_running;
-	u64				tstamp_stopped;
-
+	struct perf_event_tstamp	*tstamp;
+	struct perf_event_tstamp	tstamp_data;
 	/*
 	 * timestamp shadows the actual context timing but it can
 	 * be safely used in NMI interrupt context. It reflects the
@@ -787,6 +795,10 @@ struct perf_event_context {
 	 */
 	u64				time;
 	u64				timestamp;
+	/*
+	 * Context cache for filtered out events;
+	 */
+	struct perf_event_tstamp	tstamp_data;
 
 	/*
 	 * These fields let us detect when two contexts have both
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d3bc445..a9920ee 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -865,10 +865,10 @@ perf_cgroup_mark_enabled(struct perf_event *event,
 
 	event->cgrp_defer_enabled = 0;
 
-	event->tstamp_enabled = tstamp - event->total_time_enabled;
+	event->tstamp->enabled = tstamp - event->total_time_enabled;
 	list_for_each_entry(sub, &event->sibling_list, group_entry) {
 		if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+			sub->tstamp->enabled = tstamp - sub->total_time_enabled;
 			sub->cgrp_defer_enabled = 0;
 		}
 	}
@@ -1378,6 +1378,9 @@ static void update_context_time(struct perf_event_context *ctx)
 
 	ctx->time += now - ctx->timestamp;
 	ctx->timestamp = now;
+
+	ctx->tstamp_data.running += ctx->time - ctx->tstamp_data.stopped;
+	ctx->tstamp_data.stopped = ctx->time;
 }
 
 static u64 perf_event_time(struct perf_event *event)
@@ -1419,16 +1422,16 @@ static void update_event_times(struct perf_event *event)
 	else if (ctx->is_active)
 		run_end = ctx->time;
 	else
-		run_end = event->tstamp_stopped;
+		run_end = event->tstamp->stopped;
 
-	event->total_time_enabled = run_end - event->tstamp_enabled;
+	event->total_time_enabled = run_end - event->tstamp->enabled;
 
 	if (event->state == PERF_EVENT_STATE_INACTIVE)
-		run_end = event->tstamp_stopped;
+		run_end = event->tstamp->stopped;
 	else
 		run_end = perf_event_time(event);
 
-	event->total_time_running = run_end - event->tstamp_running;
+	event->total_time_running = run_end - event->tstamp->running;
 
 }
 
@@ -2037,9 +2040,13 @@ event_sched_out(struct perf_event *event,
 	 */
 	if (event->state == PERF_EVENT_STATE_INACTIVE &&
 	    !event_filter_match(event)) {
-		delta = tstamp - event->tstamp_stopped;
-		event->tstamp_running += delta;
-		event->tstamp_stopped = tstamp;
+		delta = tstamp - event->tstamp->stopped;
+		event->tstamp->running += delta;
+		event->tstamp->stopped = tstamp;
+		if (event->tstamp != &event->tstamp_data) {
+			event->tstamp_data = *event->tstamp;
+			event->tstamp = &event->tstamp_data;
+		}
 	}
 
 	if (event->state != PERF_EVENT_STATE_ACTIVE)
@@ -2047,7 +2054,7 @@ event_sched_out(struct perf_event *event,
 
 	perf_pmu_disable(event->pmu);
 
-	event->tstamp_stopped = tstamp;
+	event->tstamp->stopped = tstamp;
 	event->pmu->del(event, 0);
 	event->oncpu = -1;
 	event->state = PERF_EVENT_STATE_INACTIVE;
@@ -2338,7 +2345,7 @@ event_sched_in(struct perf_event *event,
 		goto out;
 	}
 
-	event->tstamp_running += tstamp - event->tstamp_stopped;
+	event->tstamp->running += tstamp - event->tstamp->stopped;
 
 	if (!is_software_event(event))
 		cpuctx->active_oncpu++;
@@ -2410,8 +2417,8 @@ group_sched_in(struct perf_event *group_event,
 			simulate = true;
 
 		if (simulate) {
-			event->tstamp_running += now - event->tstamp_stopped;
-			event->tstamp_stopped = now;
+			event->tstamp->running += now - event->tstamp->stopped;
+			event->tstamp->stopped = now;
 		} else {
 			event_sched_out(event, cpuctx, ctx);
 		}
@@ -2463,9 +2470,9 @@ static void add_event_to_ctx(struct perf_event *event,
 
 	list_add_event(event, ctx);
 	perf_group_attach(event);
-	event->tstamp_enabled = tstamp;
-	event->tstamp_running = tstamp;
-	event->tstamp_stopped = tstamp;
+	event->tstamp->enabled = tstamp;
+	event->tstamp->running = tstamp;
+	event->tstamp->stopped = tstamp;
 }
 
 static void ctx_sched_out(struct perf_event_context *ctx,
@@ -2710,10 +2717,10 @@ static void __perf_event_mark_enabled(struct perf_event *event)
 	u64 tstamp = perf_event_time(event);
 
 	event->state = PERF_EVENT_STATE_INACTIVE;
-	event->tstamp_enabled = tstamp - event->total_time_enabled;
+	event->tstamp->enabled = tstamp - event->total_time_enabled;
 	list_for_each_entry(sub, &event->sibling_list, group_entry) {
 		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+			sub->tstamp->enabled = tstamp - sub->total_time_enabled;
 	}
 }
 
@@ -3308,8 +3315,11 @@ ctx_pinned_sched_in(struct perf_event *event, void *data)
 
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
-	if (!event_filter_match(event))
+	if (!event_filter_match(event)) {
+		if (event->tstamp != &params->ctx->tstamp_data)
+			event->tstamp = &params->ctx->tstamp_data;
 		return 0;
+	}
 
 	/* may need to reset tstamp_enabled */
 	if (is_cgroup_event(event))
@@ -3342,8 +3352,11 @@ ctx_flexible_sched_in(struct perf_event *event, void *data)
 	 * Listen to the 'cpu' scheduling filter constraint
 	 * of events:
 	 */
-	if (!event_filter_match(event))
+	if (!event_filter_match(event)) {
+		if (event->tstamp != &params->ctx->tstamp_data)
+			event->tstamp = &params->ctx->tstamp_data;
 		return 0;
+	}
 
 	/* may need to reset tstamp_enabled */
 	if (is_cgroup_event(event))
@@ -5100,8 +5113,8 @@ static void calc_timer_values(struct perf_event *event,
 
 	*now = perf_clock();
 	ctx_time = event->shadow_ctx_time + *now;
-	*enabled = ctx_time - event->tstamp_enabled;
-	*running = ctx_time - event->tstamp_running;
+	*enabled = ctx_time - event->tstamp->enabled;
+	*running = ctx_time - event->tstamp->running;
 }
 
 static void perf_event_init_userpage(struct perf_event *event)
@@ -9652,6 +9665,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	raw_spin_lock_init(&event->addr_filters.lock);
 
 	atomic_long_set(&event->refcount, 1);
+	event->tstamp		= &event->tstamp_data;
 	event->cpu		= cpu;
 	event->attr		= *attr;
 	event->group_leader	= group_leader;