linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
@ 2017-05-26 22:13 Alexey Budankov
  2017-05-27 11:19 ` [PATCH v2]: perf/core: addressing 4x slowdown during per-process, " Alexey Budankov
  2017-05-29 12:03 ` [PATCH]: perf/core: addressing 4x slowdown during per-process " Alexander Shishkin
  0 siblings, 2 replies; 25+ messages in thread
From: Alexey Budankov @ 2017-05-26 22:13 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Alexander Shishkin, Andi Kleen, Kan Liang, Dmitri Prokhorov,
	Valery Cherepennikov, David Carrillo-Cisneros, Stephane Eranian,
	Mark Rutland, linux-kernel


perf/core: addressing 4x slowdown during per-process profiling of
            STREAM benchmark on Intel Xeon Phi

Motivation:

The issue manifests like 4x slowdown when profiling single thread
STREAM benchmark on Intel Xeon Phi running RHEL7.2 (Intel MPSS distro).
Perf profiling is done in per-process mode and involves about 30 core
events. In case the benchmark is OpenMP based and runs under profiling
in 272 threads the overhead becomes even more dramatic:
512.144s against 36.192s (with this patch).

Solution:

cpu indexed trees for perf_event_context::pinned_group and
perf_event_context::flexible_group lists are introduced. Every tree
node keeps a list of groups allocated for the same cpu. A tree
references only groups located at the appropriate group list. The tree
provides capability to iterate over groups allocated for a specific
cpu only, what is exactly required by multiplexing timer interrupt
handler. The handler runs per-cpu and enables/disables groups using
group_sched_in()/group_sched_out() that call event_filter_match()
function filtering out groups allocated for cpus different from the
one executing the handler. Additionally for every filtered out group
group_sched_out() updates tstamps values to the current interrupt
time. This updating work is now done only once by
update_context_time() called by ctx_sched_out() before cpu groups
iteration. For this trick to work it is required that tstamps of
filtered out events would point to perf_event_context::tstamp_data
object instead of perf_event::tstam_data ones, as it is initialized
from an event allocation. tstamps references are switched by
group_sched_in()/group_sched_out() every time a group is checked for
its suitability for currently running cpu. When a thread enters some
cpu on a context switch a long run through pinned and flexible groups
is performed by perf_event_sched_in(, mux=0) with new parameter mux
set to 0 and filtered out groups tstamps are switched to
perf_event_context::tstamp_data object. Then a series of multiplexing
interrupts happens and the handler rotates the flexible groups calling
ctx_sched_out(,mux=1)/perf_event_sched_in(,mux=1) iterating over the
cpu tree lists only and avoiding long runs through the complete group
lists. This is where speedup comes from. Eventually when the thread
leaves the cpu ctx_sched_out(,mux=0) is called restoring tstamps
pointers to the events' perf_event::tstamp_data objects.

Locking:

The new perf_event_context::pinned_tree and
perf_event_context::flexible_tree are protected by
perf_evet_context::lock and perf_event_context::mutex, the same
locking schema as for appropriate perf_event_context::pinned_group and
perf_event_context::flexible_group. Indirect perf_event::tstamp
reference and perf_event::tstam_data object are protected by
perf_event::lock as earlier for original perf_event::tstamp_* fields.

Added API:

Objects:
1. struct perf_event_tstamp:
    - enabled
    - running
    - stopped
2. struct perf_event:
    - group_node
    - group_list
    - group_list_entry
    - tstamp
    - tstamp_data
3. struct perf_event_context:
    - pinned_tree
    - flexible_tree
    - tstamp_data

Functions:
1. insert a group into a tree using event->cpu as a key:
         int perf_cpu_tree_insert(struct rb_root *tree,
                 struct perf_event *event);
2. delete a group from a tree, if the group is directly attached
    to a tree it also detaches all groups on the groups
    group_list list:
         int perf_cpu_tree_delete(struct rb_root *tree,
                 struct perf_event *event);
3. find group_list list by a cpu key:
         struct list_head * perf_cpu_tree_find(struct rb_root *tree,
                 int cpu);
4. enable a pinned group on a cpu:
         void ctx_sched_in_flexible_group(
                 struct perf_event_context *ctx,
                 struct perf_cpu_context *cpuctx,
                 struct perf_event *group, int *can_add_hw);
5. enable a flexible group on a cpu; calls ctx_sched_in_flexible_group
    and updates group->state to ERROR in case of failure:
         void ctx_sched_in_pinned_group(struct perf_event_context *ctx,
                 struct perf_cpu_context *cpuctx,
                 struct perf_event *group);
6. enable per-cpu pinned tree's groups on a cpu:
         void ctx_pinned_sched_in_groups(struct perf_event_context *ctx,
                 struct perf_cpu_context *cpuctx,
                 struct list_head *groups);
7. enable per-cpu flexible tree's groups on a cpu:
         void ctx_flexible_sched_in_groups(
                 struct perf_event_context *ctx,
                 struct perf_cpu_context *cpuctx,
                 struct list_head *groups);
8. disable per-cpu tree's groups on a cpu:
         void ctx_sched_out_groups(struct perf_event_context *ctx,
                 struct perf_cpu_context *cpuctx,
                 struct list_head *groups);
9. get a group tree based on event->attr.pinned attribute value:
         struct rb_root * ctx_cpu_tree(struct perf_event *event,
                 struct perf_event_context *ctx);

Modified API:

Objects:
1. struct perf_event
2. struct perf_event_context

Functions:
1. perf_event_alloc()
2. add_event_to_ctx()
3. perf_event_enable_on_exec()
4. __perf_install_in_context()
5. __perf_event_init_context()
6. __perf_event_mark_enabled()
7. __perf_event_enable()
8. __perf_event_task_sched_out()
9. ctx_group_list()
10. list_add_event()
11. list_del_event()
12. perf_event_context_sched_in()
13. cpu_ctx_sched_in()
14. cpu_ctx_sched_out()
15. ctx_sched_in()
16. ctx_sched_out()
17. ctx_resched()
18. ctx_pinned_sched_in()
19. ctx_flexible_sched_in()
20. group_sched_in()
21. event_sched_in()
22. event_sched_out()
23. rotate_ctx()
24. perf_rotate_context()
25. update_context_time()
26. update_event_times()
27. calc_timer_values()
28. perf_cgroup_switch()
29. perf_cgroup_mark_enabled()

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
  include/linux/perf_event.h |  63 +++++--
  kernel/events/core.c       | 442 
++++++++++++++++++++++++++++++++++-----------
  2 files changed, 384 insertions(+), 121 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 24a6358..6bd380f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -550,6 +550,22 @@ struct pmu_event_list {
  	struct list_head	list;
  };

+struct perf_event_tstamp {
+	/*
+	 * These are timestamps used for computing total_time_enabled
+	 * and total_time_running when the event is in INACTIVE or
+	 * ACTIVE state, measured in nanoseconds from an arbitrary point
+	 * in time.
+	 * enabled: the notional time when the event was enabled
+	 * running: the notional time when the event was scheduled on
+	 * stopped: in INACTIVE state, the notional time when the
+	 *	event was scheduled off.
+	 */
+	u64 enabled;
+	u64 running;
+	u64 stopped;
+};
+
  /**
   * struct perf_event - performance event kernel representation:
   */
@@ -571,6 +587,27 @@ struct perf_event {
  	 * either sufficies for read.
  	 */
  	struct list_head		group_entry;
+	/*
+	 * Node on the pinned or flexible tree located at the event context;
+	 * the node may be empty in case its event is not directly attached
+	 * to the tree but to group_list list of the event directly
+	 * attached to the tree;
+	 */
+	struct rb_node			group_node;
+	/*
+	 * List keeps groups allocated for the same cpu;
+	 * the list may be empty in case its event is not directly
+	 * attached to the tree but to group_list list of the event directly
+	 * attached to the tree;
+	 */
+	struct list_head		group_list;
+	/*
+	 * Entry into the group_list list above;
+	 * the entry may be attached to the self group_list list above
+	 * in case the event is directly attached to the pinned or
+	 * flexible tree;
+	 */
+	struct list_head		group_list_entry;
  	struct list_head		sibling_list;

  	/*
@@ -611,18 +648,11 @@ struct perf_event {
  	u64				total_time_running;

  	/*
-	 * These are timestamps used for computing total_time_enabled
-	 * and total_time_running when the event is in INACTIVE or
-	 * ACTIVE state, measured in nanoseconds from an arbitrary point
-	 * in time.
-	 * tstamp_enabled: the notional time when the event was enabled
-	 * tstamp_running: the notional time when the event was scheduled on
-	 * tstamp_stopped: in INACTIVE state, the notional time when the
-	 *	event was scheduled off.
+	 * tstamp points to the tstamp_data object below or to the object
+	 * located at the event context;
  	 */
-	u64				tstamp_enabled;
-	u64				tstamp_running;
-	u64				tstamp_stopped;
+	struct perf_event_tstamp	*tstamp;
+	struct perf_event_tstamp	tstamp_data;

  	/*
  	 * timestamp shadows the actual context timing but it can
@@ -742,7 +772,17 @@ struct perf_event_context {

  	struct list_head		active_ctx_list;
  	struct list_head		pinned_groups;
+	/*
+	 * Cpu tree for pinned groups; keeps event's group_node nodes
+	 * of attached flexible groups;
+	 */
+	struct rb_root			pinned_tree;
  	struct list_head		flexible_groups;
+	/*
+	 * Cpu tree for flexible groups; keeps event's group_node nodes
+	 * of attached flexible groups;
+	 */
+	struct rb_root			flexible_tree;
  	struct list_head		event_list;
  	int				nr_events;
  	int				nr_active;
@@ -758,6 +798,7 @@ struct perf_event_context {
  	 */
  	u64				time;
  	u64				timestamp;
+	struct perf_event_tstamp	tstamp_data;

  	/*
  	 * These fields let us detect when two contexts have both
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f8c27d3..9168406 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -555,11 +555,11 @@ void perf_sample_event_took(u64 sample_len_ns)
  static atomic64_t perf_event_id;

  static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type);
+			      enum event_type_t event_type, int mux);

  static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
  			     enum event_type_t event_type,
-			     struct task_struct *task);
+			     struct task_struct *task, int mux);

  static void update_context_time(struct perf_event_context *ctx);
  static u64 perf_event_time(struct perf_event *event);
@@ -701,6 +701,7 @@ static void perf_cgroup_switch(struct task_struct 
*task, int mode)
  	struct perf_cpu_context *cpuctx;
  	struct list_head *list;
  	unsigned long flags;
+	int mux = 0;

  	/*
  	 * Disable interrupts and preemption to avoid this CPU's
@@ -716,7 +717,7 @@ static void perf_cgroup_switch(struct task_struct 
*task, int mode)
  		perf_pmu_disable(cpuctx->ctx.pmu);

  		if (mode & PERF_CGROUP_SWOUT) {
-			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+			cpu_ctx_sched_out(cpuctx, EVENT_ALL, mux);
  			/*
  			 * must not be done before ctxswout due
  			 * to event_filter_match() in event_sched_out()
@@ -735,7 +736,7 @@ static void perf_cgroup_switch(struct task_struct 
*task, int mode)
  			 */
  			cpuctx->cgrp = perf_cgroup_from_task(task,
  							     &cpuctx->ctx);
-			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task, mux);
  		}
  		perf_pmu_enable(cpuctx->ctx.pmu);
  		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -865,10 +866,10 @@ perf_cgroup_mark_enabled(struct perf_event *event,

  	event->cgrp_defer_enabled = 0;

-	event->tstamp_enabled = tstamp - event->total_time_enabled;
+	event->tstamp->enabled = tstamp - event->total_time_enabled;
  	list_for_each_entry(sub, &event->sibling_list, group_entry) {
  		if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+			sub->tstamp->enabled = tstamp - sub->total_time_enabled;
  			sub->cgrp_defer_enabled = 0;
  		}
  	}
@@ -1383,6 +1384,9 @@ static void update_context_time(struct 
perf_event_context *ctx)

  	ctx->time += now - ctx->timestamp;
  	ctx->timestamp = now;
+
+	ctx->tstamp_data.running += ctx->time - ctx->tstamp_data.stopped;
+	ctx->tstamp_data.stopped = ctx->time;
  }

  static u64 perf_event_time(struct perf_event *event)
@@ -1424,16 +1428,16 @@ static void update_event_times(struct perf_event 
*event)
  	else if (ctx->is_active)
  		run_end = ctx->time;
  	else
-		run_end = event->tstamp_stopped;
+		run_end = event->tstamp->stopped;

-	event->total_time_enabled = run_end - event->tstamp_enabled;
+	event->total_time_enabled = run_end - event->tstamp->enabled;

  	if (event->state == PERF_EVENT_STATE_INACTIVE)
-		run_end = event->tstamp_stopped;
+		run_end = event->tstamp->stopped;
  	else
  		run_end = perf_event_time(event);

-	event->total_time_running = run_end - event->tstamp_running;
+	event->total_time_running = run_end - event->tstamp->running;

  }

@@ -1472,6 +1476,186 @@ ctx_group_list(struct perf_event *event, struct 
perf_event_context *ctx)
  		return &ctx->flexible_groups;
  }

+static int
+perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event)
+{
+	struct rb_node **node;
+	struct rb_node *parent;
+
+	if (!tree || !event)
+		return 0;
+
+	node = &tree->rb_node;
+	parent = *node;
+
+	while (*node) {
+		struct perf_event *node_event =	container_of(*node,
+				struct perf_event, group_node);
+
+		parent = *node;
+
+		if (event->cpu < node_event->cpu) {
+			node = &((*node)->rb_left);
+		} else if (event->cpu > node_event->cpu) {
+			node = &((*node)->rb_right);
+		} else {
+			list_add_tail(&event->group_list_entry,
+					&node_event->group_list);
+			return 2;
+		}
+	}
+
+	list_add_tail(&event->group_list_entry, &event->group_list);
+
+	rb_link_node(&event->group_node, parent, node);
+	rb_insert_color(&event->group_node, tree);
+
+	return 1;
+}
+
+static int
+perf_cpu_tree_delete(struct rb_root *tree, struct perf_event *event)
+{
+	if (!tree || !event)
+		return 0;
+
+	if (RB_EMPTY_NODE(&event->group_node)) {
+		list_del_init(&event->group_list_entry);
+	} else {
+		struct perf_event *list_event, *list_next;
+
+		rb_erase(&event->group_node, tree);
+		list_for_each_entry_safe(list_event, list_next,
+				&event->group_list, group_list_entry)
+			list_del_init(&list_event->group_list_entry);
+	}
+
+	return 1;
+}
+
+static struct list_head *
+perf_cpu_tree_find(struct rb_root *tree, int cpu)
+{
+	struct rb_node *node;
+
+	if (!tree)
+		return NULL;
+
+	node = tree->rb_node;
+
+	while (node) {
+		struct perf_event *node_event = container_of(node,
+				struct perf_event, group_node);
+
+		if (cpu < node_event->cpu)
+			node = node->rb_left;
+		else if (cpu > node_event->cpu)
+			node = node->rb_right;
+		else
+			return &node_event->group_list;
+	}
+
+	return NULL;
+}
+
+static struct rb_root *
+ctx_cpu_tree(struct perf_event *event, struct perf_event_context *ctx)
+{
+	if (event->attr.pinned)
+		return &ctx->pinned_tree;
+	else
+		return &ctx->flexible_tree;
+}
+
+static inline int
+event_filter_match(struct perf_event *event);
+
+static int
+group_can_go_on(struct perf_event *event,
+		struct perf_cpu_context *cpuctx,
+		int can_add_hw);
+
+static int
+group_sched_in(struct perf_event *group_event,
+	       struct perf_cpu_context *cpuctx,
+	       struct perf_event_context *ctx);
+
+static void
+ctx_sched_in_flexible_group(struct perf_event_context *ctx,
+		    struct perf_cpu_context *cpuctx,
+		    struct perf_event *event, int *can_add_hw)
+{
+	if (event->state <= PERF_EVENT_STATE_OFF)
+		return;
+	if (!event_filter_match(event)) {
+		if (event->tstamp != &ctx->tstamp_data)
+			event->tstamp = &ctx->tstamp_data;
+		return;
+	}
+
+	/* may need to reset tstamp_enabled */
+	if (is_cgroup_event(event))
+		perf_cgroup_mark_enabled(event, ctx);
+
+	if (group_can_go_on(event, cpuctx, *can_add_hw)) {
+		if (group_sched_in(event, cpuctx, ctx))
+			*can_add_hw = 0;
+	}
+}
+
+static void
+ctx_sched_in_pinned_group(struct perf_event_context *ctx,
+		    struct perf_cpu_context *cpuctx,
+		    struct perf_event *event)
+{
+	int can_add_hw = 1;
+
+	ctx_sched_in_flexible_group(ctx, cpuctx, event, &can_add_hw);
+	if (!can_add_hw) {
+		update_group_times(event);
+		event->state = PERF_EVENT_STATE_ERROR;
+	}
+}
+
+static void
+ctx_pinned_sched_in_groups(struct perf_event_context *ctx,
+		    struct perf_cpu_context *cpuctx,
+		    struct list_head *groups)
+{
+	struct perf_event *group;
+
+	list_for_each_entry(group, groups, group_list_entry)
+		ctx_sched_in_pinned_group(ctx, cpuctx, group);
+}
+
+static void
+ctx_flexible_sched_in_groups(struct perf_event_context *ctx,
+		      struct perf_cpu_context *cpuctx,
+		      struct list_head *groups)
+{
+	struct perf_event *group;
+	int can_add_hw = 1;
+
+	list_for_each_entry(group, groups, group_list_entry)
+		ctx_sched_in_flexible_group(ctx, cpuctx, group, &can_add_hw);
+}
+
+static void
+group_sched_out(struct perf_event *group_event,
+		struct perf_cpu_context *cpuctx,
+		struct perf_event_context *ctx);
+
+static void
+ctx_sched_out_groups(struct perf_event_context *ctx,
+		struct perf_cpu_context *cpuctx,
+		struct list_head *groups)
+{
+	struct perf_event *group;
+
+	list_for_each_entry(group, groups, group_list_entry)
+		group_sched_out(group, cpuctx, ctx);
+}
+
  /*
   * Add a event from the lists for its context.
   * Must be called with ctx->mutex and ctx->lock held.
@@ -1491,11 +1675,15 @@ list_add_event(struct perf_event *event, struct 
perf_event_context *ctx)
  	 */
  	if (event->group_leader == event) {
  		struct list_head *list;
+		struct rb_root *tree;

  		event->group_caps = event->event_caps;

  		list = ctx_group_list(event, ctx);
  		list_add_tail(&event->group_entry, list);
+
+		tree = ctx_cpu_tree(event, ctx);
+		perf_cpu_tree_insert(tree, event);
  	}

  	list_update_cgroup_event(event, ctx, true);
@@ -1685,8 +1873,14 @@ list_del_event(struct perf_event *event, struct 
perf_event_context *ctx)

  	list_del_rcu(&event->event_entry);

-	if (event->group_leader == event)
+	if (event->group_leader == event) {
+		struct rb_root *tree;
+
+		tree = ctx_cpu_tree(event, ctx);
+		perf_cpu_tree_delete(tree, event);
+
  		list_del_init(&event->group_entry);
+	}

  	update_group_times(event);

@@ -1811,9 +2005,9 @@ event_sched_out(struct perf_event *event,
  	 */
  	if (event->state == PERF_EVENT_STATE_INACTIVE &&
  	    !event_filter_match(event)) {
-		delta = tstamp - event->tstamp_stopped;
-		event->tstamp_running += delta;
-		event->tstamp_stopped = tstamp;
+		delta = tstamp - event->tstamp->stopped;
+		event->tstamp->running += delta;
+		event->tstamp->stopped = tstamp;
  	}

  	if (event->state != PERF_EVENT_STATE_ACTIVE)
@@ -1821,7 +2015,7 @@ event_sched_out(struct perf_event *event,

  	perf_pmu_disable(event->pmu);

-	event->tstamp_stopped = tstamp;
+	event->tstamp->stopped = tstamp;
  	event->pmu->del(event, 0);
  	event->oncpu = -1;
  	event->state = PERF_EVENT_STATE_INACTIVE;
@@ -2096,7 +2290,7 @@ event_sched_in(struct perf_event *event,
  		goto out;
  	}

-	event->tstamp_running += tstamp - event->tstamp_stopped;
+	event->tstamp->running += tstamp - event->tstamp->stopped;

  	if (!is_software_event(event))
  		cpuctx->active_oncpu++;
@@ -2168,8 +2362,8 @@ group_sched_in(struct perf_event *group_event,
  			simulate = true;

  		if (simulate) {
-			event->tstamp_running += now - event->tstamp_stopped;
-			event->tstamp_stopped = now;
+			event->tstamp->running += now - event->tstamp->stopped;
+			event->tstamp->stopped = now;
  		} else {
  			event_sched_out(event, cpuctx, ctx);
  		}
@@ -2221,43 +2415,45 @@ static void add_event_to_ctx(struct perf_event 
*event,

  	list_add_event(event, ctx);
  	perf_group_attach(event);
-	event->tstamp_enabled = tstamp;
-	event->tstamp_running = tstamp;
-	event->tstamp_stopped = tstamp;
+	event->tstamp->enabled = tstamp;
+	event->tstamp->running = tstamp;
+	event->tstamp->stopped = tstamp;
  }

  static void ctx_sched_out(struct perf_event_context *ctx,
  			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type);
+			  enum event_type_t event_type, int mux);
  static void
  ctx_sched_in(struct perf_event_context *ctx,
  	     struct perf_cpu_context *cpuctx,
  	     enum event_type_t event_type,
-	     struct task_struct *task);
+	     struct task_struct *task, int mux);

  static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
  			       struct perf_event_context *ctx,
  			       enum event_type_t event_type)
  {
+	int mux = 0;
+
  	if (!cpuctx->task_ctx)
  		return;

  	if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
  		return;

-	ctx_sched_out(ctx, cpuctx, event_type);
+	ctx_sched_out(ctx, cpuctx, event_type, mux);
  }

  static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
  				struct perf_event_context *ctx,
-				struct task_struct *task)
+				struct task_struct *task, int mux)
  {
-	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task);
+	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task, mux);
  	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task);
-	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task);
+		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task, mux);
+	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task, mux);
  	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
+		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task, mux);
  }

  /*
@@ -2281,6 +2477,7 @@ static void ctx_resched(struct perf_cpu_context 
*cpuctx,
  {
  	enum event_type_t ctx_event_type = event_type & EVENT_ALL;
  	bool cpu_event = !!(event_type & EVENT_CPU);
+	int mux = 0;

  	/*
  	 * If pinned groups are involved, flexible groups also need to be
@@ -2301,11 +2498,11 @@ static void ctx_resched(struct perf_cpu_context 
*cpuctx,
  	 *  - otherwise, do nothing more.
  	 */
  	if (cpu_event)
-		cpu_ctx_sched_out(cpuctx, ctx_event_type);
+		cpu_ctx_sched_out(cpuctx, ctx_event_type, mux);
  	else if (ctx_event_type & EVENT_PINNED)
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);

-	perf_event_sched_in(cpuctx, task_ctx, current);
+	perf_event_sched_in(cpuctx, task_ctx, current, mux);
  	perf_pmu_enable(cpuctx->ctx.pmu);
  }

@@ -2323,6 +2520,7 @@ static int  __perf_install_in_context(void *info)
  	struct perf_event_context *task_ctx = cpuctx->task_ctx;
  	bool reprogram = true;
  	int ret = 0;
+	int mux = 0;

  	raw_spin_lock(&cpuctx->ctx.lock);
  	if (ctx->task) {
@@ -2349,7 +2547,7 @@ static int  __perf_install_in_context(void *info)
  	}

  	if (reprogram) {
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
  		add_event_to_ctx(event, ctx);
  		ctx_resched(cpuctx, task_ctx, get_event_type(event));
  	} else {
@@ -2468,10 +2666,10 @@ static void __perf_event_mark_enabled(struct 
perf_event *event)
  	u64 tstamp = perf_event_time(event);

  	event->state = PERF_EVENT_STATE_INACTIVE;
-	event->tstamp_enabled = tstamp - event->total_time_enabled;
+	event->tstamp->enabled = tstamp - event->total_time_enabled;
  	list_for_each_entry(sub, &event->sibling_list, group_entry) {
  		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+			sub->tstamp->enabled = tstamp - sub->total_time_enabled;
  	}
  }

@@ -2485,13 +2683,14 @@ static void __perf_event_enable(struct 
perf_event *event,
  {
  	struct perf_event *leader = event->group_leader;
  	struct perf_event_context *task_ctx;
+	int mux = 0;

  	if (event->state >= PERF_EVENT_STATE_INACTIVE ||
  	    event->state <= PERF_EVENT_STATE_ERROR)
  		return;

  	if (ctx->is_active)
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);

  	__perf_event_mark_enabled(event);

@@ -2501,7 +2700,7 @@ static void __perf_event_enable(struct perf_event 
*event,
  	if (!event_filter_match(event)) {
  		if (is_cgroup_event(event))
  			perf_cgroup_defer_enabled(event);
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
  		return;
  	}

@@ -2510,7 +2709,7 @@ static void __perf_event_enable(struct perf_event 
*event,
  	 * then don't put it on unless the group is on.
  	 */
  	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
  		return;
  	}

@@ -2706,10 +2905,12 @@ EXPORT_SYMBOL_GPL(perf_event_refresh);

  static void ctx_sched_out(struct perf_event_context *ctx,
  			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type)
+			  enum event_type_t event_type, int mux)
  {
  	int is_active = ctx->is_active;
  	struct perf_event *event;
+	struct list_head *groups;
+	int cpu = smp_processor_id();

  	lockdep_assert_held(&ctx->lock);

@@ -2756,13 +2957,34 @@ static void ctx_sched_out(struct 
perf_event_context *ctx,

  	perf_pmu_disable(ctx->pmu);
  	if (is_active & EVENT_PINNED) {
-		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
-			group_sched_out(event, cpuctx, ctx);
+		if (mux) {
+			groups = perf_cpu_tree_find(&ctx->pinned_tree, -1);
+			if (groups)
+				ctx_sched_out_groups(ctx, cpuctx, groups);
+			groups = perf_cpu_tree_find(&ctx->pinned_tree, cpu);
+			if (groups)
+				ctx_sched_out_groups(ctx, cpuctx, groups);
+		} else {
+			list_for_each_entry(event, &ctx->pinned_groups,
+					group_entry)
+				group_sched_out(event, cpuctx, ctx);
+		}
  	}

  	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry(event, &ctx->flexible_groups, group_entry)
-			group_sched_out(event, cpuctx, ctx);
+		if (mux) {
+			groups = perf_cpu_tree_find(&ctx->flexible_tree, -1);
+			if (groups)
+				ctx_sched_out_groups(ctx, cpuctx, groups);
+			groups = perf_cpu_tree_find(&ctx->flexible_tree, cpu);
+			if (groups)
+				ctx_sched_out_groups(ctx, cpuctx, groups);
+
+		} else {
+			list_for_each_entry(event, &ctx->flexible_groups,
+					group_entry)
+				group_sched_out(event, cpuctx, ctx);
+		}
  	}
  	perf_pmu_enable(ctx->pmu);
  }
@@ -3051,9 +3273,9 @@ void __perf_event_task_sched_out(struct 
task_struct *task,
   * Called with IRQs disabled
   */
  static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type)
+			      enum event_type_t event_type, int mux)
  {
-	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
+	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
  }

  static void
@@ -3061,29 +3283,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
  		    struct perf_cpu_context *cpuctx)
  {
  	struct perf_event *event;
-
-	list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		if (!event_filter_match(event))
-			continue;
-
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
-
-		if (group_can_go_on(event, cpuctx, 1))
-			group_sched_in(event, cpuctx, ctx);
-
-		/*
-		 * If this pinned group hasn't been scheduled,
-		 * put it in error state.
-		 */
-		if (event->state == PERF_EVENT_STATE_INACTIVE) {
-			update_group_times(event);
-			event->state = PERF_EVENT_STATE_ERROR;
-		}
-	}
+	list_for_each_entry(event, &ctx->pinned_groups, group_entry)
+		ctx_sched_in_pinned_group(ctx, cpuctx, event);
  }

  static void
@@ -3092,37 +3293,19 @@ ctx_flexible_sched_in(struct perf_event_context 
*ctx,
  {
  	struct perf_event *event;
  	int can_add_hw = 1;
-
-	list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
-		/* Ignore events in OFF or ERROR state */
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		/*
-		 * Listen to the 'cpu' scheduling filter constraint
-		 * of events:
-		 */
-		if (!event_filter_match(event))
-			continue;
-
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
-
-		if (group_can_go_on(event, cpuctx, can_add_hw)) {
-			if (group_sched_in(event, cpuctx, ctx))
-				can_add_hw = 0;
-		}
-	}
+	list_for_each_entry(event, &ctx->flexible_groups, group_entry)
+		ctx_sched_in_flexible_group(ctx, cpuctx, event, &can_add_hw);
  }

  static void
  ctx_sched_in(struct perf_event_context *ctx,
  	     struct perf_cpu_context *cpuctx,
  	     enum event_type_t event_type,
-	     struct task_struct *task)
+	     struct task_struct *task, int mux)
  {
  	int is_active = ctx->is_active;
-	u64 now;
+	struct list_head *groups;
+	int cpu = smp_processor_id();

  	lockdep_assert_held(&ctx->lock);

@@ -3141,8 +3324,7 @@ ctx_sched_in(struct perf_event_context *ctx,

  	if (is_active & EVENT_TIME) {
  		/* start ctx time */
-		now = perf_clock();
-		ctx->timestamp = now;
+		ctx->timestamp = perf_clock();
  		perf_cgroup_set_timestamp(task, ctx);
  	}

@@ -3150,27 +3332,50 @@ ctx_sched_in(struct perf_event_context *ctx,
  	 * First go through the list and put on any pinned groups
  	 * in order to give them the best chance of going on.
  	 */
-	if (is_active & EVENT_PINNED)
-		ctx_pinned_sched_in(ctx, cpuctx);
+	if (is_active & EVENT_PINNED) {
+		if (mux) {
+			groups = perf_cpu_tree_find(&ctx->pinned_tree, -1);
+			if (groups)
+				ctx_pinned_sched_in_groups(ctx, cpuctx, groups);
+			groups = perf_cpu_tree_find(&ctx->pinned_tree, cpu);
+			if (groups)
+				ctx_pinned_sched_in_groups(ctx, cpuctx, groups);
+		} else {
+			ctx_pinned_sched_in(ctx, cpuctx);
+		}
+	}

  	/* Then walk through the lower prio flexible groups */
-	if (is_active & EVENT_FLEXIBLE)
-		ctx_flexible_sched_in(ctx, cpuctx);
+	if (is_active & EVENT_FLEXIBLE) {
+		if (mux) {
+			groups = perf_cpu_tree_find(&ctx->flexible_tree, -1);
+			if (groups)
+				ctx_flexible_sched_in_groups(ctx, cpuctx,
+						groups);
+			groups = perf_cpu_tree_find(&ctx->flexible_tree, cpu);
+			if (groups)
+				ctx_flexible_sched_in_groups(ctx, cpuctx,
+						groups);
+		} else {
+			ctx_flexible_sched_in(ctx, cpuctx);
+		}
+	}
  }

  static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
  			     enum event_type_t event_type,
-			     struct task_struct *task)
+			     struct task_struct *task, int mux)
  {
  	struct perf_event_context *ctx = &cpuctx->ctx;

-	ctx_sched_in(ctx, cpuctx, event_type, task);
+	ctx_sched_in(ctx, cpuctx, event_type, task, mux);
  }

  static void perf_event_context_sched_in(struct perf_event_context *ctx,
  					struct task_struct *task)
  {
  	struct perf_cpu_context *cpuctx;
+	int mux = 0;

  	cpuctx = __get_cpu_context(ctx);
  	if (cpuctx->task_ctx == ctx)
@@ -3187,8 +3392,8 @@ static void perf_event_context_sched_in(struct 
perf_event_context *ctx,
  	 * events, no need to flip the cpuctx's events around.
  	 */
  	if (!list_empty(&ctx->pinned_groups))
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
-	perf_event_sched_in(cpuctx, ctx, task);
+		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
+	perf_event_sched_in(cpuctx, ctx, task, mux);
  	perf_pmu_enable(ctx->pmu);
  	perf_ctx_unlock(cpuctx, ctx);
  }
@@ -3421,14 +3626,24 @@ static void rotate_ctx(struct perf_event_context 
*ctx)
  	 * Rotate the first entry last of non-pinned groups. Rotation might be
  	 * disabled by the inheritance code.
  	 */
-	if (!ctx->rotate_disable)
-		list_rotate_left(&ctx->flexible_groups);
+	if (!ctx->rotate_disable) {
+		struct list_head *groups;
+
+		groups = perf_cpu_tree_find(&ctx->flexible_tree, -1);
+		if (groups)
+			list_rotate_left(groups);
+		groups = perf_cpu_tree_find(&ctx->flexible_tree,
+				smp_processor_id());
+		if (groups)
+			list_rotate_left(groups);
+	}
  }

  static int perf_rotate_context(struct perf_cpu_context *cpuctx)
  {
  	struct perf_event_context *ctx = NULL;
  	int rotate = 0;
+	int mux = 1;

  	if (cpuctx->ctx.nr_events) {
  		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
@@ -3447,15 +3662,15 @@ static int perf_rotate_context(struct 
perf_cpu_context *cpuctx)
  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
  	perf_pmu_disable(cpuctx->ctx.pmu);

-	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
  	if (ctx)
-		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE, mux);

  	rotate_ctx(&cpuctx->ctx);
  	if (ctx)
  		rotate_ctx(ctx);

-	perf_event_sched_in(cpuctx, ctx, current);
+	perf_event_sched_in(cpuctx, ctx, current, mux);

  	perf_pmu_enable(cpuctx->ctx.pmu);
  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3507,6 +3722,7 @@ static void perf_event_enable_on_exec(int ctxn)
  	struct perf_event *event;
  	unsigned long flags;
  	int enabled = 0;
+	int mux = 0;

  	local_irq_save(flags);
  	ctx = current->perf_event_ctxp[ctxn];
@@ -3515,7 +3731,7 @@ static void perf_event_enable_on_exec(int ctxn)

  	cpuctx = __get_cpu_context(ctx);
  	perf_ctx_lock(cpuctx, ctx);
-	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+	ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
  	list_for_each_entry(event, &ctx->event_list, event_entry) {
  		enabled |= event_enable_on_exec(event, ctx);
  		event_type |= get_event_type(event);
@@ -3528,7 +3744,7 @@ static void perf_event_enable_on_exec(int ctxn)
  		clone_ctx = unclone_ctx(ctx);
  		ctx_resched(cpuctx, ctx, event_type);
  	} else {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
  	}
  	perf_ctx_unlock(cpuctx, ctx);

@@ -3749,7 +3965,9 @@ static void __perf_event_init_context(struct 
perf_event_context *ctx)
  	mutex_init(&ctx->mutex);
  	INIT_LIST_HEAD(&ctx->active_ctx_list);
  	INIT_LIST_HEAD(&ctx->pinned_groups);
+	ctx->pinned_tree = RB_ROOT;
  	INIT_LIST_HEAD(&ctx->flexible_groups);
+	ctx->flexible_tree = RB_ROOT;
  	INIT_LIST_HEAD(&ctx->event_list);
  	atomic_set(&ctx->refcount, 1);
  }
@@ -4848,8 +5066,8 @@ static void calc_timer_values(struct perf_event 
*event,

  	*now = perf_clock();
  	ctx_time = event->shadow_ctx_time + *now;
-	*enabled = ctx_time - event->tstamp_enabled;
-	*running = ctx_time - event->tstamp_running;
+	*enabled = ctx_time - event->tstamp->enabled;
+	*running = ctx_time - event->tstamp->running;
  }

  static void perf_event_init_userpage(struct perf_event *event)
@@ -9357,6 +9575,9 @@ perf_event_alloc(struct perf_event_attr *attr, int 
cpu,
  	INIT_LIST_HEAD(&event->child_list);

  	INIT_LIST_HEAD(&event->group_entry);
+	RB_CLEAR_NODE(&event->group_node);
+	INIT_LIST_HEAD(&event->group_list);
+	INIT_LIST_HEAD(&event->group_list_entry);
  	INIT_LIST_HEAD(&event->event_entry);
  	INIT_LIST_HEAD(&event->sibling_list);
  	INIT_LIST_HEAD(&event->rb_entry);
@@ -9372,6 +9593,7 @@ perf_event_alloc(struct perf_event_attr *attr, int 
cpu,
  	raw_spin_lock_init(&event->addr_filters.lock);

  	atomic_long_set(&event->refcount, 1);
+	event->tstamp		= &event->tstamp_data;
  	event->cpu		= cpu;
  	event->attr		= *attr;
  	event->group_leader	= group_leader;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-26 22:13 [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
@ 2017-05-27 11:19 ` Alexey Budankov
  2017-05-29  7:45   ` Peter Zijlstra
                     ` (2 more replies)
  2017-05-29 12:03 ` [PATCH]: perf/core: addressing 4x slowdown during per-process " Alexander Shishkin
  1 sibling, 3 replies; 25+ messages in thread
From: Alexey Budankov @ 2017-05-27 11:19 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Alexander Shishkin, Andi Kleen, Kan Liang, Dmitri Prokhorov,
	Valery Cherepennikov, David Carrillo-Cisneros, Stephane Eranian,
	Mark Rutland, linux-kernel

Motivation:

The issue manifests like 4x slowdown when profiling single thread STREAM
benchmark on Intel Xeon Phi running RHEL7.2 (Intel MPSS distribution).
Perf profiling is done in per-process mode and involves about 30 core
events. In case the benchmark is OpenMP based and runs under profiling in
272 threads the overhead becomes even more dramatic: 512.144s against
36.192s (with this patch).

Solution:

cpu indexed trees for perf_event_context::pinned_groups and
perf_event_context::flexible_groups lists are introduced. Every tree node
keeps a list of groups allocated for the same cpu. A tree references only
groups located at the appropriate group list. The tree provides capability
to iterate over groups allocated for a specific cpu only, what is exactly
required by multiplexing timer interrupt handler. The handler runs per-cpu
and enables/disables groups using group_sched_in()/group_sched_out() that
call event_filter_match() function filtering out groups allocated for cpus
different from the one executing the handler. Additionally for every
filtered out group group_sched_out() updates tstamps values to the current
interrupt time. This updating work is now done only once by
update_context_time() called by ctx_sched_out() before cpu groups
iteration. For this trick to work it is required that tstamps of filtered
out events would point to perf_event_context::tstamp_data object instead
of perf_event::tstamp_data ones, as it is initialized from an event
allocation. tstamps references are switched by
group_sched_in()/group_sched_out() every time a group is checked for its
suitability for currently running cpu. When a thread enters some cpu on
a context switch a long run through pinned and flexible groups is
performed by perf_event_sched_in(, mux=0) with new parameter mux set to 0
and filtered out groups tstamps are switched to
perf_event_context::tstamp_data object. Then a series of multiplexing
interrupts happens and the handler rotates the flexible groups calling
ctx_sched_out(,mux=1)/perf_event_sched_in(,mux=1) iterating over the cpu
tree lists only and avoiding long runs through the complete group lists.
This is where speedup comes from. Eventually when the thread leaves the cpu
ctx_sched_out(,mux=0) is called restoring tstamps pointers to the events'
perf_event::tstamp_data objects.

Locking:

The new perf_event_context::pinned_tree and
perf_event_context::flexible_tree are protected by perf_evet_context::lock
and perf_event_context::mutex, the same locking schema as for appropriate
perf_event_context::pinned_group and perf_event_context::flexible_group.
Indirect perf_event::tstamp reference and perf_event::tstamp_data object
are protected by perf_event::lock as earlier for original
perf_event::tstamp_* fields.

Added API:

Objects:
1. struct perf_event_tstamp:
    - enabled
    - running
    - stopped
2. struct perf_event:
    - group_node
    - group_list
    - group_list_entry
    - tstamp
    - tstamp_data
3. struct perf_event_context:
    - pinned_tree
    - flexible_tree
    - tstamp_data

Functions:
1. insert a group into a tree using event->cpu as a key:
	int perf_cpu_tree_insert(struct rb_root *tree,
		struct perf_event *event);
2. delete a group from a tree, if the group is directly attached
    to a tree it also detaches all groups on the groups
    group_list list:
	int perf_cpu_tree_delete(struct rb_root *tree,
		struct perf_event *event);
3. find group_list list by a cpu key:
         struct list_head * perf_cpu_tree_find(struct rb_root *tree,
		int cpu);
4. enable a pinned group on a cpu:
         void ctx_sched_in_flexible_group(struct perf_event_context *ctx,
		struct perf_cpu_context *cpuctx,
		struct perf_event *group, int *can_add_hw);
5. enable a flexible group on a cpu; calls ctx_sched_in_flexible_group
    and updates group->state to ERROR in case of failure:
	void ctx_sched_in_pinned_group(struct perf_event_context *ctx,
		struct perf_cpu_context *cpuctx,
		struct perf_event *group);
6. enable per-cpu pinned tree's groups on a cpu:
	void ctx_pinned_sched_in_groups(struct perf_event_context *ctx,
		struct perf_cpu_context *cpuctx,
		struct list_head *groups);
7. enable per-cpu flexible tree's groups on a cpu:
	void ctx_flexible_sched_in_groups(
		struct perf_event_context *ctx,
		struct perf_cpu_context *cpuctx,
		struct list_head *groups);
8. disable per-cpu tree's groups on a cpu:
	void ctx_sched_out_groups(struct perf_event_context *ctx,
		struct perf_cpu_context *cpuctx,
		struct list_head *groups);
9. get a group tree based on event->attr.pinned attribute value:
	struct rb_root * ctx_cpu_tree(struct perf_event *event,
		struct perf_event_context *ctx);

Modified API:

Objects:
1. struct perf_event
2. struct perf_event_context

Functions:
1. perf_event_alloc()
2. add_event_to_ctx()
3. perf_event_enable_on_exec()
4. __perf_install_in_context()
5. __perf_event_init_context()
6. __perf_event_mark_enabled()
7. __perf_event_enable()
8. __perf_event_task_sched_out()
9. ctx_group_list()
10. list_add_event()
11. list_del_event()
12. perf_event_context_sched_in()
13. cpu_ctx_sched_in()
14. cpu_ctx_sched_out()
15. ctx_sched_in()
16. ctx_sched_out()
17. ctx_resched()
18. ctx_pinned_sched_in()
19. ctx_flexible_sched_in()
20. group_sched_in()
21. event_sched_in()
22. event_sched_out()
23. rotate_ctx()
24. perf_rotate_context()
25. update_context_time()
26. update_event_times()
27. calc_timer_values()
28. perf_cgroup_switch()
29. perf_cgroup_mark_enabled()

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
  include/linux/perf_event.h |  63 +++++--
  kernel/events/core.c       | 446 
++++++++++++++++++++++++++++++++++-----------
  2 files changed, 388 insertions(+), 121 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 24a6358..6bd380f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -550,6 +550,22 @@ struct pmu_event_list {
  	struct list_head	list;
  };

+struct perf_event_tstamp {
+	/*
+	 * These are timestamps used for computing total_time_enabled
+	 * and total_time_running when the event is in INACTIVE or
+	 * ACTIVE state, measured in nanoseconds from an arbitrary point
+	 * in time.
+	 * enabled: the notional time when the event was enabled
+	 * running: the notional time when the event was scheduled on
+	 * stopped: in INACTIVE state, the notional time when the
+	 *	event was scheduled off.
+	 */
+	u64 enabled;
+	u64 running;
+	u64 stopped;
+};
+
  /**
   * struct perf_event - performance event kernel representation:
   */
@@ -571,6 +587,27 @@ struct perf_event {
  	 * either sufficies for read.
  	 */
  	struct list_head		group_entry;
+	/*
+	 * Node on the pinned or flexible tree located at the event context;
+	 * the node may be empty in case its event is not directly attached
+	 * to the tree but to group_list list of the event directly
+	 * attached to the tree;
+	 */
+	struct rb_node			group_node;
+	/*
+	 * List keeps groups allocated for the same cpu;
+	 * the list may be empty in case its event is not directly
+	 * attached to the tree but to group_list list of the event directly
+	 * attached to the tree;
+	 */
+	struct list_head		group_list;
+	/*
+	 * Entry into the group_list list above;
+	 * the entry may be attached to the self group_list list above
+	 * in case the event is directly attached to the pinned or
+	 * flexible tree;
+	 */
+	struct list_head		group_list_entry;
  	struct list_head		sibling_list;

  	/*
@@ -611,18 +648,11 @@ struct perf_event {
  	u64				total_time_running;

  	/*
-	 * These are timestamps used for computing total_time_enabled
-	 * and total_time_running when the event is in INACTIVE or
-	 * ACTIVE state, measured in nanoseconds from an arbitrary point
-	 * in time.
-	 * tstamp_enabled: the notional time when the event was enabled
-	 * tstamp_running: the notional time when the event was scheduled on
-	 * tstamp_stopped: in INACTIVE state, the notional time when the
-	 *	event was scheduled off.
+	 * tstamp points to the tstamp_data object below or to the object
+	 * located at the event context;
  	 */
-	u64				tstamp_enabled;
-	u64				tstamp_running;
-	u64				tstamp_stopped;
+	struct perf_event_tstamp	*tstamp;
+	struct perf_event_tstamp	tstamp_data;

  	/*
  	 * timestamp shadows the actual context timing but it can
@@ -742,7 +772,17 @@ struct perf_event_context {

  	struct list_head		active_ctx_list;
  	struct list_head		pinned_groups;
+	/*
+	 * Cpu tree for pinned groups; keeps event's group_node nodes
+	 * of attached flexible groups;
+	 */
+	struct rb_root			pinned_tree;
  	struct list_head		flexible_groups;
+	/*
+	 * Cpu tree for flexible groups; keeps event's group_node nodes
+	 * of attached flexible groups;
+	 */
+	struct rb_root			flexible_tree;
  	struct list_head		event_list;
  	int				nr_events;
  	int				nr_active;
@@ -758,6 +798,7 @@ struct perf_event_context {
  	 */
  	u64				time;
  	u64				timestamp;
+	struct perf_event_tstamp	tstamp_data;

  	/*
  	 * These fields let us detect when two contexts have both
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f8c27d3..92f797a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -555,11 +555,11 @@ void perf_sample_event_took(u64 sample_len_ns)
  static atomic64_t perf_event_id;

  static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type);
+			      enum event_type_t event_type, int mux);

  static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
  			     enum event_type_t event_type,
-			     struct task_struct *task);
+			     struct task_struct *task, int mux);

  static void update_context_time(struct perf_event_context *ctx);
  static u64 perf_event_time(struct perf_event *event);
@@ -701,6 +701,7 @@ static void perf_cgroup_switch(struct task_struct 
*task, int mode)
  	struct perf_cpu_context *cpuctx;
  	struct list_head *list;
  	unsigned long flags;
+	int mux = 0;

  	/*
  	 * Disable interrupts and preemption to avoid this CPU's
@@ -716,7 +717,7 @@ static void perf_cgroup_switch(struct task_struct 
*task, int mode)
  		perf_pmu_disable(cpuctx->ctx.pmu);

  		if (mode & PERF_CGROUP_SWOUT) {
-			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+			cpu_ctx_sched_out(cpuctx, EVENT_ALL, mux);
  			/*
  			 * must not be done before ctxswout due
  			 * to event_filter_match() in event_sched_out()
@@ -735,7 +736,7 @@ static void perf_cgroup_switch(struct task_struct 
*task, int mode)
  			 */
  			cpuctx->cgrp = perf_cgroup_from_task(task,
  							     &cpuctx->ctx);
-			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task, mux);
  		}
  		perf_pmu_enable(cpuctx->ctx.pmu);
  		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -865,10 +866,10 @@ perf_cgroup_mark_enabled(struct perf_event *event,

  	event->cgrp_defer_enabled = 0;

-	event->tstamp_enabled = tstamp - event->total_time_enabled;
+	event->tstamp->enabled = tstamp - event->total_time_enabled;
  	list_for_each_entry(sub, &event->sibling_list, group_entry) {
  		if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+			sub->tstamp->enabled = tstamp - sub->total_time_enabled;
  			sub->cgrp_defer_enabled = 0;
  		}
  	}
@@ -1383,6 +1384,9 @@ static void update_context_time(struct 
perf_event_context *ctx)

  	ctx->time += now - ctx->timestamp;
  	ctx->timestamp = now;
+
+	ctx->tstamp_data.running += ctx->time - ctx->tstamp_data.stopped;
+	ctx->tstamp_data.stopped = ctx->time;
  }

  static u64 perf_event_time(struct perf_event *event)
@@ -1424,16 +1428,16 @@ static void update_event_times(struct perf_event 
*event)
  	else if (ctx->is_active)
  		run_end = ctx->time;
  	else
-		run_end = event->tstamp_stopped;
+		run_end = event->tstamp->stopped;

-	event->total_time_enabled = run_end - event->tstamp_enabled;
+	event->total_time_enabled = run_end - event->tstamp->enabled;

  	if (event->state == PERF_EVENT_STATE_INACTIVE)
-		run_end = event->tstamp_stopped;
+		run_end = event->tstamp->stopped;
  	else
  		run_end = perf_event_time(event);

-	event->total_time_running = run_end - event->tstamp_running;
+	event->total_time_running = run_end - event->tstamp->running;

  }

@@ -1472,6 +1476,186 @@ ctx_group_list(struct perf_event *event, struct 
perf_event_context *ctx)
  		return &ctx->flexible_groups;
  }

+static int
+perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event)
+{
+	struct rb_node **node;
+	struct rb_node *parent;
+
+	if (!tree || !event)
+		return 0;
+
+	node = &tree->rb_node;
+	parent = *node;
+
+	while (*node) {
+		struct perf_event *node_event =	container_of(*node,
+				struct perf_event, group_node);
+
+		parent = *node;
+
+		if (event->cpu < node_event->cpu) {
+			node = &((*node)->rb_left);
+		} else if (event->cpu > node_event->cpu) {
+			node = &((*node)->rb_right);
+		} else {
+			list_add_tail(&event->group_list_entry,
+					&node_event->group_list);
+			return 2;
+		}
+	}
+
+	list_add_tail(&event->group_list_entry, &event->group_list);
+
+	rb_link_node(&event->group_node, parent, node);
+	rb_insert_color(&event->group_node, tree);
+
+	return 1;
+}
+
+static int
+perf_cpu_tree_delete(struct rb_root *tree, struct perf_event *event)
+{
+	if (!tree || !event)
+		return 0;
+
+	if (RB_EMPTY_NODE(&event->group_node)) {
+		list_del_init(&event->group_list_entry);
+	} else {
+		struct perf_event *list_event, *list_next;
+
+		rb_erase(&event->group_node, tree);
+		list_for_each_entry_safe(list_event, list_next,
+				&event->group_list, group_list_entry)
+			list_del_init(&list_event->group_list_entry);
+	}
+
+	return 1;
+}
+
+static struct list_head *
+perf_cpu_tree_find(struct rb_root *tree, int cpu)
+{
+	struct rb_node *node;
+
+	if (!tree)
+		return NULL;
+
+	node = tree->rb_node;
+
+	while (node) {
+		struct perf_event *node_event = container_of(node,
+				struct perf_event, group_node);
+
+		if (cpu < node_event->cpu)
+			node = node->rb_left;
+		else if (cpu > node_event->cpu)
+			node = node->rb_right;
+		else
+			return &node_event->group_list;
+	}
+
+	return NULL;
+}
+
+static struct rb_root *
+ctx_cpu_tree(struct perf_event *event, struct perf_event_context *ctx)
+{
+	if (event->attr.pinned)
+		return &ctx->pinned_tree;
+	else
+		return &ctx->flexible_tree;
+}
+
+static inline int
+event_filter_match(struct perf_event *event);
+
+static int
+group_can_go_on(struct perf_event *event,
+		struct perf_cpu_context *cpuctx,
+		int can_add_hw);
+
+static int
+group_sched_in(struct perf_event *group_event,
+	       struct perf_cpu_context *cpuctx,
+	       struct perf_event_context *ctx);
+
+static void
+ctx_sched_in_flexible_group(struct perf_event_context *ctx,
+		    struct perf_cpu_context *cpuctx,
+		    struct perf_event *event, int *can_add_hw)
+{
+	if (event->state <= PERF_EVENT_STATE_OFF)
+		return;
+	if (!event_filter_match(event)) {
+		if (event->tstamp != &ctx->tstamp_data)
+			event->tstamp = &ctx->tstamp_data;
+		return;
+	}
+
+	/* may need to reset tstamp_enabled */
+	if (is_cgroup_event(event))
+		perf_cgroup_mark_enabled(event, ctx);
+
+	if (group_can_go_on(event, cpuctx, *can_add_hw)) {
+		if (group_sched_in(event, cpuctx, ctx))
+			*can_add_hw = 0;
+	}
+}
+
+static void
+ctx_sched_in_pinned_group(struct perf_event_context *ctx,
+		    struct perf_cpu_context *cpuctx,
+		    struct perf_event *event)
+{
+	int can_add_hw = 1;
+
+	ctx_sched_in_flexible_group(ctx, cpuctx, event, &can_add_hw);
+	if (!can_add_hw) {
+		update_group_times(event);
+		event->state = PERF_EVENT_STATE_ERROR;
+	}
+}
+
+static void
+ctx_pinned_sched_in_groups(struct perf_event_context *ctx,
+		    struct perf_cpu_context *cpuctx,
+		    struct list_head *groups)
+{
+	struct perf_event *group;
+
+	list_for_each_entry(group, groups, group_list_entry)
+		ctx_sched_in_pinned_group(ctx, cpuctx, group);
+}
+
+static void
+ctx_flexible_sched_in_groups(struct perf_event_context *ctx,
+		      struct perf_cpu_context *cpuctx,
+		      struct list_head *groups)
+{
+	struct perf_event *group;
+	int can_add_hw = 1;
+
+	list_for_each_entry(group, groups, group_list_entry)
+		ctx_sched_in_flexible_group(ctx, cpuctx, group, &can_add_hw);
+}
+
+static void
+group_sched_out(struct perf_event *group_event,
+		struct perf_cpu_context *cpuctx,
+		struct perf_event_context *ctx);
+
+static void
+ctx_sched_out_groups(struct perf_event_context *ctx,
+		struct perf_cpu_context *cpuctx,
+		struct list_head *groups)
+{
+	struct perf_event *group;
+
+	list_for_each_entry(group, groups, group_list_entry)
+		group_sched_out(group, cpuctx, ctx);
+}
+
  /*
   * Add a event from the lists for its context.
   * Must be called with ctx->mutex and ctx->lock held.
@@ -1491,11 +1675,15 @@ list_add_event(struct perf_event *event, struct 
perf_event_context *ctx)
  	 */
  	if (event->group_leader == event) {
  		struct list_head *list;
+		struct rb_root *tree;

  		event->group_caps = event->event_caps;

  		list = ctx_group_list(event, ctx);
  		list_add_tail(&event->group_entry, list);
+
+		tree = ctx_cpu_tree(event, ctx);
+		perf_cpu_tree_insert(tree, event);
  	}

  	list_update_cgroup_event(event, ctx, true);
@@ -1685,8 +1873,14 @@ list_del_event(struct perf_event *event, struct 
perf_event_context *ctx)

  	list_del_rcu(&event->event_entry);

-	if (event->group_leader == event)
+	if (event->group_leader == event) {
+		struct rb_root *tree;
+
+		tree = ctx_cpu_tree(event, ctx);
+		perf_cpu_tree_delete(tree, event);
+
  		list_del_init(&event->group_entry);
+	}

  	update_group_times(event);

@@ -1811,9 +2005,13 @@ event_sched_out(struct perf_event *event,
  	 */
  	if (event->state == PERF_EVENT_STATE_INACTIVE &&
  	    !event_filter_match(event)) {
-		delta = tstamp - event->tstamp_stopped;
-		event->tstamp_running += delta;
-		event->tstamp_stopped = tstamp;
+		delta = tstamp - event->tstamp->stopped;
+		event->tstamp->running += delta;
+		event->tstamp->stopped = tstamp;
+		if (event->tstamp != &event->tstamp_data) {
+			event->tstamp_data = *event->tstamp;
+			event->tstamp = &event->tstamp_data;
+		}
  	}

  	if (event->state != PERF_EVENT_STATE_ACTIVE)
@@ -1821,7 +2019,7 @@ event_sched_out(struct perf_event *event,

  	perf_pmu_disable(event->pmu);

-	event->tstamp_stopped = tstamp;
+	event->tstamp->stopped = tstamp;
  	event->pmu->del(event, 0);
  	event->oncpu = -1;
  	event->state = PERF_EVENT_STATE_INACTIVE;
@@ -2096,7 +2294,7 @@ event_sched_in(struct perf_event *event,
  		goto out;
  	}

-	event->tstamp_running += tstamp - event->tstamp_stopped;
+	event->tstamp->running += tstamp - event->tstamp->stopped;

  	if (!is_software_event(event))
  		cpuctx->active_oncpu++;
@@ -2168,8 +2366,8 @@ group_sched_in(struct perf_event *group_event,
  			simulate = true;

  		if (simulate) {
-			event->tstamp_running += now - event->tstamp_stopped;
-			event->tstamp_stopped = now;
+			event->tstamp->running += now - event->tstamp->stopped;
+			event->tstamp->stopped = now;
  		} else {
  			event_sched_out(event, cpuctx, ctx);
  		}
@@ -2221,43 +2419,45 @@ static void add_event_to_ctx(struct perf_event 
*event,

  	list_add_event(event, ctx);
  	perf_group_attach(event);
-	event->tstamp_enabled = tstamp;
-	event->tstamp_running = tstamp;
-	event->tstamp_stopped = tstamp;
+	event->tstamp->enabled = tstamp;
+	event->tstamp->running = tstamp;
+	event->tstamp->stopped = tstamp;
  }

  static void ctx_sched_out(struct perf_event_context *ctx,
  			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type);
+			  enum event_type_t event_type, int mux);
  static void
  ctx_sched_in(struct perf_event_context *ctx,
  	     struct perf_cpu_context *cpuctx,
  	     enum event_type_t event_type,
-	     struct task_struct *task);
+	     struct task_struct *task, int mux);

  static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
  			       struct perf_event_context *ctx,
  			       enum event_type_t event_type)
  {
+	int mux = 0;
+
  	if (!cpuctx->task_ctx)
  		return;

  	if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
  		return;

-	ctx_sched_out(ctx, cpuctx, event_type);
+	ctx_sched_out(ctx, cpuctx, event_type, mux);
  }

  static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
  				struct perf_event_context *ctx,
-				struct task_struct *task)
+				struct task_struct *task, int mux)
  {
-	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task);
+	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task, mux);
  	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task);
-	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task);
+		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task, mux);
+	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task, mux);
  	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
+		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task, mux);
  }

  /*
@@ -2281,6 +2481,7 @@ static void ctx_resched(struct perf_cpu_context 
*cpuctx,
  {
  	enum event_type_t ctx_event_type = event_type & EVENT_ALL;
  	bool cpu_event = !!(event_type & EVENT_CPU);
+	int mux = 0;

  	/*
  	 * If pinned groups are involved, flexible groups also need to be
@@ -2301,11 +2502,11 @@ static void ctx_resched(struct perf_cpu_context 
*cpuctx,
  	 *  - otherwise, do nothing more.
  	 */
  	if (cpu_event)
-		cpu_ctx_sched_out(cpuctx, ctx_event_type);
+		cpu_ctx_sched_out(cpuctx, ctx_event_type, mux);
  	else if (ctx_event_type & EVENT_PINNED)
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);

-	perf_event_sched_in(cpuctx, task_ctx, current);
+	perf_event_sched_in(cpuctx, task_ctx, current, mux);
  	perf_pmu_enable(cpuctx->ctx.pmu);
  }

@@ -2323,6 +2524,7 @@ static int  __perf_install_in_context(void *info)
  	struct perf_event_context *task_ctx = cpuctx->task_ctx;
  	bool reprogram = true;
  	int ret = 0;
+	int mux = 0;

  	raw_spin_lock(&cpuctx->ctx.lock);
  	if (ctx->task) {
@@ -2349,7 +2551,7 @@ static int  __perf_install_in_context(void *info)
  	}

  	if (reprogram) {
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
  		add_event_to_ctx(event, ctx);
  		ctx_resched(cpuctx, task_ctx, get_event_type(event));
  	} else {
@@ -2468,10 +2670,10 @@ static void __perf_event_mark_enabled(struct 
perf_event *event)
  	u64 tstamp = perf_event_time(event);

  	event->state = PERF_EVENT_STATE_INACTIVE;
-	event->tstamp_enabled = tstamp - event->total_time_enabled;
+	event->tstamp->enabled = tstamp - event->total_time_enabled;
  	list_for_each_entry(sub, &event->sibling_list, group_entry) {
  		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+			sub->tstamp->enabled = tstamp - sub->total_time_enabled;
  	}
  }

@@ -2485,13 +2687,14 @@ static void __perf_event_enable(struct 
perf_event *event,
  {
  	struct perf_event *leader = event->group_leader;
  	struct perf_event_context *task_ctx;
+	int mux = 0;

  	if (event->state >= PERF_EVENT_STATE_INACTIVE ||
  	    event->state <= PERF_EVENT_STATE_ERROR)
  		return;

  	if (ctx->is_active)
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);

  	__perf_event_mark_enabled(event);

@@ -2501,7 +2704,7 @@ static void __perf_event_enable(struct perf_event 
*event,
  	if (!event_filter_match(event)) {
  		if (is_cgroup_event(event))
  			perf_cgroup_defer_enabled(event);
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
  		return;
  	}

@@ -2510,7 +2713,7 @@ static void __perf_event_enable(struct perf_event 
*event,
  	 * then don't put it on unless the group is on.
  	 */
  	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
  		return;
  	}

@@ -2706,10 +2909,12 @@ EXPORT_SYMBOL_GPL(perf_event_refresh);

  static void ctx_sched_out(struct perf_event_context *ctx,
  			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type)
+			  enum event_type_t event_type, int mux)
  {
  	int is_active = ctx->is_active;
  	struct perf_event *event;
+	struct list_head *groups;
+	int cpu = smp_processor_id();

  	lockdep_assert_held(&ctx->lock);

@@ -2756,13 +2961,34 @@ static void ctx_sched_out(struct 
perf_event_context *ctx,

  	perf_pmu_disable(ctx->pmu);
  	if (is_active & EVENT_PINNED) {
-		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
-			group_sched_out(event, cpuctx, ctx);
+		if (mux) {
+			groups = perf_cpu_tree_find(&ctx->pinned_tree, -1);
+			if (groups)
+				ctx_sched_out_groups(ctx, cpuctx, groups);
+			groups = perf_cpu_tree_find(&ctx->pinned_tree, cpu);
+			if (groups)
+				ctx_sched_out_groups(ctx, cpuctx, groups);
+		} else {
+			list_for_each_entry(event, &ctx->pinned_groups,
+					group_entry)
+				group_sched_out(event, cpuctx, ctx);
+		}
  	}

  	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry(event, &ctx->flexible_groups, group_entry)
-			group_sched_out(event, cpuctx, ctx);
+		if (mux) {
+			groups = perf_cpu_tree_find(&ctx->flexible_tree, -1);
+			if (groups)
+				ctx_sched_out_groups(ctx, cpuctx, groups);
+			groups = perf_cpu_tree_find(&ctx->flexible_tree, cpu);
+			if (groups)
+				ctx_sched_out_groups(ctx, cpuctx, groups);
+
+		} else {
+			list_for_each_entry(event, &ctx->flexible_groups,
+					group_entry)
+				group_sched_out(event, cpuctx, ctx);
+		}
  	}
  	perf_pmu_enable(ctx->pmu);
  }
@@ -3051,9 +3277,9 @@ void __perf_event_task_sched_out(struct 
task_struct *task,
   * Called with IRQs disabled
   */
  static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type)
+			      enum event_type_t event_type, int mux)
  {
-	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
+	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
  }

  static void
@@ -3061,29 +3287,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
  		    struct perf_cpu_context *cpuctx)
  {
  	struct perf_event *event;
-
-	list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		if (!event_filter_match(event))
-			continue;
-
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
-
-		if (group_can_go_on(event, cpuctx, 1))
-			group_sched_in(event, cpuctx, ctx);
-
-		/*
-		 * If this pinned group hasn't been scheduled,
-		 * put it in error state.
-		 */
-		if (event->state == PERF_EVENT_STATE_INACTIVE) {
-			update_group_times(event);
-			event->state = PERF_EVENT_STATE_ERROR;
-		}
-	}
+	list_for_each_entry(event, &ctx->pinned_groups, group_entry)
+		ctx_sched_in_pinned_group(ctx, cpuctx, event);
  }

  static void
@@ -3092,37 +3297,19 @@ ctx_flexible_sched_in(struct perf_event_context 
*ctx,
  {
  	struct perf_event *event;
  	int can_add_hw = 1;
-
-	list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
-		/* Ignore events in OFF or ERROR state */
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		/*
-		 * Listen to the 'cpu' scheduling filter constraint
-		 * of events:
-		 */
-		if (!event_filter_match(event))
-			continue;
-
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
-
-		if (group_can_go_on(event, cpuctx, can_add_hw)) {
-			if (group_sched_in(event, cpuctx, ctx))
-				can_add_hw = 0;
-		}
-	}
+	list_for_each_entry(event, &ctx->flexible_groups, group_entry)
+		ctx_sched_in_flexible_group(ctx, cpuctx, event, &can_add_hw);
  }

  static void
  ctx_sched_in(struct perf_event_context *ctx,
  	     struct perf_cpu_context *cpuctx,
  	     enum event_type_t event_type,
-	     struct task_struct *task)
+	     struct task_struct *task, int mux)
  {
  	int is_active = ctx->is_active;
-	u64 now;
+	struct list_head *groups;
+	int cpu = smp_processor_id();

  	lockdep_assert_held(&ctx->lock);

@@ -3141,8 +3328,7 @@ ctx_sched_in(struct perf_event_context *ctx,

  	if (is_active & EVENT_TIME) {
  		/* start ctx time */
-		now = perf_clock();
-		ctx->timestamp = now;
+		ctx->timestamp = perf_clock();
  		perf_cgroup_set_timestamp(task, ctx);
  	}

@@ -3150,27 +3336,50 @@ ctx_sched_in(struct perf_event_context *ctx,
  	 * First go through the list and put on any pinned groups
  	 * in order to give them the best chance of going on.
  	 */
-	if (is_active & EVENT_PINNED)
-		ctx_pinned_sched_in(ctx, cpuctx);
+	if (is_active & EVENT_PINNED) {
+		if (mux) {
+			groups = perf_cpu_tree_find(&ctx->pinned_tree, -1);
+			if (groups)
+				ctx_pinned_sched_in_groups(ctx, cpuctx, groups);
+			groups = perf_cpu_tree_find(&ctx->pinned_tree, cpu);
+			if (groups)
+				ctx_pinned_sched_in_groups(ctx, cpuctx, groups);
+		} else {
+			ctx_pinned_sched_in(ctx, cpuctx);
+		}
+	}

  	/* Then walk through the lower prio flexible groups */
-	if (is_active & EVENT_FLEXIBLE)
-		ctx_flexible_sched_in(ctx, cpuctx);
+	if (is_active & EVENT_FLEXIBLE) {
+		if (mux) {
+			groups = perf_cpu_tree_find(&ctx->flexible_tree, -1);
+			if (groups)
+				ctx_flexible_sched_in_groups(ctx, cpuctx,
+						groups);
+			groups = perf_cpu_tree_find(&ctx->flexible_tree, cpu);
+			if (groups)
+				ctx_flexible_sched_in_groups(ctx, cpuctx,
+						groups);
+		} else {
+			ctx_flexible_sched_in(ctx, cpuctx);
+		}
+	}
  }

  static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
  			     enum event_type_t event_type,
-			     struct task_struct *task)
+			     struct task_struct *task, int mux)
  {
  	struct perf_event_context *ctx = &cpuctx->ctx;

-	ctx_sched_in(ctx, cpuctx, event_type, task);
+	ctx_sched_in(ctx, cpuctx, event_type, task, mux);
  }

  static void perf_event_context_sched_in(struct perf_event_context *ctx,
  					struct task_struct *task)
  {
  	struct perf_cpu_context *cpuctx;
+	int mux = 0;

  	cpuctx = __get_cpu_context(ctx);
  	if (cpuctx->task_ctx == ctx)
@@ -3187,8 +3396,8 @@ static void perf_event_context_sched_in(struct 
perf_event_context *ctx,
  	 * events, no need to flip the cpuctx's events around.
  	 */
  	if (!list_empty(&ctx->pinned_groups))
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
-	perf_event_sched_in(cpuctx, ctx, task);
+		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
+	perf_event_sched_in(cpuctx, ctx, task, mux);
  	perf_pmu_enable(ctx->pmu);
  	perf_ctx_unlock(cpuctx, ctx);
  }
@@ -3421,14 +3630,24 @@ static void rotate_ctx(struct perf_event_context 
*ctx)
  	 * Rotate the first entry last of non-pinned groups. Rotation might be
  	 * disabled by the inheritance code.
  	 */
-	if (!ctx->rotate_disable)
-		list_rotate_left(&ctx->flexible_groups);
+	if (!ctx->rotate_disable) {
+		struct list_head *groups;
+
+		groups = perf_cpu_tree_find(&ctx->flexible_tree, -1);
+		if (groups)
+			list_rotate_left(groups);
+		groups = perf_cpu_tree_find(&ctx->flexible_tree,
+				smp_processor_id());
+		if (groups)
+			list_rotate_left(groups);
+	}
  }

  static int perf_rotate_context(struct perf_cpu_context *cpuctx)
  {
  	struct perf_event_context *ctx = NULL;
  	int rotate = 0;
+	int mux = 1;

  	if (cpuctx->ctx.nr_events) {
  		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
@@ -3447,15 +3666,15 @@ static int perf_rotate_context(struct 
perf_cpu_context *cpuctx)
  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
  	perf_pmu_disable(cpuctx->ctx.pmu);

-	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
  	if (ctx)
-		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE, mux);

  	rotate_ctx(&cpuctx->ctx);
  	if (ctx)
  		rotate_ctx(ctx);

-	perf_event_sched_in(cpuctx, ctx, current);
+	perf_event_sched_in(cpuctx, ctx, current, mux);

  	perf_pmu_enable(cpuctx->ctx.pmu);
  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3507,6 +3726,7 @@ static void perf_event_enable_on_exec(int ctxn)
  	struct perf_event *event;
  	unsigned long flags;
  	int enabled = 0;
+	int mux = 0;

  	local_irq_save(flags);
  	ctx = current->perf_event_ctxp[ctxn];
@@ -3515,7 +3735,7 @@ static void perf_event_enable_on_exec(int ctxn)

  	cpuctx = __get_cpu_context(ctx);
  	perf_ctx_lock(cpuctx, ctx);
-	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+	ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
  	list_for_each_entry(event, &ctx->event_list, event_entry) {
  		enabled |= event_enable_on_exec(event, ctx);
  		event_type |= get_event_type(event);
@@ -3528,7 +3748,7 @@ static void perf_event_enable_on_exec(int ctxn)
  		clone_ctx = unclone_ctx(ctx);
  		ctx_resched(cpuctx, ctx, event_type);
  	} else {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
  	}
  	perf_ctx_unlock(cpuctx, ctx);

@@ -3749,7 +3969,9 @@ static void __perf_event_init_context(struct 
perf_event_context *ctx)
  	mutex_init(&ctx->mutex);
  	INIT_LIST_HEAD(&ctx->active_ctx_list);
  	INIT_LIST_HEAD(&ctx->pinned_groups);
+	ctx->pinned_tree = RB_ROOT;
  	INIT_LIST_HEAD(&ctx->flexible_groups);
+	ctx->flexible_tree = RB_ROOT;
  	INIT_LIST_HEAD(&ctx->event_list);
  	atomic_set(&ctx->refcount, 1);
  }
@@ -4848,8 +5070,8 @@ static void calc_timer_values(struct perf_event 
*event,

  	*now = perf_clock();
  	ctx_time = event->shadow_ctx_time + *now;
-	*enabled = ctx_time - event->tstamp_enabled;
-	*running = ctx_time - event->tstamp_running;
+	*enabled = ctx_time - event->tstamp->enabled;
+	*running = ctx_time - event->tstamp->running;
  }

  static void perf_event_init_userpage(struct perf_event *event)
@@ -9357,6 +9579,9 @@ perf_event_alloc(struct perf_event_attr *attr, int 
cpu,
  	INIT_LIST_HEAD(&event->child_list);

  	INIT_LIST_HEAD(&event->group_entry);
+	RB_CLEAR_NODE(&event->group_node);
+	INIT_LIST_HEAD(&event->group_list);
+	INIT_LIST_HEAD(&event->group_list_entry);
  	INIT_LIST_HEAD(&event->event_entry);
  	INIT_LIST_HEAD(&event->sibling_list);
  	INIT_LIST_HEAD(&event->rb_entry);
@@ -9372,6 +9597,7 @@ perf_event_alloc(struct perf_event_attr *attr, int 
cpu,
  	raw_spin_lock_init(&event->addr_filters.lock);

  	atomic_long_set(&event->refcount, 1);
+	event->tstamp		= &event->tstamp_data;
  	event->cpu		= cpu;
  	event->attr		= *attr;
  	event->group_leader	= group_leader;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-27 11:19 ` [PATCH v2]: perf/core: addressing 4x slowdown during per-process, " Alexey Budankov
@ 2017-05-29  7:45   ` Peter Zijlstra
  2017-05-29  9:24     ` Alexey Budankov
  2017-05-29  7:46   ` Peter Zijlstra
  2017-05-31 21:33   ` David Carrillo-Cisneros
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2017-05-29  7:45 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On Sat, May 27, 2017 at 02:19:51PM +0300, Alexey Budankov wrote:
> Solution:
> 
> cpu indexed trees for perf_event_context::pinned_groups and
> perf_event_context::flexible_groups lists are introduced. Every tree node
> keeps a list of groups allocated for the same cpu. A tree references only
> groups located at the appropriate group list. The tree provides capability
> to iterate over groups allocated for a specific cpu only, what is exactly
> required by multiplexing timer interrupt handler. The handler runs per-cpu
> and enables/disables groups using group_sched_in()/group_sched_out() that
> call event_filter_match() function filtering out groups allocated for cpus
> different from the one executing the handler. Additionally for every
> filtered out group group_sched_out() updates tstamps values to the current
> interrupt time. This updating work is now done only once by
> update_context_time() called by ctx_sched_out() before cpu groups
> iteration. For this trick to work it is required that tstamps of filtered
> out events would point to perf_event_context::tstamp_data object instead
> of perf_event::tstamp_data ones, as it is initialized from an event
> allocation. tstamps references are switched by
> group_sched_in()/group_sched_out() every time a group is checked for its
> suitability for currently running cpu. When a thread enters some cpu on
> a context switch a long run through pinned and flexible groups is
> performed by perf_event_sched_in(, mux=0) with new parameter mux set to 0
> and filtered out groups tstamps are switched to
> perf_event_context::tstamp_data object. Then a series of multiplexing
> interrupts happens and the handler rotates the flexible groups calling
> ctx_sched_out(,mux=1)/perf_event_sched_in(,mux=1) iterating over the cpu
> tree lists only and avoiding long runs through the complete group lists.
> This is where speedup comes from. Eventually when the thread leaves the cpu
> ctx_sched_out(,mux=0) is called restoring tstamps pointers to the events'
> perf_event::tstamp_data objects.

This is unreadable.. Please use whitespace.


> Added API:
> 
> Objects:
> 1. struct perf_event_tstamp:
>    - enabled
>    - running
>    - stopped
> 2. struct perf_event:
>    - group_node
>    - group_list
>    - group_list_entry
>    - tstamp
>    - tstamp_data
> 3. struct perf_event_context:
>    - pinned_tree
>    - flexible_tree
>    - tstamp_data
> 
> Functions:
> 1. insert a group into a tree using event->cpu as a key:
> 	int perf_cpu_tree_insert(struct rb_root *tree,
> 		struct perf_event *event);
> 2. delete a group from a tree, if the group is directly attached
>    to a tree it also detaches all groups on the groups
>    group_list list:
> 	int perf_cpu_tree_delete(struct rb_root *tree,
> 		struct perf_event *event);
> 3. find group_list list by a cpu key:
>         struct list_head * perf_cpu_tree_find(struct rb_root *tree,
> 		int cpu);
> 4. enable a pinned group on a cpu:
>         void ctx_sched_in_flexible_group(struct perf_event_context *ctx,
> 		struct perf_cpu_context *cpuctx,
> 		struct perf_event *group, int *can_add_hw);
> 5. enable a flexible group on a cpu; calls ctx_sched_in_flexible_group
>    and updates group->state to ERROR in case of failure:
> 	void ctx_sched_in_pinned_group(struct perf_event_context *ctx,
> 		struct perf_cpu_context *cpuctx,
> 		struct perf_event *group);
> 6. enable per-cpu pinned tree's groups on a cpu:
> 	void ctx_pinned_sched_in_groups(struct perf_event_context *ctx,
> 		struct perf_cpu_context *cpuctx,
> 		struct list_head *groups);
> 7. enable per-cpu flexible tree's groups on a cpu:
> 	void ctx_flexible_sched_in_groups(
> 		struct perf_event_context *ctx,
> 		struct perf_cpu_context *cpuctx,
> 		struct list_head *groups);
> 8. disable per-cpu tree's groups on a cpu:
> 	void ctx_sched_out_groups(struct perf_event_context *ctx,
> 		struct perf_cpu_context *cpuctx,
> 		struct list_head *groups);
> 9. get a group tree based on event->attr.pinned attribute value:
> 	struct rb_root * ctx_cpu_tree(struct perf_event *event,
> 		struct perf_event_context *ctx);
> 
> Modified API:
> 
> Objects:
> 1. struct perf_event
> 2. struct perf_event_context
> 
> Functions:
> 1. perf_event_alloc()
> 2. add_event_to_ctx()
> 3. perf_event_enable_on_exec()
> 4. __perf_install_in_context()
> 5. __perf_event_init_context()
> 6. __perf_event_mark_enabled()
> 7. __perf_event_enable()
> 8. __perf_event_task_sched_out()
> 9. ctx_group_list()
> 10. list_add_event()
> 11. list_del_event()
> 12. perf_event_context_sched_in()
> 13. cpu_ctx_sched_in()
> 14. cpu_ctx_sched_out()
> 15. ctx_sched_in()
> 16. ctx_sched_out()
> 17. ctx_resched()
> 18. ctx_pinned_sched_in()
> 19. ctx_flexible_sched_in()
> 20. group_sched_in()
> 21. event_sched_in()
> 22. event_sched_out()
> 23. rotate_ctx()
> 24. perf_rotate_context()
> 25. update_context_time()
> 26. update_event_times()
> 27. calc_timer_values()
> 28. perf_cgroup_switch()
> 29. perf_cgroup_mark_enabled()

Yeah, this doesn't go into a changelog. Have you _ever_ seen a changelog
with such crud in?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-27 11:19 ` [PATCH v2]: perf/core: addressing 4x slowdown during per-process, " Alexey Budankov
  2017-05-29  7:45   ` Peter Zijlstra
@ 2017-05-29  7:46   ` Peter Zijlstra
  2017-05-29  9:15     ` Alexey Budankov
  2017-05-31 21:33   ` David Carrillo-Cisneros
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2017-05-29  7:46 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On Sat, May 27, 2017 at 02:19:51PM +0300, Alexey Budankov wrote:
> @@ -571,6 +587,27 @@ struct perf_event {
>  	 * either sufficies for read.
>  	 */
>  	struct list_head		group_entry;
> +	/*
> +	 * Node on the pinned or flexible tree located at the event context;
> +	 * the node may be empty in case its event is not directly attached
> +	 * to the tree but to group_list list of the event directly
> +	 * attached to the tree;
> +	 */
> +	struct rb_node			group_node;
> +	/*
> +	 * List keeps groups allocated for the same cpu;
> +	 * the list may be empty in case its event is not directly
> +	 * attached to the tree but to group_list list of the event directly
> +	 * attached to the tree;
> +	 */
> +	struct list_head		group_list;
> +	/*
> +	 * Entry into the group_list list above;
> +	 * the entry may be attached to the self group_list list above
> +	 * in case the event is directly attached to the pinned or
> +	 * flexible tree;
> +	 */
> +	struct list_head		group_list_entry;
>  	struct list_head		sibling_list;
> 
>  	/*

> @@ -742,7 +772,17 @@ struct perf_event_context {
> 
>  	struct list_head		active_ctx_list;
>  	struct list_head		pinned_groups;
> +	/*
> +	 * Cpu tree for pinned groups; keeps event's group_node nodes
> +	 * of attached flexible groups;
> +	 */
> +	struct rb_root			pinned_tree;
>  	struct list_head		flexible_groups;
> +	/*
> +	 * Cpu tree for flexible groups; keeps event's group_node nodes
> +	 * of attached flexible groups;
> +	 */
> +	struct rb_root			flexible_tree;
>  	struct list_head		event_list;
>  	int				nr_events;
>  	int				nr_active;
> @@ -758,6 +798,7 @@ struct perf_event_context {
>  	 */
>  	u64				time;
>  	u64				timestamp;
> +	struct perf_event_tstamp	tstamp_data;
> 
>  	/*
>  	 * These fields let us detect when two contexts have both


So why do we now have a list _and_ a tree for the same entries?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29  7:46   ` Peter Zijlstra
@ 2017-05-29  9:15     ` Alexey Budankov
  2017-05-29 10:43       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Alexey Budankov @ 2017-05-29  9:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On 29.05.2017 10:46, Peter Zijlstra wrote:
> On Sat, May 27, 2017 at 02:19:51PM +0300, Alexey Budankov wrote:
>> @@ -571,6 +587,27 @@ struct perf_event {
>>   	 * either sufficies for read.
>>   	 */
>>   	struct list_head		group_entry;
>> +	/*
>> +	 * Node on the pinned or flexible tree located at the event context;
>> +	 * the node may be empty in case its event is not directly attached
>> +	 * to the tree but to group_list list of the event directly
>> +	 * attached to the tree;
>> +	 */
>> +	struct rb_node			group_node;
>> +	/*
>> +	 * List keeps groups allocated for the same cpu;
>> +	 * the list may be empty in case its event is not directly
>> +	 * attached to the tree but to group_list list of the event directly
>> +	 * attached to the tree;
>> +	 */
>> +	struct list_head		group_list;
>> +	/*
>> +	 * Entry into the group_list list above;
>> +	 * the entry may be attached to the self group_list list above
>> +	 * in case the event is directly attached to the pinned or
>> +	 * flexible tree;
>> +	 */
>> +	struct list_head		group_list_entry;
>>   	struct list_head		sibling_list;
>>
>>   	/*
> 
>> @@ -742,7 +772,17 @@ struct perf_event_context {
>>
>>   	struct list_head		active_ctx_list;
>>   	struct list_head		pinned_groups;
>> +	/*
>> +	 * Cpu tree for pinned groups; keeps event's group_node nodes
>> +	 * of attached flexible groups;
>> +	 */
>> +	struct rb_root			pinned_tree;
>>   	struct list_head		flexible_groups;
>> +	/*
>> +	 * Cpu tree for flexible groups; keeps event's group_node nodes
>> +	 * of attached flexible groups;
>> +	 */
>> +	struct rb_root			flexible_tree;
>>   	struct list_head		event_list;
>>   	int				nr_events;
>>   	int				nr_active;
>> @@ -758,6 +798,7 @@ struct perf_event_context {
>>   	 */
>>   	u64				time;
>>   	u64				timestamp;
>> +	struct perf_event_tstamp	tstamp_data;
>>
>>   	/*
>>   	 * These fields let us detect when two contexts have both
> 
> 
> So why do we now have a list _and_ a tree for the same entries?
We need groups list to iterate through all groups configured for 
collection and we need the tree to quickly iterate through the groups 
allocated for a particular CPU only.
> 
> 

-Alexey

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29  7:45   ` Peter Zijlstra
@ 2017-05-29  9:24     ` Alexey Budankov
  2017-05-29 10:33       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Alexey Budankov @ 2017-05-29  9:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On 29.05.2017 10:45, Peter Zijlstra wrote:
> On Sat, May 27, 2017 at 02:19:51PM +0300, Alexey Budankov wrote:
>> Solution:
>>
>> cpu indexed trees for perf_event_context::pinned_groups and
>> perf_event_context::flexible_groups lists are introduced. Every tree node
>> keeps a list of groups allocated for the same cpu. A tree references only
>> groups located at the appropriate group list. The tree provides capability
>> to iterate over groups allocated for a specific cpu only, what is exactly
>> required by multiplexing timer interrupt handler. The handler runs per-cpu
>> and enables/disables groups using group_sched_in()/group_sched_out() that
>> call event_filter_match() function filtering out groups allocated for cpus
>> different from the one executing the handler. Additionally for every
>> filtered out group group_sched_out() updates tstamps values to the current
>> interrupt time. This updating work is now done only once by
>> update_context_time() called by ctx_sched_out() before cpu groups
>> iteration. For this trick to work it is required that tstamps of filtered
>> out events would point to perf_event_context::tstamp_data object instead
>> of perf_event::tstamp_data ones, as it is initialized from an event
>> allocation. tstamps references are switched by
>> group_sched_in()/group_sched_out() every time a group is checked for its
>> suitability for currently running cpu. When a thread enters some cpu on
>> a context switch a long run through pinned and flexible groups is
>> performed by perf_event_sched_in(, mux=0) with new parameter mux set to 0
>> and filtered out groups tstamps are switched to
>> perf_event_context::tstamp_data object. Then a series of multiplexing
>> interrupts happens and the handler rotates the flexible groups calling
>> ctx_sched_out(,mux=1)/perf_event_sched_in(,mux=1) iterating over the cpu
>> tree lists only and avoiding long runs through the complete group lists.
>> This is where speedup comes from. Eventually when the thread leaves the cpu
>> ctx_sched_out(,mux=0) is called restoring tstamps pointers to the events'
>> perf_event::tstamp_data objects.
> 
> This is unreadable.. Please use whitespace.

Do you mean do NOT use whitespaces? Could you explain in more detail 
what you mean?

> 
> 
>> Added API:
>>
>> Objects:
>> 1. struct perf_event_tstamp:
>>     - enabled
>>     - running
>>     - stopped
>> 2. struct perf_event:
>>     - group_node
>>     - group_list
>>     - group_list_entry
>>     - tstamp
>>     - tstamp_data
>> 3. struct perf_event_context:
>>     - pinned_tree
>>     - flexible_tree
>>     - tstamp_data
>>
>> Functions:
>> 1. insert a group into a tree using event->cpu as a key:
>> 	int perf_cpu_tree_insert(struct rb_root *tree,
>> 		struct perf_event *event);
>> 2. delete a group from a tree, if the group is directly attached
>>     to a tree it also detaches all groups on the groups
>>     group_list list:
>> 	int perf_cpu_tree_delete(struct rb_root *tree,
>> 		struct perf_event *event);
>> 3. find group_list list by a cpu key:
>>          struct list_head * perf_cpu_tree_find(struct rb_root *tree,
>> 		int cpu);
>> 4. enable a pinned group on a cpu:
>>          void ctx_sched_in_flexible_group(struct perf_event_context *ctx,
>> 		struct perf_cpu_context *cpuctx,
>> 		struct perf_event *group, int *can_add_hw);
>> 5. enable a flexible group on a cpu; calls ctx_sched_in_flexible_group
>>     and updates group->state to ERROR in case of failure:
>> 	void ctx_sched_in_pinned_group(struct perf_event_context *ctx,
>> 		struct perf_cpu_context *cpuctx,
>> 		struct perf_event *group);
>> 6. enable per-cpu pinned tree's groups on a cpu:
>> 	void ctx_pinned_sched_in_groups(struct perf_event_context *ctx,
>> 		struct perf_cpu_context *cpuctx,
>> 		struct list_head *groups);
>> 7. enable per-cpu flexible tree's groups on a cpu:
>> 	void ctx_flexible_sched_in_groups(
>> 		struct perf_event_context *ctx,
>> 		struct perf_cpu_context *cpuctx,
>> 		struct list_head *groups);
>> 8. disable per-cpu tree's groups on a cpu:
>> 	void ctx_sched_out_groups(struct perf_event_context *ctx,
>> 		struct perf_cpu_context *cpuctx,
>> 		struct list_head *groups);
>> 9. get a group tree based on event->attr.pinned attribute value:
>> 	struct rb_root * ctx_cpu_tree(struct perf_event *event,
>> 		struct perf_event_context *ctx);
>>
>> Modified API:
>>
>> Objects:
>> 1. struct perf_event
>> 2. struct perf_event_context
>>
>> Functions:
>> 1. perf_event_alloc()
>> 2. add_event_to_ctx()
>> 3. perf_event_enable_on_exec()
>> 4. __perf_install_in_context()
>> 5. __perf_event_init_context()
>> 6. __perf_event_mark_enabled()
>> 7. __perf_event_enable()
>> 8. __perf_event_task_sched_out()
>> 9. ctx_group_list()
>> 10. list_add_event()
>> 11. list_del_event()
>> 12. perf_event_context_sched_in()
>> 13. cpu_ctx_sched_in()
>> 14. cpu_ctx_sched_out()
>> 15. ctx_sched_in()
>> 16. ctx_sched_out()
>> 17. ctx_resched()
>> 18. ctx_pinned_sched_in()
>> 19. ctx_flexible_sched_in()
>> 20. group_sched_in()
>> 21. event_sched_in()
>> 22. event_sched_out()
>> 23. rotate_ctx()
>> 24. perf_rotate_context()
>> 25. update_context_time()
>> 26. update_event_times()
>> 27. calc_timer_values()
>> 28. perf_cgroup_switch()
>> 29. perf_cgroup_mark_enabled()
> 
> Yeah, this doesn't go into a changelog. Have you _ever_ seen a changelog
> with such crud in?
> 

I have not. Could you please advise how to format this information to be 
suitable for changelog, if it's required at all?

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29  9:24     ` Alexey Budankov
@ 2017-05-29 10:33       ` Peter Zijlstra
  2017-05-29 10:46         ` Alexey Budankov
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2017-05-29 10:33 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On Mon, May 29, 2017 at 12:24:53PM +0300, Alexey Budankov wrote:
> On 29.05.2017 10:45, Peter Zijlstra wrote:
> > On Sat, May 27, 2017 at 02:19:51PM +0300, Alexey Budankov wrote:
> > > Solution:
> > > 
> > > cpu indexed trees for perf_event_context::pinned_groups and
> > > perf_event_context::flexible_groups lists are introduced. Every tree node
> > > keeps a list of groups allocated for the same cpu. A tree references only
> > > groups located at the appropriate group list. The tree provides capability
> > > to iterate over groups allocated for a specific cpu only, what is exactly
> > > required by multiplexing timer interrupt handler. The handler runs per-cpu
> > > and enables/disables groups using group_sched_in()/group_sched_out() that
> > > call event_filter_match() function filtering out groups allocated for cpus
> > > different from the one executing the handler. Additionally for every
> > > filtered out group group_sched_out() updates tstamps values to the current
> > > interrupt time. This updating work is now done only once by
> > > update_context_time() called by ctx_sched_out() before cpu groups
> > > iteration. For this trick to work it is required that tstamps of filtered
> > > out events would point to perf_event_context::tstamp_data object instead
> > > of perf_event::tstamp_data ones, as it is initialized from an event
> > > allocation. tstamps references are switched by
> > > group_sched_in()/group_sched_out() every time a group is checked for its
> > > suitability for currently running cpu. When a thread enters some cpu on
> > > a context switch a long run through pinned and flexible groups is
> > > performed by perf_event_sched_in(, mux=0) with new parameter mux set to 0
> > > and filtered out groups tstamps are switched to
> > > perf_event_context::tstamp_data object. Then a series of multiplexing
> > > interrupts happens and the handler rotates the flexible groups calling
> > > ctx_sched_out(,mux=1)/perf_event_sched_in(,mux=1) iterating over the cpu
> > > tree lists only and avoiding long runs through the complete group lists.
> > > This is where speedup comes from. Eventually when the thread leaves the cpu
> > > ctx_sched_out(,mux=0) is called restoring tstamps pointers to the events'
> > > perf_event::tstamp_data objects.
> > 
> > This is unreadable.. Please use whitespace.
> 
> Do you mean do NOT use whitespaces? Could you explain in more detail what
> you mean?

No, add _more_ whitespace. Use things like paragraphs and such. Reading
a massive blob of text like that is painful. Also use full and complete
sentences. For example, the very first sentence:

 'cpu indexed trees for perf_event_context::pinned_groups and
 perf_event_context::flexible_groups lists are introduced.'

feels incomplete and leaves one wondering what for etc.. And it only
gets worse.


> > Yeah, this doesn't go into a changelog. Have you _ever_ seen a changelog
> > with such crud in?
> > 
> 
> I have not. Could you please advise how to format this information to be
> suitable for changelog, if it's required at all?

Don't include it. It should be fairly obvious from the diff itself what
changed after all.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29  9:15     ` Alexey Budankov
@ 2017-05-29 10:43       ` Peter Zijlstra
  2017-05-29 10:56         ` Alexey Budankov
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2017-05-29 10:43 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On Mon, May 29, 2017 at 12:15:14PM +0300, Alexey Budankov wrote:
> On 29.05.2017 10:46, Peter Zijlstra wrote:
> > On Sat, May 27, 2017 at 02:19:51PM +0300, Alexey Budankov wrote:

> > > @@ -742,7 +772,17 @@ struct perf_event_context {
> > > 
> > >   	struct list_head		active_ctx_list;
> > >   	struct list_head		pinned_groups;
> > > +	/*
> > > +	 * Cpu tree for pinned groups; keeps event's group_node nodes
> > > +	 * of attached flexible groups;
> > > +	 */
> > > +	struct rb_root			pinned_tree;
> > >   	struct list_head		flexible_groups;
> > > +	/*
> > > +	 * Cpu tree for flexible groups; keeps event's group_node nodes
> > > +	 * of attached flexible groups;
> > > +	 */
> > > +	struct rb_root			flexible_tree;
> > >   	struct list_head		event_list;
> > >   	int				nr_events;
> > >   	int				nr_active;
> > > @@ -758,6 +798,7 @@ struct perf_event_context {
> > >   	 */
> > >   	u64				time;
> > >   	u64				timestamp;
> > > +	struct perf_event_tstamp	tstamp_data;
> > > 
> > >   	/*
> > >   	 * These fields let us detect when two contexts have both
> > 
> > 
> > So why do we now have a list _and_ a tree for the same entries?

> We need groups list to iterate through all groups configured for collection
> and we need the tree to quickly iterate through the groups allocated for a
> particular CPU only.

*confused*, what?

Why can't the tree do both?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29 10:33       ` Peter Zijlstra
@ 2017-05-29 10:46         ` Alexey Budankov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexey Budankov @ 2017-05-29 10:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On 29.05.2017 13:33, Peter Zijlstra wrote:
> On Mon, May 29, 2017 at 12:24:53PM +0300, Alexey Budankov wrote:
>> On 29.05.2017 10:45, Peter Zijlstra wrote:
>>> On Sat, May 27, 2017 at 02:19:51PM +0300, Alexey Budankov wrote:
>>>> Solution:
>>>>
>>>> cpu indexed trees for perf_event_context::pinned_groups and
>>>> perf_event_context::flexible_groups lists are introduced. Every tree node
>>>> keeps a list of groups allocated for the same cpu. A tree references only
>>>> groups located at the appropriate group list. The tree provides capability
>>>> to iterate over groups allocated for a specific cpu only, what is exactly
>>>> required by multiplexing timer interrupt handler. The handler runs per-cpu
>>>> and enables/disables groups using group_sched_in()/group_sched_out() that
>>>> call event_filter_match() function filtering out groups allocated for cpus
>>>> different from the one executing the handler. Additionally for every
>>>> filtered out group group_sched_out() updates tstamps values to the current
>>>> interrupt time. This updating work is now done only once by
>>>> update_context_time() called by ctx_sched_out() before cpu groups
>>>> iteration. For this trick to work it is required that tstamps of filtered
>>>> out events would point to perf_event_context::tstamp_data object instead
>>>> of perf_event::tstamp_data ones, as it is initialized from an event
>>>> allocation. tstamps references are switched by
>>>> group_sched_in()/group_sched_out() every time a group is checked for its
>>>> suitability for currently running cpu. When a thread enters some cpu on
>>>> a context switch a long run through pinned and flexible groups is
>>>> performed by perf_event_sched_in(, mux=0) with new parameter mux set to 0
>>>> and filtered out groups tstamps are switched to
>>>> perf_event_context::tstamp_data object. Then a series of multiplexing
>>>> interrupts happens and the handler rotates the flexible groups calling
>>>> ctx_sched_out(,mux=1)/perf_event_sched_in(,mux=1) iterating over the cpu
>>>> tree lists only and avoiding long runs through the complete group lists.
>>>> This is where speedup comes from. Eventually when the thread leaves the cpu
>>>> ctx_sched_out(,mux=0) is called restoring tstamps pointers to the events'
>>>> perf_event::tstamp_data objects.
>>>
>>> This is unreadable.. Please use whitespace.
>>
>> Do you mean do NOT use whitespaces? Could you explain in more detail what
>> you mean?
> 
> No, add _more_ whitespace. Use things like paragraphs and such. Reading
> a massive blob of text like that is painful. Also use full and complete
> sentences. For example, the very first sentence:
> 
>   'cpu indexed trees for perf_event_context::pinned_groups and
>   perf_event_context::flexible_groups lists are introduced.'
> 
> feels incomplete and leaves one wondering what for etc.. And it only
> gets worse.
> 

Makes sense. Will do. Thanks.

> 
>>> Yeah, this doesn't go into a changelog. Have you _ever_ seen a changelog
>>> with such crud in?
>>>
>>
>> I have not. Could you please advise how to format this information to be
>> suitable for changelog, if it's required at all?
> 
> Don't include it. It should be fairly obvious from the diff itself what
> changed after all.
> 

Ok. Clear.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29 10:43       ` Peter Zijlstra
@ 2017-05-29 10:56         ` Alexey Budankov
  2017-05-29 11:23           ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Alexey Budankov @ 2017-05-29 10:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On 29.05.2017 13:43, Peter Zijlstra wrote:
> On Mon, May 29, 2017 at 12:15:14PM +0300, Alexey Budankov wrote:
>> On 29.05.2017 10:46, Peter Zijlstra wrote:
>>> On Sat, May 27, 2017 at 02:19:51PM +0300, Alexey Budankov wrote:
> 
>>>> @@ -742,7 +772,17 @@ struct perf_event_context {
>>>>
>>>>    	struct list_head		active_ctx_list;
>>>>    	struct list_head		pinned_groups;
>>>> +	/*
>>>> +	 * Cpu tree for pinned groups; keeps event's group_node nodes
>>>> +	 * of attached flexible groups;
>>>> +	 */
>>>> +	struct rb_root			pinned_tree;
>>>>    	struct list_head		flexible_groups;
>>>> +	/*
>>>> +	 * Cpu tree for flexible groups; keeps event's group_node nodes
>>>> +	 * of attached flexible groups;
>>>> +	 */
>>>> +	struct rb_root			flexible_tree;
>>>>    	struct list_head		event_list;
>>>>    	int				nr_events;
>>>>    	int				nr_active;
>>>> @@ -758,6 +798,7 @@ struct perf_event_context {
>>>>    	 */
>>>>    	u64				time;
>>>>    	u64				timestamp;
>>>> +	struct perf_event_tstamp	tstamp_data;
>>>>
>>>>    	/*
>>>>    	 * These fields let us detect when two contexts have both
>>>
>>>
>>> So why do we now have a list _and_ a tree for the same entries?
> 
>> We need groups list to iterate through all groups configured for collection
>> and we need the tree to quickly iterate through the groups allocated for a
>> particular CPU only.
> 
> *confused*, what?
> 
> Why can't the tree do both?
> 

Well, indeed, the tree provides such capability too. However switching 
to the full tree iteration in cases where we now go through _groups 
lists will enlarge the patch, what is probably is not a big deal. Do you 
think it is worth implementing the switch?

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29 10:56         ` Alexey Budankov
@ 2017-05-29 11:23           ` Peter Zijlstra
  2017-05-29 11:45             ` Alexey Budankov
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2017-05-29 11:23 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On Mon, May 29, 2017 at 01:56:05PM +0300, Alexey Budankov wrote:
> On 29.05.2017 13:43, Peter Zijlstra wrote:

> > Why can't the tree do both?
> > 
> 
> Well, indeed, the tree provides such capability too. However switching to
> the full tree iteration in cases where we now go through _groups lists will
> enlarge the patch, what is probably is not a big deal. Do you think it is
> worth implementing the switch?

Do it as a series of patches, where patch 1 introduces the tree, patches
2 through n convert the list users into tree users, and patch n+1
removes the list.

I think its good to not have duplicate data structures if we can avoid
it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29 11:23           ` Peter Zijlstra
@ 2017-05-29 11:45             ` Alexey Budankov
  2017-06-15 17:42               ` Alexey Budankov
  0 siblings, 1 reply; 25+ messages in thread
From: Alexey Budankov @ 2017-05-29 11:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On 29.05.2017 14:23, Peter Zijlstra wrote:
> On Mon, May 29, 2017 at 01:56:05PM +0300, Alexey Budankov wrote:
>> On 29.05.2017 13:43, Peter Zijlstra wrote:
> 
>>> Why can't the tree do both?
>>>
>>
>> Well, indeed, the tree provides such capability too. However switching to
>> the full tree iteration in cases where we now go through _groups lists will
>> enlarge the patch, what is probably is not a big deal. Do you think it is
>> worth implementing the switch?
> 
> Do it as a series of patches, where patch 1 introduces the tree, patches
> 2 through n convert the list users into tree users, and patch n+1
> removes the list.

Well ok, let's do that additionally but please expect delay in delivery 
(I am OOO till Jun 14).

> 
> I think its good to not have duplicate data structures if we can avoid
> it.
> 

yeah, makes sense.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-26 22:13 [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
  2017-05-27 11:19 ` [PATCH v2]: perf/core: addressing 4x slowdown during per-process, " Alexey Budankov
@ 2017-05-29 12:03 ` Alexander Shishkin
  2017-05-29 13:43   ` Alexey Budankov
  1 sibling, 1 reply; 25+ messages in thread
From: Alexander Shishkin @ 2017-05-29 12:03 UTC (permalink / raw)
  To: Alexey Budankov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

Alexey Budankov <alexey.budankov@linux.intel.com> writes:

Here (above the function) you could include a comment describing what
happens when this is called, locking considerations, etc.

> +static int
> +perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event)
> +{
> +	struct rb_node **node;
> +	struct rb_node *parent;
> +
> +	if (!tree || !event)
> +		return 0;

I don't think this should be happening, should it? And either way you
probably don't want to return 0 here, unless you're using !0 for
success.

> +
> +	node = &tree->rb_node;
> +	parent = *node;
> +
> +	while (*node) {
> +		struct perf_event *node_event =	container_of(*node,
> +				struct perf_event, group_node);
> +
> +		parent = *node;
> +
> +		if (event->cpu < node_event->cpu) {
> +			node = &((*node)->rb_left);

this would be the same as node = &parent->rb_left, right?

> +		} else if (event->cpu > node_event->cpu) {
> +			node = &((*node)->rb_right);
> +		} else {
> +			list_add_tail(&event->group_list_entry,
> +					&node_event->group_list);

So why is this better than simply having per-cpu event lists plus one
for per-thread events?

Also,

> +			return 2;

2?

> +		}
> +	}
> +
> +	list_add_tail(&event->group_list_entry, &event->group_list);
> +
> +	rb_link_node(&event->group_node, parent, node);
> +	rb_insert_color(&event->group_node, tree);
> +
> +	return 1;

Oh, you are using !0 for success. I guess it's a good thing you're not
actually checking its return code at the call site.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29 12:03 ` [PATCH]: perf/core: addressing 4x slowdown during per-process " Alexander Shishkin
@ 2017-05-29 13:43   ` Alexey Budankov
  2017-05-29 15:22     ` Peter Zijlstra
  2017-05-30  8:29     ` Alexander Shishkin
  0 siblings, 2 replies; 25+ messages in thread
From: Alexey Budankov @ 2017-05-29 13:43 UTC (permalink / raw)
  To: Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On 29.05.2017 15:03, Alexander Shishkin wrote:
> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
> 
> Here (above the function) you could include a comment describing what
> happens when this is called, locking considerations, etc.

I can put the short description from the initial thread message here. 
Would it be sufficient?

> 
>> +static int
>> +perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event)
>> +{
>> +	struct rb_node **node;
>> +	struct rb_node *parent;
>> +
>> +	if (!tree || !event)
>> +		return 0;
> 
> I don't think this should be happening, should it? And either way you
> probably don't want to return 0 here, unless you're using !0 for
> success.

As you might notice already, currently return codes of the tree API are 
not checked all other the implementation. I suggest replacing that int 
error code by void and simplify the stuff.

> 
>> +
>> +	node = &tree->rb_node;
>> +	parent = *node;
>> +
>> +	while (*node) {
>> +		struct perf_event *node_event =	container_of(*node,
>> +				struct perf_event, group_node);
>> +
>> +		parent = *node;
>> +
>> +		if (event->cpu < node_event->cpu) {
>> +			node = &((*node)->rb_left);
> 
> this would be the same as node = &parent->rb_left, right?

Please ask more. node is the leaf node and parent is the parent of the 
node at the end of cycle. We need the both to insert a new node into a 
tree.

> 
>> +		} else if (event->cpu > node_event->cpu) {
>> +			node = &((*node)->rb_right);
>> +		} else {
>> +			list_add_tail(&event->group_list_entry,
>> +					&node_event->group_list);
> 
> So why is this better than simply having per-cpu event lists plus one
> for per-thread events?

Good question. Choice of data structure and layout depends on the 
operations applied to the data so keeping groups as a tree simplifies 
and improves the implementation in terms of scalability and performance. 
Please ask more if any.

> 
> Also,
> 
>> +			return 2;
> 
> 2?

Answered above.

> 
>> +		}
>> +	}
>> +
>> +	list_add_tail(&event->group_list_entry, &event->group_list);
>> +
>> +	rb_link_node(&event->group_node, parent, node);
>> +	rb_insert_color(&event->group_node, tree);
>> +
>> +	return 1;
> 
> Oh, you are using !0 for success. I guess it's a good thing you're not
> actually checking its return code at the call site.

Answered above.

> 
> Regards,
> --
> Alex
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29 13:43   ` Alexey Budankov
@ 2017-05-29 15:22     ` Peter Zijlstra
  2017-05-29 15:29       ` Peter Zijlstra
  2017-05-30  8:29     ` Alexander Shishkin
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2017-05-29 15:22 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Alexander Shishkin, Ingo Molnar, Arnaldo Carvalho de Melo,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On Mon, May 29, 2017 at 04:43:09PM +0300, Alexey Budankov wrote:
> On 29.05.2017 15:03, Alexander Shishkin wrote:
> > Alexey Budankov <alexey.budankov@linux.intel.com> writes:

> > > +		} else if (event->cpu > node_event->cpu) {
> > > +			node = &((*node)->rb_right);
> > > +		} else {
> > > +			list_add_tail(&event->group_list_entry,
> > > +					&node_event->group_list);
> > 
> > So why is this better than simply having per-cpu event lists plus one
> > for per-thread events?
> 
> Good question. Choice of data structure and layout depends on the operations
> applied to the data so keeping groups as a tree simplifies and improves the
> implementation in terms of scalability and performance. Please ask more if
> any.

Since these lists are per context, and each task can have a context,
you'd end up with per-task-per-cpu memory, which is something we'd like
to avoid (some archs have very limited per-cpu memory space etc..).

Also, we'd like to have that tree for other reasons, like for instance
that heterogeneous PMU crud ARM has. Also, with a tree we can easier do
time based round-robin scheduling,

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29 15:22     ` Peter Zijlstra
@ 2017-05-29 15:29       ` Peter Zijlstra
  2017-05-29 16:41         ` Alexey Budankov
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2017-05-29 15:29 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Alexander Shishkin, Ingo Molnar, Arnaldo Carvalho de Melo,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On Mon, May 29, 2017 at 05:22:54PM +0200, Peter Zijlstra wrote:
> On Mon, May 29, 2017 at 04:43:09PM +0300, Alexey Budankov wrote:
> > On 29.05.2017 15:03, Alexander Shishkin wrote:
> > > Alexey Budankov <alexey.budankov@linux.intel.com> writes:
> 
> > > > +		} else if (event->cpu > node_event->cpu) {
> > > > +			node = &((*node)->rb_right);
> > > > +		} else {
> > > > +			list_add_tail(&event->group_list_entry,
> > > > +					&node_event->group_list);
> > > 
> > > So why is this better than simply having per-cpu event lists plus one
> > > for per-thread events?
> > 
> > Good question. Choice of data structure and layout depends on the operations
> > applied to the data so keeping groups as a tree simplifies and improves the
> > implementation in terms of scalability and performance. Please ask more if
> > any.
> 
> Since these lists are per context, and each task can have a context,
> you'd end up with per-task-per-cpu memory, which is something we'd like
> to avoid (some archs have very limited per-cpu memory space etc..).
> 
> Also, we'd like to have that tree for other reasons, like for instance
> that heterogeneous PMU crud ARM has. Also, with a tree we can easier do
> time based round-robin scheduling,
> 

Oh and in general multi-PMU stuff, aside from hetero PMU becomes much
easier.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29 15:29       ` Peter Zijlstra
@ 2017-05-29 16:41         ` Alexey Budankov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexey Budankov @ 2017-05-29 16:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, Arnaldo Carvalho de Melo,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On 29.05.2017 18:29, Peter Zijlstra wrote:
> On Mon, May 29, 2017 at 05:22:54PM +0200, Peter Zijlstra wrote:
>> On Mon, May 29, 2017 at 04:43:09PM +0300, Alexey Budankov wrote:
>>> On 29.05.2017 15:03, Alexander Shishkin wrote:
>>>> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
>>
>>>>> +		} else if (event->cpu > node_event->cpu) {
>>>>> +			node = &((*node)->rb_right);
>>>>> +		} else {
>>>>> +			list_add_tail(&event->group_list_entry,
>>>>> +					&node_event->group_list);
>>>>
>>>> So why is this better than simply having per-cpu event lists plus one
>>>> for per-thread events?
>>>
>>> Good question. Choice of data structure and layout depends on the operations
>>> applied to the data so keeping groups as a tree simplifies and improves the
>>> implementation in terms of scalability and performance. Please ask more if
>>> any.
>>
>> Since these lists are per context, and each task can have a context,
>> you'd end up with per-task-per-cpu memory, which is something we'd like
>> to avoid (some archs have very limited per-cpu memory space etc..).

Aw, yeah. Memory consumption does matter in the kernel space.

>>
>> Also, we'd like to have that tree for other reasons, like for instance
>> that heterogeneous PMU crud ARM has. Also, with a tree we can easier do
>> time based round-robin scheduling,
>>
> 
> Oh and in general multi-PMU stuff, aside from hetero PMU becomes much
> easier.
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29 13:43   ` Alexey Budankov
  2017-05-29 15:22     ` Peter Zijlstra
@ 2017-05-30  8:29     ` Alexander Shishkin
  2017-06-14 10:07       ` Alexey Budankov
  1 sibling, 1 reply; 25+ messages in thread
From: Alexander Shishkin @ 2017-05-30  8:29 UTC (permalink / raw)
  To: Alexey Budankov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

Alexey Budankov <alexey.budankov@linux.intel.com> writes:

> On 29.05.2017 15:03, Alexander Shishkin wrote:
>> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
>> 
>> Here (above the function) you could include a comment describing what
>> happens when this is called, locking considerations, etc.
>
> I can put the short description from the initial thread message here. 
> Would it be sufficient?

Sure, this is where API descriptions fit better than in commit messages.

>
>> 
>>> +static int
>>> +perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event)
>>> +{
>>> +	struct rb_node **node;
>>> +	struct rb_node *parent;
>>> +
>>> +	if (!tree || !event)
>>> +		return 0;
>> 
>> I don't think this should be happening, should it? And either way you
>> probably don't want to return 0 here, unless you're using !0 for
>> success.
>
> As you might notice already, currently return codes of the tree API are 
> not checked all other the implementation. I suggest replacing that int 
> error code by void and simplify the stuff.

Your call. But I'd still either drop the redundant checks or wrap them
in WARN_ON_ONCE().

>
>> 
>>> +
>>> +	node = &tree->rb_node;
>>> +	parent = *node;
>>> +
>>> +	while (*node) {
>>> +		struct perf_event *node_event =	container_of(*node,
>>> +				struct perf_event, group_node);
>>> +
>>> +		parent = *node;
>>> +
>>> +		if (event->cpu < node_event->cpu) {
>>> +			node = &((*node)->rb_left);
>> 
>> this would be the same as node = &parent->rb_left, right?
>
> Please ask more.

Side note: between commit message, comments and the actual code, in an
ideal situation one doesn't have to 'ask' anything, because everything
is already clear. Not the case here.

> node is the leaf node and parent is the parent of the 
> node at the end of cycle. We need the both to insert a new node into a 
> tree.

Not sure I understand. You'd still have both.

>
>> 
>>> +		} else if (event->cpu > node_event->cpu) {
>>> +			node = &((*node)->rb_right);
>>> +		} else {
>>> +			list_add_tail(&event->group_list_entry,
>>> +					&node_event->group_list);
>> 
>> So why is this better than simply having per-cpu event lists plus one
>> for per-thread events?
>
> Good question. Choice of data structure and layout depends on the 
> operations applied to the data so keeping groups as a tree simplifies 
> and improves the implementation in terms of scalability and performance. 
> Please ask more if any.

Please be more specific on how scalability and performance are
improved. In general, try to avoid vagues statements like "this is
better for performance".

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-27 11:19 ` [PATCH v2]: perf/core: addressing 4x slowdown during per-process, " Alexey Budankov
  2017-05-29  7:45   ` Peter Zijlstra
  2017-05-29  7:46   ` Peter Zijlstra
@ 2017-05-31 21:33   ` David Carrillo-Cisneros
  2017-06-14 11:27     ` Alexey Budankov
  2 siblings, 1 reply; 25+ messages in thread
From: David Carrillo-Cisneros @ 2017-05-31 21:33 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Andi Kleen, Kan Liang, Dmitri Prokhorov,
	Valery Cherepennikov, Stephane Eranian, Mark Rutland,
	linux-kernel, Arun Kalyanasundaram

On Sat, May 27, 2017 at 4:19 AM, Alexey Budankov
<alexey.budankov@linux.intel.com> wrote:
> Motivation:
>
> The issue manifests like 4x slowdown when profiling single thread STREAM
> benchmark on Intel Xeon Phi running RHEL7.2 (Intel MPSS distribution).
> Perf profiling is done in per-process mode and involves about 30 core
> events. In case the benchmark is OpenMP based and runs under profiling in
> 272 threads the overhead becomes even more dramatic: 512.144s against
> 36.192s (with this patch).

How long does it take without any perf monitoring? Could you provide
more details about the benchmark? how many CPUs are being monitored?

SNIP
> different from the one executing the handler. Additionally for every
> filtered out group group_sched_out() updates tstamps values to the current
> interrupt time. This updating work is now done only once by
> update_context_time() called by ctx_sched_out() before cpu groups
> iteration.

I don't see this. e.g.:
in your patch task_ctx_sched_out calls ctx_sched_out with mux == 0,
that path does the exact same thing as before your patch.

I understand why you want to move the event's times to a different
structure and keep a pointer in the event, but I don't see that you
are avoiding the update of the times of unscheduled events.

>
>  static u64 perf_event_time(struct perf_event *event)
> @@ -1424,16 +1428,16 @@ static void update_event_times(struct perf_event
> *event)
>         else if (ctx->is_active)
>                 run_end = ctx->time;
>         else
> -               run_end = event->tstamp_stopped;
> +               run_end = event->tstamp->stopped;
>
> -       event->total_time_enabled = run_end - event->tstamp_enabled;
> +       event->total_time_enabled = run_end - event->tstamp->enabled;
>
>         if (event->state == PERF_EVENT_STATE_INACTIVE)
> -               run_end = event->tstamp_stopped;
> +               run_end = event->tstamp->stopped;
>         else
>                 run_end = perf_event_time(event);
>
> -       event->total_time_running = run_end - event->tstamp_running;
> +       event->total_time_running = run_end - event->tstamp->running;

FWICT, this is run for ALL events in context with matching CPU.


SNIP
>  }
> @@ -3051,9 +3277,9 @@ void __perf_event_task_sched_out(struct task_struct
> *task,
>   * Called with IRQs disabled
>   */
>  static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
> -                             enum event_type_t event_type)
> +                             enum event_type_t event_type, int mux)
>  {
> -       ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
> +       ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
>  }
>
>  static void
> @@ -3061,29 +3287,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>                     struct perf_cpu_context *cpuctx)
>  {
>         struct perf_event *event;
> -
> -       list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
> -               if (event->state <= PERF_EVENT_STATE_OFF)
> -                       continue;
> -               if (!event_filter_match(event))
> -                       continue;

Could we remove or simplify the tests in event_filter_match since the
rb-tree filters by cpu?

> -
> -               /* may need to reset tstamp_enabled */
> -               if (is_cgroup_event(event))
> -                       perf_cgroup_mark_enabled(event, ctx);
> -
> -               if (group_can_go_on(event, cpuctx, 1))
> -                       group_sched_in(event, cpuctx, ctx);
> -
> -               /*
> -                * If this pinned group hasn't been scheduled,
> -                * put it in error state.
> -                */
> -               if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -                       update_group_times(event);
> -                       event->state = PERF_EVENT_STATE_ERROR;
> -               }
> -       }
> +       list_for_each_entry(event, &ctx->pinned_groups, group_entry)
> +               ctx_sched_in_pinned_group(ctx, cpuctx, event);
>  }
>
>  static void
> @@ -3092,37 +3297,19 @@ ctx_flexible_sched_in(struct perf_event_context
> *ctx,
>  {
>         struct perf_event *event;
>         int can_add_hw = 1;
> -
> -       list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
> -               /* Ignore events in OFF or ERROR state */
> -               if (event->state <= PERF_EVENT_STATE_OFF)
> -                       continue;
> -               /*
> -                * Listen to the 'cpu' scheduling filter constraint
> -                * of events:
> -                */
> -               if (!event_filter_match(event))
> -                       continue;
same as before.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-30  8:29     ` Alexander Shishkin
@ 2017-06-14 10:07       ` Alexey Budankov
  2017-06-15 17:44         ` Alexey Budankov
  0 siblings, 1 reply; 25+ messages in thread
From: Alexey Budankov @ 2017-06-14 10:07 UTC (permalink / raw)
  To: Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On 30.05.2017 11:29, Alexander Shishkin wrote:
> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
> 
>> On 29.05.2017 15:03, Alexander Shishkin wrote:
>>> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
>>>
>>> Here (above the function) you could include a comment describing what
>>> happens when this is called, locking considerations, etc.
>>
>> I can put the short description from the initial thread message here.
>> Would it be sufficient?
> 
> Sure, this is where API descriptions fit better than in commit messages.
> 
>>
>>>
>>>> +static int
>>>> +perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event)
>>>> +{
>>>> +	struct rb_node **node;
>>>> +	struct rb_node *parent;
>>>> +
>>>> +	if (!tree || !event)
>>>> +		return 0;
>>>
>>> I don't think this should be happening, should it? And either way you
>>> probably don't want to return 0 here, unless you're using !0 for
>>> success.
>>
>> As you might notice already, currently return codes of the tree API are
>> not checked all other the implementation. I suggest replacing that int
>> error code by void and simplify the stuff.
> 
> Your call. But I'd still either drop the redundant checks or wrap them
> in WARN_ON_ONCE().

Ok. WARN_ON_ONCE() then.

> 
>>
>>>
>>>> +
>>>> +	node = &tree->rb_node;
>>>> +	parent = *node;
>>>> +
>>>> +	while (*node) {
>>>> +		struct perf_event *node_event =	container_of(*node,
>>>> +				struct perf_event, group_node);
>>>> +
>>>> +		parent = *node;
>>>> +
>>>> +		if (event->cpu < node_event->cpu) {
>>>> +			node = &((*node)->rb_left);
>>>
>>> this would be the same as node = &parent->rb_left, right?

Yes, that is right.

>>
>> Please ask more.
> 
> Side note: between commit message, comments and the actual code, in an
> ideal situation one doesn't have to 'ask' anything, because everything
> is already clear. Not the case here.
> 
>> node is the leaf node and parent is the parent of the
>> node at the end of cycle. We need the both to insert a new node into a
>> tree.
> 
> Not sure I understand. You'd still have both.
> 
>>
>>>
>>>> +		} else if (event->cpu > node_event->cpu) {
>>>> +			node = &((*node)->rb_right);
>>>> +		} else {
>>>> +			list_add_tail(&event->group_list_entry,
>>>> +					&node_event->group_list);
>>>
>>> So why is this better than simply having per-cpu event lists plus one
>>> for per-thread events?
>>
>> Good question. Choice of data structure and layout depends on the
>> operations applied to the data so keeping groups as a tree simplifies
>> and improves the implementation in terms of scalability and performance.
>> Please ask more if any.
> 
> Please be more specific on how scalability and performance are
> improved. In general, try to avoid vagues statements like "this is
> better for performance".

Accepted. Peter already provided more specifics on this. Thanks.

> 
> Thanks,
> --
> Alex
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-31 21:33   ` David Carrillo-Cisneros
@ 2017-06-14 11:27     ` Alexey Budankov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexey Budankov @ 2017-06-14 11:27 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Andi Kleen, Kan Liang, Dmitri Prokhorov,
	Valery Cherepennikov, Stephane Eranian, Mark Rutland,
	linux-kernel, Arun Kalyanasundaram

On 01.06.2017 0:33, David Carrillo-Cisneros wrote:
> On Sat, May 27, 2017 at 4:19 AM, Alexey Budankov
> <alexey.budankov@linux.intel.com> wrote:
>> Motivation:
>>
>> The issue manifests like 4x slowdown when profiling single thread STREAM
>> benchmark on Intel Xeon Phi running RHEL7.2 (Intel MPSS distribution).
>> Perf profiling is done in per-process mode and involves about 30 core
>> events. In case the benchmark is OpenMP based and runs under profiling in
>> 272 threads the overhead becomes even more dramatic: 512.144s against
>> 36.192s (with this patch).
> 
> How long does it take without any perf monitoring? Could you provide
> more details about the benchmark? how many CPUs are being monitored?

STREAM runs about 10 seconds without Perf monitoring on Intel KNL in 272 
threads (272 CPUs are monitored). More on the benchmark is here: 
https://www.cs.virginia.edu/stream/

> 
> SNIP
>> different from the one executing the handler. Additionally for every
>> filtered out group group_sched_out() updates tstamps values to the current
>> interrupt time. This updating work is now done only once by
>> update_context_time() called by ctx_sched_out() before cpu groups
>> iteration.
> 
> I don't see this. e.g.:

It is done here:

@@ -1383,6 +1384,9 @@ static void update_context_time(struct 
perf_event_context *ctx)

      ctx->time += now - ctx->timestamp;
      ctx->timestamp = now;
+
+    ctx->tstamp_data.running += ctx->time - ctx->tstamp_data.stopped;
+    ctx->tstamp_data.stopped = ctx->time;
  }

Unscheduled events tstamps are updated all at once in 
update_context_time() called from ctx_sched_out().

> in your patch task_ctx_sched_out calls ctx_sched_out with mux == 0,
> that path does the exact same thing as before your patch. >
> I understand why you want to move the event's times to a different
> structure and keep a pointer in the event, but I don't see that you
> are avoiding the update of the times of unscheduled events.
> 
>>
>>   static u64 perf_event_time(struct perf_event *event)
>> @@ -1424,16 +1428,16 @@ static void update_event_times(struct perf_event
>> *event)
>>          else if (ctx->is_active)
>>                  run_end = ctx->time;
>>          else
>> -               run_end = event->tstamp_stopped;
>> +               run_end = event->tstamp->stopped;
>>
>> -       event->total_time_enabled = run_end - event->tstamp_enabled;
>> +       event->total_time_enabled = run_end - event->tstamp->enabled;
>>
>>          if (event->state == PERF_EVENT_STATE_INACTIVE)
>> -               run_end = event->tstamp_stopped;
>> +               run_end = event->tstamp->stopped;
>>          else
>>                  run_end = perf_event_time(event);
>>
>> -       event->total_time_running = run_end - event->tstamp_running;
>> +       event->total_time_running = run_end - event->tstamp->running;
> 
> FWICT, this is run for ALL events in context with matching CPU.
> 
> 
> SNIP
>>   }
>> @@ -3051,9 +3277,9 @@ void __perf_event_task_sched_out(struct task_struct
>> *task,
>>    * Called with IRQs disabled
>>    */
>>   static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
>> -                             enum event_type_t event_type)
>> +                             enum event_type_t event_type, int mux)
>>   {
>> -       ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
>> +       ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
>>   }
>>
>>   static void
>> @@ -3061,29 +3287,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>>                      struct perf_cpu_context *cpuctx)
>>   {
>>          struct perf_event *event;
>> -
>> -       list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
>> -               if (event->state <= PERF_EVENT_STATE_OFF)
>> -                       continue;
>> -               if (!event_filter_match(event))
>> -                       continue;
> 
> Could we remove or simplify the tests in event_filter_match since the
> rb-tree filters by cpu?

Why do you want to remove or simplify event_filter_match()?

> 
>> -
>> -               /* may need to reset tstamp_enabled */
>> -               if (is_cgroup_event(event))
>> -                       perf_cgroup_mark_enabled(event, ctx);
>> -
>> -               if (group_can_go_on(event, cpuctx, 1))
>> -                       group_sched_in(event, cpuctx, ctx);
>> -
>> -               /*
>> -                * If this pinned group hasn't been scheduled,
>> -                * put it in error state.
>> -                */
>> -               if (event->state == PERF_EVENT_STATE_INACTIVE) {
>> -                       update_group_times(event);
>> -                       event->state = PERF_EVENT_STATE_ERROR;
>> -               }
>> -       }
>> +       list_for_each_entry(event, &ctx->pinned_groups, group_entry)
>> +               ctx_sched_in_pinned_group(ctx, cpuctx, event);
>>   }
>>
>>   static void
>> @@ -3092,37 +3297,19 @@ ctx_flexible_sched_in(struct perf_event_context
>> *ctx,
>>   {
>>          struct perf_event *event;
>>          int can_add_hw = 1;
>> -
>> -       list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
>> -               /* Ignore events in OFF or ERROR state */
>> -               if (event->state <= PERF_EVENT_STATE_OFF)
>> -                       continue;
>> -               /*
>> -                * Listen to the 'cpu' scheduling filter constraint
>> -                * of events:
>> -                */
>> -               if (!event_filter_match(event))
>> -                       continue;
> same as before.
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-05-29 11:45             ` Alexey Budankov
@ 2017-06-15 17:42               ` Alexey Budankov
  2017-06-21 15:39                 ` Alexey Budankov
  0 siblings, 1 reply; 25+ messages in thread
From: Alexey Budankov @ 2017-06-15 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On 29.05.2017 14:45, Alexey Budankov wrote:
> On 29.05.2017 14:23, Peter Zijlstra wrote:
>> On Mon, May 29, 2017 at 01:56:05PM +0300, Alexey Budankov wrote:
>>> On 29.05.2017 13:43, Peter Zijlstra wrote:
>>
>>>> Why can't the tree do both?
>>>>
>>>
>>> Well, indeed, the tree provides such capability too. However 
>>> switching to
>>> the full tree iteration in cases where we now go through _groups 
>>> lists will
>>> enlarge the patch, what is probably is not a big deal. Do you think 
>>> it is
>>> worth implementing the switch?
>>
>> Do it as a series of patches, where patch 1 introduces the tree, patches
>> 2 through n convert the list users into tree users, and patch n+1
>> removes the list.
> 
> Well ok, let's do that additionally but please expect delay in delivery 
> (I am OOO till Jun 14).

addressed in v3.

> 
>>
>> I think its good to not have duplicate data structures if we can avoid
>> it.
>>
> 
> yeah, makes sense.
> 
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
  2017-06-14 10:07       ` Alexey Budankov
@ 2017-06-15 17:44         ` Alexey Budankov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexey Budankov @ 2017-06-15 17:44 UTC (permalink / raw)
  To: Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

On 14.06.2017 13:07, Alexey Budankov wrote:
> On 30.05.2017 11:29, Alexander Shishkin wrote:
>> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
>>
>>> On 29.05.2017 15:03, Alexander Shishkin wrote:
>>>> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
>>>>
>>>> Here (above the function) you could include a comment describing what
>>>> happens when this is called, locking considerations, etc.
>>>
>>> I can put the short description from the initial thread message here.
>>> Would it be sufficient?
>>
>> Sure, this is where API descriptions fit better than in commit messages.

addressed in v3.

>>
>>>
>>>>
>>>>> +static int
>>>>> +perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event)
>>>>> +{
>>>>> +    struct rb_node **node;
>>>>> +    struct rb_node *parent;
>>>>> +
>>>>> +    if (!tree || !event)
>>>>> +        return 0;
>>>>
>>>> I don't think this should be happening, should it? And either way you
>>>> probably don't want to return 0 here, unless you're using !0 for
>>>> success.
>>>
>>> As you might notice already, currently return codes of the tree API are
>>> not checked all other the implementation. I suggest replacing that int
>>> error code by void and simplify the stuff.
>>
>> Your call. But I'd still either drop the redundant checks or wrap them
>> in WARN_ON_ONCE().
> 
> Ok. WARN_ON_ONCE() then.

addressed in v3.

> 
>>
>>>
>>>>
>>>>> +
>>>>> +    node = &tree->rb_node;
>>>>> +    parent = *node;
>>>>> +
>>>>> +    while (*node) {
>>>>> +        struct perf_event *node_event =    container_of(*node,
>>>>> +                struct perf_event, group_node);
>>>>> +
>>>>> +        parent = *node;
>>>>> +
>>>>> +        if (event->cpu < node_event->cpu) {
>>>>> +            node = &((*node)->rb_left);
>>>>
>>>> this would be the same as node = &parent->rb_left, right?
> 
> Yes, that is right.

addressed in v3.

> 
>>>
>>> Please ask more.
>>
>> Side note: between commit message, comments and the actual code, in an
>> ideal situation one doesn't have to 'ask' anything, because everything
>> is already clear. Not the case here.
>>
>>> node is the leaf node and parent is the parent of the
>>> node at the end of cycle. We need the both to insert a new node into a
>>> tree.
>>
>> Not sure I understand. You'd still have both.
>>
>>>
>>>>
>>>>> +        } else if (event->cpu > node_event->cpu) {
>>>>> +            node = &((*node)->rb_right);
>>>>> +        } else {
>>>>> +            list_add_tail(&event->group_list_entry,
>>>>> +                    &node_event->group_list);
>>>>
>>>> So why is this better than simply having per-cpu event lists plus one
>>>> for per-thread events?
>>>
>>> Good question. Choice of data structure and layout depends on the
>>> operations applied to the data so keeping groups as a tree simplifies
>>> and improves the implementation in terms of scalability and performance.
>>> Please ask more if any.
>>
>> Please be more specific on how scalability and performance are
>> improved. In general, try to avoid vagues statements like "this is
>> better for performance".
> 
> Accepted. Peter already provided more specifics on this. Thanks.
> 
>>
>> Thanks,
>> -- 
>> Alex
>>
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-06-15 17:42               ` Alexey Budankov
@ 2017-06-21 15:39                 ` Alexey Budankov
  2017-06-30 10:22                   ` Alexey Budankov
  0 siblings, 1 reply; 25+ messages in thread
From: Alexey Budankov @ 2017-06-21 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel


Hi,

On 15.06.2017 20:42, Alexey Budankov wrote:
> On 29.05.2017 14:45, Alexey Budankov wrote:
>> On 29.05.2017 14:23, Peter Zijlstra wrote:
>>> On Mon, May 29, 2017 at 01:56:05PM +0300, Alexey Budankov wrote:
>>>> On 29.05.2017 13:43, Peter Zijlstra wrote:
>>>
>>>>> Why can't the tree do both?
>>>>>
>>>>
>>>> Well, indeed, the tree provides such capability too. However 
>>>> switching to
>>>> the full tree iteration in cases where we now go through _groups 
>>>> lists will
>>>> enlarge the patch, what is probably is not a big deal. Do you think 
>>>> it is
>>>> worth implementing the switch?
>>>
>>> Do it as a series of patches, where patch 1 introduces the tree, patches
>>> 2 through n convert the list users into tree users, and patch n+1
>>> removes the list.
>>
>> Well ok, let's do that additionally but please expect delay in 
>> delivery (I am OOO till Jun 14).
> 
> addressed in v3.
> 
>>
>>>
>>> I think its good to not have duplicate data structures if we can avoid
>>> it.
>>>
>>
>> yeah, makes sense.
>>
>>
>>
> 
> 

After straightforward switch from struct list_head to struct rb_tree for 
flexible_groups I now get dmesg dumps on rb tree corruptions. That 
happens when iterating thru tree instead of thru list. No additional
synchronization for the tree access was added. It looks like there are
some assumptions on the list_head type in the implementation itself.

Are there any ideas on why that corruptions may happen?

I still suggest isolating event groups into a separate object (please 
see patch v4-1/4):

struct perf_event_groups {
	struct rb_root	 tree;
	struct list_head list;
};

struct perf_event_context {
...
struct perf_event_groups pinned_groups;
struct perf_event_groups flexible_groups;

and implementing new API for the object:

perf_event_groups_empty()
perf_event_groups_init()
perf_event_groups_insert()
perf_event_groups_delete()
perf_event_groups_rotate(..., int cpu)
perf_event_groups_iterate_cpu(..., int cpu)
perf_event_groups_iterate()

so that perf_event_groups_iterate() would go thru list but leaving
the opportunity of iteration thru tree for a separate patch because
complete transition to rb trees may incur synchronization overhead in 
runtime.

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
  2017-06-21 15:39                 ` Alexey Budankov
@ 2017-06-30 10:22                   ` Alexey Budankov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexey Budankov @ 2017-06-30 10:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	David Carrillo-Cisneros, Stephane Eranian, Mark Rutland,
	linux-kernel

Hi Peter,

On 21.06.2017 18:39, Alexey Budankov wrote:
> 
> Hi,
> 
> On 15.06.2017 20:42, Alexey Budankov wrote:
>> On 29.05.2017 14:45, Alexey Budankov wrote:
>>> On 29.05.2017 14:23, Peter Zijlstra wrote:
>>>> On Mon, May 29, 2017 at 01:56:05PM +0300, Alexey Budankov wrote:
>>>>> On 29.05.2017 13:43, Peter Zijlstra wrote:
>>>>
>>>>>> Why can't the tree do both?
>>>>>>
>>>>>
>>>>> Well, indeed, the tree provides such capability too. However switching to
>>>>> the full tree iteration in cases where we now go through _groups lists will
>>>>> enlarge the patch, what is probably is not a big deal. Do you think it is
>>>>> worth implementing the switch?
>>>>
>>>> Do it as a series of patches, where patch 1 introduces the tree, patches
>>>> 2 through n convert the list users into tree users, and patch n+1
>>>> removes the list.
>>>
>>> Well ok, let's do that additionally but please expect delay in delivery (I am OOO till Jun 14).
>>
>> addressed in v3.
>>
>>>
>>>>
>>>> I think its good to not have duplicate data structures if we can avoid
>>>> it.
>>>>
>>>
>>> yeah, makes sense.
>>>
>>>
>>>
>>
>>
> 
> After straightforward switch from struct list_head to struct rb_tree for flexible_groups I now get dmesg dumps on rb tree corruptions. That happens when iterating thru tree instead of thru list. No additional
> synchronization for the tree access was added. It looks like there are
> some assumptions on the list_head type in the implementation itself.
> 
> Are there any ideas on why that corruptions may happen?
> 
> I still suggest isolating event groups into a separate object (please see patch v4-1/4):
> 
> struct perf_event_groups {
>     struct rb_root     tree;
>     struct list_head list;
> };
> 
> struct perf_event_context {
> ...
> struct perf_event_groups pinned_groups;
> struct perf_event_groups flexible_groups;
> 
> and implementing new API for the object:
> 
> perf_event_groups_empty()
> perf_event_groups_init()
> perf_event_groups_insert()
> perf_event_groups_delete()
> perf_event_groups_rotate(..., int cpu)
> perf_event_groups_iterate_cpu(..., int cpu)
> perf_event_groups_iterate()
> 
> so that perf_event_groups_iterate() would go thru list but leaving
> the opportunity of iteration thru tree for a separate patch because
> complete transition to rb trees may incur synchronization overhead in runtime.

Completely got rid of list and tree duplication in patch v5 4/4.
Please see here: 

[PATCH v5 4/4] perf/core: addressing 4x slowdown during per-process 
                          profiling of STREAM benchmark on Intel Xeon Phi

> 
> Thanks,
> Alexey
> 

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-06-30 10:22 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-26 22:13 [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
2017-05-27 11:19 ` [PATCH v2]: perf/core: addressing 4x slowdown during per-process, " Alexey Budankov
2017-05-29  7:45   ` Peter Zijlstra
2017-05-29  9:24     ` Alexey Budankov
2017-05-29 10:33       ` Peter Zijlstra
2017-05-29 10:46         ` Alexey Budankov
2017-05-29  7:46   ` Peter Zijlstra
2017-05-29  9:15     ` Alexey Budankov
2017-05-29 10:43       ` Peter Zijlstra
2017-05-29 10:56         ` Alexey Budankov
2017-05-29 11:23           ` Peter Zijlstra
2017-05-29 11:45             ` Alexey Budankov
2017-06-15 17:42               ` Alexey Budankov
2017-06-21 15:39                 ` Alexey Budankov
2017-06-30 10:22                   ` Alexey Budankov
2017-05-31 21:33   ` David Carrillo-Cisneros
2017-06-14 11:27     ` Alexey Budankov
2017-05-29 12:03 ` [PATCH]: perf/core: addressing 4x slowdown during per-process " Alexander Shishkin
2017-05-29 13:43   ` Alexey Budankov
2017-05-29 15:22     ` Peter Zijlstra
2017-05-29 15:29       ` Peter Zijlstra
2017-05-29 16:41         ` Alexey Budankov
2017-05-30  8:29     ` Alexander Shishkin
2017-06-14 10:07       ` Alexey Budankov
2017-06-15 17:44         ` Alexey Budankov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).