[PATCH v6 0/3] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/3] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
@ 2017-08-02  8:11 Alexey Budankov
  2017-08-02  8:13 ` [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups Alexey Budankov
                   ` (3 more replies)
  0 siblings, 4 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-02  8:11 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

Hi,

By default, the userspace perf tool opens per-cpu task-bound events
when sampling, so for N logical events requested by the user, the tool
will open N * NR_CPUS events.

In the kernel, we mux events with a hrtimer, periodically rotating the
flexible group list and trying to schedule each group in turn. We skip 
groups whose cpu filter doesn't match. So when we get unlucky, we can 
walk N * (NR_CPUS - 1) groups pointlessly for each hrtimer invocation.

This has been observed to result in significant overhead when running
the STREAM benchmark on 272 core Xeon Phi systems.

One way to avoid this is to place our events into an rb tree sorted by
CPU, so that our hrtimer can skip to the current CPU's list and ignore
everything else. 

This patch set moves event groups into rb trees and implements 
skipping to the current CPU's list on hrtimer interrupt.

The patch set was tested on Xeon Phi using perf_fuzzer and tests 
from here: https://github.com/deater/perf_event_tests

Patches in the set are expected to be applied one after another in 
the mentioned order and they are logically split here into three parts 
to simplify the review process.

Thanks,
Alexey

---
 Alexey Budankov (3):
	perf/core: use rb trees for pinned/flexible groups
	perf/core: use context tstamp_data for skipped events on mux interrupt
	perf/core: add mux switch to skip to the current CPU's events list on mux interrupt

 include/linux/perf_event.h |  54 +++--
 kernel/events/core.c       | 584 +++++++++++++++++++++++++++++++++------------
 2 files changed, 473 insertions(+), 165 deletions(-)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-02  8:11 [PATCH v6 0/3] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
@ 2017-08-02  8:13 ` Alexey Budankov
  2017-08-03 13:00   ` Peter Zijlstra
  2017-08-02  8:15 ` [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt Alexey Budankov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-08-02  8:13 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

This patch moves event groups into rb tree sorted by CPU, so that 
multiplexing hrtimer interrupt handler would be able skipping to the current 
CPU's list and ignore groups allocated for the other CPUs.

New API for manipulating event groups in the trees is implemented as well 
as adoption on the API in the current implementation.

Because perf_event_groups_iterate() API provides capability to execute 
a callback for every event group in a tree, adoption of the API introduces
some code that packs and unpacks arguments of functions existing in 
the implementation as well as adjustments of their calling signatures
e.g. ctx_pinned_sched_in(), ctx_flexible_sched_in() and inherit_task_group().

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 include/linux/perf_event.h |  18 ++-
 kernel/events/core.c       | 389 +++++++++++++++++++++++++++++++++------------
 2 files changed, 306 insertions(+), 101 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a3b873f..282f121 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -572,6 +572,20 @@ struct perf_event {
 	 */
 	struct list_head		group_entry;
 	struct list_head		sibling_list;
+	/*
+	 * Node on the pinned or flexible tree located at the event context;
+	 * the node may be empty in case its event is not directly attached
+	 * to the tree but to group_list list of the event directly
+	 * attached to the tree;
+	 */
+	struct rb_node			group_node;
+	/*
+	 * List keeps groups allocated for the same cpu;
+	 * the list may be empty in case its event is not directly
+	 * attached to the tree but to group_list list of the event directly
+	 * attached to the tree;
+	 */
+	struct list_head		group_list;
 
 	/*
 	 * We need storage to track the entries in perf_pmu_migrate_context; we
@@ -741,8 +755,8 @@ struct perf_event_context {
 	struct mutex			mutex;
 
 	struct list_head		active_ctx_list;
-	struct list_head		pinned_groups;
-	struct list_head		flexible_groups;
+	struct rb_root			pinned_groups;
+	struct rb_root			flexible_groups;
 	struct list_head		event_list;
 	int				nr_events;
 	int				nr_active;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 426c2ff..0a4f619 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1466,8 +1466,12 @@ static enum event_type_t get_event_type(struct perf_event *event)
 	return event_type;
 }
 
-static struct list_head *
-ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
+/*
+ * Extract pinned or flexible groups from the context
+ * based on event attrs bits;
+ */
+static struct rb_root *
+get_event_groups(struct perf_event *event, struct perf_event_context *ctx)
 {
 	if (event->attr.pinned)
 		return &ctx->pinned_groups;
@@ -1475,6 +1479,160 @@ ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
 		return &ctx->flexible_groups;
 }
 
+static void
+perf_event_groups_insert(struct rb_root *groups,
+		struct perf_event *event);
+
+static void
+perf_event_groups_delete(struct rb_root *groups,
+		struct perf_event *event);
+
+/*
+ * Helper function to insert event into the pinned or
+ * flexible groups;
+ */
+static void
+add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
+{
+	struct rb_root *groups;
+
+	groups = get_event_groups(event, ctx);
+	perf_event_groups_insert(groups, event);
+}
+
+/*
+ * Helper function to delete event from its groups;
+ */
+static void
+del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
+{
+	struct rb_root *groups;
+
+	groups = get_event_groups(event, ctx);
+	perf_event_groups_delete(groups, event);
+}
+
+/*
+ * Insert a group into a tree using event->cpu as a key. If event->cpu node
+ * is already attached to the tree then the event is added to the attached
+ * group's group_list list.
+ */
+static void
+perf_event_groups_insert(struct rb_root *groups,
+		struct perf_event *event)
+{
+	struct rb_node **node;
+	struct rb_node *parent;
+	struct perf_event *node_event;
+
+	node = &groups->rb_node;
+	parent = *node;
+
+	while (*node) {
+		parent = *node;
+		node_event = container_of(*node,
+				struct perf_event, group_node);
+
+		if (event->cpu < node_event->cpu) {
+			node = &parent->rb_left;
+		} else if (event->cpu > node_event->cpu) {
+			node = &parent->rb_right;
+		} else {
+			list_add_tail(&event->group_entry,
+					&node_event->group_list);
+			return;
+		}
+	}
+
+	list_add_tail(&event->group_entry, &event->group_list);
+
+	rb_link_node(&event->group_node, parent, node);
+	rb_insert_color(&event->group_node, groups);
+}
+
+/*
+ * Delete a group from a tree. If the group is directly attached to the tree
+ * it also detaches all groups on the group's group_list list.
+ */
+static void
+perf_event_groups_delete(struct rb_root *groups,
+		struct perf_event *event)
+{
+	struct perf_event *next;
+
+	list_del_init(&event->group_entry);
+
+	if (!RB_EMPTY_NODE(&event->group_node)) {
+		if (!RB_EMPTY_ROOT(groups)) {
+			if (list_empty(&event->group_list)) {
+				rb_erase(&event->group_node, groups);
+			} else {
+				next = list_first_entry(&event->group_list,
+						struct perf_event, group_entry);
+				list_replace_init(&event->group_list,
+						&next->group_list);
+				rb_replace_node(&event->group_node,
+						&next->group_node, groups);
+			}
+		}
+		RB_CLEAR_NODE(&event->group_node);
+	}
+}
+
+/*
+ * Find group list by a cpu key and rotate it.
+ */
+static void
+perf_event_groups_rotate(struct rb_root *groups, int cpu)
+{
+	struct rb_node *node;
+	struct perf_event *node_event;
+
+	node = groups->rb_node;
+
+	while (node) {
+		node_event = container_of(node,
+				struct perf_event, group_node);
+
+		if (cpu < node_event->cpu) {
+			node = node->rb_left;
+		} else if (cpu > node_event->cpu) {
+			node = node->rb_right;
+		} else {
+			list_rotate_left(&node_event->group_list);
+			break;
+		}
+	}
+}
+
+typedef int(*perf_event_groups_iterate_f)(struct perf_event *, void *);
+
+/*
+ * Iterate event groups and call provided callback for every group in the tree.
+ * Iteration stops if the callback returns non zero.
+ */
+static int
+perf_event_groups_iterate(struct rb_root *groups,
+		perf_event_groups_iterate_f callback, void *data)
+{
+	int ret = 0;
+	struct rb_node *node;
+	struct perf_event *node_event, *event;
+
+	for (node = rb_first(groups); node; node = rb_next(node)) {
+		node_event = container_of(node,	struct perf_event, group_node);
+		list_for_each_entry(event, &node_event->group_list,
+				group_entry) {
+			ret = callback(event, data);
+			if (ret) {
+				return ret;
+			}
+		}
+	}
+
+	return ret;
+}
+
 /*
  * Add a event from the lists for its context.
  * Must be called with ctx->mutex and ctx->lock held.
@@ -1493,12 +1651,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	 * perf_group_detach can, at all times, locate all siblings.
 	 */
 	if (event->group_leader == event) {
-		struct list_head *list;
-
 		event->group_caps = event->event_caps;
-
-		list = ctx_group_list(event, ctx);
-		list_add_tail(&event->group_entry, list);
+		add_event_to_groups(event, ctx);
 	}
 
 	list_update_cgroup_event(event, ctx, true);
@@ -1689,7 +1843,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 	list_del_rcu(&event->event_entry);
 
 	if (event->group_leader == event)
-		list_del_init(&event->group_entry);
+		del_event_from_groups(event, ctx);
 
 	update_group_times(event);
 
@@ -1730,22 +1884,22 @@ static void perf_group_detach(struct perf_event *event)
 		goto out;
 	}
 
-	if (!list_empty(&event->group_entry))
-		list = &event->group_entry;
-
 	/*
 	 * If this was a group event with sibling events then
 	 * upgrade the siblings to singleton events by adding them
 	 * to whatever list we are on.
 	 */
 	list_for_each_entry_safe(sibling, tmp, &event->sibling_list, group_entry) {
-		if (list)
-			list_move_tail(&sibling->group_entry, list);
 		sibling->group_leader = sibling;
 
 		/* Inherit group flags from the previous leader */
 		sibling->group_caps = event->group_caps;
 
+		if (!list_empty(&event->group_entry)) {
+			list_del_init(&sibling->group_entry);
+			add_event_to_groups(sibling, event->ctx);
+		}
+
 		WARN_ON_ONCE(sibling->ctx != event->ctx);
 	}
 
@@ -1869,6 +2023,22 @@ group_sched_out(struct perf_event *group_event,
 		cpuctx->exclusive = 0;
 }
 
+struct group_sched_params {
+	struct perf_cpu_context *cpuctx;
+	struct perf_event_context *ctx;
+	int can_add_hw;
+};
+
+static int
+group_sched_out_callback(struct perf_event *event, void *data)
+{
+	struct group_sched_params *params = data;
+
+	group_sched_out(event, params->cpuctx, params->ctx);
+
+	return 0;
+}
+
 #define DETACH_GROUP	0x01UL
 
 /*
@@ -2712,7 +2882,10 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 			  enum event_type_t event_type)
 {
 	int is_active = ctx->is_active;
-	struct perf_event *event;
+	struct group_sched_params params = {
+			.cpuctx = cpuctx,
+			.ctx = ctx
+	};
 
 	lockdep_assert_held(&ctx->lock);
 
@@ -2759,13 +2932,13 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 
 	perf_pmu_disable(ctx->pmu);
 	if (is_active & EVENT_PINNED) {
-		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
-			group_sched_out(event, cpuctx, ctx);
+		perf_event_groups_iterate(&ctx->pinned_groups,
+				group_sched_out_callback, &params);
 	}
 
 	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry(event, &ctx->flexible_groups, group_entry)
-			group_sched_out(event, cpuctx, ctx);
+		perf_event_groups_iterate(&ctx->flexible_groups,
+				group_sched_out_callback, &params);
 	}
 	perf_pmu_enable(ctx->pmu);
 }
@@ -3059,63 +3232,60 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
 	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
 }
 
-static void
-ctx_pinned_sched_in(struct perf_event_context *ctx,
-		    struct perf_cpu_context *cpuctx)
+static int
+ctx_pinned_sched_in(struct perf_event *event, void *data)
 {
-	struct perf_event *event;
+	struct group_sched_params *params = data;
 
-	list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		if (!event_filter_match(event))
-			continue;
+	if (event->state <= PERF_EVENT_STATE_OFF)
+		return 0;
+	if (!event_filter_match(event))
+		return 0;
 
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
+	/* may need to reset tstamp_enabled */
+	if (is_cgroup_event(event))
+		perf_cgroup_mark_enabled(event, params->ctx);
 
-		if (group_can_go_on(event, cpuctx, 1))
-			group_sched_in(event, cpuctx, ctx);
+	if (group_can_go_on(event, params->cpuctx, 1))
+		group_sched_in(event, params->cpuctx, params->ctx);
 
-		/*
-		 * If this pinned group hasn't been scheduled,
-		 * put it in error state.
-		 */
-		if (event->state == PERF_EVENT_STATE_INACTIVE) {
-			update_group_times(event);
-			event->state = PERF_EVENT_STATE_ERROR;
-		}
+	/*
+	 * If this pinned group hasn't been scheduled,
+	 * put it in error state.
+	 */
+	if (event->state == PERF_EVENT_STATE_INACTIVE) {
+		update_group_times(event);
+		event->state = PERF_EVENT_STATE_ERROR;
 	}
+
+	return 0;
 }
 
-static void
-ctx_flexible_sched_in(struct perf_event_context *ctx,
-		      struct perf_cpu_context *cpuctx)
+static int
+ctx_flexible_sched_in(struct perf_event *event, void *data)
 {
-	struct perf_event *event;
-	int can_add_hw = 1;
+	struct group_sched_params *params = data;
 
-	list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
-		/* Ignore events in OFF or ERROR state */
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		/*
-		 * Listen to the 'cpu' scheduling filter constraint
-		 * of events:
-		 */
-		if (!event_filter_match(event))
-			continue;
+	/* Ignore events in OFF or ERROR state */
+	if (event->state <= PERF_EVENT_STATE_OFF)
+		return 0;
+	/*
+	 * Listen to the 'cpu' scheduling filter constraint
+	 * of events:
+	 */
+	if (!event_filter_match(event))
+		return 0;
 
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
+	/* may need to reset tstamp_enabled */
+	if (is_cgroup_event(event))
+		perf_cgroup_mark_enabled(event, params->ctx);
 
-		if (group_can_go_on(event, cpuctx, can_add_hw)) {
-			if (group_sched_in(event, cpuctx, ctx))
-				can_add_hw = 0;
-		}
+	if (group_can_go_on(event, params->cpuctx, params->can_add_hw)) {
+		if (group_sched_in(event, params->cpuctx, params->ctx))
+			params->can_add_hw = 0;
 	}
+
+	return 0;
 }
 
 static void
@@ -3125,7 +3295,10 @@ ctx_sched_in(struct perf_event_context *ctx,
 	     struct task_struct *task)
 {
 	int is_active = ctx->is_active;
-	u64 now;
+	struct group_sched_params params = {
+			.cpuctx = cpuctx,
+			.ctx = ctx
+	};
 
 	lockdep_assert_held(&ctx->lock);
 
@@ -3144,7 +3317,7 @@ ctx_sched_in(struct perf_event_context *ctx,
 
 	if (is_active & EVENT_TIME) {
 		/* start ctx time */
-		now = perf_clock();
+		u64 now = perf_clock();
 		ctx->timestamp = now;
 		perf_cgroup_set_timestamp(task, ctx);
 	}
@@ -3154,11 +3327,15 @@ ctx_sched_in(struct perf_event_context *ctx,
 	 * in order to give them the best chance of going on.
 	 */
 	if (is_active & EVENT_PINNED)
-		ctx_pinned_sched_in(ctx, cpuctx);
+		perf_event_groups_iterate(&ctx->pinned_groups,
+				ctx_pinned_sched_in, &params);
 
 	/* Then walk through the lower prio flexible groups */
-	if (is_active & EVENT_FLEXIBLE)
-		ctx_flexible_sched_in(ctx, cpuctx);
+	if (is_active & EVENT_FLEXIBLE) {
+		params.can_add_hw = 1;
+		perf_event_groups_iterate(&ctx->flexible_groups,
+				ctx_flexible_sched_in, &params);
+	}
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
@@ -3189,7 +3366,7 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	 * However, if task's ctx is not carrying any pinned
 	 * events, no need to flip the cpuctx's events around.
 	 */
-	if (!list_empty(&ctx->pinned_groups))
+	if (!RB_EMPTY_ROOT(&ctx->pinned_groups))
 		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
 	perf_event_sched_in(cpuctx, ctx, task);
 	perf_pmu_enable(ctx->pmu);
@@ -3424,8 +3601,12 @@ static void rotate_ctx(struct perf_event_context *ctx)
 	 * Rotate the first entry last of non-pinned groups. Rotation might be
 	 * disabled by the inheritance code.
 	 */
-	if (!ctx->rotate_disable)
-		list_rotate_left(&ctx->flexible_groups);
+	if (!ctx->rotate_disable) {
+		int cpu = smp_processor_id();
+
+		perf_event_groups_rotate(&ctx->flexible_groups, -1);
+		perf_event_groups_rotate(&ctx->flexible_groups, cpu);
+	}
 }
 
 static int perf_rotate_context(struct perf_cpu_context *cpuctx)
@@ -3764,8 +3945,8 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 	raw_spin_lock_init(&ctx->lock);
 	mutex_init(&ctx->mutex);
 	INIT_LIST_HEAD(&ctx->active_ctx_list);
-	INIT_LIST_HEAD(&ctx->pinned_groups);
-	INIT_LIST_HEAD(&ctx->flexible_groups);
+	ctx->pinned_groups = RB_ROOT;
+	ctx->flexible_groups = RB_ROOT;
 	INIT_LIST_HEAD(&ctx->event_list);
 	atomic_set(&ctx->refcount, 1);
 }
@@ -9372,6 +9553,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	INIT_LIST_HEAD(&event->group_entry);
 	INIT_LIST_HEAD(&event->event_entry);
 	INIT_LIST_HEAD(&event->sibling_list);
+	RB_CLEAR_NODE(&event->group_node);
+	INIT_LIST_HEAD(&event->group_list);
 	INIT_LIST_HEAD(&event->rb_entry);
 	INIT_LIST_HEAD(&event->active_entry);
 	INIT_LIST_HEAD(&event->addr_filters.list);
@@ -10786,6 +10969,14 @@ static int inherit_group(struct perf_event *parent_event,
 	return 0;
 }
 
+struct inherit_task_group_params {
+	struct task_struct *parent;
+	struct perf_event_context *parent_ctx;
+	struct task_struct *child;
+	int ctxn;
+	int inherited_all;
+};
+
 /*
  * Creates the child task context and tries to inherit the event-group.
  *
@@ -10798,20 +10989,18 @@ static int inherit_group(struct perf_event *parent_event,
  *  - <0 on error
  */
 static int
-inherit_task_group(struct perf_event *event, struct task_struct *parent,
-		   struct perf_event_context *parent_ctx,
-		   struct task_struct *child, int ctxn,
-		   int *inherited_all)
+inherit_task_group(struct perf_event *event, void *data)
 {
 	int ret;
 	struct perf_event_context *child_ctx;
+	struct inherit_task_group_params *params = data;
 
 	if (!event->attr.inherit) {
-		*inherited_all = 0;
+		params->inherited_all = 0;
 		return 0;
 	}
 
-	child_ctx = child->perf_event_ctxp[ctxn];
+	child_ctx = params->child->perf_event_ctxp[params->ctxn];
 	if (!child_ctx) {
 		/*
 		 * This is executed from the parent task context, so
@@ -10819,18 +11008,19 @@ inherit_task_group(struct perf_event *event, struct task_struct *parent,
 		 * First allocate and initialize a context for the
 		 * child.
 		 */
-		child_ctx = alloc_perf_context(parent_ctx->pmu, child);
+		child_ctx = alloc_perf_context(params->parent_ctx->pmu,
+				params->child);
 		if (!child_ctx)
 			return -ENOMEM;
 
-		child->perf_event_ctxp[ctxn] = child_ctx;
+		params->child->perf_event_ctxp[params->ctxn] = child_ctx;
 	}
 
-	ret = inherit_group(event, parent, parent_ctx,
-			    child, child_ctx);
+	ret = inherit_group(event, params->parent, params->parent_ctx,
+			    params->child, child_ctx);
 
 	if (ret)
-		*inherited_all = 0;
+		params->inherited_all = 0;
 
 	return ret;
 }
@@ -10842,11 +11032,15 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 {
 	struct perf_event_context *child_ctx, *parent_ctx;
 	struct perf_event_context *cloned_ctx;
-	struct perf_event *event;
 	struct task_struct *parent = current;
-	int inherited_all = 1;
 	unsigned long flags;
 	int ret = 0;
+	struct inherit_task_group_params params = {
+			.parent = parent,
+			.child = child,
+			.ctxn = ctxn,
+			.inherited_all = 1
+	};
 
 	if (likely(!parent->perf_event_ctxp[ctxn]))
 		return 0;
@@ -10859,6 +11053,8 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 	if (!parent_ctx)
 		return 0;
 
+	params.parent_ctx = parent_ctx;
+
 	/*
 	 * No need to check if parent_ctx != NULL here; since we saw
 	 * it non-NULL earlier, the only reason for it to become NULL
@@ -10876,13 +11072,10 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 	 * We dont have to disable NMIs - we are only looking at
 	 * the list, not manipulating it:
 	 */
-	list_for_each_entry(event, &parent_ctx->pinned_groups, group_entry) {
-		ret = inherit_task_group(event, parent, parent_ctx,
-					 child, ctxn, &inherited_all);
-		if (ret)
-			goto out_unlock;
-	}
-
+	ret = perf_event_groups_iterate(&parent_ctx->pinned_groups,
+			inherit_task_group, &params);
+	if (ret)
+		goto out_unlock;
 	/*
 	 * We can't hold ctx->lock when iterating the ->flexible_group list due
 	 * to allocations, but we need to prevent rotation because
@@ -10892,19 +11085,17 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 	parent_ctx->rotate_disable = 1;
 	raw_spin_unlock_irqrestore(&parent_ctx->lock, flags);
 
-	list_for_each_entry(event, &parent_ctx->flexible_groups, group_entry) {
-		ret = inherit_task_group(event, parent, parent_ctx,
-					 child, ctxn, &inherited_all);
-		if (ret)
-			goto out_unlock;
-	}
+	ret = perf_event_groups_iterate(&parent_ctx->flexible_groups,
+			inherit_task_group, &params);
+	if (ret)
+		goto out_unlock;
 
 	raw_spin_lock_irqsave(&parent_ctx->lock, flags);
 	parent_ctx->rotate_disable = 0;
 
 	child_ctx = child->perf_event_ctxp[ctxn];
 
-	if (child_ctx && inherited_all) {
+	if (child_ctx && params.inherited_all) {
 		/*
 		 * Mark the child context as a clone of the parent
 		 * context, or of whatever the parent is a clone of.

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-02  8:11 [PATCH v6 0/3] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
  2017-08-02  8:13 ` [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups Alexey Budankov
@ 2017-08-02  8:15 ` Alexey Budankov
  2017-08-03 13:04   ` Peter Zijlstra
                     ` (2 more replies)
  2017-08-02  8:16 ` [PATCH v6 3/3]: perf/core: add mux switch to skip to the current CPU's events list on mux interrupt Alexey Budankov
  2017-08-18  5:17 ` [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
  3 siblings, 3 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-02  8:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

Event groups allocated for CPU's different from the one that handles multiplexing
hrtimer interrupt may be skipped by interrupt handler however the events  
tstamp_enabled, tstamp_running and tstamp_stopped fields still need to be updated 
to have correct timings.

To implement that tstamp_data object is introduced at the event context
and the skipped events' tstamps pointers are switched between self and context
tstamp_data objects.

The context object timings are updated by update_context_time() on every 
multiplexing hrtimer interrupt so all events referencing the context object get its 
timings properly updated all at once.

Event groups tstamps are switched to the context object and back to self object 
if they don't pass thru event_filter_match() on thread context switch in and out.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 include/linux/perf_event.h | 36 ++++++++++++++++++----------
 kernel/events/core.c       | 58 ++++++++++++++++++++++++++++------------------
 2 files changed, 60 insertions(+), 34 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 282f121..69d60f2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -550,6 +550,22 @@ struct pmu_event_list {
 	struct list_head	list;
 };
 
+struct perf_event_tstamp {
+	/*
+	 * These are timestamps used for computing total_time_enabled
+	 * and total_time_running when the event is in INACTIVE or
+	 * ACTIVE state, measured in nanoseconds from an arbitrary point
+	 * in time.
+	 * enabled: the notional time when the event was enabled
+	 * running: the notional time when the event was scheduled on
+	 * stopped: in INACTIVE state, the notional time when the
+	 *    event was scheduled off.
+	 */
+	u64 enabled;
+	u64 running;
+	u64 stopped;
+};
+
 /**
  * struct perf_event - performance event kernel representation:
  */
@@ -625,19 +641,11 @@ struct perf_event {
 	u64				total_time_running;
 
 	/*
-	 * These are timestamps used for computing total_time_enabled
-	 * and total_time_running when the event is in INACTIVE or
-	 * ACTIVE state, measured in nanoseconds from an arbitrary point
-	 * in time.
-	 * tstamp_enabled: the notional time when the event was enabled
-	 * tstamp_running: the notional time when the event was scheduled on
-	 * tstamp_stopped: in INACTIVE state, the notional time when the
-	 *	event was scheduled off.
+	 * tstamp points to the tstamp_data object below or to the object
+	 * located at the event context;
 	 */
-	u64				tstamp_enabled;
-	u64				tstamp_running;
-	u64				tstamp_stopped;
-
+	struct perf_event_tstamp	*tstamp;
+	struct perf_event_tstamp	tstamp_data;
 	/*
 	 * timestamp shadows the actual context timing but it can
 	 * be safely used in NMI interrupt context. It reflects the
@@ -772,6 +780,10 @@ struct perf_event_context {
 	 */
 	u64				time;
 	u64				timestamp;
+	/*
+	 * Context cache for filtered out events;
+	 */
+	struct perf_event_tstamp	tstamp_data;
 
 	/*
 	 * These fields let us detect when two contexts have both
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0a4f619..5ccb8a2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -866,10 +866,10 @@ perf_cgroup_mark_enabled(struct perf_event *event,
 
 	event->cgrp_defer_enabled = 0;
 
-	event->tstamp_enabled = tstamp - event->total_time_enabled;
+	event->tstamp->enabled = tstamp - event->total_time_enabled;
 	list_for_each_entry(sub, &event->sibling_list, group_entry) {
 		if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+			sub->tstamp->enabled = tstamp - sub->total_time_enabled;
 			sub->cgrp_defer_enabled = 0;
 		}
 	}
@@ -1379,6 +1379,9 @@ static void update_context_time(struct perf_event_context *ctx)
 
 	ctx->time += now - ctx->timestamp;
 	ctx->timestamp = now;
+
+	ctx->tstamp_data.running += ctx->time - ctx->tstamp_data.stopped;
+	ctx->tstamp_data.stopped = ctx->time;
 }
 
 static u64 perf_event_time(struct perf_event *event)
@@ -1420,16 +1423,16 @@ static void update_event_times(struct perf_event *event)
 	else if (ctx->is_active)
 		run_end = ctx->time;
 	else
-		run_end = event->tstamp_stopped;
+		run_end = event->tstamp->stopped;
 
-	event->total_time_enabled = run_end - event->tstamp_enabled;
+	event->total_time_enabled = run_end - event->tstamp->enabled;
 
 	if (event->state == PERF_EVENT_STATE_INACTIVE)
-		run_end = event->tstamp_stopped;
+		run_end = event->tstamp->stopped;
 	else
 		run_end = perf_event_time(event);
 
-	event->total_time_running = run_end - event->tstamp_running;
+	event->total_time_running = run_end - event->tstamp->running;
 
 }
 
@@ -1968,9 +1971,13 @@ event_sched_out(struct perf_event *event,
 	 */
 	if (event->state == PERF_EVENT_STATE_INACTIVE &&
 	    !event_filter_match(event)) {
-		delta = tstamp - event->tstamp_stopped;
-		event->tstamp_running += delta;
-		event->tstamp_stopped = tstamp;
+		delta = tstamp - event->tstamp->stopped;
+		event->tstamp->running += delta;
+		event->tstamp->stopped = tstamp;
+		if (event->tstamp != &event->tstamp_data) {
+			event->tstamp_data = *event->tstamp;
+			event->tstamp = &event->tstamp_data;
+		}
 	}
 
 	if (event->state != PERF_EVENT_STATE_ACTIVE)
@@ -1978,7 +1985,7 @@ event_sched_out(struct perf_event *event,
 
 	perf_pmu_disable(event->pmu);
 
-	event->tstamp_stopped = tstamp;
+	event->tstamp->stopped = tstamp;
 	event->pmu->del(event, 0);
 	event->oncpu = -1;
 	event->state = PERF_EVENT_STATE_INACTIVE;
@@ -2269,7 +2276,7 @@ event_sched_in(struct perf_event *event,
 		goto out;
 	}
 
-	event->tstamp_running += tstamp - event->tstamp_stopped;
+	event->tstamp->running += tstamp - event->tstamp->stopped;
 
 	if (!is_software_event(event))
 		cpuctx->active_oncpu++;
@@ -2341,8 +2348,8 @@ group_sched_in(struct perf_event *group_event,
 			simulate = true;
 
 		if (simulate) {
-			event->tstamp_running += now - event->tstamp_stopped;
-			event->tstamp_stopped = now;
+			event->tstamp->running += now - event->tstamp->stopped;
+			event->tstamp->stopped = now;
 		} else {
 			event_sched_out(event, cpuctx, ctx);
 		}
@@ -2394,9 +2401,9 @@ static void add_event_to_ctx(struct perf_event *event,
 
 	list_add_event(event, ctx);
 	perf_group_attach(event);
-	event->tstamp_enabled = tstamp;
-	event->tstamp_running = tstamp;
-	event->tstamp_stopped = tstamp;
+	event->tstamp->enabled = tstamp;
+	event->tstamp->running = tstamp;
+	event->tstamp->stopped = tstamp;
 }
 
 static void ctx_sched_out(struct perf_event_context *ctx,
@@ -2641,10 +2648,10 @@ static void __perf_event_mark_enabled(struct perf_event *event)
 	u64 tstamp = perf_event_time(event);
 
 	event->state = PERF_EVENT_STATE_INACTIVE;
-	event->tstamp_enabled = tstamp - event->total_time_enabled;
+	event->tstamp->enabled = tstamp - event->total_time_enabled;
 	list_for_each_entry(sub, &event->sibling_list, group_entry) {
 		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+			sub->tstamp->enabled = tstamp - sub->total_time_enabled;
 	}
 }
 
@@ -3239,8 +3246,11 @@ ctx_pinned_sched_in(struct perf_event *event, void *data)
 
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
-	if (!event_filter_match(event))
+	if (!event_filter_match(event)) {
+		if (event->tstamp != &params->ctx->tstamp_data)
+			event->tstamp = &params->ctx->tstamp_data;
 		return 0;
+	}
 
 	/* may need to reset tstamp_enabled */
 	if (is_cgroup_event(event))
@@ -3273,8 +3283,11 @@ ctx_flexible_sched_in(struct perf_event *event, void *data)
 	 * Listen to the 'cpu' scheduling filter constraint
 	 * of events:
 	 */
-	if (!event_filter_match(event))
+	if (!event_filter_match(event)) {
+		if (event->tstamp != &params->ctx->tstamp_data)
+			event->tstamp = &params->ctx->tstamp_data;
 		return 0;
+	}
 
 	/* may need to reset tstamp_enabled */
 	if (is_cgroup_event(event))
@@ -5042,8 +5055,8 @@ static void calc_timer_values(struct perf_event *event,
 
 	*now = perf_clock();
 	ctx_time = event->shadow_ctx_time + *now;
-	*enabled = ctx_time - event->tstamp_enabled;
-	*running = ctx_time - event->tstamp_running;
+	*enabled = ctx_time - event->tstamp->enabled;
+	*running = ctx_time - event->tstamp->running;
 }
 
 static void perf_event_init_userpage(struct perf_event *event)
@@ -9568,6 +9581,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	raw_spin_lock_init(&event->addr_filters.lock);
 
 	atomic_long_set(&event->refcount, 1);
+	event->tstamp		= &event->tstamp_data;
 	event->cpu		= cpu;
 	event->attr		= *attr;
 	event->group_leader	= group_leader;

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v6 3/3]: perf/core: add mux switch to skip to the current CPU's events list on mux interrupt
  2017-08-02  8:11 [PATCH v6 0/3] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
  2017-08-02  8:13 ` [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups Alexey Budankov
  2017-08-02  8:15 ` [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt Alexey Budankov
@ 2017-08-02  8:16 ` Alexey Budankov
  2017-08-18  5:17 ` [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
  3 siblings, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-02  8:16 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

This patch implements mux switch that triggers skipping to the current CPU's 
events list at mulitplexing hrtimer interrupt handler as well as adoption of 
the switch in the existing implementation.

perf_event_groups_iterate_cpu() API is introduced to implement iteration thru the 
certain CPU groups list skipping groups allocated for the other CPUs.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 kernel/events/core.c | 159 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 118 insertions(+), 41 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5ccb8a2..61f370e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -556,11 +556,11 @@ void perf_sample_event_took(u64 sample_len_ns)
 static atomic64_t perf_event_id;
 
 static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type);
+			      enum event_type_t event_type, int mux);
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
 			     enum event_type_t event_type,
-			     struct task_struct *task);
+			     struct task_struct *task, int mux);
 
 static void update_context_time(struct perf_event_context *ctx);
 static u64 perf_event_time(struct perf_event *event);
@@ -702,6 +702,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 	struct perf_cpu_context *cpuctx;
 	struct list_head *list;
 	unsigned long flags;
+	int mux = 0;
 
 	/*
 	 * Disable interrupts and preemption to avoid this CPU's
@@ -717,7 +718,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 		perf_pmu_disable(cpuctx->ctx.pmu);
 
 		if (mode & PERF_CGROUP_SWOUT) {
-			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+			cpu_ctx_sched_out(cpuctx, EVENT_ALL, mux);
 			/*
 			 * must not be done before ctxswout due
 			 * to event_filter_match() in event_sched_out()
@@ -736,7 +737,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 			 */
 			cpuctx->cgrp = perf_cgroup_from_task(task,
 							     &cpuctx->ctx);
-			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task, mux);
 		}
 		perf_pmu_enable(cpuctx->ctx.pmu);
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -1611,6 +1612,36 @@ perf_event_groups_rotate(struct rb_root *groups, int cpu)
 typedef int(*perf_event_groups_iterate_f)(struct perf_event *, void *);
 
 /*
+ * Find group_list list by a cpu key and call provided callback for every
+ * group on the list.
+ */
+static void
+perf_event_groups_iterate_cpu(struct rb_root *groups, int cpu,
+		perf_event_groups_iterate_f callback, void *data)
+{
+	struct rb_node *node;
+	struct perf_event *event, *node_event;
+
+	node = groups->rb_node;
+
+	while (node) {
+		node_event = container_of(node,
+				struct perf_event, group_node);
+
+		if (cpu < node_event->cpu) {
+			node = node->rb_left;
+		} else if (cpu > node_event->cpu) {
+			node = node->rb_right;
+		} else {
+			list_for_each_entry(event, &node_event->group_list,
+					group_entry)
+				callback(event, data);
+			break;
+		}
+	}
+}
+
+/*
  * Iterate event groups and call provided callback for every group in the tree.
  * Iteration stops if the callback returns non zero.
  */
@@ -1866,7 +1897,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 static void perf_group_detach(struct perf_event *event)
 {
 	struct perf_event *sibling, *tmp;
-	struct list_head *list = NULL;
 
 	lockdep_assert_held(&event->ctx->lock);
 
@@ -2408,36 +2438,38 @@ static void add_event_to_ctx(struct perf_event *event,
 
 static void ctx_sched_out(struct perf_event_context *ctx,
 			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type);
+			  enum event_type_t event_type, int mux);
 static void
 ctx_sched_in(struct perf_event_context *ctx,
 	     struct perf_cpu_context *cpuctx,
 	     enum event_type_t event_type,
-	     struct task_struct *task);
+	     struct task_struct *task, int mux);
 
 static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
 			       struct perf_event_context *ctx,
 			       enum event_type_t event_type)
 {
+	int mux = 0;
+
 	if (!cpuctx->task_ctx)
 		return;
 
 	if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
 		return;
 
-	ctx_sched_out(ctx, cpuctx, event_type);
+	ctx_sched_out(ctx, cpuctx, event_type, mux);
 }
 
 static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
 				struct perf_event_context *ctx,
-				struct task_struct *task)
+				struct task_struct *task, int mux)
 {
-	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task);
+	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task, mux);
 	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task);
-	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task);
+		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task, mux);
+	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task, mux);
 	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
+		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task, mux);
 }
 
 /*
@@ -2461,6 +2493,7 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 {
 	enum event_type_t ctx_event_type = event_type & EVENT_ALL;
 	bool cpu_event = !!(event_type & EVENT_CPU);
+	int mux = 0;
 
 	/*
 	 * If pinned groups are involved, flexible groups also need to be
@@ -2481,11 +2514,11 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 	 *  - otherwise, do nothing more.
 	 */
 	if (cpu_event)
-		cpu_ctx_sched_out(cpuctx, ctx_event_type);
+		cpu_ctx_sched_out(cpuctx, ctx_event_type, mux);
 	else if (ctx_event_type & EVENT_PINNED)
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
 
-	perf_event_sched_in(cpuctx, task_ctx, current);
+	perf_event_sched_in(cpuctx, task_ctx, current, mux);
 	perf_pmu_enable(cpuctx->ctx.pmu);
 }
 
@@ -2503,6 +2536,7 @@ static int  __perf_install_in_context(void *info)
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
 	bool reprogram = true;
 	int ret = 0;
+	int mux = 0;
 
 	raw_spin_lock(&cpuctx->ctx.lock);
 	if (ctx->task) {
@@ -2529,7 +2563,7 @@ static int  __perf_install_in_context(void *info)
 	}
 
 	if (reprogram) {
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
 		add_event_to_ctx(event, ctx);
 		ctx_resched(cpuctx, task_ctx, get_event_type(event));
 	} else {
@@ -2665,13 +2699,14 @@ static void __perf_event_enable(struct perf_event *event,
 {
 	struct perf_event *leader = event->group_leader;
 	struct perf_event_context *task_ctx;
+	int mux = 0;
 
 	if (event->state >= PERF_EVENT_STATE_INACTIVE ||
 	    event->state <= PERF_EVENT_STATE_ERROR)
 		return;
 
 	if (ctx->is_active)
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
 
 	__perf_event_mark_enabled(event);
 
@@ -2681,7 +2716,7 @@ static void __perf_event_enable(struct perf_event *event,
 	if (!event_filter_match(event)) {
 		if (is_cgroup_event(event))
 			perf_cgroup_defer_enabled(event);
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
 		return;
 	}
 
@@ -2690,7 +2725,7 @@ static void __perf_event_enable(struct perf_event *event,
 	 * then don't put it on unless the group is on.
 	 */
 	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
 		return;
 	}
 
@@ -2886,13 +2921,14 @@ EXPORT_SYMBOL_GPL(perf_event_refresh);
 
 static void ctx_sched_out(struct perf_event_context *ctx,
 			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type)
+			  enum event_type_t event_type, int mux)
 {
 	int is_active = ctx->is_active;
 	struct group_sched_params params = {
 			.cpuctx = cpuctx,
 			.ctx = ctx
 	};
+	int cpu = smp_processor_id();
 
 	lockdep_assert_held(&ctx->lock);
 
@@ -2939,13 +2975,27 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 
 	perf_pmu_disable(ctx->pmu);
 	if (is_active & EVENT_PINNED) {
-		perf_event_groups_iterate(&ctx->pinned_groups,
-				group_sched_out_callback, &params);
+		if (mux) {
+			perf_event_groups_iterate_cpu(&ctx->pinned_groups, -1,
+					group_sched_out_callback, &params);
+			perf_event_groups_iterate_cpu(&ctx->pinned_groups, cpu,
+					group_sched_out_callback, &params);
+		} else {
+			perf_event_groups_iterate(&ctx->pinned_groups,
+					group_sched_out_callback, &params);
+		}
 	}
 
 	if (is_active & EVENT_FLEXIBLE) {
-		perf_event_groups_iterate(&ctx->flexible_groups,
-				group_sched_out_callback, &params);
+		if (mux) {
+			perf_event_groups_iterate_cpu(&ctx->flexible_groups, -1,
+					group_sched_out_callback, &params);
+			perf_event_groups_iterate_cpu(&ctx->flexible_groups, cpu,
+					group_sched_out_callback, &params);
+		} else {
+			perf_event_groups_iterate(&ctx->flexible_groups,
+					group_sched_out_callback, &params);
+		}
 	}
 	perf_pmu_enable(ctx->pmu);
 }
@@ -3234,9 +3284,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
  * Called with IRQs disabled
  */
 static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type)
+			      enum event_type_t event_type, int mux)
 {
-	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
+	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
 }
 
 static int
@@ -3305,13 +3355,14 @@ static void
 ctx_sched_in(struct perf_event_context *ctx,
 	     struct perf_cpu_context *cpuctx,
 	     enum event_type_t event_type,
-	     struct task_struct *task)
+	     struct task_struct *task, int mux)
 {
 	int is_active = ctx->is_active;
 	struct group_sched_params params = {
 			.cpuctx = cpuctx,
 			.ctx = ctx
 	};
+	int cpu = smp_processor_id();
 
 	lockdep_assert_held(&ctx->lock);
 
@@ -3339,31 +3390,55 @@ ctx_sched_in(struct perf_event_context *ctx,
 	 * First go through the list and put on any pinned groups
 	 * in order to give them the best chance of going on.
 	 */
-	if (is_active & EVENT_PINNED)
-		perf_event_groups_iterate(&ctx->pinned_groups,
-				ctx_pinned_sched_in, &params);
+	if (is_active & EVENT_PINNED) {
+		if (mux) {
+			perf_event_groups_iterate_cpu(&ctx->pinned_groups,
+					-1, ctx_pinned_sched_in,
+					&params);
+			perf_event_groups_iterate_cpu(&ctx->pinned_groups,
+					cpu, ctx_pinned_sched_in,
+					&params);
+		} else {
+			perf_event_groups_iterate(&ctx->pinned_groups,
+					ctx_pinned_sched_in,
+					&params);
+		}
+	}
 
 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE) {
-		params.can_add_hw = 1;
-		perf_event_groups_iterate(&ctx->flexible_groups,
-				ctx_flexible_sched_in, &params);
+		if (mux) {
+			params.can_add_hw = 1;
+			perf_event_groups_iterate_cpu(&ctx->flexible_groups,
+					-1, ctx_flexible_sched_in,
+					&params);
+			params.can_add_hw = 1;
+			perf_event_groups_iterate_cpu(&ctx->flexible_groups,
+					cpu, ctx_flexible_sched_in,
+					&params);
+		} else {
+			params.can_add_hw = 1;
+			perf_event_groups_iterate(&ctx->flexible_groups,
+					ctx_flexible_sched_in,
+					&params);
+		}
 	}
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
 			     enum event_type_t event_type,
-			     struct task_struct *task)
+			     struct task_struct *task, int mux)
 {
 	struct perf_event_context *ctx = &cpuctx->ctx;
 
-	ctx_sched_in(ctx, cpuctx, event_type, task);
+	ctx_sched_in(ctx, cpuctx, event_type, task, mux);
 }
 
 static void perf_event_context_sched_in(struct perf_event_context *ctx,
 					struct task_struct *task)
 {
 	struct perf_cpu_context *cpuctx;
+	int mux = 0;
 
 	cpuctx = __get_cpu_context(ctx);
 	if (cpuctx->task_ctx == ctx)
@@ -3626,6 +3701,7 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
 {
 	struct perf_event_context *ctx = NULL;
 	int rotate = 0;
+	int mux = 1;
 
 	if (cpuctx->ctx.nr_events) {
 		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
@@ -3644,15 +3720,15 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 	perf_pmu_disable(cpuctx->ctx.pmu);
 
-	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
 	if (ctx)
-		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE, mux);
 
 	rotate_ctx(&cpuctx->ctx);
 	if (ctx)
 		rotate_ctx(ctx);
 
-	perf_event_sched_in(cpuctx, ctx, current);
+	perf_event_sched_in(cpuctx, ctx, current, mux);
 
 	perf_pmu_enable(cpuctx->ctx.pmu);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3704,6 +3780,7 @@ static void perf_event_enable_on_exec(int ctxn)
 	struct perf_event *event;
 	unsigned long flags;
 	int enabled = 0;
+	int mux = 0;
 
 	local_irq_save(flags);
 	ctx = current->perf_event_ctxp[ctxn];
@@ -3712,7 +3789,7 @@ static void perf_event_enable_on_exec(int ctxn)
 
 	cpuctx = __get_cpu_context(ctx);
 	perf_ctx_lock(cpuctx, ctx);
-	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+	ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
 	list_for_each_entry(event, &ctx->event_list, event_entry) {
 		enabled |= event_enable_on_exec(event, ctx);
 		event_type |= get_event_type(event);
@@ -3725,7 +3802,7 @@ static void perf_event_enable_on_exec(int ctxn)
 		clone_ctx = unclone_ctx(ctx);
 		ctx_resched(cpuctx, ctx, event_type);
 	} else {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
 	}
 	perf_ctx_unlock(cpuctx, ctx);
 

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-02  8:13 ` [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups Alexey Budankov
@ 2017-08-03 13:00   ` Peter Zijlstra
  2017-08-03 20:30     ` Alexey Budankov
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-03 13:00 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
> This patch moves event groups into rb tree sorted by CPU, so that 
> multiplexing hrtimer interrupt handler would be able skipping to the current 
> CPU's list and ignore groups allocated for the other CPUs.
> 
> New API for manipulating event groups in the trees is implemented as well 
> as adoption on the API in the current implementation.
> 
> Because perf_event_groups_iterate() API provides capability to execute 
> a callback for every event group in a tree, adoption of the API introduces
> some code that packs and unpacks arguments of functions existing in 
> the implementation as well as adjustments of their calling signatures
> e.g. ctx_pinned_sched_in(), ctx_flexible_sched_in() and inherit_task_group().

This does not speak of why we need group_list.

> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  include/linux/perf_event.h |  18 ++-
>  kernel/events/core.c       | 389 +++++++++++++++++++++++++++++++++------------
>  2 files changed, 306 insertions(+), 101 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index a3b873f..282f121 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -572,6 +572,20 @@ struct perf_event {
>  	 */
>  	struct list_head		group_entry;
>  	struct list_head		sibling_list;
> +	/*
> +	 * Node on the pinned or flexible tree located at the event context;
> +	 * the node may be empty in case its event is not directly attached
> +	 * to the tree but to group_list list of the event directly
> +	 * attached to the tree;
> +	 */
> +	struct rb_node			group_node;
> +	/*
> +	 * List keeps groups allocated for the same cpu;
> +	 * the list may be empty in case its event is not directly
> +	 * attached to the tree but to group_list list of the event directly
> +	 * attached to the tree;
> +	 */
> +	struct list_head		group_list;
>  
>  	/*
>  	 * We need storage to track the entries in perf_pmu_migrate_context; we
> @@ -741,8 +755,8 @@ struct perf_event_context {
>  	struct mutex			mutex;
>  
>  	struct list_head		active_ctx_list;
> -	struct list_head		pinned_groups;
> -	struct list_head		flexible_groups;
> +	struct rb_root			pinned_groups;
> +	struct rb_root			flexible_groups;
>  	struct list_head		event_list;
>  	int				nr_events;
>  	int				nr_active;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 426c2ff..0a4f619 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1466,8 +1466,12 @@ static enum event_type_t get_event_type(struct perf_event *event)
>  	return event_type;
>  }
>  
> -static struct list_head *
> -ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
> +/*
> + * Extract pinned or flexible groups from the context
> + * based on event attrs bits;
> + */
> +static struct rb_root *
> +get_event_groups(struct perf_event *event, struct perf_event_context *ctx)
>  {
>  	if (event->attr.pinned)
>  		return &ctx->pinned_groups;
> @@ -1475,6 +1479,160 @@ ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
>  		return &ctx->flexible_groups;
>  }
>  
> +static void
> +perf_event_groups_insert(struct rb_root *groups,
> +		struct perf_event *event);
> +
> +static void
> +perf_event_groups_delete(struct rb_root *groups,
> +		struct perf_event *event);

Can't we do away with these fwd declarations by simple reordering of the
function definitions?

> +/*
> + * Helper function to insert event into the pinned or
> + * flexible groups;
> + */
> +static void
> +add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
> +{
> +	struct rb_root *groups;
> +
> +	groups = get_event_groups(event, ctx);
> +	perf_event_groups_insert(groups, event);
> +}
> +
> +/*
> + * Helper function to delete event from its groups;
> + */
> +static void
> +del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
> +{
> +	struct rb_root *groups;
> +
> +	groups = get_event_groups(event, ctx);
> +	perf_event_groups_delete(groups, event);
> +}
> +
> +/*
> + * Insert a group into a tree using event->cpu as a key. If event->cpu node
> + * is already attached to the tree then the event is added to the attached
> + * group's group_list list.
> + */
> +static void
> +perf_event_groups_insert(struct rb_root *groups,
> +		struct perf_event *event)
> +{
> +	struct rb_node **node;
> +	struct rb_node *parent;
> +	struct perf_event *node_event;
> +
> +	node = &groups->rb_node;
> +	parent = *node;
> +
> +	while (*node) {
> +		parent = *node;
> +		node_event = container_of(*node,
> +				struct perf_event, group_node);
> +
> +		if (event->cpu < node_event->cpu) {
> +			node = &parent->rb_left;
> +		} else if (event->cpu > node_event->cpu) {
> +			node = &parent->rb_right;

I would much prefer you use a comparator like:

static always_inline int
perf_event_less(struct perf_event *left, struct perf_event *right)
{
	if (left->cpu < right_cpu)
		return 1;

	return 0;
}

That way we can add additional order. In specific ARM also wants things
ordered on PMU for their big.LITTLE stuff.

> +		} else {
> +			list_add_tail(&event->group_entry,
> +					&node_event->group_list);
> +			return;

Urgh, so this is what you want that list for... why not keep duplicates
in the tree itself and iterate that?

> +		}
> +	}
> +
> +	list_add_tail(&event->group_entry, &event->group_list);
> +
> +	rb_link_node(&event->group_node, parent, node);
> +	rb_insert_color(&event->group_node, groups);
> +}

> +/*
> + * Find group list by a cpu key and rotate it.
> + */
> +static void
> +perf_event_groups_rotate(struct rb_root *groups, int cpu)
> +{
> +	struct rb_node *node;
> +	struct perf_event *node_event;
> +
> +	node = groups->rb_node;
> +
> +	while (node) {
> +		node_event = container_of(node,
> +				struct perf_event, group_node);
> +
> +		if (cpu < node_event->cpu) {
> +			node = node->rb_left;
> +		} else if (cpu > node_event->cpu) {
> +			node = node->rb_right;
> +		} else {
> +			list_rotate_left(&node_event->group_list);
> +			break;
> +		}
> +	}
> +}

Ah, you worry about how to rotate inside a tree?

You can do that by adding (run)time based ordering, and you'll end up
with a runtime based scheduler.

A trivial variant keeps a simple counter per tree that is incremented
for each rotation. That should end up with the events ordered exactly
like the list. And if you have that comparator like above, expressing
that additional ordering becomes simple ;-)

Something like:

struct group {
  u64 vtime;
  rb_tree tree;
};

bool event_less(left, right)
{
  if (left->cpu < right->cpu)
    return true;

  if (left->cpu > right_cpu)
    return false;

  if (left->vtime < right->vtime)
    return true;

  return false;
}

insert_group(group, event, tail)
{
  if (tail)
    event->vtime = ++group->vtime;

  tree_insert(&group->root, event);
}

Then every time you use insert_group(.tail=1) it goes to the end of that
CPU's 'list'.


The added benefit is that it then becomes fairly simple to improve upon
the RR scheduling, which suffers a bunch of boundary conditions where
the task runtimes mis-align with the rotation window.

> +typedef int(*perf_event_groups_iterate_f)(struct perf_event *, void *);

We already have perf_iterate_f, the only difference appears to be that
this has a return value. Surely these can be unified.

> +/*
> + * Iterate event groups and call provided callback for every group in the tree.
> + * Iteration stops if the callback returns non zero.
> + */
> +static int
> +perf_event_groups_iterate(struct rb_root *groups,
> +		perf_event_groups_iterate_f callback, void *data)
> +{
> +	int ret = 0;
> +	struct rb_node *node;
> +	struct perf_event *node_event, *event;

In general we prefer variable definitions to be ordered on line length,
longest first. So the exact opposite of what you have here.

> +
> +	for (node = rb_first(groups); node; node = rb_next(node)) {
> +		node_event = container_of(node,	struct perf_event, group_node);
> +		list_for_each_entry(event, &node_event->group_list,
> +				group_entry) {
> +			ret = callback(event, data);
> +			if (ret) {
> +				return ret;
> +			}
> +		}
> +	}
> +
> +	return ret;
> +}
> +
>  /*
>   * Add a event from the lists for its context.
>   * Must be called with ctx->mutex and ctx->lock held.

> @@ -1869,6 +2023,22 @@ group_sched_out(struct perf_event *group_event,
>  		cpuctx->exclusive = 0;
>  }
>  
> +struct group_sched_params {
> +	struct perf_cpu_context *cpuctx;
> +	struct perf_event_context *ctx;
> +	int can_add_hw;
> +};
> +
> +static int
> +group_sched_out_callback(struct perf_event *event, void *data)
> +{
> +	struct group_sched_params *params = data;
> +
> +	group_sched_out(event, params->cpuctx, params->ctx);
> +
> +	return 0;
> +}

Right, C sucks.. or possibly you've chosen the wrong pattern.

So the alternative is something like:

#define for_each_group_event(event, group, cpu, pmu, field)	\
	for (event = rb_entry_safe(group_first(group, cpu, pmu),\
				   typeof(*event), field);	\
	     event && event->cpu == cpu && event->pmu == pmu;	\
	     event = rb_entry_safe(rb_next(&event->field),	\
				   typeof(*event), field))


And then you can write things like:

	for_each_group_event(event, group, cpu, pmu)
		group_sched_out(event, cpuctx, ctx);


> +
>  #define DETACH_GROUP	0x01UL
>  
>  /*
> @@ -2712,7 +2882,10 @@ static void ctx_sched_out(struct perf_event_context *ctx,
>  			  enum event_type_t event_type)
>  {
>  	int is_active = ctx->is_active;
> -	struct perf_event *event;
> +	struct group_sched_params params = {
> +			.cpuctx = cpuctx,
> +			.ctx = ctx
> +	};
>  
>  	lockdep_assert_held(&ctx->lock);
>  
> @@ -2759,13 +2932,13 @@ static void ctx_sched_out(struct perf_event_context *ctx,
>  
>  	perf_pmu_disable(ctx->pmu);
>  	if (is_active & EVENT_PINNED) {
> -		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
> -			group_sched_out(event, cpuctx, ctx);
> +		perf_event_groups_iterate(&ctx->pinned_groups,
> +				group_sched_out_callback, &params);

So here I would expect to not iterate events where event->cpu !=
smp_processor_id() (and ideally not where event->pmu != ctx->pmu).

>  	}
>  
>  	if (is_active & EVENT_FLEXIBLE) {
> -		list_for_each_entry(event, &ctx->flexible_groups, group_entry)
> -			group_sched_out(event, cpuctx, ctx);
> +		perf_event_groups_iterate(&ctx->flexible_groups,
> +				group_sched_out_callback, &params);

Idem.

>  	}
>  	perf_pmu_enable(ctx->pmu);
>  }


I think the rest of the patch is just plumbing to make the above useful.
Let me know if I missed something of value.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-02  8:15 ` [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt Alexey Budankov
@ 2017-08-03 13:04   ` Peter Zijlstra
  2017-08-03 14:00   ` Peter Zijlstra
  2017-08-03 15:00   ` Peter Zijlstra
  2 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-03 13:04 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Wed, Aug 02, 2017 at 11:15:39AM +0300, Alexey Budankov wrote:
> Event groups allocated for CPU's different from the one that handles multiplexing
> hrtimer interrupt may be skipped by interrupt handler however the events  
> tstamp_enabled, tstamp_running and tstamp_stopped fields still need to be updated 
> to have correct timings.
> 
> To implement that tstamp_data object is introduced at the event context
> and the skipped events' tstamps pointers are switched between self and context
> tstamp_data objects.
> 
> The context object timings are updated by update_context_time() on every 
> multiplexing hrtimer interrupt so all events referencing the context object get its 
> timings properly updated all at once.
> 
> Event groups tstamps are switched to the context object and back to self object 
> if they don't pass thru event_filter_match() on thread context switch in and out.

FWIW, Changelogs should be <=72 characters (like normal emails). All
sane editors can do this for you. Also, you have weird trailing
whitespace in your messages.

The above then ends up like:

Event groups allocated for CPU's different from the one that handles
multiplexing hrtimer interrupt may be skipped by interrupt handler
however the events  tstamp_enabled, tstamp_running and tstamp_stopped
fields still need to be updated to have correct timings.

To implement that tstamp_data object is introduced at the event context
and the skipped events' tstamps pointers are switched between self and
context tstamp_data objects.

The context object timings are updated by update_context_time() on every
multiplexing hrtimer interrupt so all events referencing the context
object get its timings properly updated all at once.

Event groups tstamps are switched to the context object and back to self
object if they don't pass thru event_filter_match() on thread context
switch in and out.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-02  8:15 ` [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt Alexey Budankov
  2017-08-03 13:04   ` Peter Zijlstra
@ 2017-08-03 14:00   ` Peter Zijlstra
  2017-08-03 15:58     ` Alexey Budankov
  2017-08-03 15:00   ` Peter Zijlstra
  2 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-03 14:00 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Wed, Aug 02, 2017 at 11:15:39AM +0300, Alexey Budankov wrote:
> +struct perf_event_tstamp {
> +	/*
> +	 * These are timestamps used for computing total_time_enabled
> +	 * and total_time_running when the event is in INACTIVE or
> +	 * ACTIVE state, measured in nanoseconds from an arbitrary point
> +	 * in time.
> +	 * enabled: the notional time when the event was enabled
> +	 * running: the notional time when the event was scheduled on
> +	 * stopped: in INACTIVE state, the notional time when the
> +	 *    event was scheduled off.
> +	 */
> +	u64 enabled;
> +	u64 running;
> +	u64 stopped;
> +};


So I have the below (untested) patch, also see:

  https://lkml.kernel.org/r/20170802171051.zlq5rgx3jqkkxpg7@hirez.programming.kicks-ass.net

And I don't think I fully agree with your description of running.
Despite its name tstamp_running is not in fact a time stamp afaict. Its
more like an accumulator of running, but with an offset of stopped.

I'm always completely confused by the way this timekeeping is done.

---
Subject: perf: Fix time on IOC_ENABLE
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu Aug 3 15:42:09 CEST 2017

Vince reported that when we do IOC_ENABLE/IOC_DISABLE while the task
is SIGSTOP'ed state the timestamps go wobbly.

It turns out we indeed fail to correctly account time while in 'OFF'
state and doing IOC_ENABLE without getting scheduled in exposes the
problem.

Further thinking about this problem, it occurred to me that we can
suffer a similar fate when we migrate an uncore event between CPUs.
The perf_event_install() on the 'new' CPU will do add_event_to_ctx()
which will reset all the time stamp, resulting in a subsequent
update_event_times() to overwrite the total_time_* fields with smaller
values.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   36 +++++++++++++++++++++++++++++++-----
 1 file changed, 31 insertions(+), 5 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2217,6 +2217,33 @@ static int group_can_go_on(struct perf_e
 	return can_add_hw;
 }
 
+/*
+ * Complement to update_event_times(). This computes the tstamp_* values to
+ * continue 'enabled' state from @now. And effectively discards the time
+ * between the prior tstamp_stopped and now (as we were in the OFF state, or
+ * just switched (context) time base).
+ *
+ * This further assumes '@event->state == INACTIVE' (we just came from OFF) and
+ * cannot have been scheduled in yet. And going into INACTIVE state means
+ * '@event->tstamp_stopped = @now'.
+ *
+ * Thus given the rules of update_event_times():
+ *
+ *   total_time_enabled = tstamp_stopped - tstamp_enabled
+ *   total_time_running = tstamp_stopped - tstamp_running
+ *
+ * We can insert 'tstamp_stopped == now' and reverse them to compute new
+ * tstamp_* values.
+ */
+static void __perf_event_enable_time(struct perf_event *event, u64 now)
+{
+	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE);
+
+	event->tstamp_stopped = now;
+	event->tstamp_enabled = now - event->total_time_enabled;
+	event->tstamp_running = now - event->total_time_running;
+}
+
 static void add_event_to_ctx(struct perf_event *event,
 			       struct perf_event_context *ctx)
 {
@@ -2224,9 +2251,7 @@ static void add_event_to_ctx(struct perf
 
 	list_add_event(event, ctx);
 	perf_group_attach(event);
-	event->tstamp_enabled = tstamp;
-	event->tstamp_running = tstamp;
-	event->tstamp_stopped = tstamp;
+	__perf_event_enable_time(event, tstamp);
 }
 
 static void ctx_sched_out(struct perf_event_context *ctx,
@@ -2471,10 +2496,11 @@ static void __perf_event_mark_enabled(st
 	u64 tstamp = perf_event_time(event);
 
 	event->state = PERF_EVENT_STATE_INACTIVE;
-	event->tstamp_enabled = tstamp - event->total_time_enabled;
+	__perf_event_enable_time(event, tstamp);
 	list_for_each_entry(sub, &event->sibling_list, group_entry) {
+		/* XXX should not be > INACTIVE if event isn't */
 		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+			__perf_event_enable_time(sub, tstamp);
 	}
 }
 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-02  8:15 ` [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt Alexey Budankov
  2017-08-03 13:04   ` Peter Zijlstra
  2017-08-03 14:00   ` Peter Zijlstra
@ 2017-08-03 15:00   ` Peter Zijlstra
  2017-08-03 18:47     ` Alexey Budankov
  2017-08-10 15:57     ` Alexey Budankov
  2 siblings, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-03 15:00 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Wed, Aug 02, 2017 at 11:15:39AM +0300, Alexey Budankov wrote:
> @@ -772,6 +780,10 @@ struct perf_event_context {
>  	 */
>  	u64				time;
>  	u64				timestamp;
> +	/*
> +	 * Context cache for filtered out events;
> +	 */
> +	struct perf_event_tstamp	tstamp_data;
>  
>  	/*
>  	 * These fields let us detect when two contexts have both


> @@ -1379,6 +1379,9 @@ static void update_context_time(struct perf_event_context *ctx)
>  
>  	ctx->time += now - ctx->timestamp;
>  	ctx->timestamp = now;
> +
> +	ctx->tstamp_data.running += ctx->time - ctx->tstamp_data.stopped;
> +	ctx->tstamp_data.stopped = ctx->time;
>  }
>  
>  static u64 perf_event_time(struct perf_event *event)

It appears to me we have some redundancy here.


> @@ -1968,9 +1971,13 @@ event_sched_out(struct perf_event *event,
>  	 */
>  	if (event->state == PERF_EVENT_STATE_INACTIVE &&
>  	    !event_filter_match(event)) {
> +		delta = tstamp - event->tstamp->stopped;
> +		event->tstamp->running += delta;
> +		event->tstamp->stopped = tstamp;
> +		if (event->tstamp != &event->tstamp_data) {
> +			event->tstamp_data = *event->tstamp;

This,

> +			event->tstamp = &event->tstamp_data;
> +		}
>  	}
>  
>  	if (event->state != PERF_EVENT_STATE_ACTIVE)


> @@ -3239,8 +3246,11 @@ ctx_pinned_sched_in(struct perf_event *event, void *data)
>  
>  	if (event->state <= PERF_EVENT_STATE_OFF)
>  		return 0;
> -	if (!event_filter_match(event))
> +	if (!event_filter_match(event)) {
> +		if (event->tstamp != &params->ctx->tstamp_data)
> +			event->tstamp = &params->ctx->tstamp_data;

this and

>  		return 0;
> +	}
>  
>  	/* may need to reset tstamp_enabled */
>  	if (is_cgroup_event(event))
> @@ -3273,8 +3283,11 @@ ctx_flexible_sched_in(struct perf_event *event, void *data)
>  	 * Listen to the 'cpu' scheduling filter constraint
>  	 * of events:
>  	 */
> -	if (!event_filter_match(event))
> +	if (!event_filter_match(event)) {
> +		if (event->tstamp != &params->ctx->tstamp_data)
> +			event->tstamp = &params->ctx->tstamp_data;

this..

>  		return 0;
> +	}
>  
>  	/* may need to reset tstamp_enabled */
>  	if (is_cgroup_event(event))


Are the magic spots, right? And I'm not convinced its right.

Suppose I have two events in my context, and I created them 1 minute
apart. Then their respective tstamp_enabled are 1 minute apart as well.
But the above doesn't seem to preserve that difference.

A similar argument can be made for running I think. That is a per event
value and cannot be passed along to the ctx and back.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-03 14:00   ` Peter Zijlstra
@ 2017-08-03 15:58     ` Alexey Budankov
  2017-08-04 12:36       ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-08-03 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 03.08.2017 17:00, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 11:15:39AM +0300, Alexey Budankov wrote:
>> +struct perf_event_tstamp {
>> +	/*
>> +	 * These are timestamps used for computing total_time_enabled
>> +	 * and total_time_running when the event is in INACTIVE or
>> +	 * ACTIVE state, measured in nanoseconds from an arbitrary point
>> +	 * in time.
>> +	 * enabled: the notional time when the event was enabled
>> +	 * running: the notional time when the event was scheduled on
>> +	 * stopped: in INACTIVE state, the notional time when the
>> +	 *    event was scheduled off.
>> +	 */
>> +	u64 enabled;
>> +	u64 running;
>> +	u64 stopped;
>> +};
> 
> 
> So I have the below (untested) patch, also see:
> 
>   https://lkml.kernel.org/r/20170802171051.zlq5rgx3jqkkxpg7@hirez.programming.kicks-ass.net
> 
> And I don't think I fully agree with your description of running.

I copied this comment from the previous place without any change.

> Despite its name tstamp_running is not in fact a time stamp afaict. Its
> more like an accumulator of running, but with an offset of stopped.

I see tstamp_running as something that needs to be subtracted from the timestamp
e.g. when update_context_time() is called to get correct event's total timings:

total_time_enabled = timestamp - enabled
total_time_running = timestamp - running

E.g. for the case with a single thread and a single event, running on a
dual-core machine during 10 ticks and half time on each core we have:

For the first core event instance:

10 = total_time_enabled = timestamp[110] - enabled[100]
5  = total_time_running = timestamp[110] - running[100 + 1 + 1 + 1 + 1 + 1]

"+ 1" above for every time event instance doesn't get thru perf_event_filter().
In particular when an event instance is for a CPU different from the one that 
schedules the instance.

So 5/10 = 0.5 - 50% of time event running on the first core. The same is for the second core.

When we sum up instances times we get value for the user:

50%(first core) + 50%(second core) = 100% of event run time - no multiplexing case.

Without a thread migration we would have:

For the first core running thread:

10 = total_time_enabled = timestamp[110] - enabled[100]
10 = total_time_running = timestamp[110] - running[100]

10/10 = 1 - 100%

For the second core:

10 = total_time_enabled = timestamp[110] - enabled[100]
0  = total_time_running = timestamp[110] - running[100 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1]

0/10 = 0 - 0%

100% + 0% == 100% of event run time

>From this perspective tstamp_running field indeed accumulates some time 
but is more like tstamp_eligible_to_run so:

	total_time_running == elapsed - tstamp_eligible_to_run

> 
> I'm always completely confused by the way this timekeeping is done.
> 
> ---
> Subject: perf: Fix time on IOC_ENABLE
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Thu Aug 3 15:42:09 CEST 2017
> 
> Vince reported that when we do IOC_ENABLE/IOC_DISABLE while the task
> is SIGSTOP'ed state the timestamps go wobbly.
> 
> It turns out we indeed fail to correctly account time while in 'OFF'
> state and doing IOC_ENABLE without getting scheduled in exposes the
> problem.
> 
> Further thinking about this problem, it occurred to me that we can
> suffer a similar fate when we migrate an uncore event between CPUs.
> The perf_event_install() on the 'new' CPU will do add_event_to_ctx()
> which will reset all the time stamp, resulting in a subsequent
> update_event_times() to overwrite the total_time_* fields with smaller
> values.
> 
> Reported-by: Vince Weaver <vincent.weaver@maine.edu>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/events/core.c |   36 +++++++++++++++++++++++++++++++-----
>  1 file changed, 31 insertions(+), 5 deletions(-)
> 
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2217,6 +2217,33 @@ static int group_can_go_on(struct perf_e
>  	return can_add_hw;
>  }
>  
> +/*
> + * Complement to update_event_times(). This computes the tstamp_* values to
> + * continue 'enabled' state from @now. And effectively discards the time
> + * between the prior tstamp_stopped and now (as we were in the OFF state, or
> + * just switched (context) time base).
> + *
> + * This further assumes '@event->state == INACTIVE' (we just came from OFF) and
> + * cannot have been scheduled in yet. And going into INACTIVE state means
> + * '@event->tstamp_stopped = @now'.
> + *
> + * Thus given the rules of update_event_times():
> + *
> + *   total_time_enabled = tstamp_stopped - tstamp_enabled
> + *   total_time_running = tstamp_stopped - tstamp_running
> + *
> + * We can insert 'tstamp_stopped == now' and reverse them to compute new
> + * tstamp_* values.
> + */
> +static void __perf_event_enable_time(struct perf_event *event, u64 now)
> +{
> +	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE);
> +
> +	event->tstamp_stopped = now;
> +	event->tstamp_enabled = now - event->total_time_enabled;
> +	event->tstamp_running = now - event->total_time_running;
> +}
> +
>  static void add_event_to_ctx(struct perf_event *event,
>  			       struct perf_event_context *ctx)
>  {
> @@ -2224,9 +2251,7 @@ static void add_event_to_ctx(struct perf
>  
>  	list_add_event(event, ctx);
>  	perf_group_attach(event);
> -	event->tstamp_enabled = tstamp;
> -	event->tstamp_running = tstamp;
> -	event->tstamp_stopped = tstamp;
> +	__perf_event_enable_time(event, tstamp);
>  }
>  
>  static void ctx_sched_out(struct perf_event_context *ctx,
> @@ -2471,10 +2496,11 @@ static void __perf_event_mark_enabled(st
>  	u64 tstamp = perf_event_time(event);
>  
>  	event->state = PERF_EVENT_STATE_INACTIVE;
> -	event->tstamp_enabled = tstamp - event->total_time_enabled;
> +	__perf_event_enable_time(event, tstamp);
>  	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> +		/* XXX should not be > INACTIVE if event isn't */
>  		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
> -			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
> +			__perf_event_enable_time(sub, tstamp);
>  	}
>  }
>  
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-03 15:00   ` Peter Zijlstra
@ 2017-08-03 18:47     ` Alexey Budankov
  2017-08-04 12:35       ` Peter Zijlstra
  2017-08-10 15:57     ` Alexey Budankov
  1 sibling, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-08-03 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 03.08.2017 18:00, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 11:15:39AM +0300, Alexey Budankov wrote:
>> @@ -772,6 +780,10 @@ struct perf_event_context {
>>  	 */
>>  	u64				time;
>>  	u64				timestamp;
>> +	/*
>> +	 * Context cache for filtered out events;
>> +	 */
>> +	struct perf_event_tstamp	tstamp_data;
>>  
>>  	/*
>>  	 * These fields let us detect when two contexts have both
> 
> 
>> @@ -1379,6 +1379,9 @@ static void update_context_time(struct perf_event_context *ctx)
>>  
>>  	ctx->time += now - ctx->timestamp;
>>  	ctx->timestamp = now;
>> +
>> +	ctx->tstamp_data.running += ctx->time - ctx->tstamp_data.stopped;
>> +	ctx->tstamp_data.stopped = ctx->time;
>>  }
>>  
>>  static u64 perf_event_time(struct perf_event *event)
> 
> It appears to me we have some redundancy here.
> 
> 
>> @@ -1968,9 +1971,13 @@ event_sched_out(struct perf_event *event,
>>  	 */
>>  	if (event->state == PERF_EVENT_STATE_INACTIVE &&
>>  	    !event_filter_match(event)) {
>> +		delta = tstamp - event->tstamp->stopped;
>> +		event->tstamp->running += delta;
>> +		event->tstamp->stopped = tstamp;
>> +		if (event->tstamp != &event->tstamp_data) {
>> +			event->tstamp_data = *event->tstamp;
> 
> This,
> 
>> +			event->tstamp = &event->tstamp_data;
>> +		}
>>  	}
>>  
>>  	if (event->state != PERF_EVENT_STATE_ACTIVE)
> 
> 
>> @@ -3239,8 +3246,11 @@ ctx_pinned_sched_in(struct perf_event *event, void *data)
>>  
>>  	if (event->state <= PERF_EVENT_STATE_OFF)
>>  		return 0;
>> -	if (!event_filter_match(event))
>> +	if (!event_filter_match(event)) {
>> +		if (event->tstamp != &params->ctx->tstamp_data)
>> +			event->tstamp = &params->ctx->tstamp_data;
> 
> this and
> 
>>  		return 0;
>> +	}
>>  
>>  	/* may need to reset tstamp_enabled */
>>  	if (is_cgroup_event(event))
>> @@ -3273,8 +3283,11 @@ ctx_flexible_sched_in(struct perf_event *event, void *data)
>>  	 * Listen to the 'cpu' scheduling filter constraint
>>  	 * of events:
>>  	 */
>> -	if (!event_filter_match(event))
>> +	if (!event_filter_match(event)) {
>> +		if (event->tstamp != &params->ctx->tstamp_data)
>> +			event->tstamp = &params->ctx->tstamp_data;
> 
> this..
> 
>>  		return 0;
>> +	}
>>  
>>  	/* may need to reset tstamp_enabled */
>>  	if (is_cgroup_event(event))
> 
> 
> Are the magic spots, right? And I'm not convinced its right.
> 
> Suppose I have two events in my context, and I created them 1 minute
> apart. Then their respective tstamp_enabled are 1 minute apart as well.
> But the above doesn't seem to preserve that difference.
> 
> A similar argument can be made for running I think. That is a per event
> value and cannot be passed along to the ctx and back.

Aww, I see your point and it challenges my initial assumptions. 
Let me think thru the case more. There must be some solution. Thanks!

> 
> 
>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-03 13:00   ` Peter Zijlstra
@ 2017-08-03 20:30     ` Alexey Budankov
  2017-08-04 14:36       ` Peter Zijlstra
  2017-08-04 14:53       ` Peter Zijlstra
  0 siblings, 2 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-03 20:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 03.08.2017 16:00, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
>> This patch moves event groups into rb tree sorted by CPU, so that 
>> multiplexing hrtimer interrupt handler would be able skipping to the current 
>> CPU's list and ignore groups allocated for the other CPUs.
>>
>> New API for manipulating event groups in the trees is implemented as well 
>> as adoption on the API in the current implementation.
>>
>> Because perf_event_groups_iterate() API provides capability to execute 
>> a callback for every event group in a tree, adoption of the API introduces
>> some code that packs and unpacks arguments of functions existing in 
>> the implementation as well as adjustments of their calling signatures
>> e.g. ctx_pinned_sched_in(), ctx_flexible_sched_in() and inherit_task_group().
> 
> This does not speak of why we need group_list.
> 
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>>  include/linux/perf_event.h |  18 ++-
>>  kernel/events/core.c       | 389 +++++++++++++++++++++++++++++++++------------
>>  2 files changed, 306 insertions(+), 101 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index a3b873f..282f121 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -572,6 +572,20 @@ struct perf_event {
>>  	 */
>>  	struct list_head		group_entry;
>>  	struct list_head		sibling_list;
>> +	/*
>> +	 * Node on the pinned or flexible tree located at the event context;
>> +	 * the node may be empty in case its event is not directly attached
>> +	 * to the tree but to group_list list of the event directly
>> +	 * attached to the tree;
>> +	 */
>> +	struct rb_node			group_node;
>> +	/*
>> +	 * List keeps groups allocated for the same cpu;
>> +	 * the list may be empty in case its event is not directly
>> +	 * attached to the tree but to group_list list of the event directly
>> +	 * attached to the tree;
>> +	 */
>> +	struct list_head		group_list;
>>  
>>  	/*
>>  	 * We need storage to track the entries in perf_pmu_migrate_context; we
>> @@ -741,8 +755,8 @@ struct perf_event_context {
>>  	struct mutex			mutex;
>>  
>>  	struct list_head		active_ctx_list;
>> -	struct list_head		pinned_groups;
>> -	struct list_head		flexible_groups;
>> +	struct rb_root			pinned_groups;
>> +	struct rb_root			flexible_groups;
>>  	struct list_head		event_list;
>>  	int				nr_events;
>>  	int				nr_active;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 426c2ff..0a4f619 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -1466,8 +1466,12 @@ static enum event_type_t get_event_type(struct perf_event *event)
>>  	return event_type;
>>  }
>>  
>> -static struct list_head *
>> -ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
>> +/*
>> + * Extract pinned or flexible groups from the context
>> + * based on event attrs bits;
>> + */
>> +static struct rb_root *
>> +get_event_groups(struct perf_event *event, struct perf_event_context *ctx)
>>  {
>>  	if (event->attr.pinned)
>>  		return &ctx->pinned_groups;
>> @@ -1475,6 +1479,160 @@ ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
>>  		return &ctx->flexible_groups;
>>  }
>>  
>> +static void
>> +perf_event_groups_insert(struct rb_root *groups,
>> +		struct perf_event *event);
>> +
>> +static void
>> +perf_event_groups_delete(struct rb_root *groups,
>> +		struct perf_event *event);
> 
> Can't we do away with these fwd declarations by simple reordering of the
> function definitions?

Ok. I will clean this up.

> 
>> +/*
>> + * Helper function to insert event into the pinned or
>> + * flexible groups;
>> + */
>> +static void
>> +add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
>> +{
>> +	struct rb_root *groups;
>> +
>> +	groups = get_event_groups(event, ctx);
>> +	perf_event_groups_insert(groups, event);
>> +}
>> +
>> +/*
>> + * Helper function to delete event from its groups;
>> + */
>> +static void
>> +del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
>> +{
>> +	struct rb_root *groups;
>> +
>> +	groups = get_event_groups(event, ctx);
>> +	perf_event_groups_delete(groups, event);
>> +}
>> +
>> +/*
>> + * Insert a group into a tree using event->cpu as a key. If event->cpu node
>> + * is already attached to the tree then the event is added to the attached
>> + * group's group_list list.
>> + */
>> +static void
>> +perf_event_groups_insert(struct rb_root *groups,
>> +		struct perf_event *event)
>> +{
>> +	struct rb_node **node;
>> +	struct rb_node *parent;
>> +	struct perf_event *node_event;
>> +
>> +	node = &groups->rb_node;
>> +	parent = *node;
>> +
>> +	while (*node) {
>> +		parent = *node;
>> +		node_event = container_of(*node,
>> +				struct perf_event, group_node);
>> +
>> +		if (event->cpu < node_event->cpu) {
>> +			node = &parent->rb_left;
>> +		} else if (event->cpu > node_event->cpu) {
>> +			node = &parent->rb_right;
> 
> I would much prefer you use a comparator like:

Ok. I Will do.

> 
> static always_inline int
> perf_event_less(struct perf_event *left, struct perf_event *right)
> {
> 	if (left->cpu < right_cpu)
> 		return 1;
> 
> 	return 0;
> }
> 
> That way we can add additional order. In specific ARM also wants things
> ordered on PMU for their big.LITTLE stuff.
> 
>> +		} else {
>> +			list_add_tail(&event->group_entry,
>> +					&node_event->group_list);
>> +			return;
> 
> Urgh, so this is what you want that list for... why not keep duplicates
> in the tree itself and iterate that?
> 
>> +		}
>> +	}
>> +
>> +	list_add_tail(&event->group_entry, &event->group_list);
>> +
>> +	rb_link_node(&event->group_node, parent, node);
>> +	rb_insert_color(&event->group_node, groups);
>> +}
> 
>> +/*
>> + * Find group list by a cpu key and rotate it.
>> + */
>> +static void
>> +perf_event_groups_rotate(struct rb_root *groups, int cpu)
>> +{
>> +	struct rb_node *node;
>> +	struct perf_event *node_event;
>> +
>> +	node = groups->rb_node;
>> +
>> +	while (node) {
>> +		node_event = container_of(node,
>> +				struct perf_event, group_node);
>> +
>> +		if (cpu < node_event->cpu) {
>> +			node = node->rb_left;
>> +		} else if (cpu > node_event->cpu) {
>> +			node = node->rb_right;
>> +		} else {
>> +			list_rotate_left(&node_event->group_list);
>> +			break;
>> +		}
>> +	}
>> +}
> 
> Ah, you worry about how to rotate inside a tree?

Exactly.

> 
> You can do that by adding (run)time based ordering, and you'll end up
> with a runtime based scheduler.

Do you mean replacing a CPU indexed rb_tree of lists with 
an CPU indexed rb_tree of counter indexed rb_trees?

> 
> A trivial variant keeps a simple counter per tree that is incremented
> for each rotation. That should end up with the events ordered exactly
> like the list. And if you have that comparator like above, expressing
> that additional ordering becomes simple ;-)
> 
> Something like:
> 
> struct group {
>   u64 vtime;
>   rb_tree tree;
> };
> 
> bool event_less(left, right)
> {
>   if (left->cpu < right->cpu)
>     return true;
> 
>   if (left->cpu > right_cpu)
>     return false;
> 
>   if (left->vtime < right->vtime)
>     return true;
> 
>   return false;
> }
> 
> insert_group(group, event, tail)
> {
>   if (tail)
>     event->vtime = ++group->vtime;
> 
>   tree_insert(&group->root, event);
> }
> 
> Then every time you use insert_group(.tail=1) it goes to the end of that
> CPU's 'list'.
> 

Could you elaborate more on how to implement rotation?

Do you mean the rotation of rb_tree? So that the iteration order 
of the counter indexed rb_tree would coincide with iteration 
order of a list after rotation?

And I then still need struct rb_tree group_tree in perf_event structure:

+ /*
+  * Node on the pinned or flexible tree located at the event context;
+  * the node may be empty in case its event is not directly attached
+  * to the tree but to group_list list of the event directly
+  * attached to the tree;
+  */
+ struct rb_node			group_node;
+ /*
+  * List keeps groups allocated for the same cpu;
+  * the list may be empty in case its event is not directly
+  * attached to the tree but to group_list list of the event directly
+  * attached to the tree;
+  */
+ struct rb_tree			group_tree;

> 
> The added benefit is that it then becomes fairly simple to improve upon
> the RR scheduling, which suffers a bunch of boundary conditions where
> the task runtimes mis-align with the rotation window.
> 
>> +typedef int(*perf_event_groups_iterate_f)(struct perf_event *, void *);
> 
> We already have perf_iterate_f, the only difference appears to be that
> this has a return value. Surely these can be unified.

Ok. I will unify at some point.

> 
>> +/*
>> + * Iterate event groups and call provided callback for every group in the tree.
>> + * Iteration stops if the callback returns non zero.
>> + */
>> +static int
>> +perf_event_groups_iterate(struct rb_root *groups,
>> +		perf_event_groups_iterate_f callback, void *data)
>> +{
>> +	int ret = 0;
>> +	struct rb_node *node;
>> +	struct perf_event *node_event, *event;
> 
> In general we prefer variable definitions to be ordered on line length,
> longest first. So the exact opposite of what you have here.

Accepted.

> 
>> +
>> +	for (node = rb_first(groups); node; node = rb_next(node)) {
>> +		node_event = container_of(node,	struct perf_event, group_node);
>> +		list_for_each_entry(event, &node_event->group_list,
>> +				group_entry) {
>> +			ret = callback(event, data);
>> +			if (ret) {
>> +				return ret;
>> +			}
>> +		}
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>>  /*
>>   * Add a event from the lists for its context.
>>   * Must be called with ctx->mutex and ctx->lock held.
> 
>> @@ -1869,6 +2023,22 @@ group_sched_out(struct perf_event *group_event,
>>  		cpuctx->exclusive = 0;
>>  }
>>  
>> +struct group_sched_params {
>> +	struct perf_cpu_context *cpuctx;
>> +	struct perf_event_context *ctx;
>> +	int can_add_hw;
>> +};
>> +
>> +static int
>> +group_sched_out_callback(struct perf_event *event, void *data)
>> +{
>> +	struct group_sched_params *params = data;
>> +
>> +	group_sched_out(event, params->cpuctx, params->ctx);
>> +
>> +	return 0;
>> +}
> 
> Right, C sucks.. or possibly you've chosen the wrong pattern.

Consequences of new API integration with a callback in signature :)

> 
> So the alternative is something like:
> 
> #define for_each_group_event(event, group, cpu, pmu, field)	\
> 	for (event = rb_entry_safe(group_first(group, cpu, pmu),\
> 				   typeof(*event), field);	\
> 	     event && event->cpu == cpu && event->pmu == pmu;	\
> 	     event = rb_entry_safe(rb_next(&event->field),	\
> 				   typeof(*event), field))
> 
> 
> And then you can write things like:
> 
> 	for_each_group_event(event, group, cpu, pmu)
> 		group_sched_out(event, cpuctx, ctx);
> 
> 
>> +
>>  #define DETACH_GROUP	0x01UL
>>  
>>  /*
>> @@ -2712,7 +2882,10 @@ static void ctx_sched_out(struct perf_event_context *ctx,
>>  			  enum event_type_t event_type)
>>  {
>>  	int is_active = ctx->is_active;
>> -	struct perf_event *event;
>> +	struct group_sched_params params = {
>> +			.cpuctx = cpuctx,
>> +			.ctx = ctx
>> +	};
>>  
>>  	lockdep_assert_held(&ctx->lock);
>>  
>> @@ -2759,13 +2932,13 @@ static void ctx_sched_out(struct perf_event_context *ctx,
>>  
>>  	perf_pmu_disable(ctx->pmu);
>>  	if (is_active & EVENT_PINNED) {
>> -		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
>> -			group_sched_out(event, cpuctx, ctx);
>> +		perf_event_groups_iterate(&ctx->pinned_groups,
>> +				group_sched_out_callback, &params);
> 
> So here I would expect to not iterate events where event->cpu !=
> smp_processor_id() (and ideally not where event->pmu != ctx->pmu).
>

We still need to iterate thru all groups on thread context switch in 
and out as well as iterate thru cpu == -1 list (software events) additionally 
to smp_processor_id() list from multiplexing timer interrupt handler.
 
>>  	}
>>  
>>  	if (is_active & EVENT_FLEXIBLE) {
>> -		list_for_each_entry(event, &ctx->flexible_groups, group_entry)
>> -			group_sched_out(event, cpuctx, ctx);
>> +		perf_event_groups_iterate(&ctx->flexible_groups,
>> +				group_sched_out_callback, &params);
> 
> Idem.
> 
>>  	}
>>  	perf_pmu_enable(ctx->pmu);
>>  }
> 
> 
> I think the rest of the patch is just plumbing to make the above useful.
> Let me know if I missed something of value.
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-03 18:47     ` Alexey Budankov
@ 2017-08-04 12:35       ` Peter Zijlstra
  2017-08-04 12:51         ` Peter Zijlstra
  2017-08-04 14:23         ` Alexey Budankov
  0 siblings, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-04 12:35 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Thu, Aug 03, 2017 at 09:47:56PM +0300, Alexey Budankov wrote:
> On 03.08.2017 18:00, Peter Zijlstra wrote:

> > Are the magic spots, right? And I'm not convinced its right.
> > 
> > Suppose I have two events in my context, and I created them 1 minute
> > apart. Then their respective tstamp_enabled are 1 minute apart as well.
> > But the above doesn't seem to preserve that difference.
> > 
> > A similar argument can be made for running I think. That is a per event
> > value and cannot be passed along to the ctx and back.
> 
> Aww, I see your point and it challenges my initial assumptions. 
> Let me think thru the case more. There must be some solution. Thanks!

So the sensible thing is probably to rewrite the entire time tracking to
make more sense. OTOH that's also the riskiest.

Something like:

__update_state_and_time(event, new_state)
{
	u64 delta, now = perf_event_time(event);
	int old_state = event->state;

	event->tstamp = now;
	event->state  = new_state;

	delta = now - event->tstamp;
	switch (state) {
	case STATE_ACTIVE:
		WARN_ON_ONCE(old_state != STATE_INACTIVE);
		event->total_time_enabled += delta;
		break;

	case STATE_INACTIVE:
		switch (old_state) {
		case STATE_OFF:
			/* ignore the OFF -> INACTIVE period */
			break;

		case STATE_ACTIVE:
			event->total_time_enabled += delta;
			event->total_time_running += delta;
			break;

		default:
			WARN_ONCE();
		}
		break;

	case STATE_OFF:
		WARN_ON_ONCE(old_state != STATE_INACTIVE)
		event->total_time_enabled += delta;
		break;
	}
}

__read_curent_times(event, u64 *enabled, u64 *running)
{
	u64 delta, now = perf_event_time(event);

	delta = now - event->tstamp;

	*enabled = event->total_time_enabled;
	if (event->state >= STATE_INACTIVE)
		*enabled += delta;
	*running = event->total_time_running
	if (event->state == STATE_ACTIVE)
		*running += delta;
}

perhaps? That instantly solves the problem I think, because now we don't
need to update inactive events. But maybe I missed some, could you
verify?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-03 15:58     ` Alexey Budankov
@ 2017-08-04 12:36       ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-04 12:36 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Thu, Aug 03, 2017 at 06:58:41PM +0300, Alexey Budankov wrote:
> On 03.08.2017 17:00, Peter Zijlstra wrote:

> > And I don't think I fully agree with your description of running.
> 
> I copied this comment from the previous place without any change.

Ah, my bad, I often look at patches with all - lines stripped out.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-04 12:35       ` Peter Zijlstra
@ 2017-08-04 12:51         ` Peter Zijlstra
  2017-08-04 14:25           ` Alexey Budankov
  2017-08-04 14:23         ` Alexey Budankov
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-04 12:51 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Fri, Aug 04, 2017 at 02:35:34PM +0200, Peter Zijlstra wrote:
> Something like:
> 
> __update_state_and_time(event, new_state)
> {
> 	u64 delta, now = perf_event_time(event);
> 	int old_state = event->state;
> 
> 	event->tstamp = now;
> 	event->state  = new_state;
> 
> 	delta = now - event->tstamp;
Obv should go above the tstamp assignment

> 	switch (state) {
> 	case STATE_ACTIVE:
> 		WARN_ON_ONCE(old_state != STATE_INACTIVE);
> 		event->total_time_enabled += delta;
> 		break;
> 
> 	case STATE_INACTIVE:
> 		switch (old_state) {
> 		case STATE_OFF:
> 			/* ignore the OFF -> INACTIVE period */
> 			break;
> 
> 		case STATE_ACTIVE:
> 			event->total_time_enabled += delta;
> 			event->total_time_running += delta;
> 			break;
> 
> 		default:
> 			WARN_ONCE();
> 		}
> 		break;
> 
> 	case STATE_OFF:
> 		WARN_ON_ONCE(old_state != STATE_INACTIVE)
> 		event->total_time_enabled += delta;
> 		break;
> 	}
> }

So that's a straight fwd state machine that deals with:

  OFF <-> INACTIVE <-> ACTIVE

but I think something like:

__update_state_and_time(event, new_state)
{
	u64 delta, new = perf_event_time(event);
	int old_state = event->state;

	delta = now - event->tstamp;
	event->tstamp = now;
	event->state  = new_state;

	if (old_state == STATE_OFF)
		return;

	event->total_time_enabled += delta;

	if (old_state == STATE_ACTIVE)
		event->total_time_running += delta;
}

is equivalent and generates smaller code.. but again, double check (also
it doesn't validate the state transitions).

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-04 12:35       ` Peter Zijlstra
  2017-08-04 12:51         ` Peter Zijlstra
@ 2017-08-04 14:23         ` Alexey Budankov
  1 sibling, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-04 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 04.08.2017 15:35, Peter Zijlstra wrote:
> On Thu, Aug 03, 2017 at 09:47:56PM +0300, Alexey Budankov wrote:
>> On 03.08.2017 18:00, Peter Zijlstra wrote:
> 
>>> Are the magic spots, right? And I'm not convinced its right.
>>>
>>> Suppose I have two events in my context, and I created them 1 minute
>>> apart. Then their respective tstamp_enabled are 1 minute apart as well.
>>> But the above doesn't seem to preserve that difference.
>>>
>>> A similar argument can be made for running I think. That is a per event
>>> value and cannot be passed along to the ctx and back.
>>
>> Aww, I see your point and it challenges my initial assumptions. 
>> Let me think thru the case more. There must be some solution. Thanks!
> 
> So the sensible thing is probably to rewrite the entire time tracking to
> make more sense. OTOH that's also the riskiest.
> 
> Something like:
> 
> __update_state_and_time(event, new_state)
> {
> 	u64 delta, now = perf_event_time(event);
> 	int old_state = event->state;
> 
> 	event->tstamp = now;
> 	event->state  = new_state;
> 
> 	delta = now - event->tstamp;
> 	switch (state) {
> 	case STATE_ACTIVE:
> 		WARN_ON_ONCE(old_state != STATE_INACTIVE);
> 		event->total_time_enabled += delta;
> 		break;
> 
> 	case STATE_INACTIVE:
> 		switch (old_state) {
> 		case STATE_OFF:
> 			/* ignore the OFF -> INACTIVE period */
> 			break;
> 
> 		case STATE_ACTIVE:
> 			event->total_time_enabled += delta;
> 			event->total_time_running += delta;
> 			break;
> 
> 		default:
> 			WARN_ONCE();
> 		}
> 		break;
> 
> 	case STATE_OFF:
> 		WARN_ON_ONCE(old_state != STATE_INACTIVE)
> 		event->total_time_enabled += delta;
> 		break;
> 	}
> }
> 
> __read_curent_times(event, u64 *enabled, u64 *running)
> {
> 	u64 delta, now = perf_event_time(event);
> 
> 	delta = now - event->tstamp;
> 
> 	*enabled = event->total_time_enabled;
> 	if (event->state >= STATE_INACTIVE)
> 		*enabled += delta;
> 	*running = event->total_time_running
> 	if (event->state == STATE_ACTIVE)
> 		*running += delta;
> }
> 
> perhaps? That instantly solves the problem I think, because now we don't
> need to update inactive events. But maybe I missed some, could you
> verify?

Thanks for the input. I will check it.

> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-04 12:51         ` Peter Zijlstra
@ 2017-08-04 14:25           ` Alexey Budankov
  0 siblings, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-04 14:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 04.08.2017 15:51, Peter Zijlstra wrote:
> On Fri, Aug 04, 2017 at 02:35:34PM +0200, Peter Zijlstra wrote:
>> Something like:
>>
>> __update_state_and_time(event, new_state)
>> {
>> 	u64 delta, now = perf_event_time(event);
>> 	int old_state = event->state;
>>
>> 	event->tstamp = now;
>> 	event->state  = new_state;
>>
>> 	delta = now - event->tstamp;
> Obv should go above the tstamp assignment
> 
>> 	switch (state) {
>> 	case STATE_ACTIVE:
>> 		WARN_ON_ONCE(old_state != STATE_INACTIVE);
>> 		event->total_time_enabled += delta;
>> 		break;
>>
>> 	case STATE_INACTIVE:
>> 		switch (old_state) {
>> 		case STATE_OFF:
>> 			/* ignore the OFF -> INACTIVE period */
>> 			break;
>>
>> 		case STATE_ACTIVE:
>> 			event->total_time_enabled += delta;
>> 			event->total_time_running += delta;
>> 			break;
>>
>> 		default:
>> 			WARN_ONCE();
>> 		}
>> 		break;
>>
>> 	case STATE_OFF:
>> 		WARN_ON_ONCE(old_state != STATE_INACTIVE)
>> 		event->total_time_enabled += delta;
>> 		break;
>> 	}
>> }
> 
> So that's a straight fwd state machine that deals with:
> 
>   OFF <-> INACTIVE <-> ACTIVE
> 
> but I think something like:
> 
> __update_state_and_time(event, new_state)
> {
> 	u64 delta, new = perf_event_time(event);
> 	int old_state = event->state;
> 
> 	delta = now - event->tstamp;
> 	event->tstamp = now;
> 	event->state  = new_state;
> 
> 	if (old_state == STATE_OFF)
> 		return;
> 
> 	event->total_time_enabled += delta;
> 
> 	if (old_state == STATE_ACTIVE)
> 		event->total_time_running += delta;
> }
> 
> is equivalent and generates smaller code.. but again, double check (also
> it doesn't validate the state transitions).

Accepted.

>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-03 20:30     ` Alexey Budankov
@ 2017-08-04 14:36       ` Peter Zijlstra
  2017-08-07  7:17         ` Alexey Budankov
  2017-08-04 14:53       ` Peter Zijlstra
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-04 14:36 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
> On 03.08.2017 16:00, Peter Zijlstra wrote:
> > On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:

> >> +/*
> >> + * Find group list by a cpu key and rotate it.
> >> + */
> >> +static void
> >> +perf_event_groups_rotate(struct rb_root *groups, int cpu)
> >> +{
> >> +	struct rb_node *node;
> >> +	struct perf_event *node_event;
> >> +
> >> +	node = groups->rb_node;
> >> +
> >> +	while (node) {
> >> +		node_event = container_of(node,
> >> +				struct perf_event, group_node);
> >> +
> >> +		if (cpu < node_event->cpu) {
> >> +			node = node->rb_left;
> >> +		} else if (cpu > node_event->cpu) {
> >> +			node = node->rb_right;
> >> +		} else {
> >> +			list_rotate_left(&node_event->group_list);
> >> +			break;
> >> +		}
> >> +	}
> >> +}
> > 
> > Ah, you worry about how to rotate inside a tree?
> 
> Exactly.
> 
> > 
> > You can do that by adding (run)time based ordering, and you'll end up
> > with a runtime based scheduler.
> 
> Do you mean replacing a CPU indexed rb_tree of lists with 
> an CPU indexed rb_tree of counter indexed rb_trees?

No, single tree, just more complicated ordering rules.

> > A trivial variant keeps a simple counter per tree that is incremented
> > for each rotation. That should end up with the events ordered exactly
> > like the list. And if you have that comparator like above, expressing
> > that additional ordering becomes simple ;-)
> > 
> > Something like:
> > 
> > struct group {
> >   u64 vtime;
> >   rb_tree tree;
> > };
> > 
> > bool event_less(left, right)
> > {
> >   if (left->cpu < right->cpu)
> >     return true;
> > 
> >   if (left->cpu > right_cpu)
> >     return false;
> > 
> >   if (left->vtime < right->vtime)
> >     return true;
> > 
> >   return false;
> > }
> > 
> > insert_group(group, event, tail)
> > {
> >   if (tail)
> >     event->vtime = ++group->vtime;
> > 
> >   tree_insert(&group->root, event);
> > }
> > 
> > Then every time you use insert_group(.tail=1) it goes to the end of that
> > CPU's 'list'.
> > 
> 
> Could you elaborate more on how to implement rotation?

Its almost all there, but let me write a complete replacement for your
perf_event_group_rotate() above.

/* find the leftmost event matching @cpu */
/* XXX not sure how to best parametrise a subtree search, */
/* again, C sucks... */
struct perf_event *__group_find_cpu(group, cpu)
{
	struct rb_node *node = group->tree.rb_node;
	struct perf_event *event, *match = NULL;

	while (node) {
		event = container_of(node, struct perf_event, group_node);

		if (cpu > event->cpu) {
			node = node->rb_right;
		} else if (cpu < event->cpu) {
			node = node->rb_left;
		} else {
			/*
			 * subtree match, try left subtree for a
			 * 'smaller' match.
			 */
			match = event;
			node = node->rb_left;
		}
	}

	return match;
}

void perf_event_group_rotate(group, int cpu)
{
	struct perf_event *event = __group_find_cpu(cpu);

	if (!event)
		return;

	tree_delete(&group->tree, event);
	insert_group(group, event, 1);
}

So we have a tree ordered by {cpu,vtime} and what we do is find the
leftmost {cpu} entry, that is the smallest vtime entry for that cpu. We
then take it out and re-insert it with a vtime number larger than any
other, which places it as the rightmost entry for that cpu.


So given:

       {1,1}
       / \
    {0,5} {1,2}
   / \        \
{0,1} {0,6}  {1,4}


__group_find_cpu(.cpu=1) will return {1,1} as being the leftmost entry
with cpu=1. We'll then remove it, update its vtime to 7 and re-insert.
resulting in something like:

       {1,2}
       / \
    {0,5} {1,4}
   / \        \
{0,1} {0,6}  {1,7}

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-03 20:30     ` Alexey Budankov
  2017-08-04 14:36       ` Peter Zijlstra
@ 2017-08-04 14:53       ` Peter Zijlstra
  2017-08-07 15:22         ` Alexey Budankov
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-04 14:53 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
> On 03.08.2017 16:00, Peter Zijlstra wrote:
> > On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:

> >> @@ -2759,13 +2932,13 @@ static void ctx_sched_out(struct perf_event_context *ctx,
> >>  
> >>  	perf_pmu_disable(ctx->pmu);
> >>  	if (is_active & EVENT_PINNED) {
> >> -		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
> >> -			group_sched_out(event, cpuctx, ctx);
> >> +		perf_event_groups_iterate(&ctx->pinned_groups,
> >> +				group_sched_out_callback, &params);
> > 
> > So here I would expect to not iterate events where event->cpu !=
> > smp_processor_id() (and ideally not where event->pmu != ctx->pmu).
> >
> 
> We still need to iterate thru all groups on thread context switch in 
> and out as well as iterate thru cpu == -1 list (software events) additionally 
> to smp_processor_id() list from multiplexing timer interrupt handler.

Well, just doing the @cpu=-1 and @cpu=this_cpu subtrees is less work
than iterating _everything_, right?

The rest will not survive event_filter_match() anyway, so iterating them
is complete waste of time, and once we have them in a tree, its actually
easy to find this subset.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-04 14:36       ` Peter Zijlstra
@ 2017-08-07  7:17         ` Alexey Budankov
  2017-08-07  8:39           ` Peter Zijlstra
  2017-08-15 17:28           ` Alexey Budankov
  0 siblings, 2 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-07  7:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 04.08.2017 17:36, Peter Zijlstra wrote:
> On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
>> On 03.08.2017 16:00, Peter Zijlstra wrote:
>>> On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
> 
>>>> +/*
>>>> + * Find group list by a cpu key and rotate it.
>>>> + */
>>>> +static void
>>>> +perf_event_groups_rotate(struct rb_root *groups, int cpu)
>>>> +{
>>>> +	struct rb_node *node;
>>>> +	struct perf_event *node_event;
>>>> +
>>>> +	node = groups->rb_node;
>>>> +
>>>> +	while (node) {
>>>> +		node_event = container_of(node,
>>>> +				struct perf_event, group_node);
>>>> +
>>>> +		if (cpu < node_event->cpu) {
>>>> +			node = node->rb_left;
>>>> +		} else if (cpu > node_event->cpu) {
>>>> +			node = node->rb_right;
>>>> +		} else {
>>>> +			list_rotate_left(&node_event->group_list);
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +}
>>>
>>> Ah, you worry about how to rotate inside a tree?
>>
>> Exactly.
>>
>>>
>>> You can do that by adding (run)time based ordering, and you'll end up
>>> with a runtime based scheduler.
>>
>> Do you mean replacing a CPU indexed rb_tree of lists with 
>> an CPU indexed rb_tree of counter indexed rb_trees?
> 
> No, single tree, just more complicated ordering rules.
> 
>>> A trivial variant keeps a simple counter per tree that is incremented
>>> for each rotation. That should end up with the events ordered exactly
>>> like the list. And if you have that comparator like above, expressing
>>> that additional ordering becomes simple ;-)
>>>
>>> Something like:
>>>
>>> struct group {
>>>   u64 vtime;
>>>   rb_tree tree;
>>> };
>>>
>>> bool event_less(left, right)
>>> {
>>>   if (left->cpu < right->cpu)
>>>     return true;
>>>
>>>   if (left->cpu > right_cpu)
>>>     return false;
>>>
>>>   if (left->vtime < right->vtime)
>>>     return true;
>>>
>>>   return false;
>>> }
>>>
>>> insert_group(group, event, tail)
>>> {
>>>   if (tail)
>>>     event->vtime = ++group->vtime;
>>>
>>>   tree_insert(&group->root, event);
>>> }
>>>
>>> Then every time you use insert_group(.tail=1) it goes to the end of that
>>> CPU's 'list'.
>>>
>>
>> Could you elaborate more on how to implement rotation?
> 
> Its almost all there, but let me write a complete replacement for your
> perf_event_group_rotate() above.
> 
> /* find the leftmost event matching @cpu */
> /* XXX not sure how to best parametrise a subtree search, */
> /* again, C sucks... */
> struct perf_event *__group_find_cpu(group, cpu)
> {
> 	struct rb_node *node = group->tree.rb_node;
> 	struct perf_event *event, *match = NULL;
> 
> 	while (node) {
> 		event = container_of(node, struct perf_event, group_node);
> 
> 		if (cpu > event->cpu) {
> 			node = node->rb_right;
> 		} else if (cpu < event->cpu) {
> 			node = node->rb_left;
> 		} else {
> 			/*
> 			 * subtree match, try left subtree for a
> 			 * 'smaller' match.
> 			 */
> 			match = event;
> 			node = node->rb_left;
> 		}
> 	}
> 
> 	return match;
> }
> 
> void perf_event_group_rotate(group, int cpu)
> {
> 	struct perf_event *event = __group_find_cpu(cpu);
> 
> 	if (!event)
> 		return;
> 
> 	tree_delete(&group->tree, event);
> 	insert_group(group, event, 1);
> }
> 
> So we have a tree ordered by {cpu,vtime} and what we do is find the
> leftmost {cpu} entry, that is the smallest vtime entry for that cpu. We
> then take it out and re-insert it with a vtime number larger than any
> other, which places it as the rightmost entry for that cpu.
> 
> 
> So given:
> 
>        {1,1}
>        / \
>     {0,5} {1,2}
>    / \        \
> {0,1} {0,6}  {1,4}
> 
> 
> __group_find_cpu(.cpu=1) will return {1,1} as being the leftmost entry
> with cpu=1. We'll then remove it, update its vtime to 7 and re-insert.
> resulting in something like:
> 
>        {1,2}
>        / \
>     {0,5} {1,4}
>    / \        \
> {0,1} {0,6}  {1,7}
> 

Makes sense. The implementation becomes a bit simpler. The drawbacks 
may be several rotations of potentially big tree on the critical path, 
instead of updating four pointers in case of the tree of lists.

> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-07  7:17         ` Alexey Budankov
@ 2017-08-07  8:39           ` Peter Zijlstra
  2017-08-07  9:13             ` Peter Zijlstra
  2017-08-15 17:28           ` Alexey Budankov
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-07  8:39 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Mon, Aug 07, 2017 at 10:17:46AM +0300, Alexey Budankov wrote:
> Makes sense. The implementation becomes a bit simpler. The drawbacks 
> may be several rotations of potentially big tree on the critical path, 
> instead of updating four pointers in case of the tree of lists.

Yes, but like said, it allows implementing a better scheduler than RR,
allowing us to fix rotation artifacts where task runtimes are near the
rotation window.

A slightly more complicated, but also interested scheduling problem is
the per-cpu flexible vs the per-task flexible. Ideally we'd rotate them
at the same priority based on service, without strictly prioritizing the
per-cpu events.

Again, that is something that should be possible once we have a more
capable event scheduler.

So yes, cons and pros.. :-)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-07  8:39           ` Peter Zijlstra
@ 2017-08-07  9:13             ` Peter Zijlstra
  2017-08-07 15:32               ` Alexey Budankov
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-07  9:13 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Mon, Aug 07, 2017 at 10:39:13AM +0200, Peter Zijlstra wrote:
> On Mon, Aug 07, 2017 at 10:17:46AM +0300, Alexey Budankov wrote:
> > Makes sense. The implementation becomes a bit simpler. The drawbacks 
> > may be several rotations of potentially big tree on the critical path, 
> > instead of updating four pointers in case of the tree of lists.
> 
> Yes, but like said, it allows implementing a better scheduler than RR,
> allowing us to fix rotation artifacts where task runtimes are near the
> rotation window.
> 
> A slightly more complicated, but also interested scheduling problem is
> the per-cpu flexible vs the per-task flexible. Ideally we'd rotate them
> at the same priority based on service, without strictly prioritizing the
> per-cpu events.
> 
> Again, that is something that should be possible once we have a more
> capable event scheduler.
> 
> 
> So yes, cons and pros.. :-)

Also, I think for AVL tree you could do the erase and (re)insert
combined and then rebalance in one go, not sure RB allows the same
thing, but it might be fun looking into.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-04 14:53       ` Peter Zijlstra
@ 2017-08-07 15:22         ` Alexey Budankov
  0 siblings, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-07 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 04.08.2017 17:53, Peter Zijlstra wrote:
> On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
>> On 03.08.2017 16:00, Peter Zijlstra wrote:
>>> On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
> 
>>>> @@ -2759,13 +2932,13 @@ static void ctx_sched_out(struct perf_event_context *ctx,
>>>>  
>>>>  	perf_pmu_disable(ctx->pmu);
>>>>  	if (is_active & EVENT_PINNED) {
>>>> -		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
>>>> -			group_sched_out(event, cpuctx, ctx);
>>>> +		perf_event_groups_iterate(&ctx->pinned_groups,
>>>> +				group_sched_out_callback, &params);
>>>
>>> So here I would expect to not iterate events where event->cpu !=
>>> smp_processor_id() (and ideally not where event->pmu != ctx->pmu).
>>>
>>
>> We still need to iterate thru all groups on thread context switch in 
>> and out as well as iterate thru cpu == -1 list (software events) additionally 
>> to smp_processor_id() list from multiplexing timer interrupt handler.
> 
> Well, just doing the @cpu=-1 and @cpu=this_cpu subtrees is less work
> than iterating _everything_, right?

Right. That is actually the aim of this whole patch set - to avoid iterating "_everything_".

> 
> The rest will not survive event_filter_match() anyway, so iterating them
> is complete waste of time, and once we have them in a tree, its actually
> easy to find this subset.
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-07  9:13             ` Peter Zijlstra
@ 2017-08-07 15:32               ` Alexey Budankov
  2017-08-07 15:55                 ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-08-07 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 07.08.2017 12:13, Peter Zijlstra wrote:
> On Mon, Aug 07, 2017 at 10:39:13AM +0200, Peter Zijlstra wrote:
>> On Mon, Aug 07, 2017 at 10:17:46AM +0300, Alexey Budankov wrote:
>>> Makes sense. The implementation becomes a bit simpler. The drawbacks 
>>> may be several rotations of potentially big tree on the critical path, 
>>> instead of updating four pointers in case of the tree of lists.
>>
>> Yes, but like said, it allows implementing a better scheduler than RR,
>> allowing us to fix rotation artifacts where task runtimes are near the
>> rotation window.

Could you elaborate more on the artifacts or my be share some link to the theory?

>>
>> A slightly more complicated, but also interested scheduling problem is
>> the per-cpu flexible vs the per-task flexible. Ideally we'd rotate them
>> at the same priority based on service, without strictly prioritizing the
>> per-cpu events.
>>
>> Again, that is something that should be possible once we have a more
>> capable event scheduler.
>>
>>
>> So yes, cons and pros.. :-)
> 
> Also, I think for AVL tree you could do the erase and (re)insert
> combined and then rebalance in one go, not sure RB allows the same
> thing, but it might be fun looking into.

Not sure if AVL is more practical here. You get better balancing what gives 
you faster average search for the price of longer modifications 
so yes, need to measure and compare ... :-)

> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-07 15:32               ` Alexey Budankov
@ 2017-08-07 15:55                 ` Peter Zijlstra
  2017-08-07 16:27                   ` Alexey Budankov
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-07 15:55 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Mon, Aug 07, 2017 at 06:32:16PM +0300, Alexey Budankov wrote:
> On 07.08.2017 12:13, Peter Zijlstra wrote:
> > On Mon, Aug 07, 2017 at 10:39:13AM +0200, Peter Zijlstra wrote:
> >> On Mon, Aug 07, 2017 at 10:17:46AM +0300, Alexey Budankov wrote:
> >>> Makes sense. The implementation becomes a bit simpler. The drawbacks 
> >>> may be several rotations of potentially big tree on the critical path, 
> >>> instead of updating four pointers in case of the tree of lists.
> >>
> >> Yes, but like said, it allows implementing a better scheduler than RR,
> >> allowing us to fix rotation artifacts where task runtimes are near the
> >> rotation window.
> 
> Could you elaborate more on the artifacts or my be share some link to the theory?

In the extreme, if you construct your program such that you'll never get
hit by the tick (this used to be a popular measure to hide yourself from
time accounting), you'll never rotate the counters, even though you can
rack up quite a lot of runtime.

By doing a runtime based scheduler, instead of a tick based RR, we'll
still get rotation, and the tick will only function as a forced
reprogram point.

> >> A slightly more complicated, but also interested scheduling problem is
> >> the per-cpu flexible vs the per-task flexible. Ideally we'd rotate them
> >> at the same priority based on service, without strictly prioritizing the
> >> per-cpu events.
> >>
> >> Again, that is something that should be possible once we have a more
> >> capable event scheduler.
> >>
> >>
> >> So yes, cons and pros.. :-)
> > 
> > Also, I think for AVL tree you could do the erase and (re)insert
> > combined and then rebalance in one go, not sure RB allows the same
> > thing, but it might be fun looking into.
> 
> Not sure if AVL is more practical here. You get better balancing what gives 
> you faster average search for the price of longer modifications 
> so yes, need to measure and compare ... :-)

Oh, I wasn't suggesting using AVL (the last thing we need is another
balanced tree in the kernel), I was merely wondering if you could do
compound/bulk updates on RB as you can with AVL.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-07 15:55                 ` Peter Zijlstra
@ 2017-08-07 16:27                   ` Alexey Budankov
  2017-08-07 16:57                     ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-08-07 16:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 07.08.2017 18:55, Peter Zijlstra wrote:
> On Mon, Aug 07, 2017 at 06:32:16PM +0300, Alexey Budankov wrote:
>> On 07.08.2017 12:13, Peter Zijlstra wrote:
>>> On Mon, Aug 07, 2017 at 10:39:13AM +0200, Peter Zijlstra wrote:
>>>> On Mon, Aug 07, 2017 at 10:17:46AM +0300, Alexey Budankov wrote:
>>>>> Makes sense. The implementation becomes a bit simpler. The drawbacks 
>>>>> may be several rotations of potentially big tree on the critical path, 
>>>>> instead of updating four pointers in case of the tree of lists.
>>>>
>>>> Yes, but like said, it allows implementing a better scheduler than RR,
>>>> allowing us to fix rotation artifacts where task runtimes are near the
>>>> rotation window.
>>
>> Could you elaborate more on the artifacts or my be share some link to the theory?
> 
> In the extreme, if you construct your program such that you'll never get
> hit by the tick (this used to be a popular measure to hide yourself from
> time accounting)

Well, some weird thing for me. Never run longer than one tick? 
I could imaging some I/O bound code that would fast serve some short 
messages, all the other time waiting for incoming requests.
Not sure if CPU events monitoring is helpful in this case.

> , you'll never rotate the counters, even though you can
> rack up quite a lot of runtime.> 
> By doing a runtime based scheduler, instead of a tick based RR, we'll
> still get rotation, and the tick will only function as a forced
> reprogram point.
> 
>>>> A slightly more complicated, but also interested scheduling problem is
>>>> the per-cpu flexible vs the per-task flexible. Ideally we'd rotate them
>>>> at the same priority based on service, without strictly prioritizing the
>>>> per-cpu events.
>>>>
>>>> Again, that is something that should be possible once we have a more
>>>> capable event scheduler.
>>>>
>>>>
>>>> So yes, cons and pros.. :-)
>>>
>>> Also, I think for AVL tree you could do the erase and (re)insert
>>> combined and then rebalance in one go, not sure RB allows the same
>>> thing, but it might be fun looking into.
>>
>> Not sure if AVL is more practical here. You get better balancing what gives 
>> you faster average search for the price of longer modifications 
>> so yes, need to measure and compare ... :-)
> 
> Oh, I wasn't suggesting using AVL (the last thing we need is another
> balanced tree in the kernel), I was merely wondering if you could do
> compound/bulk updates on RB as you can with AVL.

Aww, I see.

> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-07 16:27                   ` Alexey Budankov
@ 2017-08-07 16:57                     ` Peter Zijlstra
  2017-08-07 17:39                       ` Andi Kleen
  2017-08-07 18:13                       ` Alexey Budankov
  0 siblings, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-07 16:57 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Mon, Aug 07, 2017 at 07:27:30PM +0300, Alexey Budankov wrote:
> On 07.08.2017 18:55, Peter Zijlstra wrote:

> > In the extreme, if you construct your program such that you'll never get
> > hit by the tick (this used to be a popular measure to hide yourself from
> > time accounting)
> 
> Well, some weird thing for me. Never run longer than one tick? 
> I could imaging some I/O bound code that would fast serve some short 
> messages, all the other time waiting for incoming requests.
> Not sure if CPU events monitoring is helpful in this case.

Like I said, in extreme. Typically its less weird.

Another example is scheduling a very constrained counter/group along
with a bunch of simple events such that the group will only succeed to
schedule when its the first. In this case it will get only 1/nr_events
time with RR, as opposed to the other/simple events that will get
nr_counters/nr_events time.

By making it runtime based, the constrained thing will more often be
head of list and acquire equal total runtime to the other events.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-07 16:57                     ` Peter Zijlstra
@ 2017-08-07 17:39                       ` Andi Kleen
  2017-08-07 18:12                         ` Peter Zijlstra
  2017-08-07 18:13                       ` Alexey Budankov
  1 sibling, 1 reply; 76+ messages in thread
From: Andi Kleen @ 2017-08-07 17:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexey Budankov, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Kan Liang, Dmitri Prokhorov,
	Valery Cherepennikov, Mark Rutland, Stephane Eranian,
	David Carrillo-Cisneros, linux-kernel

On Mon, Aug 07, 2017 at 06:57:11PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 07, 2017 at 07:27:30PM +0300, Alexey Budankov wrote:
> > On 07.08.2017 18:55, Peter Zijlstra wrote:
> 
> > > In the extreme, if you construct your program such that you'll never get
> > > hit by the tick (this used to be a popular measure to hide yourself from
> > > time accounting)
> > 
> > Well, some weird thing for me. Never run longer than one tick? 
> > I could imaging some I/O bound code that would fast serve some short 
> > messages, all the other time waiting for incoming requests.
> > Not sure if CPU events monitoring is helpful in this case.
> 
> Like I said, in extreme. Typically its less weird.
> 
> Another example is scheduling a very constrained counter/group along
> with a bunch of simple events such that the group will only succeed to
> schedule when its the first. In this case it will get only 1/nr_events
> time with RR, as opposed to the other/simple events that will get
> nr_counters/nr_events time.
> 
> By making it runtime based, the constrained thing will more often be
> head of list and acquire equal total runtime to the other events.

I'm not sure Alexey's patch kit will be able to solve every possible
problem with the event scheduler. Trying to fix everything at 
the same time is usually difficult. 

It would seem better to mainly focus on the scaling problem for now
(which is essentially a show stopper bug for one platform)
and then tackle other problems later once that is solved.

-Andi

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-07 17:39                       ` Andi Kleen
@ 2017-08-07 18:12                         ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-07 18:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexey Budankov, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Kan Liang, Dmitri Prokhorov,
	Valery Cherepennikov, Mark Rutland, Stephane Eranian,
	David Carrillo-Cisneros, linux-kernel

On Mon, Aug 07, 2017 at 10:39:55AM -0700, Andi Kleen wrote:
> I'm not sure Alexey's patch kit will be able to solve every possible
> problem with the event scheduler. Trying to fix everything at 
> the same time is usually difficult. 

I didn't say he should solve this. Just said that putting everything in
a tree enables solving it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-07 16:57                     ` Peter Zijlstra
  2017-08-07 17:39                       ` Andi Kleen
@ 2017-08-07 18:13                       ` Alexey Budankov
  1 sibling, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-07 18:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 07.08.2017 19:57, Peter Zijlstra wrote:
> On Mon, Aug 07, 2017 at 07:27:30PM +0300, Alexey Budankov wrote:
>> On 07.08.2017 18:55, Peter Zijlstra wrote:
> 
>>> In the extreme, if you construct your program such that you'll never get
>>> hit by the tick (this used to be a popular measure to hide yourself from
>>> time accounting)
>>
>> Well, some weird thing for me. Never run longer than one tick? 
>> I could imaging some I/O bound code that would fast serve some short 
>> messages, all the other time waiting for incoming requests.
>> Not sure if CPU events monitoring is helpful in this case.
> 
> Like I said, in extreme. Typically its less weird.> 
> Another example is scheduling a very constrained counter/group along
> with a bunch of simple events such that the group will only succeed to
> schedule when its the first. In this case it will get only 1/nr_events
> time with RR, as opposed to the other/simple events that will get
> nr_counters/nr_events time.
> 
> By making it runtime based, the constrained thing will more often be
> head of list and acquire equal total runtime to the other events.

I see and what could be the triggering condition for runtime based scheduling 
of groups as an alternative to hrtimer signal?

> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-03 15:00   ` Peter Zijlstra
  2017-08-03 18:47     ` Alexey Budankov
@ 2017-08-10 15:57     ` Alexey Budankov
  2017-08-22 20:47       ` Peter Zijlstra
  1 sibling, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-08-10 15:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6163 bytes --]

Well, Ok.

I re-implemented this patch of patch set and also implemented a unit 
test that IMHO mimics the case mentioned above to check if it solves the issue.

The test is a single thread application that creates 272 fds for every of  
two INSTRUCTIONS_RETIRED events covering 272 cores of Intel Xeon Phi. 

The test simultaneously executes 1 million of instructions several times 
when the events are enabled and reads the events' counts, TOTAL_ENABLED and 
TOTAL_RUNNING over read() system call.

The first event is allocated, enabled and disabled at the beginning and 
the end of the test execution phase where as the second event allocation and
measurements are included into the measurement interval of the first event.

Below is what I am getting now when running the test on the patched kernel:

EID	CPU	COUNT		T_ENABLED	T_RUNNING	 SCALE
--- 0 enabled ---
--- million instructions ---
0	26	1334		1666671		137366		  8.24
0	94	138637		1735872		527098		 30.37
0	162	874166		1823695		1083785		 59.43
--- million instructions ---
0	26	1334		3328832		137366		  4.13
0	94	1164146		3318599		2109825		 63.58
0	162	874166		3390716		1083785		 31.96
--- million instructions ---
0	26	1334		4835955		137366		  2.84
0	94	2189671		4820750		3611976		 74.93
0	162	874166		4934513		1083785		 21.96
--- 1 enabled ---
--- million instructions ---
0	26	1334		22661310	137366		  0.61
0	94	3248450		22667165	21458391	 94.67
0	162	874166		22742990	1083785		  4.77
1	94	1033387		2150307		2150307		100.00
--- million instructions ---
0	26	1334		24878504	137366		  0.55
0	94	4289784		24869644	23660870	 95.14
0	162	874166		24943564	1083785		  4.34
1	94	2074675		4370708		4370708		100.00
--- 1 disabled ---
--- million instructions ---
0	26	1334		27681611	137366		  0.50
0	94	5337752		27675968	26467194	 95.63
0	162	874166		27749528	1083785		  3.91
1	94	2089278		5024218		5024218		100.00
--- 0 disabled ---
--- million instructions ---
0	26	1334		29004868	137366		  0.47
0	94	5372734		28964075	27755301	 95.83
0	162	874166		29005751	1083785		  3.74
1	94	2089278		5024218		5024218		100.00

The output demonstrates that test thread migrated two time during execution
thus several fds were employed for measuring amount of executed instructions.

Also the output shows that T_RUNNING values of events updated and
maintained so that sums of SCALE values for every event are near 100% 
(no multiplexing) after every million instructions execution.

Unit test code is attached for convenience.

The key thing in the patch is explicit updating of tstamp fields for
INACTIVE events in update_event_times().

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
kernel/events/core.c | 52 +++++++++++++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0a4f619..d195fdc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1391,6 +1391,27 @@ static u64 perf_event_time(struct perf_event *event)
 	return ctx ? ctx->time : 0;
 }
 
+void perf_event_tstamp_init(struct perf_event *event)
+{
+	u64 tstamp = perf_event_time(event);
+
+	event->tstamp_enabled = tstamp;
+	event->tstamp_running = tstamp;
+	event->tstamp_stopped = tstamp;
+}
+
+void perf_event_tstamp_update(struct perf_event *event)
+{
+	u64 tstamp, delta;
+
+	tstamp = perf_event_time(event);
+
+	delta = tstamp - event->tstamp_stopped;
+
+	event->tstamp_running += delta;
+	event->tstamp_stopped = tstamp;
+}
+
 /*
  * Update the total_time_enabled and total_time_running fields for a event.
  */
@@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event)
 	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
 		return;
 
+	if (event->state == PERF_EVENT_STATE_INACTIVE)
+		perf_event_tstamp_update(event);
+
 	/*
 	 * in cgroup mode, time_enabled represents
 	 * the time the event was enabled AND active
@@ -1430,7 +1454,6 @@ static void update_event_times(struct perf_event *event)
 		run_end = perf_event_time(event);
 
 	event->total_time_running = run_end - event->tstamp_running;
-
 }
 
 /*
@@ -1954,9 +1977,6 @@ event_sched_out(struct perf_event *event,
 		  struct perf_cpu_context *cpuctx,
 		  struct perf_event_context *ctx)
 {
-	u64 tstamp = perf_event_time(event);
-	u64 delta;
-
 	WARN_ON_ONCE(event->ctx != ctx);
 	lockdep_assert_held(&ctx->lock);
 
@@ -1967,18 +1987,15 @@ event_sched_out(struct perf_event *event,
 	 * via read() for time_enabled, time_running:
 	 */
 	if (event->state == PERF_EVENT_STATE_INACTIVE &&
-	    !event_filter_match(event)) {
-		delta = tstamp - event->tstamp_stopped;
-		event->tstamp_running += delta;
-		event->tstamp_stopped = tstamp;
-	}
+	    !event_filter_match(event))
+		perf_event_tstamp_update(event);
 
 	if (event->state != PERF_EVENT_STATE_ACTIVE)
 		return;
 
 	perf_pmu_disable(event->pmu);
 
-	event->tstamp_stopped = tstamp;
+	event->tstamp_stopped = perf_event_time(event);
 	event->pmu->del(event, 0);
 	event->oncpu = -1;
 	event->state = PERF_EVENT_STATE_INACTIVE;
@@ -2294,7 +2311,6 @@ group_sched_in(struct perf_event *group_event,
 {
 	struct perf_event *event, *partial_group = NULL;
 	struct pmu *pmu = ctx->pmu;
-	u64 now = ctx->time;
 	bool simulate = false;
 
 	if (group_event->state == PERF_EVENT_STATE_OFF)
@@ -2340,12 +2356,10 @@ group_sched_in(struct perf_event *group_event,
 		if (event == partial_group)
 			simulate = true;
 
-		if (simulate) {
-			event->tstamp_running += now - event->tstamp_stopped;
-			event->tstamp_stopped = now;
-		} else {
+		if (simulate)
+			perf_event_tstamp_update(event);
+		else
 			event_sched_out(event, cpuctx, ctx);
-		}
 	}
 	event_sched_out(group_event, cpuctx, ctx);
 
@@ -2390,13 +2404,9 @@ static int group_can_go_on(struct perf_event *event,
 static void add_event_to_ctx(struct perf_event *event,
 			       struct perf_event_context *ctx)
 {
-	u64 tstamp = perf_event_time(event);
-
 	list_add_event(event, ctx);
 	perf_group_attach(event);
-	event->tstamp_enabled = tstamp;
-	event->tstamp_running = tstamp;
-	event->tstamp_stopped = tstamp;
+	perf_event_tstamp_init(event);
 }
 
 static void ctx_sched_out(struct perf_event_context *ctx,

[-- Attachment #2: check_multiplexing_read.c --]
[-- Type: text/plain, Size: 3294 bytes --]

/* check_multiplexing_read.c						*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <unistd.h>
#include <errno.h>

#include <sys/mman.h>

#include <sys/ioctl.h>
#include <asm/unistd.h>

#include "perf_event.h"
#include "test_utils.h"
#include "perf_helpers.h"
#include "instructions_testcode.h"

#define NUM_CPUS 272
#define NUM_EVENTS 2

int fd[NUM_EVENTS][NUM_CPUS];
long long events[NUM_EVENTS]={
	PERF_COUNT_HW_CPU_CYCLES,
	PERF_COUNT_HW_INSTRUCTIONS
};

static long long base_results[NUM_CPUS][3];

#define TIME_ENABLED	1
#define TIME_RUNNING	2

static int test_routine(void) {

	int i,result;

	printf("--- million instructions ---\n");

	for(i=0;i<1;i++) {
		result=instructions_million();
	}

	return result;
}

void alloc_events(long long config, int fd[NUM_CPUS])
{
	int ret,i,j;

	struct perf_event_attr pe;

	for(j=0;j<NUM_CPUS;j++) {
		memset(&pe,0,sizeof(struct perf_event_attr));
		pe.type=PERF_TYPE_HARDWARE;
		pe.size=sizeof(struct perf_event_attr);
		pe.config=config;
		pe.disabled=1;
		pe.exclude_kernel=1;
		pe.exclude_hv=1;
		pe.read_format=PERF_FORMAT_TOTAL_TIME_ENABLED |
				PERF_FORMAT_TOTAL_TIME_RUNNING;

		fd[j]=perf_event_open(&pe,0,j,-1,0);
		if (fd[j]<0) {
			fprintf(stderr,"Failed adding mpx event 0 %s\n",
				strerror(errno));
			return -1;
		}
	}
}

void free_events(int fd[NUM_CPUS])
{
	int j;
	for(j=0;j<NUM_CPUS;j++) {
		close(fd[j]);
	}
}

void enable_events(int fd[NUM_CPUS])
{
	int j,ret;
	for(j=0;j<NUM_CPUS;j++) {
		ret=ioctl(fd[j], PERF_EVENT_IOC_ENABLE,0);
		if (ret<0) {
			fprintf(stderr,"Error starting event fd[%d]\n",j);
		}
	}
}

void disable_events(int fd[NUM_CPUS])
{
	int j,ret;
	for(j=0;j<NUM_CPUS;j++) {
		ret=ioctl(fd[j], PERF_EVENT_IOC_DISABLE,0);
		if (ret<0) {
			fprintf(stderr,"Error stopping event fd[%d]\n",j);
		}
	}
}

void read_events(int i, int fd[NUM_CPUS])
{
	int j, ret;
	for(j=0;j<NUM_CPUS;j++) {
		ret=read(fd[j],&base_results[j],3*sizeof(long long));
		if (ret<3*sizeof(long long)) {
			fprintf(stderr,"Event fd[0][%d] unexpected read size %d\n",j,ret);
			return;
		}
		if (base_results[j][0])
			printf("%d\t%d\t%lld\t\t\t%lld\t\t\t%lld\t\t\t%.2f\n", i, j,
				base_results[j][0],
				base_results[j][TIME_ENABLED],
				base_results[j][TIME_RUNNING],
				(double)base_results[j][TIME_RUNNING]/
				(double)base_results[j][TIME_ENABLED] * 100.);
	}
}

int main(int argc, char** argv) {

	int ret,quiet,i,j;

	struct perf_event_attr pe;
	
	char test_string[]="Testing ...";

	printf("\nEID\tCPU\tCOUNT\t\t\tT_ENABLED\t\t\tT_RUNNING\t\t\tSCALE\n");

	alloc_events(events[1], fd[0]);
	enable_events(fd[0]);
	printf("--- 1 enabled\n");

	test_routine();
	read_events(0, fd[0]);

	test_routine();
	read_events(0, fd[0]);

	test_routine();
	read_events(0, fd[0]);

	alloc_events(events[1], fd[1]);
	enable_events(fd[1]);
	printf("--- 1 enabled\n");

	test_routine();
	read_events(0, fd[0]);
	read_events(1, fd[1]);

	test_routine();
	read_events(0, fd[0]);
	read_events(1, fd[1]);

	disable_events(fd[1]);
	printf("--- 1 disabled\n");

	test_routine();
	read_events(0,fd[0]);
	read_events(1,fd[1]);

	disable_events(fd[0]);
	printf("--- 0 disabled\n");
	test_routine();

	read_events(0,fd[0]);
	read_events(1,fd[1]);

	free_events(fd[1]);
	free_events(fd[0]);

	test_pass(test_string);

	return 0;
}

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-07  7:17         ` Alexey Budankov
  2017-08-07  8:39           ` Peter Zijlstra
@ 2017-08-15 17:28           ` Alexey Budankov
  2017-08-23 13:39             ` Alexander Shishkin
                               ` (2 more replies)
  1 sibling, 3 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-15 17:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

Hi Peter,

On 07.08.2017 10:17, Alexey Budankov wrote:
> On 04.08.2017 17:36, Peter Zijlstra wrote:
>> On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
>>> On 03.08.2017 16:00, Peter Zijlstra wrote:
>>>> On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
>>
>>>>> +/*
>>>>> + * Find group list by a cpu key and rotate it.
>>>>> + */
>>>>> +static void
>>>>> +perf_event_groups_rotate(struct rb_root *groups, int cpu)
>>>>> +{
>>>>> +	struct rb_node *node;
>>>>> +	struct perf_event *node_event;
>>>>> +
>>>>> +	node = groups->rb_node;
>>>>> +
>>>>> +	while (node) {
>>>>> +		node_event = container_of(node,
>>>>> +				struct perf_event, group_node);
>>>>> +
>>>>> +		if (cpu < node_event->cpu) {
>>>>> +			node = node->rb_left;
>>>>> +		} else if (cpu > node_event->cpu) {
>>>>> +			node = node->rb_right;
>>>>> +		} else {
>>>>> +			list_rotate_left(&node_event->group_list);
>>>>> +			break;
>>>>> +		}
>>>>> +	}
>>>>> +}
>>>>
>>>> Ah, you worry about how to rotate inside a tree?
>>>
>>> Exactly.
>>>
>>>>
>>>> You can do that by adding (run)time based ordering, and you'll end up
>>>> with a runtime based scheduler.
>>>
>>> Do you mean replacing a CPU indexed rb_tree of lists with 
>>> an CPU indexed rb_tree of counter indexed rb_trees?
>>
>> No, single tree, just more complicated ordering rules.
>>
>>>> A trivial variant keeps a simple counter per tree that is incremented
>>>> for each rotation. That should end up with the events ordered exactly
>>>> like the list. And if you have that comparator like above, expressing
>>>> that additional ordering becomes simple ;-)
>>>>
>>>> Something like:
>>>>
>>>> struct group {
>>>>   u64 vtime;
>>>>   rb_tree tree;
>>>> };
>>>>
>>>> bool event_less(left, right)
>>>> {
>>>>   if (left->cpu < right->cpu)
>>>>     return true;
>>>>
>>>>   if (left->cpu > right_cpu)
>>>>     return false;
>>>>
>>>>   if (left->vtime < right->vtime)
>>>>     return true;
>>>>
>>>>   return false;
>>>> }
>>>>
>>>> insert_group(group, event, tail)
>>>> {
>>>>   if (tail)
>>>>     event->vtime = ++group->vtime;
>>>>
>>>>   tree_insert(&group->root, event);
>>>> }
>>>>
>>>> Then every time you use insert_group(.tail=1) it goes to the end of that
>>>> CPU's 'list'.
>>>>
>>>
>>> Could you elaborate more on how to implement rotation?
>>
>> Its almost all there, but let me write a complete replacement for your
>> perf_event_group_rotate() above.
>>
>> /* find the leftmost event matching @cpu */
>> /* XXX not sure how to best parametrise a subtree search, */
>> /* again, C sucks... */
>> struct perf_event *__group_find_cpu(group, cpu)
>> {
>> 	struct rb_node *node = group->tree.rb_node;
>> 	struct perf_event *event, *match = NULL;
>>
>> 	while (node) {
>> 		event = container_of(node, struct perf_event, group_node);
>>
>> 		if (cpu > event->cpu) {
>> 			node = node->rb_right;
>> 		} else if (cpu < event->cpu) {
>> 			node = node->rb_left;
>> 		} else {
>> 			/*
>> 			 * subtree match, try left subtree for a
>> 			 * 'smaller' match.
>> 			 */
>> 			match = event;
>> 			node = node->rb_left;
>> 		}
>> 	}
>>
>> 	return match;
>> }
>>
>> void perf_event_group_rotate(group, int cpu)
>> {
>> 	struct perf_event *event = __group_find_cpu(cpu);
>>
>> 	if (!event)
>> 		return;
>>
>> 	tree_delete(&group->tree, event);
>> 	insert_group(group, event, 1);
>> }
>>
>> So we have a tree ordered by {cpu,vtime} and what we do is find the
>> leftmost {cpu} entry, that is the smallest vtime entry for that cpu. We
>> then take it out and re-insert it with a vtime number larger than any
>> other, which places it as the rightmost entry for that cpu.
>>
>>
>> So given:
>>
>>        {1,1}
>>        / \
>>     {0,5} {1,2}
>>    / \        \
>> {0,1} {0,6}  {1,4}
>>
>>
>> __group_find_cpu(.cpu=1) will return {1,1} as being the leftmost entry
>> with cpu=1. We'll then remove it, update its vtime to 7 and re-insert.
>> resulting in something like:
>>
>>        {1,2}
>>        / \
>>     {0,5} {1,4}
>>    / \        \
>> {0,1} {0,6}  {1,7}
>>
> 
> Makes sense. The implementation becomes a bit simpler. The drawbacks 
> may be several rotations of potentially big tree on the critical path, 
> instead of updating four pointers in case of the tree of lists.

I implemented the approach you had suggested (as I understood it),
tested it and got results that are drastically different from what 
I am getting for the tree of lists. Specifically I did:

1. keeping all groups in the same single tree by employing a 64-bit index
   additionally to CPU key;
   
2. implementing special _less() function and rotation by re-inserting
   group with incremented index;

3. replacing API with a callback in the signature by a macro
   perf_event_groups_for_each();

Employing all that shrunk the total patch size, however I am still 
struggling with the correctness issues.

Now I figured that not all indexed events are always located under 
the root with the same cpu, and it depends on the order of insertion
e.g. with insertion order 01,02,03,14,15,16 we get this:

     02
    /  \
   01  14
      /  \
     03  15
           \
           16

and it is unclear how to iterate cpu==0 part of tree in this case.

Iterating cpu specific subtree like this:

#define for_each_group_event(event, group, cpu, pmu, field)	 \
	for (event = rb_entry_safe(group_first(group, cpu, pmu), \
				   typeof(*event), field);	 \
	     event && event->cpu == cpu && event->pmu == pmu;	 \
	     event = rb_entry_safe(rb_next(&event->field),	 \
				   typeof(*event), field))

misses event==03 for the case above and I guess this is where I loose 
samples in my testing. 

Please advise how to proceed.

Thanks,
Alexey

> 
>>
>>
>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
  2017-08-02  8:11 [PATCH v6 0/3] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
                   ` (2 preceding siblings ...)
  2017-08-02  8:16 ` [PATCH v6 3/3]: perf/core: add mux switch to skip to the current CPU's events list on mux interrupt Alexey Budankov
@ 2017-08-18  5:17 ` Alexey Budankov
  2017-08-18  5:21   ` [PATCH v7 1/2] perf/core: use rb trees for pinned/flexible groups Alexey Budankov
                     ` (2 more replies)
  3 siblings, 3 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-18  5:17 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

Hi,

This patch set v7 moves event groups into rb trees and implements 
skipping to the current CPU's list on hrtimer interrupt.

Events allocated for the same CPU are still kept in a linked list
of the event directly attached to the tree because it is unclear 
how to implement fast iteration thru events allocated for 
the same CPU when they are all attached to a tree employing 
additional 64bit index as a secondary treee key.

The patch set addresses feeback captured previously. Specifically
API with a callback in signature is replaced by a macro what reduced
the size of adapting changes.

Patches in the set are expected to be applied one after another in 
the mentioned order and they are logically split into two parts 
to simplify the review process.

For more background details and feedback of the patch set please 
refer to v6 and older.

Thanks,
Alexey

---
 Alexey Budankov (2):
	perf/core: use rb trees for pinned/flexible groups
	perf/core: add mux switch to skip to the current CPU's events list on mux interrupt

 include/linux/perf_event.h |  19 +-
 kernel/events/core.c       | 463 ++++++++++++++++++++++++++++++++++-----------
 2 files changed, 364 insertions(+), 118 deletions(-)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v7 1/2] perf/core: use rb trees for pinned/flexible groups
  2017-08-18  5:17 ` [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
@ 2017-08-18  5:21   ` Alexey Budankov
  2017-08-23 11:17     ` Alexander Shishkin
  2017-08-18  5:22   ` [PATCH v7 2/2] perf/core: add mux switch to skip to the current CPU's events list on mux interrupt Alexey Budankov
  2017-08-22 20:21   ` [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Peter Zijlstra
  2 siblings, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-08-18  5:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

This patch moves event groups into rb tree sorted by CPU, so that 
multiplexing hrtimer interrupt handler would be able skipping to the current 
CPU's list and ignore groups allocated for the other CPUs.

New API for manipulating event groups in the trees is implemented as well 
as adoption on the API in the current implementation.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 include/linux/perf_event.h |  19 ++-
 kernel/events/core.c       | 314 +++++++++++++++++++++++++++++++++------------
 2 files changed, 249 insertions(+), 84 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b14095b..cc07904 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -572,7 +572,20 @@ struct perf_event {
 	 */
 	struct list_head		group_entry;
 	struct list_head		sibling_list;
-
+	/*
+	 * Node on the pinned or flexible tree located at the event context;
+	 * the node may be empty in case its event is not directly attached
+	 * to the tree but to group_list list of the event directly
+	 * attached to the tree;
+	 */
+	struct rb_node			group_node;
+	/*
+	 * List keeps groups allocated for the same cpu;
+	 * the list may be empty in case its event is not directly
+	 * attached to the tree but to group_list list of the event directly
+	 * attached to the tree;
+	 */
+	struct list_head		group_list;
 	/*
 	 * We need storage to track the entries in perf_pmu_migrate_context; we
 	 * cannot use the event_entry because of RCU and we want to keep the
@@ -741,8 +754,8 @@ struct perf_event_context {
 	struct mutex			mutex;
 
 	struct list_head		active_ctx_list;
-	struct list_head		pinned_groups;
-	struct list_head		flexible_groups;
+	struct rb_root			pinned_groups;
+	struct rb_root			flexible_groups;
 	struct list_head		event_list;
 	int				nr_events;
 	int				nr_active;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d704e23..08ccfb2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1466,8 +1466,12 @@ static enum event_type_t get_event_type(struct perf_event *event)
 	return event_type;
 }
 
-static struct list_head *
-ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
+/*
+ * Extract pinned or flexible groups from the context
+ * based on event attrs bits;
+ */
+static struct rb_root *
+get_event_groups(struct perf_event *event, struct perf_event_context *ctx)
 {
 	if (event->attr.pinned)
 		return &ctx->pinned_groups;
@@ -1476,6 +1480,143 @@ ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
 }
 
 /*
+ * Insert a group into a tree using event->cpu as a key. If event->cpu node
+ * is already attached to the tree then the event is added to the attached
+ * group's group_list list.
+ */
+static void
+perf_event_groups_insert(struct rb_root *groups, struct perf_event *event)
+{
+	struct perf_event *node_event;
+	struct rb_node *parent;
+	struct rb_node **node;
+
+	node = &groups->rb_node;
+	parent = *node;
+
+	while (*node) {
+		parent = *node;
+		node_event = container_of(*node,
+				struct perf_event, group_node);
+
+		if (event->cpu < node_event->cpu) {
+			node = &parent->rb_left;
+		} else if (event->cpu > node_event->cpu) {
+			node = &parent->rb_right;
+		} else {
+			list_add_tail(&event->group_entry,
+					&node_event->group_list);
+			return;
+		}
+	}
+
+	list_add_tail(&event->group_entry, &event->group_list);
+
+	rb_link_node(&event->group_node, parent, node);
+	rb_insert_color(&event->group_node, groups);
+}
+
+/*
+ * Helper function to insert event into the pinned or
+ * flexible groups;
+ */
+static void
+add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
+{
+	struct rb_root *groups;
+
+	groups = get_event_groups(event, ctx);
+	perf_event_groups_insert(groups, event);
+}
+
+/*
+ * Delete a group from a tree. If the group is directly attached to the tree
+ * it is replaced by the next group on the group's group_list.
+ */
+static void
+perf_event_groups_delete(struct rb_root *groups, struct perf_event *event)
+{
+	list_del_init(&event->group_entry);
+
+	if (!RB_EMPTY_NODE(&event->group_node)) {
+		if (!RB_EMPTY_ROOT(groups)) {
+			if (list_empty(&event->group_list)) {
+				rb_erase(&event->group_node, groups);
+			} else {
+				struct perf_event *next =
+					list_first_entry(&event->group_list,
+						struct perf_event, group_entry);
+				list_replace_init(&event->group_list,
+						&next->group_list);
+				rb_replace_node(&event->group_node,
+						&next->group_node, groups);
+			}
+		}
+		RB_CLEAR_NODE(&event->group_node);
+	}
+}
+
+/*
+ * Helper function to delete event from its groups;
+ */
+static void
+del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
+{
+	struct rb_root *groups;
+
+	groups = get_event_groups(event, ctx);
+	perf_event_groups_delete(groups, event);
+}
+
+/*
+ * Find group_list list by a cpu key.
+ */
+static struct list_head *
+perf_event_groups_get_list(struct rb_root *groups, int cpu)
+{
+	struct perf_event *node_event;
+	struct rb_node *node;
+
+	node = groups->rb_node;
+
+	while (node) {
+		node_event = container_of(node,
+				struct perf_event, group_node);
+
+		if (cpu < node_event->cpu) {
+			node = node->rb_left;
+		} else if (cpu > node_event->cpu) {
+			node = node->rb_right;
+		} else {
+			return &node_event->group_list;
+		}
+	}
+
+	return NULL;
+}
+
+/*
+ * Find group list by a cpu key and rotate it.
+ */
+static void
+perf_event_groups_rotate(struct rb_root *groups, int cpu)
+{
+	struct list_head *group_list =
+			perf_event_groups_get_list(groups, cpu);
+
+	if (group_list)
+		list_rotate_left(group_list);
+}
+
+/*
+ * Iterate event groups thru the whole tree.
+ */
+#define perf_event_groups_for_each(event, iter, tree, node, list, link) \
+	   for (iter = rb_first(tree); iter; iter = rb_next(iter))	\
+		list_for_each_entry(event, &(rb_entry(iter,		\
+			typeof(*event), node)->list), link)
+
+/*
  * Add a event from the lists for its context.
  * Must be called with ctx->mutex and ctx->lock held.
  */
@@ -1493,12 +1634,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	 * perf_group_detach can, at all times, locate all siblings.
 	 */
 	if (event->group_leader == event) {
-		struct list_head *list;
-
 		event->group_caps = event->event_caps;
-
-		list = ctx_group_list(event, ctx);
-		list_add_tail(&event->group_entry, list);
+		add_event_to_groups(event, ctx);
 	}
 
 	list_update_cgroup_event(event, ctx, true);
@@ -1689,7 +1826,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 	list_del_rcu(&event->event_entry);
 
 	if (event->group_leader == event)
-		list_del_init(&event->group_entry);
+		del_event_from_groups(event, ctx);
 
 	update_group_times(event);
 
@@ -1709,7 +1846,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 static void perf_group_detach(struct perf_event *event)
 {
 	struct perf_event *sibling, *tmp;
-	struct list_head *list = NULL;
 
 	lockdep_assert_held(&event->ctx->lock);
 
@@ -1730,22 +1866,22 @@ static void perf_group_detach(struct perf_event *event)
 		goto out;
 	}
 
-	if (!list_empty(&event->group_entry))
-		list = &event->group_entry;
-
 	/*
 	 * If this was a group event with sibling events then
 	 * upgrade the siblings to singleton events by adding them
 	 * to whatever list we are on.
 	 */
 	list_for_each_entry_safe(sibling, tmp, &event->sibling_list, group_entry) {
-		if (list)
-			list_move_tail(&sibling->group_entry, list);
 		sibling->group_leader = sibling;
 
 		/* Inherit group flags from the previous leader */
 		sibling->group_caps = event->group_caps;
 
+		if (!list_empty(&event->group_entry)) {
+			list_del_init(&sibling->group_entry);
+			add_event_to_groups(sibling, event->ctx);
+		}
+
 		WARN_ON_ONCE(sibling->ctx != event->ctx);
 	}
 
@@ -2744,7 +2880,7 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 {
 	int is_active = ctx->is_active;
 	struct perf_event *event;
-
+	struct rb_node *node;
 	lockdep_assert_held(&ctx->lock);
 
 	if (likely(!ctx->nr_events)) {
@@ -2789,15 +2925,19 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 		return;
 
 	perf_pmu_disable(ctx->pmu);
-	if (is_active & EVENT_PINNED) {
-		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
+
+	if (is_active & EVENT_PINNED)
+		perf_event_groups_for_each(event, node,
+				&ctx->pinned_groups, group_node,
+				group_list, group_entry)
 			group_sched_out(event, cpuctx, ctx);
-	}
 
-	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry(event, &ctx->flexible_groups, group_entry)
+	if (is_active & EVENT_FLEXIBLE)
+		perf_event_groups_for_each(event, node,
+				&ctx->flexible_groups, group_node,
+				group_list, group_entry)
 			group_sched_out(event, cpuctx, ctx);
-	}
+
 	perf_pmu_enable(ctx->pmu);
 }
 
@@ -3091,61 +3231,55 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
 }
 
 static void
-ctx_pinned_sched_in(struct perf_event_context *ctx,
-		    struct perf_cpu_context *cpuctx)
+ctx_pinned_sched_in(struct perf_event *event,
+		   struct perf_cpu_context *cpuctx,
+		   struct perf_event_context *ctx)
 {
-	struct perf_event *event;
-
-	list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		if (!event_filter_match(event))
-			continue;
+	if (event->state <= PERF_EVENT_STATE_OFF)
+		return;
+	if (!event_filter_match(event))
+		return;
 
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
+	/* may need to reset tstamp_enabled */
+	if (is_cgroup_event(event))
+		perf_cgroup_mark_enabled(event, ctx);
 
-		if (group_can_go_on(event, cpuctx, 1))
-			group_sched_in(event, cpuctx, ctx);
+	if (group_can_go_on(event, cpuctx, 1))
+		group_sched_in(event, cpuctx, ctx);
 
-		/*
-		 * If this pinned group hasn't been scheduled,
-		 * put it in error state.
-		 */
-		if (event->state == PERF_EVENT_STATE_INACTIVE) {
-			update_group_times(event);
-			event->state = PERF_EVENT_STATE_ERROR;
-		}
+	/*
+	 * If this pinned group hasn't been scheduled,
+	 * put it in error state.
+	 */
+	if (event->state == PERF_EVENT_STATE_INACTIVE) {
+		update_group_times(event);
+		event->state = PERF_EVENT_STATE_ERROR;
 	}
 }
 
 static void
-ctx_flexible_sched_in(struct perf_event_context *ctx,
-		      struct perf_cpu_context *cpuctx)
+ctx_flexible_sched_in(struct perf_event *event,
+		     struct perf_cpu_context *cpuctx,
+		     struct perf_event_context *ctx,
+		     int *can_add_hw)
 {
-	struct perf_event *event;
-	int can_add_hw = 1;
-
-	list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
-		/* Ignore events in OFF or ERROR state */
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		/*
-		 * Listen to the 'cpu' scheduling filter constraint
-		 * of events:
-		 */
-		if (!event_filter_match(event))
-			continue;
+	/* Ignore events in OFF or ERROR state */
+	if (event->state <= PERF_EVENT_STATE_OFF)
+		return;
+	/*
+	 * Listen to the 'cpu' scheduling filter constraint
+	 * of events:
+	 */
+	if (!event_filter_match(event))
+		return;
 
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
+	/* may need to reset tstamp_enabled */
+	if (is_cgroup_event(event))
+		perf_cgroup_mark_enabled(event, ctx);
 
-		if (group_can_go_on(event, cpuctx, can_add_hw)) {
-			if (group_sched_in(event, cpuctx, ctx))
-				can_add_hw = 0;
-		}
+	if (group_can_go_on(event, cpuctx, *can_add_hw)) {
+		if (group_sched_in(event, cpuctx, ctx))
+			*can_add_hw = 0;
 	}
 }
 
@@ -3156,7 +3290,8 @@ ctx_sched_in(struct perf_event_context *ctx,
 	     struct task_struct *task)
 {
 	int is_active = ctx->is_active;
-	u64 now;
+	struct perf_event *event;
+	struct rb_node *node;
 
 	lockdep_assert_held(&ctx->lock);
 
@@ -3175,7 +3310,7 @@ ctx_sched_in(struct perf_event_context *ctx,
 
 	if (is_active & EVENT_TIME) {
 		/* start ctx time */
-		now = perf_clock();
+		u64 now = perf_clock();
 		ctx->timestamp = now;
 		perf_cgroup_set_timestamp(task, ctx);
 	}
@@ -3185,11 +3320,19 @@ ctx_sched_in(struct perf_event_context *ctx,
 	 * in order to give them the best chance of going on.
 	 */
 	if (is_active & EVENT_PINNED)
-		ctx_pinned_sched_in(ctx, cpuctx);
+		perf_event_groups_for_each(event, node,
+				&ctx->pinned_groups, group_node,
+				group_list, group_entry)
+			ctx_pinned_sched_in(event, cpuctx, ctx);
 
 	/* Then walk through the lower prio flexible groups */
-	if (is_active & EVENT_FLEXIBLE)
-		ctx_flexible_sched_in(ctx, cpuctx);
+	if (is_active & EVENT_FLEXIBLE) {
+		int can_add_hw = 1;
+		perf_event_groups_for_each(event, node,
+				&ctx->flexible_groups, group_node,
+				group_list, group_entry)
+			ctx_flexible_sched_in(event, cpuctx, ctx, &can_add_hw);
+	}
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
@@ -3227,7 +3370,7 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	 * However, if task's ctx is not carrying any pinned
 	 * events, no need to flip the cpuctx's events around.
 	 */
-	if (!list_empty(&ctx->pinned_groups))
+	if (!RB_EMPTY_ROOT(&ctx->pinned_groups))
 		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
 	perf_event_sched_in(cpuctx, ctx, task);
 	perf_pmu_enable(ctx->pmu);
@@ -3464,8 +3607,12 @@ static void rotate_ctx(struct perf_event_context *ctx)
 	 * Rotate the first entry last of non-pinned groups. Rotation might be
 	 * disabled by the inheritance code.
 	 */
-	if (!ctx->rotate_disable)
-		list_rotate_left(&ctx->flexible_groups);
+	if (!ctx->rotate_disable) {
+		int sw = -1, cpu = smp_processor_id();
+
+		perf_event_groups_rotate(&ctx->flexible_groups, sw);
+		perf_event_groups_rotate(&ctx->flexible_groups, cpu);
+	}
 }
 
 static int perf_rotate_context(struct perf_cpu_context *cpuctx)
@@ -3804,8 +3951,8 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 	raw_spin_lock_init(&ctx->lock);
 	mutex_init(&ctx->mutex);
 	INIT_LIST_HEAD(&ctx->active_ctx_list);
-	INIT_LIST_HEAD(&ctx->pinned_groups);
-	INIT_LIST_HEAD(&ctx->flexible_groups);
+	ctx->pinned_groups = RB_ROOT;
+	ctx->flexible_groups = RB_ROOT;
 	INIT_LIST_HEAD(&ctx->event_list);
 	atomic_set(&ctx->refcount, 1);
 }
@@ -9412,6 +9559,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	INIT_LIST_HEAD(&event->group_entry);
 	INIT_LIST_HEAD(&event->event_entry);
 	INIT_LIST_HEAD(&event->sibling_list);
+	RB_CLEAR_NODE(&event->group_node);
+	INIT_LIST_HEAD(&event->group_list);
 	INIT_LIST_HEAD(&event->rb_entry);
 	INIT_LIST_HEAD(&event->active_entry);
 	INIT_LIST_HEAD(&event->addr_filters.list);
@@ -10839,9 +10988,9 @@ static int inherit_group(struct perf_event *parent_event,
  */
 static int
 inherit_task_group(struct perf_event *event, struct task_struct *parent,
-		   struct perf_event_context *parent_ctx,
-		   struct task_struct *child, int ctxn,
-		   int *inherited_all)
+		  struct perf_event_context *parent_ctx,
+		  struct task_struct *child, int ctxn,
+		  int *inherited_all)
 {
 	int ret;
 	struct perf_event_context *child_ctx;
@@ -10859,7 +11008,7 @@ inherit_task_group(struct perf_event *event, struct task_struct *parent,
 		 * First allocate and initialize a context for the
 		 * child.
 		 */
-		child_ctx = alloc_perf_context(parent_ctx->pmu, child);
+		child_ctx = alloc_perf_context(parent_ctx->pmu,	child);
 		if (!child_ctx)
 			return -ENOMEM;
 
@@ -10883,6 +11032,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 	struct perf_event_context *child_ctx, *parent_ctx;
 	struct perf_event_context *cloned_ctx;
 	struct perf_event *event;
+	struct rb_node *node;
 	struct task_struct *parent = current;
 	int inherited_all = 1;
 	unsigned long flags;
@@ -10916,7 +11066,8 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 	 * We dont have to disable NMIs - we are only looking at
 	 * the list, not manipulating it:
 	 */
-	list_for_each_entry(event, &parent_ctx->pinned_groups, group_entry) {
+	perf_event_groups_for_each(event, node,	&parent_ctx->pinned_groups,
+			group_node, group_list, group_entry) {
 		ret = inherit_task_group(event, parent, parent_ctx,
 					 child, ctxn, &inherited_all);
 		if (ret)
@@ -10932,7 +11083,8 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 	parent_ctx->rotate_disable = 1;
 	raw_spin_unlock_irqrestore(&parent_ctx->lock, flags);
 
-	list_for_each_entry(event, &parent_ctx->flexible_groups, group_entry) {
+	perf_event_groups_for_each(event, node,	&parent_ctx->flexible_groups,
+			group_node, group_list, group_entry) {
 		ret = inherit_task_group(event, parent, parent_ctx,
 					 child, ctxn, &inherited_all);
 		if (ret)

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v7 2/2] perf/core: add mux switch to skip to the current CPU's events list on mux interrupt
  2017-08-18  5:17 ` [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
  2017-08-18  5:21   ` [PATCH v7 1/2] perf/core: use rb trees for pinned/flexible groups Alexey Budankov
@ 2017-08-18  5:22   ` Alexey Budankov
  2017-08-23 11:54     ` Alexander Shishkin
  2017-08-22 20:21   ` [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Peter Zijlstra
  2 siblings, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-08-18  5:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

This patch implements mux switch that triggers skipping to the 
current CPU's events list at mulitplexing hrtimer interrupt 
handler as well as adoption of the switch in the existing 
implementation.

perf_event_groups_iterate_cpu() API is introduced to implement 
iteration thru the certain CPU groups list skipping groups 
allocated for the other CPUs.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 kernel/events/core.c | 193 ++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 137 insertions(+), 56 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 08ccfb2..aeb0f81 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -556,11 +556,11 @@ void perf_sample_event_took(u64 sample_len_ns)
 static atomic64_t perf_event_id;
 
 static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type);
+			      enum event_type_t event_type, int mux);
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
 			     enum event_type_t event_type,
-			     struct task_struct *task);
+			     struct task_struct *task, int mux);
 
 static void update_context_time(struct perf_event_context *ctx);
 static u64 perf_event_time(struct perf_event *event);
@@ -702,6 +702,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 	struct perf_cpu_context *cpuctx;
 	struct list_head *list;
 	unsigned long flags;
+	int mux = 0;
 
 	/*
 	 * Disable interrupts and preemption to avoid this CPU's
@@ -717,7 +718,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 		perf_pmu_disable(cpuctx->ctx.pmu);
 
 		if (mode & PERF_CGROUP_SWOUT) {
-			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+			cpu_ctx_sched_out(cpuctx, EVENT_ALL, mux);
 			/*
 			 * must not be done before ctxswout due
 			 * to event_filter_match() in event_sched_out()
@@ -736,7 +737,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 			 */
 			cpuctx->cgrp = perf_cgroup_from_task(task,
 							     &cpuctx->ctx);
-			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task, mux);
 		}
 		perf_pmu_enable(cpuctx->ctx.pmu);
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -1613,8 +1614,16 @@ perf_event_groups_rotate(struct rb_root *groups, int cpu)
  */
 #define perf_event_groups_for_each(event, iter, tree, node, list, link) \
 	   for (iter = rb_first(tree); iter; iter = rb_next(iter))	\
-		list_for_each_entry(event, &(rb_entry(iter,		\
-			typeof(*event), node)->list), link)
+		list_for_each_entry(event, &(rb_entry(iter, 		\
+				typeof(*event), node)->list), link)
+
+/*
+ * Iterate event groups related to specific cpu.
+ */
+#define perf_event_groups_for_each_cpu(event, cpu, tree, list, link)	\
+		list = perf_event_groups_get_list(tree, cpu);		\
+		if (list)						\
+			list_for_each_entry(event, list, link)
 
 /*
  * Add a event from the lists for its context.
@@ -2397,36 +2406,38 @@ static void add_event_to_ctx(struct perf_event *event,
 
 static void ctx_sched_out(struct perf_event_context *ctx,
 			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type);
+			  enum event_type_t event_type, int mux);
 static void
 ctx_sched_in(struct perf_event_context *ctx,
 	     struct perf_cpu_context *cpuctx,
 	     enum event_type_t event_type,
-	     struct task_struct *task);
+	     struct task_struct *task, int mux);
 
 static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
 			       struct perf_event_context *ctx,
 			       enum event_type_t event_type)
 {
+	int mux = 0;
+
 	if (!cpuctx->task_ctx)
 		return;
 
 	if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
 		return;
 
-	ctx_sched_out(ctx, cpuctx, event_type);
+	ctx_sched_out(ctx, cpuctx, event_type, mux);
 }
 
 static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
 				struct perf_event_context *ctx,
-				struct task_struct *task)
+				struct task_struct *task, int mux)
 {
-	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task);
+	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task, mux);
 	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task);
-	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task);
+		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task, mux);
+	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task, mux);
 	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
+		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task, mux);
 }
 
 /*
@@ -2450,6 +2461,7 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 {
 	enum event_type_t ctx_event_type = event_type & EVENT_ALL;
 	bool cpu_event = !!(event_type & EVENT_CPU);
+	int mux = 0;
 
 	/*
 	 * If pinned groups are involved, flexible groups also need to be
@@ -2470,11 +2482,11 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 	 *  - otherwise, do nothing more.
 	 */
 	if (cpu_event)
-		cpu_ctx_sched_out(cpuctx, ctx_event_type);
+		cpu_ctx_sched_out(cpuctx, ctx_event_type, mux);
 	else if (ctx_event_type & EVENT_PINNED)
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
 
-	perf_event_sched_in(cpuctx, task_ctx, current);
+	perf_event_sched_in(cpuctx, task_ctx, current, mux);
 	perf_pmu_enable(cpuctx->ctx.pmu);
 }
 
@@ -2491,7 +2503,7 @@ static int  __perf_install_in_context(void *info)
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
 	bool reprogram = true;
-	int ret = 0;
+	int ret = 0, mux =0;
 
 	raw_spin_lock(&cpuctx->ctx.lock);
 	if (ctx->task) {
@@ -2518,7 +2530,7 @@ static int  __perf_install_in_context(void *info)
 	}
 
 	if (reprogram) {
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
 		add_event_to_ctx(event, ctx);
 		ctx_resched(cpuctx, task_ctx, get_event_type(event));
 	} else {
@@ -2655,13 +2667,14 @@ static void __perf_event_enable(struct perf_event *event,
 {
 	struct perf_event *leader = event->group_leader;
 	struct perf_event_context *task_ctx;
+	int mux = 0;
 
 	if (event->state >= PERF_EVENT_STATE_INACTIVE ||
 	    event->state <= PERF_EVENT_STATE_ERROR)
 		return;
 
 	if (ctx->is_active)
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
 
 	__perf_event_mark_enabled(event);
 
@@ -2671,7 +2684,7 @@ static void __perf_event_enable(struct perf_event *event,
 	if (!event_filter_match(event)) {
 		if (is_cgroup_event(event))
 			perf_cgroup_defer_enabled(event);
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
 		return;
 	}
 
@@ -2680,7 +2693,7 @@ static void __perf_event_enable(struct perf_event *event,
 	 * then don't put it on unless the group is on.
 	 */
 	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
 		return;
 	}
 
@@ -2876,11 +2889,13 @@ EXPORT_SYMBOL_GPL(perf_event_refresh);
 
 static void ctx_sched_out(struct perf_event_context *ctx,
 			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type)
+			  enum event_type_t event_type, int mux)
 {
 	int is_active = ctx->is_active;
+	struct list_head *group_list;
 	struct perf_event *event;
 	struct rb_node *node;
+	int sw = -1, cpu = smp_processor_id();
 	lockdep_assert_held(&ctx->lock);
 
 	if (likely(!ctx->nr_events)) {
@@ -2926,17 +2941,47 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 
 	perf_pmu_disable(ctx->pmu);
 
-	if (is_active & EVENT_PINNED)
-		perf_event_groups_for_each(event, node,
-				&ctx->pinned_groups, group_node,
-				group_list, group_entry)
-			group_sched_out(event, cpuctx, ctx);
+	if (is_active & EVENT_PINNED) {
+		if (mux) {
+			perf_event_groups_for_each_cpu(event, cpu,
+					&ctx->pinned_groups,
+					group_list, group_entry) {
+					group_sched_out(event, cpuctx, ctx);
+			}
+			perf_event_groups_for_each_cpu(event, sw,
+					&ctx->pinned_groups,
+					group_list, group_entry) {
+					group_sched_out(event, cpuctx, ctx);
+			}
+		} else {
+			perf_event_groups_for_each(event, node,
+					&ctx->pinned_groups, group_node,
+					group_list, group_entry) {
+					group_sched_out(event, cpuctx, ctx);
+			}
+		}
+	}
 
-	if (is_active & EVENT_FLEXIBLE)
-		perf_event_groups_for_each(event, node,
-				&ctx->flexible_groups, group_node,
-				group_list, group_entry)
-			group_sched_out(event, cpuctx, ctx);
+	if (is_active & EVENT_FLEXIBLE) {
+		if (mux) {
+			perf_event_groups_for_each_cpu(event, cpu,
+					&ctx->flexible_groups,
+					group_list, group_entry) {
+					group_sched_out(event, cpuctx, ctx);
+			}
+			perf_event_groups_for_each_cpu(event, sw,
+					&ctx->flexible_groups,
+					group_list, group_entry) {
+					group_sched_out(event, cpuctx, ctx);
+			}
+		} else {
+			perf_event_groups_for_each(event, node,
+					&ctx->flexible_groups, group_node,
+					group_list, group_entry) {
+					group_sched_out(event, cpuctx, ctx);
+			}
+		}
+	}
 
 	perf_pmu_enable(ctx->pmu);
 }
@@ -3225,9 +3270,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
  * Called with IRQs disabled
  */
 static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type)
+			      enum event_type_t event_type, int mux)
 {
-	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
+	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
 }
 
 static void
@@ -3287,11 +3332,13 @@ static void
 ctx_sched_in(struct perf_event_context *ctx,
 	     struct perf_cpu_context *cpuctx,
 	     enum event_type_t event_type,
-	     struct task_struct *task)
+	     struct task_struct *task, int mux)
 {
 	int is_active = ctx->is_active;
+	struct list_head *group_list;
 	struct perf_event *event;
 	struct rb_node *node;
+	int sw = -1, cpu = smp_processor_id();
 
 	lockdep_assert_held(&ctx->lock);
 
@@ -3319,35 +3366,69 @@ ctx_sched_in(struct perf_event_context *ctx,
 	 * First go through the list and put on any pinned groups
 	 * in order to give them the best chance of going on.
 	 */
-	if (is_active & EVENT_PINNED)
-		perf_event_groups_for_each(event, node,
-				&ctx->pinned_groups, group_node,
-				group_list, group_entry)
-			ctx_pinned_sched_in(event, cpuctx, ctx);
+	if (is_active & EVENT_PINNED) {
+		if (mux) {
+			perf_event_groups_for_each_cpu(event, sw,
+					&ctx->pinned_groups,
+					group_list, group_entry) {
+					ctx_pinned_sched_in(event, cpuctx, ctx);
+			}
+			perf_event_groups_for_each_cpu(event, cpu,
+					&ctx->pinned_groups,
+					group_list, group_entry) {
+					ctx_pinned_sched_in(event, cpuctx, ctx);
+			}
+		} else {
+			perf_event_groups_for_each(event, node,
+					&ctx->pinned_groups, group_node,
+					group_list, group_entry) {
+					ctx_pinned_sched_in(event, cpuctx, ctx);
+			}
+		}
+	}
 
 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE) {
 		int can_add_hw = 1;
-		perf_event_groups_for_each(event, node,
-				&ctx->flexible_groups, group_node,
-				group_list, group_entry)
-			ctx_flexible_sched_in(event, cpuctx, ctx, &can_add_hw);
+		if (mux) {
+			perf_event_groups_for_each_cpu(event, sw,
+					&ctx->flexible_groups,
+					group_list, group_entry) {
+					ctx_flexible_sched_in(event, cpuctx,
+							ctx, &can_add_hw);
+			}
+			can_add_hw = 1;
+			perf_event_groups_for_each_cpu(event, cpu,
+					&ctx->flexible_groups,
+					group_list, group_entry) {
+					ctx_flexible_sched_in(event, cpuctx,
+							ctx, &can_add_hw);
+			}
+		} else {
+			perf_event_groups_for_each(event, node,
+					&ctx->flexible_groups, group_node,
+					group_list, group_entry) {
+					ctx_flexible_sched_in(event, cpuctx,
+							ctx, &can_add_hw);
+			}
+		}
 	}
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
 			     enum event_type_t event_type,
-			     struct task_struct *task)
+			     struct task_struct *task, int mux)
 {
 	struct perf_event_context *ctx = &cpuctx->ctx;
 
-	ctx_sched_in(ctx, cpuctx, event_type, task);
+	ctx_sched_in(ctx, cpuctx, event_type, task, mux);
 }
 
 static void perf_event_context_sched_in(struct perf_event_context *ctx,
 					struct task_struct *task)
 {
 	struct perf_cpu_context *cpuctx;
+	int mux = 0;
 
 	cpuctx = __get_cpu_context(ctx);
 	if (cpuctx->task_ctx == ctx)
@@ -3371,8 +3452,8 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	 * events, no need to flip the cpuctx's events around.
 	 */
 	if (!RB_EMPTY_ROOT(&ctx->pinned_groups))
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
-	perf_event_sched_in(cpuctx, ctx, task);
+		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
+	perf_event_sched_in(cpuctx, ctx, task, mux);
 	perf_pmu_enable(ctx->pmu);
 
 unlock:
@@ -3618,7 +3699,7 @@ static void rotate_ctx(struct perf_event_context *ctx)
 static int perf_rotate_context(struct perf_cpu_context *cpuctx)
 {
 	struct perf_event_context *ctx = NULL;
-	int rotate = 0;
+	int rotate = 0, mux = 1;
 
 	if (cpuctx->ctx.nr_events) {
 		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
@@ -3637,15 +3718,15 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 	perf_pmu_disable(cpuctx->ctx.pmu);
 
-	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
 	if (ctx)
-		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE, mux);
 
 	rotate_ctx(&cpuctx->ctx);
 	if (ctx)
 		rotate_ctx(ctx);
 
-	perf_event_sched_in(cpuctx, ctx, current);
+	perf_event_sched_in(cpuctx, ctx, current, mux);
 
 	perf_pmu_enable(cpuctx->ctx.pmu);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3696,7 +3777,7 @@ static void perf_event_enable_on_exec(int ctxn)
 	struct perf_cpu_context *cpuctx;
 	struct perf_event *event;
 	unsigned long flags;
-	int enabled = 0;
+	int enabled = 0, mux = 0;
 
 	local_irq_save(flags);
 	ctx = current->perf_event_ctxp[ctxn];
@@ -3705,7 +3786,7 @@ static void perf_event_enable_on_exec(int ctxn)
 
 	cpuctx = __get_cpu_context(ctx);
 	perf_ctx_lock(cpuctx, ctx);
-	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+	ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
 	list_for_each_entry(event, &ctx->event_list, event_entry) {
 		enabled |= event_enable_on_exec(event, ctx);
 		event_type |= get_event_type(event);
@@ -3718,7 +3799,7 @@ static void perf_event_enable_on_exec(int ctxn)
 		clone_ctx = unclone_ctx(ctx);
 		ctx_resched(cpuctx, ctx, event_type);
 	} else {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
 	}
 	perf_ctx_unlock(cpuctx, ctx);
 

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
  2017-08-18  5:17 ` [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
  2017-08-18  5:21   ` [PATCH v7 1/2] perf/core: use rb trees for pinned/flexible groups Alexey Budankov
  2017-08-18  5:22   ` [PATCH v7 2/2] perf/core: add mux switch to skip to the current CPU's events list on mux interrupt Alexey Budankov
@ 2017-08-22 20:21   ` Peter Zijlstra
  2017-08-23  8:54     ` Alexey Budankov
  2017-08-31 10:12     ` Alexey Budankov
  2 siblings, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-22 20:21 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Fri, Aug 18, 2017 at 08:17:15AM +0300, Alexey Budankov wrote:
> Hi,

Please don't post new versions in reply to old versions, that gets them
lost in thread sorted views.

> This patch set v7 moves event groups into rb trees and implements 
> skipping to the current CPU's list on hrtimer interrupt.

Does this depend on your timekeeping rework posted in that v6 thread?
If so, I would have expected to see that as part of these patches, if
not, I'm confused, because part of the problem was that we currently
need to update times for events we don't want to schedule etc..

> Events allocated for the same CPU are still kept in a linked list
> of the event directly attached to the tree because it is unclear 
> how to implement fast iteration thru events allocated for 
> the same CPU when they are all attached to a tree employing 
> additional 64bit index as a secondary treee key.

Finding the CPU subtree and rb_next() wasn't good?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-10 15:57     ` Alexey Budankov
@ 2017-08-22 20:47       ` Peter Zijlstra
  2017-08-23  8:54         ` Alexey Budankov
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-22 20:47 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote:
> The key thing in the patch is explicit updating of tstamp fields for
> INACTIVE events in update_event_times().

> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event)
>  	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
>  		return;
>  
> +	if (event->state == PERF_EVENT_STATE_INACTIVE)
> +		perf_event_tstamp_update(event);
> +
>  	/*
>  	 * in cgroup mode, time_enabled represents
>  	 * the time the event was enabled AND active

But why!? I thought the whole point was to not need to do this.

The thing I outlined earlier would only need to update timestamps when
events change state and at no other point in time.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
  2017-08-22 20:21   ` [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Peter Zijlstra
@ 2017-08-23  8:54     ` Alexey Budankov
  2017-08-31 10:12     ` Alexey Budankov
  1 sibling, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-23  8:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 22.08.2017 23:21, Peter Zijlstra wrote:
> On Fri, Aug 18, 2017 at 08:17:15AM +0300, Alexey Budankov wrote:
>> Hi,
> 
> Please don't post new versions in reply to old versions, that gets them
> lost in thread sorted views.

Accepted. Actually I followed these recommendations:

	https://git-scm.com/docs/git-send-email

and they say it needs to be like this:

[PATCH 0/2] Here is what I did...
  [PATCH 1/2] Clean up and tests
  [PATCH 2/2] Implementation
  [PATCH v2 0/3] Here is a reroll
    [PATCH v2 1/3] Clean up
    [PATCH v2 2/3] New tests
    [PATCH v2 3/3] Implementation

So I made v7 a descendant of v6.

Do you prefer new version be at the same level as the previous 
versions like this?

[PATCH 0/2] Here is what I did...
  [PATCH 1/2] Clean up and tests
  [PATCH 2/2] Implementation
[PATCH v2 0/3] Here is a reroll
  [PATCH v2 1/3] Clean up
  [PATCH v2 2/3] New tests
  [PATCH v2 3/3] Implementation

> 
>> This patch set v7 moves event groups into rb trees and implements 
>> skipping to the current CPU's list on hrtimer interrupt.
> 
> Does this depend on your timekeeping rework posted in that v6 thread?

This v7 includes timekeeping rework so it is complete patch set 
addressing your earlier concerns. The bunch of changes became 
smaller so we are going right way.

I tested v7 thru several nights on Xeon Phi under fuzzer like this:
for ((i=0;i<1000;i=i+1)) do ./perf_fuzzer; done
and there were no crashes or hangs in perf code the machine is still alive.

> If so, I would have expected to see that as part of these patches, if
> not, I'm confused, because part of the problem was that we currently
> need to update times for events we don't want to schedule etc..

Yes, you are right. We need to update times for that events and 
we still do, but on-demand - read() syscall. Doing so we may skip 
slow iterating of the whole bunch of events and get performance boost.

We may skip updating the times every timer interrupt and do it only
on read() call and on thread context switch out when 
update_event_times() is actually called.

> 
>> Events allocated for the same CPU are still kept in a linked list
>> of the event directly attached to the tree because it is unclear 
>> how to implement fast iteration thru events allocated for 
>> the same CPU when they are all attached to a tree employing 
>> additional 64bit index as a secondary treee key.
> 
> Finding the CPU subtree and rb_next() wasn't good?

I implemented the approach you had suggested (as I understood it),
tested it and got results that are drastically different from what 
I am getting for the tree of lists. Specifically I did:

1. keeping all groups in the same single tree by employing a 64-bit index
   additionally to CPU key;

2. implementing special _less() function and rotation by re-inserting
   group with incremented index;

3. replacing API with a callback in the signature by a macro
   perf_event_groups_for_each();

Employing all that shrunk the total patch size, however I am still 
struggling with the correctness issues.

Now I figured that not all indexed events are always located under 
the root with the same cpu, and it depends on the order of insertion
e.g. with insertion order 01,02,03,14,15,16 we get this:

     02
    /  \
   01  14
      /  \
     03  15
           \
           16

and it is unclear how to iterate cpu==0 part of tree in this case.

Iterating cpu specific subtree like this:

#define for_each_group_event(event, group, cpu, pmu, field)	 \
	for (event = rb_entry_safe(group_first(group, cpu, pmu), \
				   typeof(*event), field);	 \
	     event && event->cpu == cpu && event->pmu == pmu;	 \
	     event = rb_entry_safe(rb_next(&event->field),	 \
				   typeof(*event), field))

misses event==03 for the case above and I guess this is where I loose 
samples in my testing.

> 
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt
  2017-08-22 20:47       ` Peter Zijlstra
@ 2017-08-23  8:54         ` Alexey Budankov
  2017-08-31 17:18           ` [RFC][PATCH] perf: Rewrite enabled/running timekeeping Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-08-23  8:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 22.08.2017 23:47, Peter Zijlstra wrote:
> On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote:
>> The key thing in the patch is explicit updating of tstamp fields for
>> INACTIVE events in update_event_times().
> 
>> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event)
>>  	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
>>  		return;
>>  
>> +	if (event->state == PERF_EVENT_STATE_INACTIVE)
>> +		perf_event_tstamp_update(event);
>> +
>>  	/*
>>  	 * in cgroup mode, time_enabled represents
>>  	 * the time the event was enabled AND active
> 
> But why!? I thought the whole point was to not need to do this.

update_event_times() is not called from timer interrupt handler 
thus it is not on the critical path which is optimized in this patch set.

But update_event_times() is called in the context of read() syscall so
this is the place where we may update event times for INACTIVE events 
instead of timer interrupt.

Also update_event_times() is called on thread context switch out so
we get event times also updated when the thread migrates to other CPU.

> 
> The thing I outlined earlier would only need to update timestamps when
> events change state and at no other point in time.

But we still may request times while event is in INACTIVE state 
thru read() syscall and event timings need to be up-to-date. 

> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v7 1/2] perf/core: use rb trees for pinned/flexible groups
  2017-08-18  5:21   ` [PATCH v7 1/2] perf/core: use rb trees for pinned/flexible groups Alexey Budankov
@ 2017-08-23 11:17     ` Alexander Shishkin
  2017-08-23 17:23       ` Alexey Budankov
  0 siblings, 1 reply; 76+ messages in thread
From: Alexander Shishkin @ 2017-08-23 11:17 UTC (permalink / raw)
  To: Alexey Budankov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

Alexey Budankov <alexey.budankov@linux.intel.com> writes:

> @@ -3091,61 +3231,55 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
>  }
>  
>  static void
> -ctx_pinned_sched_in(struct perf_event_context *ctx,
> -		    struct perf_cpu_context *cpuctx)
> +ctx_pinned_sched_in(struct perf_event *event,
> +		   struct perf_cpu_context *cpuctx,
> +		   struct perf_event_context *ctx)

If you're doing this, you also need to rename the function, because it
now schedules in one event and not a context. But better just keep it as
is.

>  {
> -	struct perf_event *event;
> -
> -	list_for_each_entry(event, &ctx->pinned_groups, group_entry) {

Why not put your new iterator here in place of the old one, instead of
moving things around? Because what follows is hard to read and is also
completely unnecessary:

> -		if (event->state <= PERF_EVENT_STATE_OFF)
> -			continue;
> -		if (!event_filter_match(event))
> -			continue;
> +	if (event->state <= PERF_EVENT_STATE_OFF)
> +		return;
> +	if (!event_filter_match(event))
> +		return;

like this,

>  
> -		/* may need to reset tstamp_enabled */
> -		if (is_cgroup_event(event))
> -			perf_cgroup_mark_enabled(event, ctx);
> +	/* may need to reset tstamp_enabled */
> +	if (is_cgroup_event(event))
> +		perf_cgroup_mark_enabled(event, ctx);

or this

>  
> -		if (group_can_go_on(event, cpuctx, 1))
> -			group_sched_in(event, cpuctx, ctx);
> +	if (group_can_go_on(event, cpuctx, 1))
> +		group_sched_in(event, cpuctx, ctx);

etc, etc.

> @@ -3156,7 +3290,8 @@ ctx_sched_in(struct perf_event_context *ctx,
>  	     struct task_struct *task)
>  {
>  	int is_active = ctx->is_active;
> -	u64 now;

Why?

> +	struct perf_event *event;
> +	struct rb_node *node;
>  
>  	lockdep_assert_held(&ctx->lock);
>  
> @@ -3175,7 +3310,7 @@ ctx_sched_in(struct perf_event_context *ctx,
>  
>  	if (is_active & EVENT_TIME) {
>  		/* start ctx time */
> -		now = perf_clock();
> +		u64 now = perf_clock();

Why?

>  		ctx->timestamp = now;
>  		perf_cgroup_set_timestamp(task, ctx);
>  	}
> @@ -3185,11 +3320,19 @@ ctx_sched_in(struct perf_event_context *ctx,
>  	 * in order to give them the best chance of going on.
>  	 */
>  	if (is_active & EVENT_PINNED)
> -		ctx_pinned_sched_in(ctx, cpuctx);
> +		perf_event_groups_for_each(event, node,
> +				&ctx->pinned_groups, group_node,
> +				group_list, group_entry)
> +			ctx_pinned_sched_in(event, cpuctx, ctx);

So this perf_event_groups_for_each() can just move into
ctx_*_sched_in(), can't it?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v7 2/2] perf/core: add mux switch to skip to the current CPU's events list on mux interrupt
  2017-08-18  5:22   ` [PATCH v7 2/2] perf/core: add mux switch to skip to the current CPU's events list on mux interrupt Alexey Budankov
@ 2017-08-23 11:54     ` Alexander Shishkin
  2017-08-23 18:12       ` Alexey Budankov
  0 siblings, 1 reply; 76+ messages in thread
From: Alexander Shishkin @ 2017-08-23 11:54 UTC (permalink / raw)
  To: Alexey Budankov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

Alexey Budankov <alexey.budankov@linux.intel.com> writes:

> This patch implements mux switch that triggers skipping to the 
> current CPU's events list at mulitplexing hrtimer interrupt 
> handler as well as adoption of the switch in the existing 
> implementation.
>
> perf_event_groups_iterate_cpu() API is introduced to implement 
> iteration thru the certain CPU groups list skipping groups 

"through"

> allocated for the other CPUs.
>
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  kernel/events/core.c | 193 ++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 137 insertions(+), 56 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 08ccfb2..aeb0f81 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -556,11 +556,11 @@ void perf_sample_event_took(u64 sample_len_ns)
>  static atomic64_t perf_event_id;
>  
>  static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
> -			      enum event_type_t event_type);
> +			      enum event_type_t event_type, int mux);
>  
>  static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
>  			     enum event_type_t event_type,
> -			     struct task_struct *task);
> +			     struct task_struct *task, int mux);
>  
>  static void update_context_time(struct perf_event_context *ctx);
>  static u64 perf_event_time(struct perf_event *event);
> @@ -702,6 +702,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
>  	struct perf_cpu_context *cpuctx;
>  	struct list_head *list;
>  	unsigned long flags;
> +	int mux = 0;
>  
>  	/*
>  	 * Disable interrupts and preemption to avoid this CPU's
> @@ -717,7 +718,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
>  		perf_pmu_disable(cpuctx->ctx.pmu);
>  
>  		if (mode & PERF_CGROUP_SWOUT) {
> -			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
> +			cpu_ctx_sched_out(cpuctx, EVENT_ALL, mux);
>  			/*
>  			 * must not be done before ctxswout due
>  			 * to event_filter_match() in event_sched_out()
> @@ -736,7 +737,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
>  			 */
>  			cpuctx->cgrp = perf_cgroup_from_task(task,
>  							     &cpuctx->ctx);
> -			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
> +			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task, mux);

'mux' is always zero in this function, isn't it?

>  		}
>  		perf_pmu_enable(cpuctx->ctx.pmu);
>  		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> @@ -1613,8 +1614,16 @@ perf_event_groups_rotate(struct rb_root *groups, int cpu)
>   */
>  #define perf_event_groups_for_each(event, iter, tree, node, list, link) \
>  	   for (iter = rb_first(tree); iter; iter = rb_next(iter))	\
> -		list_for_each_entry(event, &(rb_entry(iter,		\
> -			typeof(*event), node)->list), link)
> +		list_for_each_entry(event, &(rb_entry(iter, 		\
> +				typeof(*event), node)->list), link)

Is this an indentation change? What is it doing here?

> +
> +/*
> + * Iterate event groups related to specific cpu.
> + */
> +#define perf_event_groups_for_each_cpu(event, cpu, tree, list, link)	\
> +		list = perf_event_groups_get_list(tree, cpu);		\
> +		if (list)						\
> +			list_for_each_entry(event, list, link)

..or not, if there's no list.

>  
>  /*
>   * Add a event from the lists for its context.
> @@ -2397,36 +2406,38 @@ static void add_event_to_ctx(struct perf_event *event,
>  
>  static void ctx_sched_out(struct perf_event_context *ctx,
>  			  struct perf_cpu_context *cpuctx,
> -			  enum event_type_t event_type);
> +			  enum event_type_t event_type, int mux);
>  static void
>  ctx_sched_in(struct perf_event_context *ctx,
>  	     struct perf_cpu_context *cpuctx,
>  	     enum event_type_t event_type,
> -	     struct task_struct *task);
> +	     struct task_struct *task, int mux);
>  
>  static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
>  			       struct perf_event_context *ctx,
>  			       enum event_type_t event_type)
>  {
> +	int mux = 0;
> +
>  	if (!cpuctx->task_ctx)
>  		return;
>  
>  	if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
>  		return;
>  
> -	ctx_sched_out(ctx, cpuctx, event_type);
> +	ctx_sched_out(ctx, cpuctx, event_type, mux);

Just use 0.

>  }
>  
>  static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
>  				struct perf_event_context *ctx,
> -				struct task_struct *task)
> +				struct task_struct *task, int mux)
>  {
> -	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task);
> +	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task, mux);
>  	if (ctx)
> -		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task);
> -	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task);
> +		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task, mux);
> +	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task, mux);
>  	if (ctx)
> -		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
> +		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task, mux);
>  }
>  
>  /*
> @@ -2450,6 +2461,7 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>  {
>  	enum event_type_t ctx_event_type = event_type & EVENT_ALL;
>  	bool cpu_event = !!(event_type & EVENT_CPU);
> +	int mux = 0;
>  
>  	/*
>  	 * If pinned groups are involved, flexible groups also need to be
> @@ -2470,11 +2482,11 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>  	 *  - otherwise, do nothing more.
>  	 */
>  	if (cpu_event)
> -		cpu_ctx_sched_out(cpuctx, ctx_event_type);
> +		cpu_ctx_sched_out(cpuctx, ctx_event_type, mux);
>  	else if (ctx_event_type & EVENT_PINNED)
> -		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> +		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
>  
> -	perf_event_sched_in(cpuctx, task_ctx, current);
> +	perf_event_sched_in(cpuctx, task_ctx, current, mux);

Also mux==0 in all cases in this function.

>  	perf_pmu_enable(cpuctx->ctx.pmu);
>  }
>  
> @@ -2491,7 +2503,7 @@ static int  __perf_install_in_context(void *info)
>  	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
>  	struct perf_event_context *task_ctx = cpuctx->task_ctx;
>  	bool reprogram = true;
> -	int ret = 0;
> +	int ret = 0, mux =0;
>  
>  	raw_spin_lock(&cpuctx->ctx.lock);
>  	if (ctx->task) {
> @@ -2518,7 +2530,7 @@ static int  __perf_install_in_context(void *info)
>  	}
>  
>  	if (reprogram) {
> -		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
> +		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
>  		add_event_to_ctx(event, ctx);
>  		ctx_resched(cpuctx, task_ctx, get_event_type(event));
>  	} else {
> @@ -2655,13 +2667,14 @@ static void __perf_event_enable(struct perf_event *event,
>  {
>  	struct perf_event *leader = event->group_leader;
>  	struct perf_event_context *task_ctx;
> +	int mux = 0;
>  
>  	if (event->state >= PERF_EVENT_STATE_INACTIVE ||
>  	    event->state <= PERF_EVENT_STATE_ERROR)
>  		return;
>  
>  	if (ctx->is_active)
> -		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
> +		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
>  
>  	__perf_event_mark_enabled(event);
>  
> @@ -2671,7 +2684,7 @@ static void __perf_event_enable(struct perf_event *event,
>  	if (!event_filter_match(event)) {
>  		if (is_cgroup_event(event))
>  			perf_cgroup_defer_enabled(event);
> -		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
> +		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
>  		return;
>  	}
>  
> @@ -2680,7 +2693,7 @@ static void __perf_event_enable(struct perf_event *event,
>  	 * then don't put it on unless the group is on.
>  	 */
>  	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
> -		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
> +		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);

And here.

>  		return;
>  	}
>  
> @@ -2876,11 +2889,13 @@ EXPORT_SYMBOL_GPL(perf_event_refresh);
>  
>  static void ctx_sched_out(struct perf_event_context *ctx,
>  			  struct perf_cpu_context *cpuctx,
> -			  enum event_type_t event_type)
> +			  enum event_type_t event_type, int mux)
>  {
>  	int is_active = ctx->is_active;
> +	struct list_head *group_list;
>  	struct perf_event *event;
>  	struct rb_node *node;
> +	int sw = -1, cpu = smp_processor_id();

Same thing seems to be happening with 'sw'.

>  	lockdep_assert_held(&ctx->lock);
>  
>  	if (likely(!ctx->nr_events)) {
> @@ -2926,17 +2941,47 @@ static void ctx_sched_out(struct perf_event_context *ctx,
>  
>  	perf_pmu_disable(ctx->pmu);
>  
> -	if (is_active & EVENT_PINNED)
> -		perf_event_groups_for_each(event, node,
> -				&ctx->pinned_groups, group_node,
> -				group_list, group_entry)
> -			group_sched_out(event, cpuctx, ctx);
> +	if (is_active & EVENT_PINNED) {
> +		if (mux) {

So it's 'rotate', really.

> +			perf_event_groups_for_each_cpu(event, cpu,
> +					&ctx->pinned_groups,
> +					group_list, group_entry) {
> +					group_sched_out(event, cpuctx, ctx);
> +			}
> +			perf_event_groups_for_each_cpu(event, sw,
> +					&ctx->pinned_groups,
> +					group_list, group_entry) {
> +					group_sched_out(event, cpuctx, ctx);
> +			}
> +		} else {
> +			perf_event_groups_for_each(event, node,
> +					&ctx->pinned_groups, group_node,
> +					group_list, group_entry) {
> +					group_sched_out(event, cpuctx, ctx);
> +			}
> +		}
> +	}
>  
> -	if (is_active & EVENT_FLEXIBLE)
> -		perf_event_groups_for_each(event, node,
> -				&ctx->flexible_groups, group_node,
> -				group_list, group_entry)
> -			group_sched_out(event, cpuctx, ctx);
> +	if (is_active & EVENT_FLEXIBLE) {
> +		if (mux) {
> +			perf_event_groups_for_each_cpu(event, cpu,
> +					&ctx->flexible_groups,
> +					group_list, group_entry) {
> +					group_sched_out(event, cpuctx, ctx);
> +			}
> +			perf_event_groups_for_each_cpu(event, sw,
> +					&ctx->flexible_groups,
> +					group_list, group_entry) {
> +					group_sched_out(event, cpuctx, ctx);
> +			}
> +		} else {
> +			perf_event_groups_for_each(event, node,
> +					&ctx->flexible_groups, group_node,
> +					group_list, group_entry) {
> +					group_sched_out(event, cpuctx, ctx);
> +			}
> +		}
> +	}
>  
>  	perf_pmu_enable(ctx->pmu);
>  }
> @@ -3225,9 +3270,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
>   * Called with IRQs disabled
>   */
>  static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
> -			      enum event_type_t event_type)
> +			      enum event_type_t event_type, int mux)
>  {
> -	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
> +	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
>  }
>  
>  static void
> @@ -3287,11 +3332,13 @@ static void
>  ctx_sched_in(struct perf_event_context *ctx,
>  	     struct perf_cpu_context *cpuctx,
>  	     enum event_type_t event_type,
> -	     struct task_struct *task)
> +	     struct task_struct *task, int mux)
>  {
>  	int is_active = ctx->is_active;
> +	struct list_head *group_list;
>  	struct perf_event *event;
>  	struct rb_node *node;
> +	int sw = -1, cpu = smp_processor_id();
>  
>  	lockdep_assert_held(&ctx->lock);
>  
> @@ -3319,35 +3366,69 @@ ctx_sched_in(struct perf_event_context *ctx,
>  	 * First go through the list and put on any pinned groups
>  	 * in order to give them the best chance of going on.
>  	 */
> -	if (is_active & EVENT_PINNED)
> -		perf_event_groups_for_each(event, node,
> -				&ctx->pinned_groups, group_node,
> -				group_list, group_entry)
> -			ctx_pinned_sched_in(event, cpuctx, ctx);
> +	if (is_active & EVENT_PINNED) {
> +		if (mux) {
> +			perf_event_groups_for_each_cpu(event, sw,
> +					&ctx->pinned_groups,
> +					group_list, group_entry) {
> +					ctx_pinned_sched_in(event, cpuctx, ctx);
> +			}
> +			perf_event_groups_for_each_cpu(event, cpu,
> +					&ctx->pinned_groups,
> +					group_list, group_entry) {
> +					ctx_pinned_sched_in(event, cpuctx, ctx);
> +			}
> +		} else {
> +			perf_event_groups_for_each(event, node,
> +					&ctx->pinned_groups, group_node,
> +					group_list, group_entry) {
> +					ctx_pinned_sched_in(event, cpuctx, ctx);
> +			}
> +		}
> +	}
>  
>  	/* Then walk through the lower prio flexible groups */
>  	if (is_active & EVENT_FLEXIBLE) {
>  		int can_add_hw = 1;
> -		perf_event_groups_for_each(event, node,
> -				&ctx->flexible_groups, group_node,
> -				group_list, group_entry)
> -			ctx_flexible_sched_in(event, cpuctx, ctx, &can_add_hw);
> +		if (mux) {
> +			perf_event_groups_for_each_cpu(event, sw,
> +					&ctx->flexible_groups,
> +					group_list, group_entry) {
> +					ctx_flexible_sched_in(event, cpuctx,
> +							ctx, &can_add_hw);
> +			}
> +			can_add_hw = 1;
> +			perf_event_groups_for_each_cpu(event, cpu,
> +					&ctx->flexible_groups,
> +					group_list, group_entry) {
> +					ctx_flexible_sched_in(event, cpuctx,
> +							ctx, &can_add_hw);
> +			}
> +		} else {
> +			perf_event_groups_for_each(event, node,
> +					&ctx->flexible_groups, group_node,
> +					group_list, group_entry) {
> +					ctx_flexible_sched_in(event, cpuctx,
> +							ctx, &can_add_hw);
> +			}
> +		}
>  	}
>  }
>  
>  static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
>  			     enum event_type_t event_type,
> -			     struct task_struct *task)
> +			     struct task_struct *task, int mux)
>  {
>  	struct perf_event_context *ctx = &cpuctx->ctx;
>  
> -	ctx_sched_in(ctx, cpuctx, event_type, task);
> +	ctx_sched_in(ctx, cpuctx, event_type, task, mux);
>  }
>  
>  static void perf_event_context_sched_in(struct perf_event_context *ctx,
>  					struct task_struct *task)
>  {
>  	struct perf_cpu_context *cpuctx;
> +	int mux = 0;
>  
>  	cpuctx = __get_cpu_context(ctx);
>  	if (cpuctx->task_ctx == ctx)
> @@ -3371,8 +3452,8 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
>  	 * events, no need to flip the cpuctx's events around.
>  	 */
>  	if (!RB_EMPTY_ROOT(&ctx->pinned_groups))
> -		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> -	perf_event_sched_in(cpuctx, ctx, task);
> +		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
> +	perf_event_sched_in(cpuctx, ctx, task, mux);
>  	perf_pmu_enable(ctx->pmu);
>  
>  unlock:
> @@ -3618,7 +3699,7 @@ static void rotate_ctx(struct perf_event_context *ctx)
>  static int perf_rotate_context(struct perf_cpu_context *cpuctx)
>  {
>  	struct perf_event_context *ctx = NULL;
> -	int rotate = 0;
> +	int rotate = 0, mux = 1;
>  
>  	if (cpuctx->ctx.nr_events) {
>  		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
> @@ -3637,15 +3718,15 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
>  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>  	perf_pmu_disable(cpuctx->ctx.pmu);
>  
> -	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> +	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);

It's '1'.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-15 17:28           ` Alexey Budankov
@ 2017-08-23 13:39             ` Alexander Shishkin
  2017-08-23 14:18               ` Alexey Budankov
  2017-08-29 13:51             ` Alexander Shishkin
  2017-08-31 10:12             ` Alexey Budankov
  2 siblings, 1 reply; 76+ messages in thread
From: Alexander Shishkin @ 2017-08-23 13:39 UTC (permalink / raw)
  To: Alexey Budankov, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen, Kan Liang,
	Dmitri Prokhorov, Valery Cherepennikov, Mark Rutland,
	Stephane Eranian, David Carrillo-Cisneros, linux-kernel

Alexey Budankov <alexey.budankov@linux.intel.com> writes:

>>>>> bool event_less(left, right)
>>>>> {
>>>>>   if (left->cpu < right->cpu)
>>>>>     return true;
>>>>>
>>>>>   if (left->cpu > right_cpu)
>>>>>     return false;
>>>>>
>>>>>   if (left->vtime < right->vtime)
>>>>>     return true;
>>>>>
>>>>>   return false;
>>>>> }
>>>>>
>>>>> insert_group(group, event, tail)
>>>>> {
>>>>>   if (tail)
>>>>>     event->vtime = ++group->vtime;
>>>>>
>>>>>   tree_insert(&group->root, event);
>>>>> }

[ ... ]

> 2. implementing special _less() function and rotation by re-inserting
>    group with incremented index;
>

[ ... ]

> Now I figured that not all indexed events are always located under 
> the root with the same cpu, and it depends on the order of insertion
> e.g. with insertion order 01,02,03,14,15,16 we get this:
>
>      02
>     /  \
>    01  14
>       /  \
>      03  15
>            \
>            16

How did you arrive at this? Seeing the actual code would help, because
this is not the ordering we're looking for. With Peter's earlier example
(quoted above) it shouldn't look like this.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-23 13:39             ` Alexander Shishkin
@ 2017-08-23 14:18               ` Alexey Budankov
  0 siblings, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-23 14:18 UTC (permalink / raw)
  To: Alexander Shishkin, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen, Kan Liang,
	Dmitri Prokhorov, Valery Cherepennikov, Mark Rutland,
	Stephane Eranian, David Carrillo-Cisneros, linux-kernel

On 23.08.2017 16:39, Alexander Shishkin wrote:
> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
> 
>>>>>> bool event_less(left, right)
>>>>>> {
>>>>>>   if (left->cpu < right->cpu)
>>>>>>     return true;
>>>>>>
>>>>>>   if (left->cpu > right_cpu)
>>>>>>     return false;
>>>>>>
>>>>>>   if (left->vtime < right->vtime)
>>>>>>     return true;
>>>>>>
>>>>>>   return false;
>>>>>> }
>>>>>>
>>>>>> insert_group(group, event, tail)
>>>>>> {
>>>>>>   if (tail)
>>>>>>     event->vtime = ++group->vtime;
>>>>>>
>>>>>>   tree_insert(&group->root, event);
>>>>>> }
> 
> [ ... ]
> 
>> 2. implementing special _less() function and rotation by re-inserting
>>    group with incremented index;
>>
> 
> [ ... ]
> 
>> Now I figured that not all indexed events are always located under 
>> the root with the same cpu, and it depends on the order of insertion
>> e.g. with insertion order 01,02,03,14,15,16 we get this:
>>
>>      02
>>     /  \
>>    01  14
>>       /  \
>>      03  15
>>            \
>>            16
> 
> How did you arrive at this? Seeing the actual code would help, because
> this is not the ordering we're looking for. With Peter's earlier example
> (quoted above) it shouldn't look like this.

I implemented the solution Peter suggested. Then I was testing and noticed
considerable difference in amount of collected samples when multiplexing 
event, in comparison to the version with tree of lists. 

I then looked for a fast way to emulate the idea with virtual index as 
a secondary key and found this RB tree emulator:

https://www.cs.usfca.edu/~galles/visualization/RedBlack.html

and it showed me the picture I mentioned above:

      02
     /  \
    01  14
       /  \
      03  15
            \
            16

I understand it is not 100% proof that index idea doesn't work 
however it means that in order to apply the idea to this patch 
some more changes are required additionally to what Peter 
shared earlier.

> 
> Regards,
> --
> Alex
> 
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v7 1/2] perf/core: use rb trees for pinned/flexible groups
  2017-08-23 11:17     ` Alexander Shishkin
@ 2017-08-23 17:23       ` Alexey Budankov
  0 siblings, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-23 17:23 UTC (permalink / raw)
  To: Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 23.08.2017 14:17, Alexander Shishkin wrote:
> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
> 
>> @@ -3091,61 +3231,55 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
>>  }
>>  
>>  static void
>> -ctx_pinned_sched_in(struct perf_event_context *ctx,
>> -		    struct perf_cpu_context *cpuctx)
>> +ctx_pinned_sched_in(struct perf_event *event,
>> +		   struct perf_cpu_context *cpuctx,
>> +		   struct perf_event_context *ctx)
> 
> If you're doing this, you also need to rename the function, because it
> now schedules in one event and not a context. But better just keep it as
> is.
> 
>>  {
>> -	struct perf_event *event;
>> -
>> -	list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
> 
> Why not put your new iterator here in place of the old one, instead of
> moving things around? Because what follows is hard to read and is also
> completely unnecessary:
> 
>> -		if (event->state <= PERF_EVENT_STATE_OFF)
>> -			continue;
>> -		if (!event_filter_match(event))
>> -			continue;
>> +	if (event->state <= PERF_EVENT_STATE_OFF)
>> +		return;
>> +	if (!event_filter_match(event))
>> +		return;
> 
> like this,
> 
>>  
>> -		/* may need to reset tstamp_enabled */
>> -		if (is_cgroup_event(event))
>> -			perf_cgroup_mark_enabled(event, ctx);
>> +	/* may need to reset tstamp_enabled */
>> +	if (is_cgroup_event(event))
>> +		perf_cgroup_mark_enabled(event, ctx);
> 
> or this
> 
>>  
>> -		if (group_can_go_on(event, cpuctx, 1))
>> -			group_sched_in(event, cpuctx, ctx);
>> +	if (group_can_go_on(event, cpuctx, 1))
>> +		group_sched_in(event, cpuctx, ctx);
> 
> etc, etc.
> 
>> @@ -3156,7 +3290,8 @@ ctx_sched_in(struct perf_event_context *ctx,
>>  	     struct task_struct *task)
>>  {
>>  	int is_active = ctx->is_active;
>> -	u64 now;
> 
> Why?

Shortened the scope/span/life time of this variable.
Declared, defined and initialized closer to the place of employment.

> 
>> +	struct perf_event *event;
>> +	struct rb_node *node;
>>  
>>  	lockdep_assert_held(&ctx->lock);
>>  
>> @@ -3175,7 +3310,7 @@ ctx_sched_in(struct perf_event_context *ctx,
>>  
>>  	if (is_active & EVENT_TIME) {
>>  		/* start ctx time */
>> -		now = perf_clock();
>> +		u64 now = perf_clock();
> 
> Why?> 
>>  		ctx->timestamp = now;
>>  		perf_cgroup_set_timestamp(task, ctx);
>>  	}
>> @@ -3185,11 +3320,19 @@ ctx_sched_in(struct perf_event_context *ctx,
>>  	 * in order to give them the best chance of going on.
>>  	 */
>>  	if (is_active & EVENT_PINNED)
>> -		ctx_pinned_sched_in(ctx, cpuctx);
>> +		perf_event_groups_for_each(event, node,
>> +				&ctx->pinned_groups, group_node,
>> +				group_list, group_entry)
>> +			ctx_pinned_sched_in(event, cpuctx, ctx);
> 
> So this perf_event_groups_for_each() can just move into
> ctx_*_sched_in(), can't it?

Yes. Makes sense. Addressed it in v8. Thanks!

> 
> Regards,
> --
> Alex
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v7 2/2] perf/core: add mux switch to skip to the current CPU's events list on mux interrupt
  2017-08-23 11:54     ` Alexander Shishkin
@ 2017-08-23 18:12       ` Alexey Budankov
  0 siblings, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-23 18:12 UTC (permalink / raw)
  To: Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 23.08.2017 14:54, Alexander Shishkin wrote:
> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
> 
>> This patch implements mux switch that triggers skipping to the 
>> current CPU's events list at mulitplexing hrtimer interrupt 
>> handler as well as adoption of the switch in the existing 
>> implementation.
>>
>> perf_event_groups_iterate_cpu() API is introduced to implement 
>> iteration thru the certain CPU groups list skipping groups 
> 
> "through"
> 
>> allocated for the other CPUs.
>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>>  kernel/events/core.c | 193 ++++++++++++++++++++++++++++++++++++---------------
>>  1 file changed, 137 insertions(+), 56 deletions(-)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 08ccfb2..aeb0f81 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -556,11 +556,11 @@ void perf_sample_event_took(u64 sample_len_ns)
>>  static atomic64_t perf_event_id;
>>  
>>  static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
>> -			      enum event_type_t event_type);
>> +			      enum event_type_t event_type, int mux);
>>  
>>  static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
>>  			     enum event_type_t event_type,
>> -			     struct task_struct *task);
>> +			     struct task_struct *task, int mux);
>>  
>>  static void update_context_time(struct perf_event_context *ctx);
>>  static u64 perf_event_time(struct perf_event *event);
>> @@ -702,6 +702,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
>>  	struct perf_cpu_context *cpuctx;
>>  	struct list_head *list;
>>  	unsigned long flags;
>> +	int mux = 0;
>>  
>>  	/*
>>  	 * Disable interrupts and preemption to avoid this CPU's
>> @@ -717,7 +718,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
>>  		perf_pmu_disable(cpuctx->ctx.pmu);
>>  
>>  		if (mode & PERF_CGROUP_SWOUT) {
>> -			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
>> +			cpu_ctx_sched_out(cpuctx, EVENT_ALL, mux);
>>  			/*
>>  			 * must not be done before ctxswout due
>>  			 * to event_filter_match() in event_sched_out()
>> @@ -736,7 +737,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
>>  			 */
>>  			cpuctx->cgrp = perf_cgroup_from_task(task,
>>  							     &cpuctx->ctx);
>> -			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
>> +			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task, mux);
> 
> 'mux' is always zero in this function, isn't it?
> 
>>  		}
>>  		perf_pmu_enable(cpuctx->ctx.pmu);
>>  		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> @@ -1613,8 +1614,16 @@ perf_event_groups_rotate(struct rb_root *groups, int cpu)
>>   */
>>  #define perf_event_groups_for_each(event, iter, tree, node, list, link) \
>>  	   for (iter = rb_first(tree); iter; iter = rb_next(iter))	\
>> -		list_for_each_entry(event, &(rb_entry(iter,		\
>> -			typeof(*event), node)->list), link)
>> +		list_for_each_entry(event, &(rb_entry(iter, 		\
>> +				typeof(*event), node)->list), link)
> 
> Is this an indentation change? What is it doing here?
> 
>> +
>> +/*
>> + * Iterate event groups related to specific cpu.
>> + */
>> +#define perf_event_groups_for_each_cpu(event, cpu, tree, list, link)	\
>> +		list = perf_event_groups_get_list(tree, cpu);		\
>> +		if (list)						\
>> +			list_for_each_entry(event, list, link)
> 
> ..or not, if there's no list.
> 
>>  
>>  /*
>>   * Add a event from the lists for its context.
>> @@ -2397,36 +2406,38 @@ static void add_event_to_ctx(struct perf_event *event,
>>  
>>  static void ctx_sched_out(struct perf_event_context *ctx,
>>  			  struct perf_cpu_context *cpuctx,
>> -			  enum event_type_t event_type);
>> +			  enum event_type_t event_type, int mux);
>>  static void
>>  ctx_sched_in(struct perf_event_context *ctx,
>>  	     struct perf_cpu_context *cpuctx,
>>  	     enum event_type_t event_type,
>> -	     struct task_struct *task);
>> +	     struct task_struct *task, int mux);
>>  
>>  static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
>>  			       struct perf_event_context *ctx,
>>  			       enum event_type_t event_type)
>>  {
>> +	int mux = 0;
>> +
>>  	if (!cpuctx->task_ctx)
>>  		return;
>>  
>>  	if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
>>  		return;
>>  
>> -	ctx_sched_out(ctx, cpuctx, event_type);
>> +	ctx_sched_out(ctx, cpuctx, event_type, mux);
> 
> Just use 0.

Well, I intentionally made this ugly switch named over local variable 
to add clarity into the code so I simply wonder - why do you prefer 
the unnamed variant?

> 
>>  }
>>  
>>  static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
>>  				struct perf_event_context *ctx,
>> -				struct task_struct *task)
>> +				struct task_struct *task, int mux)
>>  {
>> -	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task);
>> +	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task, mux);
>>  	if (ctx)
>> -		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task);
>> -	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task);
>> +		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task, mux);
>> +	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task, mux);
>>  	if (ctx)
>> -		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
>> +		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task, mux);
>>  }
>>  
>>  /*
>> @@ -2450,6 +2461,7 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>>  {
>>  	enum event_type_t ctx_event_type = event_type & EVENT_ALL;
>>  	bool cpu_event = !!(event_type & EVENT_CPU);
>> +	int mux = 0;
>>  
>>  	/*
>>  	 * If pinned groups are involved, flexible groups also need to be
>> @@ -2470,11 +2482,11 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>>  	 *  - otherwise, do nothing more.
>>  	 */
>>  	if (cpu_event)
>> -		cpu_ctx_sched_out(cpuctx, ctx_event_type);
>> +		cpu_ctx_sched_out(cpuctx, ctx_event_type, mux);
>>  	else if (ctx_event_type & EVENT_PINNED)
>> -		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>> +		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
>>  
>> -	perf_event_sched_in(cpuctx, task_ctx, current);
>> +	perf_event_sched_in(cpuctx, task_ctx, current, mux);
> 
> Also mux==0 in all cases in this function.
> 
>>  	perf_pmu_enable(cpuctx->ctx.pmu);
>>  }
>>  
>> @@ -2491,7 +2503,7 @@ static int  __perf_install_in_context(void *info)
>>  	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
>>  	struct perf_event_context *task_ctx = cpuctx->task_ctx;
>>  	bool reprogram = true;
>> -	int ret = 0;
>> +	int ret = 0, mux =0;
>>  
>>  	raw_spin_lock(&cpuctx->ctx.lock);
>>  	if (ctx->task) {
>> @@ -2518,7 +2530,7 @@ static int  __perf_install_in_context(void *info)
>>  	}
>>  
>>  	if (reprogram) {
>> -		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>> +		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
>>  		add_event_to_ctx(event, ctx);
>>  		ctx_resched(cpuctx, task_ctx, get_event_type(event));
>>  	} else {
>> @@ -2655,13 +2667,14 @@ static void __perf_event_enable(struct perf_event *event,
>>  {
>>  	struct perf_event *leader = event->group_leader;
>>  	struct perf_event_context *task_ctx;
>> +	int mux = 0;
>>  
>>  	if (event->state >= PERF_EVENT_STATE_INACTIVE ||
>>  	    event->state <= PERF_EVENT_STATE_ERROR)
>>  		return;
>>  
>>  	if (ctx->is_active)
>> -		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>> +		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
>>  
>>  	__perf_event_mark_enabled(event);
>>  
>> @@ -2671,7 +2684,7 @@ static void __perf_event_enable(struct perf_event *event,
>>  	if (!event_filter_match(event)) {
>>  		if (is_cgroup_event(event))
>>  			perf_cgroup_defer_enabled(event);
>> -		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
>> +		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
>>  		return;
>>  	}
>>  
>> @@ -2680,7 +2693,7 @@ static void __perf_event_enable(struct perf_event *event,
>>  	 * then don't put it on unless the group is on.
>>  	 */
>>  	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
>> -		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
>> +		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
> 
> And here.
> 
>>  		return;
>>  	}
>>  
>> @@ -2876,11 +2889,13 @@ EXPORT_SYMBOL_GPL(perf_event_refresh);
>>  
>>  static void ctx_sched_out(struct perf_event_context *ctx,
>>  			  struct perf_cpu_context *cpuctx,
>> -			  enum event_type_t event_type)
>> +			  enum event_type_t event_type, int mux)
>>  {
>>  	int is_active = ctx->is_active;
>> +	struct list_head *group_list;
>>  	struct perf_event *event;
>>  	struct rb_node *node;
>> +	int sw = -1, cpu = smp_processor_id();
> 
> Same thing seems to be happening with 'sw'.
> 
>>  	lockdep_assert_held(&ctx->lock);
>>  
>>  	if (likely(!ctx->nr_events)) {
>> @@ -2926,17 +2941,47 @@ static void ctx_sched_out(struct perf_event_context *ctx,
>>  
>>  	perf_pmu_disable(ctx->pmu);
>>  
>> -	if (is_active & EVENT_PINNED)
>> -		perf_event_groups_for_each(event, node,
>> -				&ctx->pinned_groups, group_node,
>> -				group_list, group_entry)
>> -			group_sched_out(event, cpuctx, ctx);
>> +	if (is_active & EVENT_PINNED) {
>> +		if (mux) {
> 
> So it's 'rotate', really.

Which 'rotate' do you mean? If the local variable from 
perf_rotate_context() then yes - it may be passed to this function 
call from there as mux value, but logically they are still different.

> 
>> +			perf_event_groups_for_each_cpu(event, cpu,
>> +					&ctx->pinned_groups,
>> +					group_list, group_entry) {
>> +					group_sched_out(event, cpuctx, ctx);
>> +			}
>> +			perf_event_groups_for_each_cpu(event, sw,
>> +					&ctx->pinned_groups,
>> +					group_list, group_entry) {
>> +					group_sched_out(event, cpuctx, ctx);
>> +			}
>> +		} else {
>> +			perf_event_groups_for_each(event, node,
>> +					&ctx->pinned_groups, group_node,
>> +					group_list, group_entry) {
>> +					group_sched_out(event, cpuctx, ctx);
>> +			}
>> +		}
>> +	}
>>  
>> -	if (is_active & EVENT_FLEXIBLE)
>> -		perf_event_groups_for_each(event, node,
>> -				&ctx->flexible_groups, group_node,
>> -				group_list, group_entry)
>> -			group_sched_out(event, cpuctx, ctx);
>> +	if (is_active & EVENT_FLEXIBLE) {
>> +		if (mux) {
>> +			perf_event_groups_for_each_cpu(event, cpu,
>> +					&ctx->flexible_groups,
>> +					group_list, group_entry) {
>> +					group_sched_out(event, cpuctx, ctx);
>> +			}
>> +			perf_event_groups_for_each_cpu(event, sw,
>> +					&ctx->flexible_groups,
>> +					group_list, group_entry) {
>> +					group_sched_out(event, cpuctx, ctx);
>> +			}
>> +		} else {
>> +			perf_event_groups_for_each(event, node,
>> +					&ctx->flexible_groups, group_node,
>> +					group_list, group_entry) {
>> +					group_sched_out(event, cpuctx, ctx);
>> +			}
>> +		}
>> +	}
>>  
>>  	perf_pmu_enable(ctx->pmu);
>>  }
>> @@ -3225,9 +3270,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
>>   * Called with IRQs disabled
>>   */
>>  static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
>> -			      enum event_type_t event_type)
>> +			      enum event_type_t event_type, int mux)
>>  {
>> -	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
>> +	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
>>  }
>>  
>>  static void
>> @@ -3287,11 +3332,13 @@ static void
>>  ctx_sched_in(struct perf_event_context *ctx,
>>  	     struct perf_cpu_context *cpuctx,
>>  	     enum event_type_t event_type,
>> -	     struct task_struct *task)
>> +	     struct task_struct *task, int mux)
>>  {
>>  	int is_active = ctx->is_active;
>> +	struct list_head *group_list;
>>  	struct perf_event *event;
>>  	struct rb_node *node;
>> +	int sw = -1, cpu = smp_processor_id();
>>  
>>  	lockdep_assert_held(&ctx->lock);
>>  
>> @@ -3319,35 +3366,69 @@ ctx_sched_in(struct perf_event_context *ctx,
>>  	 * First go through the list and put on any pinned groups
>>  	 * in order to give them the best chance of going on.
>>  	 */
>> -	if (is_active & EVENT_PINNED)
>> -		perf_event_groups_for_each(event, node,
>> -				&ctx->pinned_groups, group_node,
>> -				group_list, group_entry)
>> -			ctx_pinned_sched_in(event, cpuctx, ctx);
>> +	if (is_active & EVENT_PINNED) {
>> +		if (mux) {
>> +			perf_event_groups_for_each_cpu(event, sw,
>> +					&ctx->pinned_groups,
>> +					group_list, group_entry) {
>> +					ctx_pinned_sched_in(event, cpuctx, ctx);
>> +			}
>> +			perf_event_groups_for_each_cpu(event, cpu,
>> +					&ctx->pinned_groups,
>> +					group_list, group_entry) {
>> +					ctx_pinned_sched_in(event, cpuctx, ctx);
>> +			}
>> +		} else {
>> +			perf_event_groups_for_each(event, node,
>> +					&ctx->pinned_groups, group_node,
>> +					group_list, group_entry) {
>> +					ctx_pinned_sched_in(event, cpuctx, ctx);
>> +			}
>> +		}
>> +	}
>>  
>>  	/* Then walk through the lower prio flexible groups */
>>  	if (is_active & EVENT_FLEXIBLE) {
>>  		int can_add_hw = 1;
>> -		perf_event_groups_for_each(event, node,
>> -				&ctx->flexible_groups, group_node,
>> -				group_list, group_entry)
>> -			ctx_flexible_sched_in(event, cpuctx, ctx, &can_add_hw);
>> +		if (mux) {
>> +			perf_event_groups_for_each_cpu(event, sw,
>> +					&ctx->flexible_groups,
>> +					group_list, group_entry) {
>> +					ctx_flexible_sched_in(event, cpuctx,
>> +							ctx, &can_add_hw);
>> +			}
>> +			can_add_hw = 1;
>> +			perf_event_groups_for_each_cpu(event, cpu,
>> +					&ctx->flexible_groups,
>> +					group_list, group_entry) {
>> +					ctx_flexible_sched_in(event, cpuctx,
>> +							ctx, &can_add_hw);
>> +			}
>> +		} else {
>> +			perf_event_groups_for_each(event, node,
>> +					&ctx->flexible_groups, group_node,
>> +					group_list, group_entry) {
>> +					ctx_flexible_sched_in(event, cpuctx,
>> +							ctx, &can_add_hw);
>> +			}
>> +		}
>>  	}
>>  }
>>  
>>  static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
>>  			     enum event_type_t event_type,
>> -			     struct task_struct *task)
>> +			     struct task_struct *task, int mux)
>>  {
>>  	struct perf_event_context *ctx = &cpuctx->ctx;
>>  
>> -	ctx_sched_in(ctx, cpuctx, event_type, task);
>> +	ctx_sched_in(ctx, cpuctx, event_type, task, mux);
>>  }
>>  
>>  static void perf_event_context_sched_in(struct perf_event_context *ctx,
>>  					struct task_struct *task)
>>  {
>>  	struct perf_cpu_context *cpuctx;
>> +	int mux = 0;
>>  
>>  	cpuctx = __get_cpu_context(ctx);
>>  	if (cpuctx->task_ctx == ctx)
>> @@ -3371,8 +3452,8 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
>>  	 * events, no need to flip the cpuctx's events around.
>>  	 */
>>  	if (!RB_EMPTY_ROOT(&ctx->pinned_groups))
>> -		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>> -	perf_event_sched_in(cpuctx, ctx, task);
>> +		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
>> +	perf_event_sched_in(cpuctx, ctx, task, mux);
>>  	perf_pmu_enable(ctx->pmu);
>>  
>>  unlock:
>> @@ -3618,7 +3699,7 @@ static void rotate_ctx(struct perf_event_context *ctx)
>>  static int perf_rotate_context(struct perf_cpu_context *cpuctx)
>>  {
>>  	struct perf_event_context *ctx = NULL;
>> -	int rotate = 0;
>> +	int rotate = 0, mux = 1;
>>  
>>  	if (cpuctx->ctx.nr_events) {
>>  		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
>> @@ -3637,15 +3718,15 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
>>  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>  	perf_pmu_disable(cpuctx->ctx.pmu);
>>  
>> -	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>> +	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
> 
> It's '1'.
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-15 17:28           ` Alexey Budankov
  2017-08-23 13:39             ` Alexander Shishkin
@ 2017-08-29 13:51             ` Alexander Shishkin
  2017-08-30  8:30               ` Alexey Budankov
  2017-08-31 10:12             ` Alexey Budankov
  2 siblings, 1 reply; 76+ messages in thread
From: Alexander Shishkin @ 2017-08-29 13:51 UTC (permalink / raw)
  To: Alexey Budankov, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen, Kan Liang,
	Dmitri Prokhorov, Valery Cherepennikov, Mark Rutland,
	Stephane Eranian, David Carrillo-Cisneros, linux-kernel

Alexey Budankov <alexey.budankov@linux.intel.com> writes:

> Now I figured that not all indexed events are always located under 
> the root with the same cpu, and it depends on the order of insertion
> e.g. with insertion order 01,02,03,14,15,16 we get this:
>
>      02
>     /  \
>    01  14
>       /  \
>      03  15
>            \
>            16
>
> and it is unclear how to iterate cpu==0 part of tree in this case.

Using this example, rb_next() should take you through the nodes in this
order (assuming you start with 01): 01, 02, 03, 14, etc. So you iterate
while event->cpu==cpu using rb_next() and you should be fine.

> Iterating cpu specific subtree like this:
>
> #define for_each_group_event(event, group, cpu, pmu, field)	 \
> 	for (event = rb_entry_safe(group_first(group, cpu, pmu), \
> 				   typeof(*event), field);	 \
> 	     event && event->cpu == cpu && event->pmu == pmu;	 \
> 	     event = rb_entry_safe(rb_next(&event->field),	 \
> 				   typeof(*event), field))

Afaict, this assumes that you are also ordering on event->pmu, which
should be reflected in your _less function. And also assuming that
group_first() is doing the right thing. Can we see the code?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-29 13:51             ` Alexander Shishkin
@ 2017-08-30  8:30               ` Alexey Budankov
  2017-08-30 10:18                 ` Alexander Shishkin
  2017-08-30 11:16                 ` Alexey Budankov
  0 siblings, 2 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-30  8:30 UTC (permalink / raw)
  To: Alexander Shishkin, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen, Kan Liang,
	Dmitri Prokhorov, Valery Cherepennikov, Mark Rutland,
	Stephane Eranian, David Carrillo-Cisneros, linux-kernel

On 29.08.2017 16:51, Alexander Shishkin wrote:
> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
> 
>> Now I figured that not all indexed events are always located under 
>> the root with the same cpu, and it depends on the order of insertion
>> e.g. with insertion order 01,02,03,14,15,16 we get this:
>>
>>      02
>>     /  \
>>    01  14
>>       /  \
>>      03  15
>>            \
>>            16
>>
>> and it is unclear how to iterate cpu==0 part of tree in this case.
> 
> Using this example, rb_next() should take you through the nodes in this
> order (assuming you start with 01): 01, 02, 03, 14, etc. So you iterate
> while event->cpu==cpu using rb_next() and you should be fine.

Well, indeed we get the most left leaf (03) in rb_next() for the case above.

> 
>> Iterating cpu specific subtree like this:
>>
>> #define for_each_group_event(event, group, cpu, pmu, field)	 \
>> 	for (event = rb_entry_safe(group_first(group, cpu, pmu), \
>> 				   typeof(*event), field);	 \
>> 	     event && event->cpu == cpu && event->pmu == pmu;	 \
>> 	     event = rb_entry_safe(rb_next(&event->field),	 \
>> 				   typeof(*event), field))
> 
> Afaict, this assumes that you are also ordering on event->pmu, which
> should be reflected in your _less function. And also assuming that
> group_first() is doing the right thing. Can we see the code?

I didn't do ordering by PMU for this patch set. Yet more I implemented 
groups_first() like this:

static struct perf_event *
perf_event_groups_first(struct perf_event_groups *groups, int cpu)
{
	struct perf_event *node_event = NULL;
	struct rb_node *node = NULL;

	node = groups->tree.rb_node;

	while (node) {
		node_event = container_of(node,
				struct perf_event, group_node);

		if (cpu < node_event->cpu) {
			node = node->rb_left;
		} else if (cpu > node_event->cpu) {
			node = node->rb_right;
		} else {
			node = node->rb_left;
		}
	}

	return node_event;
}

and it doesn't work as expected for case above with cpu == 1.

I corrected the code above to this:

static struct perf_event *
perf_event_groups_first(struct perf_event_groups *groups, int cpu)
{
	struct perf_event *node_event = NULL, *match = NULL;
	struct rb_node *node = NULL;

	node = groups->tree.rb_node;

	while (node) {
		node_event = container_of(node,
				struct perf_event, group_node);

		if (cpu < node_event->cpu) {
			node = node->rb_left;
		} else if (cpu > node_event->cpu) {
			node = node->rb_right;
		} else {
			match = node_event;
			node = node->rb_left;
		}
	}

	return match;
}

but now struggling with silent oopses which I guess are not 
related to multiplexing at all.

Please look at v8 for a while. It addresses your comments for v7.

> 
> Regards,
> --
> Alex
> 

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-30  8:30               ` Alexey Budankov
@ 2017-08-30 10:18                 ` Alexander Shishkin
  2017-08-30 10:30                   ` Alexey Budankov
  2017-08-30 11:16                 ` Alexey Budankov
  1 sibling, 1 reply; 76+ messages in thread
From: Alexander Shishkin @ 2017-08-30 10:18 UTC (permalink / raw)
  To: Alexey Budankov, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen, Kan Liang,
	Dmitri Prokhorov, Valery Cherepennikov, Mark Rutland,
	Stephane Eranian, David Carrillo-Cisneros, linux-kernel

Alexey Budankov <alexey.budankov@linux.intel.com> writes:

>>> Iterating cpu specific subtree like this:
>>>
>>> #define for_each_group_event(event, group, cpu, pmu, field)	 \
>>> 	for (event = rb_entry_safe(group_first(group, cpu, pmu), \
>>> 				   typeof(*event), field);	 \
>>> 	     event && event->cpu == cpu && event->pmu == pmu;	 \
>>> 	     event = rb_entry_safe(rb_next(&event->field),	 \
>>> 				   typeof(*event), field))
>> 
>> Afaict, this assumes that you are also ordering on event->pmu, which
>> should be reflected in your _less function. And also assuming that
>> group_first() is doing the right thing. Can we see the code?
>
> I didn't do ordering by PMU for this patch set. Yet more I implemented 
> groups_first() like this:

Your iterator (quoted above) begs to differ.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-30 10:18                 ` Alexander Shishkin
@ 2017-08-30 10:30                   ` Alexey Budankov
  2017-08-30 11:13                     ` Alexander Shishkin
  0 siblings, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-08-30 10:30 UTC (permalink / raw)
  To: Alexander Shishkin, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen, Kan Liang,
	Dmitri Prokhorov, Valery Cherepennikov, Mark Rutland,
	Stephane Eranian, David Carrillo-Cisneros, linux-kernel

On 30.08.2017 13:18, Alexander Shishkin wrote:
> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
> 
>>>> Iterating cpu specific subtree like this:
>>>>
>>>> #define for_each_group_event(event, group, cpu, pmu, field)	 \
>>>> 	for (event = rb_entry_safe(group_first(group, cpu, pmu), \
>>>> 				   typeof(*event), field);	 \
>>>> 	     event && event->cpu == cpu && event->pmu == pmu;	 \
>>>> 	     event = rb_entry_safe(rb_next(&event->field),	 \
>>>> 				   typeof(*event), field))
>>>
>>> Afaict, this assumes that you are also ordering on event->pmu, which
>>> should be reflected in your _less function. And also assuming that
>>> group_first() is doing the right thing. Can we see the code?
>>
>> I didn't do ordering by PMU for this patch set. Yet more I implemented 
>> groups_first() like this:
> 
> Your iterator (quoted above) begs to differ.

What do you specifically mean? I am doing iterations like this:

/*
 * Iterate event groups thru the whole tree.
 */
#define perf_event_groups_for_each(event, groups, node)			\
	for (event = rb_entry_safe(rb_first(&((groups)->tree)),		\
			typeof(*event), node); event; 			\
			event = rb_entry_safe(rb_next(&event->node),	\
				typeof(*event), node))

/*
 * Iterate event groups with cpu == cpu_id.
 */
#define perf_event_groups_for_each_cpu(event, key, groups, node)	\
	for (event = perf_event_groups_first(groups, key);		\
		event && event->cpu == key;				\
		event = rb_entry_safe(rb_next(&event->node),		\
				typeof(*event), node))

> 
> Regards,
> --
> Alex
> 

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-30 10:30                   ` Alexey Budankov
@ 2017-08-30 11:13                     ` Alexander Shishkin
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Shishkin @ 2017-08-30 11:13 UTC (permalink / raw)
  To: Alexey Budankov, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen, Kan Liang,
	Dmitri Prokhorov, Valery Cherepennikov, Mark Rutland,
	Stephane Eranian, David Carrillo-Cisneros, linux-kernel

Alexey Budankov <alexey.budankov@linux.intel.com> writes:

> On 30.08.2017 13:18, Alexander Shishkin wrote:
>> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
>> 
>>>>> Iterating cpu specific subtree like this:
>>>>>
>>>>> #define for_each_group_event(event, group, cpu, pmu, field)	 \
>>>>> 	for (event = rb_entry_safe(group_first(group, cpu, pmu), \
>>>>> 				   typeof(*event), field);	 \
>>>>> 	     event && event->cpu == cpu && event->pmu == pmu;	 \
>>>>> 	     event = rb_entry_safe(rb_next(&event->field),	 \
>>>>> 				   typeof(*event), field))
>>>>
>>>> Afaict, this assumes that you are also ordering on event->pmu, which
>>>> should be reflected in your _less function. And also assuming that
>>>> group_first() is doing the right thing. Can we see the code?
>>>
>>> I didn't do ordering by PMU for this patch set. Yet more I implemented 
>>> groups_first() like this:
>> 
>> Your iterator (quoted above) begs to differ.
>
> What do you specifically mean? I am doing iterations like this:

I mean the code that you've shown before, which is quoted above. It's
difficult to tell why something's not working if you don't show the
code.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-30  8:30               ` Alexey Budankov
  2017-08-30 10:18                 ` Alexander Shishkin
@ 2017-08-30 11:16                 ` Alexey Budankov
  2017-08-31 10:12                   ` Alexey Budankov
  1 sibling, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-08-30 11:16 UTC (permalink / raw)
  To: Alexander Shishkin, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen, Kan Liang,
	Dmitri Prokhorov, Valery Cherepennikov, Mark Rutland,
	Stephane Eranian, David Carrillo-Cisneros, linux-kernel

On 30.08.2017 11:30, Alexey Budankov wrote:
> On 29.08.2017 16:51, Alexander Shishkin wrote:
>> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
>>
>>> Now I figured that not all indexed events are always located under 
>>> the root with the same cpu, and it depends on the order of insertion
>>> e.g. with insertion order 01,02,03,14,15,16 we get this:
>>>
>>>      02
>>>     /  \
>>>    01  14
>>>       /  \
>>>      03  15
>>>            \
>>>            16
>>>
>>> and it is unclear how to iterate cpu==0 part of tree in this case.
>>
>> Using this example, rb_next() should take you through the nodes in this
>> order (assuming you start with 01): 01, 02, 03, 14, etc. So you iterate
>> while event->cpu==cpu using rb_next() and you should be fine.
> 
> Well, indeed we get the most left leaf (03) in rb_next() for the case above.
> 
>>
>>> Iterating cpu specific subtree like this:
>>>
>>> #define for_each_group_event(event, group, cpu, pmu, field)	 \
>>> 	for (event = rb_entry_safe(group_first(group, cpu, pmu), \
>>> 				   typeof(*event), field);	 \
>>> 	     event && event->cpu == cpu && event->pmu == pmu;	 \
>>> 	     event = rb_entry_safe(rb_next(&event->field),	 \
>>> 				   typeof(*event), field))
>>
>> Afaict, this assumes that you are also ordering on event->pmu, which
>> should be reflected in your _less function. And also assuming that
>> group_first() is doing the right thing. Can we see the code?
> 
> I didn't do ordering by PMU for this patch set. Yet more I implemented 
> groups_first() like this:
> 
> static struct perf_event *
> perf_event_groups_first(struct perf_event_groups *groups, int cpu)
> {
> 	struct perf_event *node_event = NULL;
> 	struct rb_node *node = NULL;
> 
> 	node = groups->tree.rb_node;
> 
> 	while (node) {
> 		node_event = container_of(node,
> 				struct perf_event, group_node);
> 
> 		if (cpu < node_event->cpu) {
> 			node = node->rb_left;
> 		} else if (cpu > node_event->cpu) {
> 			node = node->rb_right;
> 		} else {
> 			node = node->rb_left;
> 		}
> 	}
> 
> 	return node_event;
> }
> 
> and it doesn't work as expected for case above with cpu == 1.
> 
> I corrected the code above to this:
> 
> static struct perf_event *
> perf_event_groups_first(struct perf_event_groups *groups, int cpu)
> {
> 	struct perf_event *node_event = NULL, *match = NULL;
> 	struct rb_node *node = NULL;
> 
> 	node = groups->tree.rb_node;
> 
> 	while (node) {
> 		node_event = container_of(node,
> 				struct perf_event, group_node);
> 
> 		if (cpu < node_event->cpu) {
> 			node = node->rb_left;
> 		} else if (cpu > node_event->cpu) {
> 			node = node->rb_right;
> 		} else {
> 			match = node_event;
> 			node = node->rb_left;
> 		}
> 	}
> 
> 	return match;
> }
> 
> but now struggling with silent oopses which I guess are not 
> related to multiplexing at all.

Added logging into the code and now see this in dmesg output:

[  175.743879] BUG: unable to handle kernel paging request at 00007fe2a90d1a54
[  175.743899] IP: __task_pid_nr_ns+0x3b/0x90
[  175.743903] PGD 2f317ca067 
[  175.743906] P4D 2f317ca067 
[  175.743910] PUD 0 

[  175.743926] Oops: 0000 [#1] SMP
[  175.743931] Modules linked in: fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables nfsv3 rpcsec_gss_krb5 nfsv4 cmac arc4 md4 nls_utf8 cifs nfs ccm dns_resolver fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp hfi1 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate rdmavt joydev ipmi_ssif intel_uncore ib_core ipmi_si ipmi_devintf intel_rapl_perf iTCO_wdt iTCO_vendor_support pcspkr tpm_tis tpm_tis_core
[  175.744088]  mei_me tpm i2c_i801 ipmi_msghandler lpc_ich mei shpchp wmi acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc mgag200 drm_kms_helper ttm drm igb crc32c_intel ptp pps_core dca i2c_algo_bit
[  175.744156] CPU: 12 PID: 8272 Comm: perf Not tainted 4.13.0-rc4-v7.3.3+ #13
[  175.744160] Hardware name: Intel Corporation S7200AP/S7200AP, BIOS S72C610.86B.01.01.0190.080520162104 08/05/2016
[  175.744165] task: ffff90c47d4d0000 task.stack: ffffae42d8fb0000
[  175.744177] RIP: 0010:__task_pid_nr_ns+0x3b/0x90
[  175.744181] RSP: 0018:ffff90c4bbd05ae0 EFLAGS: 00010046
[  175.744190] RAX: 0000000000000000 RBX: ffff90c47d4d0000 RCX: 00007fe2a90d1a50
[  175.744204] RDX: ffffffffbee4ed20 RSI: 0000000000000000 RDI: ffff90c47d4d0000
[  175.744209] RBP: ffff90c4bbd05ae0 R08: 0000000000281a93 R09: 0000000000000000
[  175.744213] R10: 0000000000000005 R11: 0000000000000000 R12: ffff90c46d25d800
[  175.744218] R13: ffff90c4bbd05c40 R14: ffff90c47d4d0000 R15: ffff90c46d25d800
[  175.744224] FS:  0000000000000000(0000) GS:ffff90c4bbd00000(0000) knlGS:0000000000000000
[  175.744228] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  175.744232] CR2: 00007fe2a90d1a54 CR3: 0000002f33bd2000 CR4: 00000000001406e0
[  175.744236] Call Trace:
[  175.744243]  <NMI>
[  175.744259]  perf_event_pid_type+0x27/0x40
[  175.744272]  __perf_event_header__init_id+0xb5/0xd0
[  175.744284]  perf_prepare_sample+0x54/0x360
[  175.744293]  perf_event_output_forward+0x2f/0x80
[  175.744304]  ? sched_clock+0xb/0x10
[  175.744315]  ? sched_clock_cpu+0x11/0xb0
[  175.744334]  __perf_event_overflow+0x54/0xe0
[  175.744346]  perf_event_overflow+0x14/0x20
[  175.744356]  intel_pmu_handle_irq+0x203/0x4b0
[  175.744379]  perf_event_nmi_handler+0x2d/0x50
[  175.744391]  nmi_handle+0x61/0x110
[  175.744402]  default_do_nmi+0x44/0x110
[  175.744411]  do_nmi+0x113/0x190
[  175.744421]  end_repeat_nmi+0x1a/0x1e
[  175.744432] RIP: 0010:native_write_msr+0x6/0x30
[  175.744437] RSP: 0018:ffffae42d8fb3c18 EFLAGS: 00000002
[  175.744452] RAX: 0000000000000003 RBX: ffff90c4bbd0a360 RCX: 000000000000038f
[  175.744457] RDX: 0000000000000007 RSI: 0000000000000003 RDI: 000000000000038f
[  175.744461] RBP: ffffae42d8fb3c30 R08: 0000000000281a93 R09: 0000000000000000
[  175.744464] R10: 0000000000000005 R11: 0000000000000000 R12: 0000000000000000
[  175.744468] R13: ffff90c4bbd0a360 R14: ffff90c4bbd0a584 R15: 0000000000000001
[  175.744485]  ? native_write_msr+0x6/0x30
[  175.744495]  ? native_write_msr+0x6/0x30
[  175.744504]  </NMI>
[  175.744514]  ? __intel_pmu_enable_all.isra.13+0x4f/0x80
[  175.744524]  intel_pmu_enable_all+0x10/0x20
[  175.744534]  x86_pmu_enable+0x263/0x2f0
[  175.744545]  perf_pmu_enable+0x22/0x30
[  175.744554]  ctx_resched+0x74/0xb0
[  175.744568]  perf_event_exec+0x17e/0x1e0
[  175.744584]  setup_new_exec+0x72/0x180
[  175.744595]  load_elf_binary+0x39f/0x15ea
[  175.744610]  ? get_user_pages_remote+0x83/0x1f0
[  175.744620]  ? __check_object_size+0x164/0x1a0
[  175.744632]  ? __check_object_size+0x164/0x1a0
[  175.744642]  ? _copy_from_user+0x33/0x70
[  175.744654]  search_binary_handler+0x9e/0x1e0
[  175.744664]  do_execveat_common.isra.31+0x53d/0x700
[  175.744677]  SyS_execve+0x3a/0x50
[  175.744691]  do_syscall_64+0x67/0x150
[  175.744702]  entry_SYSCALL64_slow_path+0x25/0x25
[  175.744709] RIP: 0033:0x7fe2a6cb77a7
[  175.744713] RSP: 002b:00007ffc554690b8 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
[  175.744721] RAX: ffffffffffffffda RBX: 00007ffc5546b900 RCX: 00007fe2a6cb77a7
[  175.744725] RDX: 000000000151dd70 RSI: 00007ffc5546b900 RDI: 00007ffc5546d59e
[  175.744735] RBP: 00007ffc55469140 R08: 00007ffc554690a0 R09: 00007ffc55468f50
[  175.744740] R10: 00007ffc55468ed0 R11: 0000000000000202 R12: 000000000151dd70
[  175.744748] R13: 000000000082e740 R14: 0000000000000000 R15: 00007ffc5546d59e
[  175.744757] Code: bf 78 09 00 00 00 74 46 85 f6 75 35 89 f6 48 8d 04 76 48 8d 84 c7 70 09 00 00 48 8b 48 08 48 85 c9 74 2b 8b b2 30 08 00 00 31 c0 <3b> 71 04 77 0f 48 c1 e6 05 48 8d 4c 31 30 48 3b 51 08 74 0b 5d 
[  175.744910] RIP: __task_pid_nr_ns+0x3b/0x90 RSP: ffff90c4bbd05ae0
[  175.744914] CR2: 00007fe2a90d1a54

> 
> Please look at v8 for a while. It addresses your comments for v7.
> 
>>
>> Regards,
>> --
>> Alex
>>
> 
> Thanks,
> Alexey
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-30 11:16                 ` Alexey Budankov
@ 2017-08-31 10:12                   ` Alexey Budankov
  0 siblings, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-31 10:12 UTC (permalink / raw)
  To: Alexander Shishkin, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen, Kan Liang,
	Dmitri Prokhorov, Valery Cherepennikov, Mark Rutland,
	Stephane Eranian, David Carrillo-Cisneros, linux-kernel

On 30.08.2017 14:16, Alexey Budankov wrote:
> On 30.08.2017 11:30, Alexey Budankov wrote:
>> On 29.08.2017 16:51, Alexander Shishkin wrote:
>>> Alexey Budankov <alexey.budankov@linux.intel.com> writes:
>>>
>>>> Now I figured that not all indexed events are always located under 
>>>> the root with the same cpu, and it depends on the order of insertion
>>>> e.g. with insertion order 01,02,03,14,15,16 we get this:
>>>>
>>>>      02
>>>>     /  \
>>>>    01  14
>>>>       /  \
>>>>      03  15
>>>>            \
>>>>            16
>>>>
>>>> and it is unclear how to iterate cpu==0 part of tree in this case.
>>>
>>> Using this example, rb_next() should take you through the nodes in this
>>> order (assuming you start with 01): 01, 02, 03, 14, etc. So you iterate
>>> while event->cpu==cpu using rb_next() and you should be fine.
>>
>> Well, indeed we get the most left leaf (03) in rb_next() for the case above.
>>
>>>
>>>> Iterating cpu specific subtree like this:
>>>>
>>>> #define for_each_group_event(event, group, cpu, pmu, field)	 \
>>>> 	for (event = rb_entry_safe(group_first(group, cpu, pmu), \
>>>> 				   typeof(*event), field);	 \
>>>> 	     event && event->cpu == cpu && event->pmu == pmu;	 \
>>>> 	     event = rb_entry_safe(rb_next(&event->field),	 \
>>>> 				   typeof(*event), field))
>>>
>>> Afaict, this assumes that you are also ordering on event->pmu, which
>>> should be reflected in your _less function. And also assuming that
>>> group_first() is doing the right thing. Can we see the code?
>>
>> I didn't do ordering by PMU for this patch set. Yet more I implemented 
>> groups_first() like this:
>>
>> static struct perf_event *
>> perf_event_groups_first(struct perf_event_groups *groups, int cpu)
>> {
>> 	struct perf_event *node_event = NULL;
>> 	struct rb_node *node = NULL;
>>
>> 	node = groups->tree.rb_node;
>>
>> 	while (node) {
>> 		node_event = container_of(node,
>> 				struct perf_event, group_node);
>>
>> 		if (cpu < node_event->cpu) {
>> 			node = node->rb_left;
>> 		} else if (cpu > node_event->cpu) {
>> 			node = node->rb_right;
>> 		} else {
>> 			node = node->rb_left;
>> 		}
>> 	}
>>
>> 	return node_event;
>> }
>>
>> and it doesn't work as expected for case above with cpu == 1.
>>
>> I corrected the code above to this:
>>
>> static struct perf_event *
>> perf_event_groups_first(struct perf_event_groups *groups, int cpu)
>> {
>> 	struct perf_event *node_event = NULL, *match = NULL;
>> 	struct rb_node *node = NULL;
>>
>> 	node = groups->tree.rb_node;
>>
>> 	while (node) {
>> 		node_event = container_of(node,
>> 				struct perf_event, group_node);
>>
>> 		if (cpu < node_event->cpu) {
>> 			node = node->rb_left;
>> 		} else if (cpu > node_event->cpu) {
>> 			node = node->rb_right;
>> 		} else {
>> 			match = node_event;
>> 			node = node->rb_left;
>> 		}
>> 	}
>>
>> 	return match;
>> }
>>
>> but now struggling with silent oopses which I guess are not 
>> related to multiplexing at all.
> 
> Added logging into the code and now see this in dmesg output:
> 
> [  175.743879] BUG: unable to handle kernel paging request at 00007fe2a90d1a54
> [  175.743899] IP: __task_pid_nr_ns+0x3b/0x90
> [  175.743903] PGD 2f317ca067 
> [  175.743906] P4D 2f317ca067 
> [  175.743910] PUD 0 
> 
> [  175.743926] Oops: 0000 [#1] SMP
> [  175.743931] Modules linked in: fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables nfsv3 rpcsec_gss_krb5 nfsv4 cmac arc4 md4 nls_utf8 cifs nfs ccm dns_resolver fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp hfi1 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate rdmavt joydev ipmi_ssif intel_uncore ib_core ipmi_si ipmi_devintf intel_rapl_perf iTCO_wdt iTCO_vendor_support pcspkr tpm_tis tpm_tis_core
> [  175.744088]  mei_me tpm i2c_i801 ipmi_msghandler lpc_ich mei shpchp wmi acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc mgag200 drm_kms_helper ttm drm igb crc32c_intel ptp pps_core dca i2c_algo_bit
> [  175.744156] CPU: 12 PID: 8272 Comm: perf Not tainted 4.13.0-rc4-v7.3.3+ #13
> [  175.744160] Hardware name: Intel Corporation S7200AP/S7200AP, BIOS S72C610.86B.01.01.0190.080520162104 08/05/2016
> [  175.744165] task: ffff90c47d4d0000 task.stack: ffffae42d8fb0000
> [  175.744177] RIP: 0010:__task_pid_nr_ns+0x3b/0x90
> [  175.744181] RSP: 0018:ffff90c4bbd05ae0 EFLAGS: 00010046
> [  175.744190] RAX: 0000000000000000 RBX: ffff90c47d4d0000 RCX: 00007fe2a90d1a50
> [  175.744204] RDX: ffffffffbee4ed20 RSI: 0000000000000000 RDI: ffff90c47d4d0000
> [  175.744209] RBP: ffff90c4bbd05ae0 R08: 0000000000281a93 R09: 0000000000000000
> [  175.744213] R10: 0000000000000005 R11: 0000000000000000 R12: ffff90c46d25d800
> [  175.744218] R13: ffff90c4bbd05c40 R14: ffff90c47d4d0000 R15: ffff90c46d25d800
> [  175.744224] FS:  0000000000000000(0000) GS:ffff90c4bbd00000(0000) knlGS:0000000000000000
> [  175.744228] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  175.744232] CR2: 00007fe2a90d1a54 CR3: 0000002f33bd2000 CR4: 00000000001406e0
> [  175.744236] Call Trace:
> [  175.744243]  <NMI>
> [  175.744259]  perf_event_pid_type+0x27/0x40
> [  175.744272]  __perf_event_header__init_id+0xb5/0xd0
> [  175.744284]  perf_prepare_sample+0x54/0x360
> [  175.744293]  perf_event_output_forward+0x2f/0x80
> [  175.744304]  ? sched_clock+0xb/0x10
> [  175.744315]  ? sched_clock_cpu+0x11/0xb0
> [  175.744334]  __perf_event_overflow+0x54/0xe0
> [  175.744346]  perf_event_overflow+0x14/0x20
> [  175.744356]  intel_pmu_handle_irq+0x203/0x4b0
> [  175.744379]  perf_event_nmi_handler+0x2d/0x50
> [  175.744391]  nmi_handle+0x61/0x110
> [  175.744402]  default_do_nmi+0x44/0x110
> [  175.744411]  do_nmi+0x113/0x190
> [  175.744421]  end_repeat_nmi+0x1a/0x1e
> [  175.744432] RIP: 0010:native_write_msr+0x6/0x30
> [  175.744437] RSP: 0018:ffffae42d8fb3c18 EFLAGS: 00000002
> [  175.744452] RAX: 0000000000000003 RBX: ffff90c4bbd0a360 RCX: 000000000000038f
> [  175.744457] RDX: 0000000000000007 RSI: 0000000000000003 RDI: 000000000000038f
> [  175.744461] RBP: ffffae42d8fb3c30 R08: 0000000000281a93 R09: 0000000000000000
> [  175.744464] R10: 0000000000000005 R11: 0000000000000000 R12: 0000000000000000
> [  175.744468] R13: ffff90c4bbd0a360 R14: ffff90c4bbd0a584 R15: 0000000000000001
> [  175.744485]  ? native_write_msr+0x6/0x30
> [  175.744495]  ? native_write_msr+0x6/0x30
> [  175.744504]  </NMI>
> [  175.744514]  ? __intel_pmu_enable_all.isra.13+0x4f/0x80
> [  175.744524]  intel_pmu_enable_all+0x10/0x20
> [  175.744534]  x86_pmu_enable+0x263/0x2f0
> [  175.744545]  perf_pmu_enable+0x22/0x30
> [  175.744554]  ctx_resched+0x74/0xb0
> [  175.744568]  perf_event_exec+0x17e/0x1e0
> [  175.744584]  setup_new_exec+0x72/0x180
> [  175.744595]  load_elf_binary+0x39f/0x15ea
> [  175.744610]  ? get_user_pages_remote+0x83/0x1f0
> [  175.744620]  ? __check_object_size+0x164/0x1a0
> [  175.744632]  ? __check_object_size+0x164/0x1a0
> [  175.744642]  ? _copy_from_user+0x33/0x70
> [  175.744654]  search_binary_handler+0x9e/0x1e0
> [  175.744664]  do_execveat_common.isra.31+0x53d/0x700
> [  175.744677]  SyS_execve+0x3a/0x50
> [  175.744691]  do_syscall_64+0x67/0x150
> [  175.744702]  entry_SYSCALL64_slow_path+0x25/0x25
> [  175.744709] RIP: 0033:0x7fe2a6cb77a7
> [  175.744713] RSP: 002b:00007ffc554690b8 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
> [  175.744721] RAX: ffffffffffffffda RBX: 00007ffc5546b900 RCX: 00007fe2a6cb77a7
> [  175.744725] RDX: 000000000151dd70 RSI: 00007ffc5546b900 RDI: 00007ffc5546d59e
> [  175.744735] RBP: 00007ffc55469140 R08: 00007ffc554690a0 R09: 00007ffc55468f50
> [  175.744740] R10: 00007ffc55468ed0 R11: 0000000000000202 R12: 000000000151dd70
> [  175.744748] R13: 000000000082e740 R14: 0000000000000000 R15: 00007ffc5546d59e
> [  175.744757] Code: bf 78 09 00 00 00 74 46 85 f6 75 35 89 f6 48 8d 04 76 48 8d 84 c7 70 09 00 00 48 8b 48 08 48 85 c9 74 2b 8b b2 30 08 00 00 31 c0 <3b> 71 04 77 0f 48 c1 e6 05 48 8d 4c 31 30 48 3b 51 08 74 0b 5d 
> [  175.744910] RIP: __task_pid_nr_ns+0x3b/0x90 RSP: ffff90c4bbd05ae0
> [  175.744914] CR2: 00007fe2a90d1a54

I eventually managed to overcome difficulties with implementation
of rb_tree indexed by {cpu,index} for event groups so please 
see patches v9.

> 
>>
>> Please look at v8 for a while. It addresses your comments for v7.
>>
>>>
>>> Regards,
>>> --
>>> Alex
>>>
>>
>> Thanks,
>> Alexey
>>
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
  2017-08-22 20:21   ` [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Peter Zijlstra
  2017-08-23  8:54     ` Alexey Budankov
@ 2017-08-31 10:12     ` Alexey Budankov
  1 sibling, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-31 10:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

Hi,
On 22.08.2017 23:21, Peter Zijlstra wrote:
> On Fri, Aug 18, 2017 at 08:17:15AM +0300, Alexey Budankov wrote:
>> Hi,
> 
> Please don't post new versions in reply to old versions, that gets them
> lost in thread sorted views.
> 
>> This patch set v7 moves event groups into rb trees and implements 
>> skipping to the current CPU's list on hrtimer interrupt.
> 
> Does this depend on your timekeeping rework posted in that v6 thread?
> If so, I would have expected to see that as part of these patches, if
> not, I'm confused, because part of the problem was that we currently
> need to update times for events we don't want to schedule etc..
> 
>> Events allocated for the same CPU are still kept in a linked list
>> of the event directly attached to the tree because it is unclear 
>> how to implement fast iteration thru events allocated for 
>> the same CPU when they are all attached to a tree employing 
>> additional 64bit index as a secondary treee key.
> 
> Finding the CPU subtree and rb_next() wasn't good?

I eventually managed to overcome difficulties with implementation
of rb_tree indexed by {cpu,index} for event groups so please 
see patches v9.

> 
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
  2017-08-15 17:28           ` Alexey Budankov
  2017-08-23 13:39             ` Alexander Shishkin
  2017-08-29 13:51             ` Alexander Shishkin
@ 2017-08-31 10:12             ` Alexey Budankov
  2 siblings, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-08-31 10:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel

On 15.08.2017 20:28, Alexey Budankov wrote:
> Hi Peter,
> 
> On 07.08.2017 10:17, Alexey Budankov wrote:
>> On 04.08.2017 17:36, Peter Zijlstra wrote:
>>> On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
>>>> On 03.08.2017 16:00, Peter Zijlstra wrote:
>>>>> On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
>>>
>>>>>> +/*
>>>>>> + * Find group list by a cpu key and rotate it.
>>>>>> + */
>>>>>> +static void
>>>>>> +perf_event_groups_rotate(struct rb_root *groups, int cpu)
>>>>>> +{
>>>>>> +	struct rb_node *node;
>>>>>> +	struct perf_event *node_event;
>>>>>> +
>>>>>> +	node = groups->rb_node;
>>>>>> +
>>>>>> +	while (node) {
>>>>>> +		node_event = container_of(node,
>>>>>> +				struct perf_event, group_node);
>>>>>> +
>>>>>> +		if (cpu < node_event->cpu) {
>>>>>> +			node = node->rb_left;
>>>>>> +		} else if (cpu > node_event->cpu) {
>>>>>> +			node = node->rb_right;
>>>>>> +		} else {
>>>>>> +			list_rotate_left(&node_event->group_list);
>>>>>> +			break;
>>>>>> +		}
>>>>>> +	}
>>>>>> +}
>>>>>
>>>>> Ah, you worry about how to rotate inside a tree?
>>>>
>>>> Exactly.
>>>>
>>>>>
>>>>> You can do that by adding (run)time based ordering, and you'll end up
>>>>> with a runtime based scheduler.
>>>>
>>>> Do you mean replacing a CPU indexed rb_tree of lists with 
>>>> an CPU indexed rb_tree of counter indexed rb_trees?
>>>
>>> No, single tree, just more complicated ordering rules.
>>>
>>>>> A trivial variant keeps a simple counter per tree that is incremented
>>>>> for each rotation. That should end up with the events ordered exactly
>>>>> like the list. And if you have that comparator like above, expressing
>>>>> that additional ordering becomes simple ;-)
>>>>>
>>>>> Something like:
>>>>>
>>>>> struct group {
>>>>>   u64 vtime;
>>>>>   rb_tree tree;
>>>>> };
>>>>>
>>>>> bool event_less(left, right)
>>>>> {
>>>>>   if (left->cpu < right->cpu)
>>>>>     return true;
>>>>>
>>>>>   if (left->cpu > right_cpu)
>>>>>     return false;
>>>>>
>>>>>   if (left->vtime < right->vtime)
>>>>>     return true;
>>>>>
>>>>>   return false;
>>>>> }
>>>>>
>>>>> insert_group(group, event, tail)
>>>>> {
>>>>>   if (tail)
>>>>>     event->vtime = ++group->vtime;
>>>>>
>>>>>   tree_insert(&group->root, event);
>>>>> }
>>>>>
>>>>> Then every time you use insert_group(.tail=1) it goes to the end of that
>>>>> CPU's 'list'.
>>>>>
>>>>
>>>> Could you elaborate more on how to implement rotation?
>>>
>>> Its almost all there, but let me write a complete replacement for your
>>> perf_event_group_rotate() above.
>>>
>>> /* find the leftmost event matching @cpu */
>>> /* XXX not sure how to best parametrise a subtree search, */
>>> /* again, C sucks... */
>>> struct perf_event *__group_find_cpu(group, cpu)
>>> {
>>> 	struct rb_node *node = group->tree.rb_node;
>>> 	struct perf_event *event, *match = NULL;
>>>
>>> 	while (node) {
>>> 		event = container_of(node, struct perf_event, group_node);
>>>
>>> 		if (cpu > event->cpu) {
>>> 			node = node->rb_right;
>>> 		} else if (cpu < event->cpu) {
>>> 			node = node->rb_left;
>>> 		} else {
>>> 			/*
>>> 			 * subtree match, try left subtree for a
>>> 			 * 'smaller' match.
>>> 			 */
>>> 			match = event;
>>> 			node = node->rb_left;
>>> 		}
>>> 	}
>>>
>>> 	return match;
>>> }
>>>
>>> void perf_event_group_rotate(group, int cpu)
>>> {
>>> 	struct perf_event *event = __group_find_cpu(cpu);
>>>
>>> 	if (!event)
>>> 		return;
>>>
>>> 	tree_delete(&group->tree, event);
>>> 	insert_group(group, event, 1);
>>> }
>>>
>>> So we have a tree ordered by {cpu,vtime} and what we do is find the
>>> leftmost {cpu} entry, that is the smallest vtime entry for that cpu. We
>>> then take it out and re-insert it with a vtime number larger than any
>>> other, which places it as the rightmost entry for that cpu.
>>>
>>>
>>> So given:
>>>
>>>        {1,1}
>>>        / \
>>>     {0,5} {1,2}
>>>    / \        \
>>> {0,1} {0,6}  {1,4}
>>>
>>>
>>> __group_find_cpu(.cpu=1) will return {1,1} as being the leftmost entry
>>> with cpu=1. We'll then remove it, update its vtime to 7 and re-insert.
>>> resulting in something like:
>>>
>>>        {1,2}
>>>        / \
>>>     {0,5} {1,4}
>>>    / \        \
>>> {0,1} {0,6}  {1,7}
>>>
>>
>> Makes sense. The implementation becomes a bit simpler. The drawbacks 
>> may be several rotations of potentially big tree on the critical path, 
>> instead of updating four pointers in case of the tree of lists.
> 
> I implemented the approach you had suggested (as I understood it),
> tested it and got results that are drastically different from what 
> I am getting for the tree of lists. Specifically I did:
> 
> 1. keeping all groups in the same single tree by employing a 64-bit index
>    additionally to CPU key;
>    
> 2. implementing special _less() function and rotation by re-inserting
>    group with incremented index;
> 
> 3. replacing API with a callback in the signature by a macro
>    perf_event_groups_for_each();
> 
> Employing all that shrunk the total patch size, however I am still 
> struggling with the correctness issues.
> 
> Now I figured that not all indexed events are always located under 
> the root with the same cpu, and it depends on the order of insertion
> e.g. with insertion order 01,02,03,14,15,16 we get this:
> 
>      02
>     /  \
>    01  14
>       /  \
>      03  15
>            \
>            16
> 
> and it is unclear how to iterate cpu==0 part of tree in this case.
> 
> Iterating cpu specific subtree like this:
> 
> #define for_each_group_event(event, group, cpu, pmu, field)	 \
> 	for (event = rb_entry_safe(group_first(group, cpu, pmu), \
> 				   typeof(*event), field);	 \
> 	     event && event->cpu == cpu && event->pmu == pmu;	 \
> 	     event = rb_entry_safe(rb_next(&event->field),	 \
> 				   typeof(*event), field))
> 
> misses event==03 for the case above and I guess this is where I loose 
> samples in my testing. 

I eventually managed to overcome difficulties with implementation
of rb_tree indexed by {cpu,index} for event groups so please 
see patches v9.

> 
> Please advise how to proceed.
> 
> Thanks,
> Alexey
> 
>>
>>>
>>>
>>>
>>>
>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-08-23  8:54         ` Alexey Budankov
@ 2017-08-31 17:18           ` Peter Zijlstra
  2017-08-31 19:51             ` Stephane Eranian
                               ` (4 more replies)
  0 siblings, 5 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-08-31 17:18 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On Wed, Aug 23, 2017 at 11:54:15AM +0300, Alexey Budankov wrote:
> On 22.08.2017 23:47, Peter Zijlstra wrote:
> > On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote:
> >> The key thing in the patch is explicit updating of tstamp fields for
> >> INACTIVE events in update_event_times().
> > 
> >> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event)
> >>  	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
> >>  		return;
> >>  
> >> +	if (event->state == PERF_EVENT_STATE_INACTIVE)
> >> +		perf_event_tstamp_update(event);
> >> +
> >>  	/*
> >>  	 * in cgroup mode, time_enabled represents
> >>  	 * the time the event was enabled AND active
> > 
> > But why!? I thought the whole point was to not need to do this.
> 
> update_event_times() is not called from timer interrupt handler 
> thus it is not on the critical path which is optimized in this patch set.
> 
> But update_event_times() is called in the context of read() syscall so
> this is the place where we may update event times for INACTIVE events 
> instead of timer interrupt.
> 
> Also update_event_times() is called on thread context switch out so
> we get event times also updated when the thread migrates to other CPU.
> 
> > 
> > The thing I outlined earlier would only need to update timestamps when
> > events change state and at no other point in time.
> 
> But we still may request times while event is in INACTIVE state 
> thru read() syscall and event timings need to be up-to-date. 

Sure, read() also updates.

So the below completely rewrites timekeeping (and probably breaks
world) but does away with the need to touch events that don't get
scheduled.

Esp the cgroup stuff is entirely untested since I simply don't know how
to operate that. I did run Vince's tests on it, and I think it doesn't
regress, but I'm near a migraine so I can't really see straight atm.

Vince, Stephane, could you guys have a peek?

(There's a few other bits in, I'll break up into patches and write
comments and Changelogs later, I think its can be split in some 5
patches).

The basic idea is really simple, we have a single timestamp and
depending on the state we update enabled/running. This obviously only
requires updates when we change state and when we need up-to-date
timestamps (read).

No more weird and wonderful mind bending interaction between 3 different
timestamps with arcane update rules.

---
 include/linux/perf_event.h |  25 +-
 kernel/events/core.c       | 551 ++++++++++++++++-----------------------------
 2 files changed, 192 insertions(+), 384 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8e22f24ded6a..2a6ae48a1a96 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -485,9 +485,9 @@ struct perf_addr_filters_head {
 };
 
 /**
- * enum perf_event_active_state - the states of a event
+ * enum perf_event_state - the states of a event
  */
-enum perf_event_active_state {
+enum perf_event_state {
 	PERF_EVENT_STATE_DEAD		= -4,
 	PERF_EVENT_STATE_EXIT		= -3,
 	PERF_EVENT_STATE_ERROR		= -2,
@@ -578,7 +578,7 @@ struct perf_event {
 	struct pmu			*pmu;
 	void				*pmu_private;
 
-	enum perf_event_active_state	state;
+	enum perf_event_state		state;
 	unsigned int			attach_state;
 	local64_t			count;
 	atomic64_t			child_count;
@@ -588,26 +588,10 @@ struct perf_event {
 	 * has been enabled (i.e. eligible to run, and the task has
 	 * been scheduled in, if this is a per-task event)
 	 * and running (scheduled onto the CPU), respectively.
-	 *
-	 * They are computed from tstamp_enabled, tstamp_running and
-	 * tstamp_stopped when the event is in INACTIVE or ACTIVE state.
 	 */
 	u64				total_time_enabled;
 	u64				total_time_running;
-
-	/*
-	 * These are timestamps used for computing total_time_enabled
-	 * and total_time_running when the event is in INACTIVE or
-	 * ACTIVE state, measured in nanoseconds from an arbitrary point
-	 * in time.
-	 * tstamp_enabled: the notional time when the event was enabled
-	 * tstamp_running: the notional time when the event was scheduled on
-	 * tstamp_stopped: in INACTIVE state, the notional time when the
-	 *	event was scheduled off.
-	 */
-	u64				tstamp_enabled;
-	u64				tstamp_running;
-	u64				tstamp_stopped;
+	u64				tstamp;
 
 	/*
 	 * timestamp shadows the actual context timing but it can
@@ -699,7 +683,6 @@ struct perf_event {
 
 #ifdef CONFIG_CGROUP_PERF
 	struct perf_cgroup		*cgrp; /* cgroup event is attach to */
-	int				cgrp_defer_enabled;
 #endif
 
 	struct list_head		sb_list;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 294f1927f944..e968b3eab9c7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -582,6 +582,70 @@ static inline u64 perf_event_clock(struct perf_event *event)
 	return event->clock();
 }
 
+/*
+ * XXX comment about timekeeping goes here
+ */
+
+static __always_inline enum perf_event_state
+__perf_effective_state(struct perf_event *event)
+{
+	struct perf_event *leader = event->group_leader;
+
+	if (leader->state <= PERF_EVENT_STATE_OFF)
+		return leader->state;
+
+	return event->state;
+}
+
+static __always_inline void
+__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
+{
+	enum perf_event_state state = __perf_effective_state(event);
+	u64 delta = now - event->tstamp;
+
+	*enabled = event->total_time_enabled;
+	if (state >= PERF_EVENT_STATE_INACTIVE)
+		*enabled += delta;
+
+	*running = event->total_time_running;
+	if (state >= PERF_EVENT_STATE_ACTIVE)
+		*running += delta;
+}
+
+static void perf_event_update_time(struct perf_event *event)
+{
+	u64 now = perf_event_time(event);
+
+	__perf_update_times(event, now, &event->total_time_enabled,
+					&event->total_time_running);
+	event->tstamp = now;
+}
+
+static void perf_event_update_sibling_time(struct perf_event *leader)
+{
+	struct perf_event *sibling;
+
+	list_for_each_entry(sibling, &leader->sibling_list, group_entry)
+		perf_event_update_time(sibling);
+}
+
+static void
+perf_event_set_state(struct perf_event *event, enum perf_event_state state)
+{
+	if (event->state == state)
+		return;
+
+	perf_event_update_time(event);
+	/*
+	 * If a group leader gets enabled/disabled all its siblings
+	 * are affected too.
+	 */
+	if ((event->state < 0) ^ (state < 0))
+		perf_event_update_sibling_time(event);
+
+	WRITE_ONCE(event->state, state);
+}
+
 #ifdef CONFIG_CGROUP_PERF
 
 static inline bool
@@ -841,40 +905,6 @@ perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
 	event->shadow_ctx_time = now - t->timestamp;
 }
 
-static inline void
-perf_cgroup_defer_enabled(struct perf_event *event)
-{
-	/*
-	 * when the current task's perf cgroup does not match
-	 * the event's, we need to remember to call the
-	 * perf_mark_enable() function the first time a task with
-	 * a matching perf cgroup is scheduled in.
-	 */
-	if (is_cgroup_event(event) && !perf_cgroup_match(event))
-		event->cgrp_defer_enabled = 1;
-}
-
-static inline void
-perf_cgroup_mark_enabled(struct perf_event *event,
-			 struct perf_event_context *ctx)
-{
-	struct perf_event *sub;
-	u64 tstamp = perf_event_time(event);
-
-	if (!event->cgrp_defer_enabled)
-		return;
-
-	event->cgrp_defer_enabled = 0;
-
-	event->tstamp_enabled = tstamp - event->total_time_enabled;
-	list_for_each_entry(sub, &event->sibling_list, group_entry) {
-		if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
-			sub->cgrp_defer_enabled = 0;
-		}
-	}
-}
-
 /*
  * Update cpuctx->cgrp so that it is set when first cgroup event is added and
  * cleared when last cgroup event is removed.
@@ -973,17 +1003,6 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
 }
 
 static inline void
-perf_cgroup_defer_enabled(struct perf_event *event)
-{
-}
-
-static inline void
-perf_cgroup_mark_enabled(struct perf_event *event,
-			 struct perf_event_context *ctx)
-{
-}
-
-static inline void
 list_update_cgroup_event(struct perf_event *event,
 			 struct perf_event_context *ctx, bool add)
 {
@@ -1396,60 +1415,6 @@ static u64 perf_event_time(struct perf_event *event)
 	return ctx ? ctx->time : 0;
 }
 
-/*
- * Update the total_time_enabled and total_time_running fields for a event.
- */
-static void update_event_times(struct perf_event *event)
-{
-	struct perf_event_context *ctx = event->ctx;
-	u64 run_end;
-
-	lockdep_assert_held(&ctx->lock);
-
-	if (event->state < PERF_EVENT_STATE_INACTIVE ||
-	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
-		return;
-
-	/*
-	 * in cgroup mode, time_enabled represents
-	 * the time the event was enabled AND active
-	 * tasks were in the monitored cgroup. This is
-	 * independent of the activity of the context as
-	 * there may be a mix of cgroup and non-cgroup events.
-	 *
-	 * That is why we treat cgroup events differently
-	 * here.
-	 */
-	if (is_cgroup_event(event))
-		run_end = perf_cgroup_event_time(event);
-	else if (ctx->is_active)
-		run_end = ctx->time;
-	else
-		run_end = event->tstamp_stopped;
-
-	event->total_time_enabled = run_end - event->tstamp_enabled;
-
-	if (event->state == PERF_EVENT_STATE_INACTIVE)
-		run_end = event->tstamp_stopped;
-	else
-		run_end = perf_event_time(event);
-
-	event->total_time_running = run_end - event->tstamp_running;
-
-}
-
-/*
- * Update total_time_enabled and total_time_running for all events in a group.
- */
-static void update_group_times(struct perf_event *leader)
-{
-	struct perf_event *event;
-
-	update_event_times(leader);
-	list_for_each_entry(event, &leader->sibling_list, group_entry)
-		update_event_times(event);
-}
-
 static enum event_type_t get_event_type(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
@@ -1492,6 +1457,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT);
 	event->attach_state |= PERF_ATTACH_CONTEXT;
 
+	event->tstamp = perf_event_time(event);
+
 	/*
 	 * If we're a stand alone event or group leader, we go to the context
 	 * list, group events are kept attached to the group so that
@@ -1699,8 +1666,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 	if (event->group_leader == event)
 		list_del_init(&event->group_entry);
 
-	update_group_times(event);
-
 	/*
 	 * If event was in error state, then keep it
 	 * that way, otherwise bogus counts will be
@@ -1709,7 +1674,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 	 * of the event
 	 */
 	if (event->state > PERF_EVENT_STATE_OFF)
-		event->state = PERF_EVENT_STATE_OFF;
+		perf_event_set_state(event, PERF_EVENT_STATE_OFF);
 
 	ctx->generation++;
 }
@@ -1808,38 +1773,24 @@ event_sched_out(struct perf_event *event,
 		  struct perf_cpu_context *cpuctx,
 		  struct perf_event_context *ctx)
 {
-	u64 tstamp = perf_event_time(event);
-	u64 delta;
+	enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
 
 	WARN_ON_ONCE(event->ctx != ctx);
 	lockdep_assert_held(&ctx->lock);
 
-	/*
-	 * An event which could not be activated because of
-	 * filter mismatch still needs to have its timings
-	 * maintained, otherwise bogus information is return
-	 * via read() for time_enabled, time_running:
-	 */
-	if (event->state == PERF_EVENT_STATE_INACTIVE &&
-	    !event_filter_match(event)) {
-		delta = tstamp - event->tstamp_stopped;
-		event->tstamp_running += delta;
-		event->tstamp_stopped = tstamp;
-	}
-
 	if (event->state != PERF_EVENT_STATE_ACTIVE)
 		return;
 
 	perf_pmu_disable(event->pmu);
 
-	event->tstamp_stopped = tstamp;
 	event->pmu->del(event, 0);
 	event->oncpu = -1;
-	event->state = PERF_EVENT_STATE_INACTIVE;
+
 	if (event->pending_disable) {
 		event->pending_disable = 0;
-		event->state = PERF_EVENT_STATE_OFF;
+		state = PERF_EVENT_STATE_OFF;
 	}
+	perf_event_set_state(event, state);
 
 	if (!is_software_event(event))
 		cpuctx->active_oncpu--;
@@ -1859,7 +1810,9 @@ group_sched_out(struct perf_event *group_event,
 		struct perf_event_context *ctx)
 {
 	struct perf_event *event;
-	int state = group_event->state;
+
+	if (group_event->state != PERF_EVENT_STATE_ACTIVE)
+		return;
 
 	perf_pmu_disable(ctx->pmu);
 
@@ -1873,7 +1826,7 @@ group_sched_out(struct perf_event *group_event,
 
 	perf_pmu_enable(ctx->pmu);
 
-	if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive)
+	if (group_event->attr.exclusive)
 		cpuctx->exclusive = 0;
 }
 
@@ -1893,6 +1846,11 @@ __perf_remove_from_context(struct perf_event *event,
 {
 	unsigned long flags = (unsigned long)info;
 
+	if (ctx->is_active & EVENT_TIME) {
+		update_context_time(ctx);
+		update_cgrp_time_from_cpuctx(cpuctx);
+	}
+
 	event_sched_out(event, cpuctx, ctx);
 	if (flags & DETACH_GROUP)
 		perf_group_detach(event);
@@ -1955,14 +1913,17 @@ static void __perf_event_disable(struct perf_event *event,
 	if (event->state < PERF_EVENT_STATE_INACTIVE)
 		return;
 
-	update_context_time(ctx);
-	update_cgrp_time_from_event(event);
-	update_group_times(event);
+	if (ctx->is_active & EVENT_TIME) {
+		update_context_time(ctx);
+		update_cgrp_time_from_cpuctx(cpuctx);
+	}
+
 	if (event == event->group_leader)
 		group_sched_out(event, cpuctx, ctx);
 	else
 		event_sched_out(event, cpuctx, ctx);
-	event->state = PERF_EVENT_STATE_OFF;
+
+	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
 }
 
 /*
@@ -2019,8 +1980,7 @@ void perf_event_disable_inatomic(struct perf_event *event)
 }
 
 static void perf_set_shadow_time(struct perf_event *event,
-				 struct perf_event_context *ctx,
-				 u64 tstamp)
+				 struct perf_event_context *ctx)
 {
 	/*
 	 * use the correct time source for the time snapshot
@@ -2048,9 +2008,9 @@ static void perf_set_shadow_time(struct perf_event *event,
 	 * is cleaner and simpler to understand.
 	 */
 	if (is_cgroup_event(event))
-		perf_cgroup_set_shadow_time(event, tstamp);
+		perf_cgroup_set_shadow_time(event, event->tstamp);
 	else
-		event->shadow_ctx_time = tstamp - ctx->timestamp;
+		event->shadow_ctx_time = event->tstamp - ctx->timestamp;
 }
 
 #define MAX_INTERRUPTS (~0ULL)
@@ -2063,7 +2023,6 @@ event_sched_in(struct perf_event *event,
 		 struct perf_cpu_context *cpuctx,
 		 struct perf_event_context *ctx)
 {
-	u64 tstamp = perf_event_time(event);
 	int ret = 0;
 
 	lockdep_assert_held(&ctx->lock);
@@ -2077,7 +2036,7 @@ event_sched_in(struct perf_event *event,
 	 * is visible.
 	 */
 	smp_wmb();
-	WRITE_ONCE(event->state, PERF_EVENT_STATE_ACTIVE);
+	perf_event_set_state(event, PERF_EVENT_STATE_ACTIVE);
 
 	/*
 	 * Unthrottle events, since we scheduled we might have missed several
@@ -2089,26 +2048,19 @@ event_sched_in(struct perf_event *event,
 		event->hw.interrupts = 0;
 	}
 
-	/*
-	 * The new state must be visible before we turn it on in the hardware:
-	 */
-	smp_wmb();
-
 	perf_pmu_disable(event->pmu);
 
-	perf_set_shadow_time(event, ctx, tstamp);
+	perf_set_shadow_time(event, ctx);
 
 	perf_log_itrace_start(event);
 
 	if (event->pmu->add(event, PERF_EF_START)) {
-		event->state = PERF_EVENT_STATE_INACTIVE;
+		perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
 		event->oncpu = -1;
 		ret = -EAGAIN;
 		goto out;
 	}
 
-	event->tstamp_running += tstamp - event->tstamp_stopped;
-
 	if (!is_software_event(event))
 		cpuctx->active_oncpu++;
 	if (!ctx->nr_active++)
@@ -2132,8 +2084,6 @@ group_sched_in(struct perf_event *group_event,
 {
 	struct perf_event *event, *partial_group = NULL;
 	struct pmu *pmu = ctx->pmu;
-	u64 now = ctx->time;
-	bool simulate = false;
 
 	if (group_event->state == PERF_EVENT_STATE_OFF)
 		return 0;
@@ -2163,27 +2113,13 @@ group_sched_in(struct perf_event *group_event,
 	/*
 	 * Groups can be scheduled in as one unit only, so undo any
 	 * partial group before returning:
-	 * The events up to the failed event are scheduled out normally,
-	 * tstamp_stopped will be updated.
-	 *
-	 * The failed events and the remaining siblings need to have
-	 * their timings updated as if they had gone thru event_sched_in()
-	 * and event_sched_out(). This is required to get consistent timings
-	 * across the group. This also takes care of the case where the group
-	 * could never be scheduled by ensuring tstamp_stopped is set to mark
-	 * the time the event was actually stopped, such that time delta
-	 * calculation in update_event_times() is correct.
+	 * The events up to the failed event are scheduled out normally.
 	 */
 	list_for_each_entry(event, &group_event->sibling_list, group_entry) {
 		if (event == partial_group)
-			simulate = true;
+			break;
 
-		if (simulate) {
-			event->tstamp_running += now - event->tstamp_stopped;
-			event->tstamp_stopped = now;
-		} else {
-			event_sched_out(event, cpuctx, ctx);
-		}
+		event_sched_out(event, cpuctx, ctx);
 	}
 	event_sched_out(group_event, cpuctx, ctx);
 
@@ -2225,46 +2161,11 @@ static int group_can_go_on(struct perf_event *event,
 	return can_add_hw;
 }
 
-/*
- * Complement to update_event_times(). This computes the tstamp_* values to
- * continue 'enabled' state from @now, and effectively discards the time
- * between the prior tstamp_stopped and now (as we were in the OFF state, or
- * just switched (context) time base).
- *
- * This further assumes '@event->state == INACTIVE' (we just came from OFF) and
- * cannot have been scheduled in yet. And going into INACTIVE state means
- * '@event->tstamp_stopped = @now'.
- *
- * Thus given the rules of update_event_times():
- *
- *   total_time_enabled = tstamp_stopped - tstamp_enabled
- *   total_time_running = tstamp_stopped - tstamp_running
- *
- * We can insert 'tstamp_stopped == now' and reverse them to compute new
- * tstamp_* values.
- */
-static void __perf_event_enable_time(struct perf_event *event, u64 now)
-{
-	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE);
-
-	event->tstamp_stopped = now;
-	event->tstamp_enabled = now - event->total_time_enabled;
-	event->tstamp_running = now - event->total_time_running;
-}
-
 static void add_event_to_ctx(struct perf_event *event,
 			       struct perf_event_context *ctx)
 {
-	u64 tstamp = perf_event_time(event);
-
 	list_add_event(event, ctx);
 	perf_group_attach(event);
-	/*
-	 * We can be called with event->state == STATE_OFF when we create with
-	 * .disabled = 1. In that case the IOC_ENABLE will call this function.
-	 */
-	if (event->state == PERF_EVENT_STATE_INACTIVE)
-		__perf_event_enable_time(event, tstamp);
 }
 
 static void ctx_sched_out(struct perf_event_context *ctx,
@@ -2496,28 +2397,6 @@ perf_install_in_context(struct perf_event_context *ctx,
 }
 
 /*
- * Put a event into inactive state and update time fields.
- * Enabling the leader of a group effectively enables all
- * the group members that aren't explicitly disabled, so we
- * have to update their ->tstamp_enabled also.
- * Note: this works for group members as well as group leaders
- * since the non-leader members' sibling_lists will be empty.
- */
-static void __perf_event_mark_enabled(struct perf_event *event)
-{
-	struct perf_event *sub;
-	u64 tstamp = perf_event_time(event);
-
-	event->state = PERF_EVENT_STATE_INACTIVE;
-	__perf_event_enable_time(event, tstamp);
-	list_for_each_entry(sub, &event->sibling_list, group_entry) {
-		/* XXX should not be > INACTIVE if event isn't */
-		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
-			__perf_event_enable_time(sub, tstamp);
-	}
-}
-
-/*
  * Cross CPU call to enable a performance event
  */
 static void __perf_event_enable(struct perf_event *event,
@@ -2535,14 +2414,12 @@ static void __perf_event_enable(struct perf_event *event,
 	if (ctx->is_active)
 		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
 
-	__perf_event_mark_enabled(event);
+	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
 
 	if (!ctx->is_active)
 		return;
 
 	if (!event_filter_match(event)) {
-		if (is_cgroup_event(event))
-			perf_cgroup_defer_enabled(event);
 		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
 		return;
 	}
@@ -2862,18 +2739,10 @@ static void __perf_event_sync_stat(struct perf_event *event,
 	 * we know the event must be on the current CPU, therefore we
 	 * don't need to use it.
 	 */
-	switch (event->state) {
-	case PERF_EVENT_STATE_ACTIVE:
+	if (event->state == PERF_EVENT_STATE_ACTIVE)
 		event->pmu->read(event);
-		/* fall-through */
 
-	case PERF_EVENT_STATE_INACTIVE:
-		update_event_times(event);
-		break;
-
-	default:
-		break;
-	}
+	perf_event_update_time(event);
 
 	/*
 	 * In order to keep per-task stats reliable we need to flip the event
@@ -3110,10 +2979,6 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
 		if (!event_filter_match(event))
 			continue;
 
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
-
 		if (group_can_go_on(event, cpuctx, 1))
 			group_sched_in(event, cpuctx, ctx);
 
@@ -3121,10 +2986,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
 		 * If this pinned group hasn't been scheduled,
 		 * put it in error state.
 		 */
-		if (event->state == PERF_EVENT_STATE_INACTIVE) {
-			update_group_times(event);
-			event->state = PERF_EVENT_STATE_ERROR;
-		}
+		if (event->state == PERF_EVENT_STATE_INACTIVE)
+			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
 	}
 }
 
@@ -3146,10 +3009,6 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
 		if (!event_filter_match(event))
 			continue;
 
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
-
 		if (group_can_go_on(event, cpuctx, can_add_hw)) {
 			if (group_sched_in(event, cpuctx, ctx))
 				can_add_hw = 0;
@@ -3541,7 +3400,7 @@ static int event_enable_on_exec(struct perf_event *event,
 	if (event->state >= PERF_EVENT_STATE_INACTIVE)
 		return 0;
 
-	__perf_event_mark_enabled(event);
+	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
 
 	return 1;
 }
@@ -3590,12 +3449,6 @@ static void perf_event_enable_on_exec(int ctxn)
 		put_ctx(clone_ctx);
 }
 
-struct perf_read_data {
-	struct perf_event *event;
-	bool group;
-	int ret;
-};
-
 static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
 {
 	u16 local_pkg, event_pkg;
@@ -3613,64 +3466,6 @@ static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
 	return event_cpu;
 }
 
-/*
- * Cross CPU call to read the hardware event
- */
-static void __perf_event_read(void *info)
-{
-	struct perf_read_data *data = info;
-	struct perf_event *sub, *event = data->event;
-	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
-	struct pmu *pmu = event->pmu;
-
-	/*
-	 * If this is a task context, we need to check whether it is
-	 * the current task context of this cpu.  If not it has been
-	 * scheduled out before the smp call arrived.  In that case
-	 * event->count would have been updated to a recent sample
-	 * when the event was scheduled out.
-	 */
-	if (ctx->task && cpuctx->task_ctx != ctx)
-		return;
-
-	raw_spin_lock(&ctx->lock);
-	if (ctx->is_active) {
-		update_context_time(ctx);
-		update_cgrp_time_from_event(event);
-	}
-
-	update_event_times(event);
-	if (event->state != PERF_EVENT_STATE_ACTIVE)
-		goto unlock;
-
-	if (!data->group) {
-		pmu->read(event);
-		data->ret = 0;
-		goto unlock;
-	}
-
-	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
-
-	pmu->read(event);
-
-	list_for_each_entry(sub, &event->sibling_list, group_entry) {
-		update_event_times(sub);
-		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
-			/*
-			 * Use sibling's PMU rather than @event's since
-			 * sibling could be on different (eg: software) PMU.
-			 */
-			sub->pmu->read(sub);
-		}
-	}
-
-	data->ret = pmu->commit_txn(pmu);
-
-unlock:
-	raw_spin_unlock(&ctx->lock);
-}
-
 static inline u64 perf_event_count(struct perf_event *event)
 {
 	return local64_read(&event->count) + atomic64_read(&event->child_count);
@@ -3733,63 +3528,81 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
 	return ret;
 }
 
-static int perf_event_read(struct perf_event *event, bool group)
+struct perf_read_data {
+	struct perf_event *event;
+	bool group;
+	int ret;
+};
+
+static void __perf_event_read(struct perf_event *event,
+			      struct perf_cpu_context *cpuctx,
+			      struct perf_event_context *ctx,
+			      void *data)
 {
-	int event_cpu, ret = 0;
+	struct perf_read_data *prd = data;
+	struct pmu *pmu = event->pmu;
+	struct perf_event *sibling;
 
-	/*
-	 * If event is enabled and currently active on a CPU, update the
-	 * value in the event structure:
-	 */
-	if (event->state == PERF_EVENT_STATE_ACTIVE) {
-		struct perf_read_data data = {
-			.event = event,
-			.group = group,
-			.ret = 0,
-		};
+	if (ctx->is_active & EVENT_TIME) {
+		update_context_time(ctx);
+		update_cgrp_time_from_cpuctx(cpuctx);
+	}
 
-		event_cpu = READ_ONCE(event->oncpu);
-		if ((unsigned)event_cpu >= nr_cpu_ids)
-			return 0;
+	perf_event_update_time(event);
+	if (prd->group)
+		perf_event_update_sibling_time(event);
 
-		preempt_disable();
-		event_cpu = __perf_event_read_cpu(event, event_cpu);
+	if (event->state != PERF_EVENT_STATE_ACTIVE)
+		return;
 
+	if (!prd->group) {
+		pmu->read(event);
+		prd->ret = 0;
+		return;
+	}
+
+	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
+
+	pmu->read(event);
+	list_for_each_entry(sibling, &event->sibling_list, group_entry) {
+		if (sibling->state == PERF_EVENT_STATE_ACTIVE) {
+			/*
+			 * Use sibling's PMU rather than @event's since
+			 * sibling could be on different (eg: software) PMU.
+			 */
+			sibling->pmu->read(sibling);
+		}
+	}
+
+	prd->ret = pmu->commit_txn(pmu);
+}
+
+static int perf_event_read(struct perf_event *event, bool group)
+{
+	struct perf_read_data prd = {
+		.event = event,
+		.group = group,
+		.ret = 0,
+	};
+
+	if (event->ctx->task) {
+		event_function_call(event, __perf_event_read, &prd);
+	} else {
 		/*
-		 * Purposely ignore the smp_call_function_single() return
-		 * value.
-		 *
-		 * If event_cpu isn't a valid CPU it means the event got
-		 * scheduled out and that will have updated the event count.
-		 *
-		 * Therefore, either way, we'll have an up-to-date event count
-		 * after this.
-		 */
-		(void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
-		preempt_enable();
-		ret = data.ret;
-	} else if (event->state == PERF_EVENT_STATE_INACTIVE) {
-		struct perf_event_context *ctx = event->ctx;
-		unsigned long flags;
-
-		raw_spin_lock_irqsave(&ctx->lock, flags);
-		/*
-		 * may read while context is not active
-		 * (e.g., thread is blocked), in that case
-		 * we cannot update context time
+		 * For uncore events (which are per definition per-cpu)
+		 * allow a different read CPU from event->cpu.
 		 */
-		if (ctx->is_active) {
-			update_context_time(ctx);
-			update_cgrp_time_from_event(event);
-		}
-		if (group)
-			update_group_times(event);
-		else
-			update_event_times(event);
-		raw_spin_unlock_irqrestore(&ctx->lock, flags);
+		struct event_function_struct efs = {
+			.event = event,
+			.func = __perf_event_read,
+			.data = &prd,
+		};
+		int cpu = __perf_event_read_cpu(event, event->cpu);
+
+		cpu_function_call(cpu, event_function, &efs);
 	}
 
-	return ret;
+	return prd.ret;
 }
 
 /*
@@ -4388,7 +4201,7 @@ static int perf_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
-u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
+static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 {
 	struct perf_event *child;
 	u64 total = 0;
@@ -4416,6 +4229,18 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 
 	return total;
 }
+
+u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
+{
+	struct perf_event_context *ctx;
+	u64 count;
+
+	ctx = perf_event_ctx_lock(event);
+	count = __perf_event_read_value(event, enabled, running);
+	perf_event_ctx_unlock(event, ctx);
+
+	return count;
+}
 EXPORT_SYMBOL_GPL(perf_event_read_value);
 
 static int __perf_read_group_add(struct perf_event *leader,
@@ -4431,6 +4256,8 @@ static int __perf_read_group_add(struct perf_event *leader,
 	if (ret)
 		return ret;
 
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+
 	/*
 	 * Since we co-schedule groups, {enabled,running} times of siblings
 	 * will be identical to those of the leader, so we only publish one
@@ -4453,8 +4280,6 @@ static int __perf_read_group_add(struct perf_event *leader,
 	if (read_format & PERF_FORMAT_ID)
 		values[n++] = primary_event_id(leader);
 
-	raw_spin_lock_irqsave(&ctx->lock, flags);
-
 	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
 		values[n++] += perf_event_count(sub);
 		if (read_format & PERF_FORMAT_ID)
@@ -4518,7 +4343,7 @@ static int perf_read_one(struct perf_event *event,
 	u64 values[4];
 	int n = 0;
 
-	values[n++] = perf_event_read_value(event, &enabled, &running);
+	values[n++] = __perf_event_read_value(event, &enabled, &running);
 	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
 		values[n++] = enabled;
 	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
@@ -4897,8 +4722,7 @@ static void calc_timer_values(struct perf_event *event,
 
 	*now = perf_clock();
 	ctx_time = event->shadow_ctx_time + *now;
-	*enabled = ctx_time - event->tstamp_enabled;
-	*running = ctx_time - event->tstamp_running;
+	__perf_update_times(event, ctx_time, enabled, running);
 }
 
 static void perf_event_init_userpage(struct perf_event *event)
@@ -10516,7 +10340,7 @@ perf_event_exit_event(struct perf_event *child_event,
 	if (parent_event)
 		perf_group_detach(child_event);
 	list_del_event(child_event, child_ctx);
-	child_event->state = PERF_EVENT_STATE_EXIT; /* is_event_hup() */
+	perf_event_set_state(child_event, PERF_EVENT_STATE_EXIT); /* is_event_hup() */
 	raw_spin_unlock_irq(&child_ctx->lock);
 
 	/*
@@ -10754,7 +10578,7 @@ inherit_event(struct perf_event *parent_event,
 	      struct perf_event *group_leader,
 	      struct perf_event_context *child_ctx)
 {
-	enum perf_event_active_state parent_state = parent_event->state;
+	enum perf_event_state parent_state = parent_event->state;
 	struct perf_event *child_event;
 	unsigned long flags;
 
@@ -11090,6 +10914,7 @@ static void __perf_event_exit_context(void *__info)
 	struct perf_event *event;
 
 	raw_spin_lock(&ctx->lock);
+	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
 	list_for_each_entry(event, &ctx->event_list, event_entry)
 		__perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
 	raw_spin_unlock(&ctx->lock);

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-08-31 17:18           ` [RFC][PATCH] perf: Rewrite enabled/running timekeeping Peter Zijlstra
@ 2017-08-31 19:51             ` Stephane Eranian
  2017-09-05  7:51               ` Stephane Eranian
  2017-09-01 10:45             ` Alexey Budankov
                               ` (3 subsequent siblings)
  4 siblings, 1 reply; 76+ messages in thread
From: Stephane Eranian @ 2017-08-31 19:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexey Budankov, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Andi Kleen, Kan Liang, Dmitri Prokhorov,
	Valery Cherepennikov, Mark Rutland, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

Hi,

On Thu, Aug 31, 2017 at 10:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Aug 23, 2017 at 11:54:15AM +0300, Alexey Budankov wrote:
>> On 22.08.2017 23:47, Peter Zijlstra wrote:
>> > On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote:
>> >> The key thing in the patch is explicit updating of tstamp fields for
>> >> INACTIVE events in update_event_times().
>> >
>> >> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event)
>> >>        event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
>> >>            return;
>> >>
>> >> +  if (event->state == PERF_EVENT_STATE_INACTIVE)
>> >> +          perf_event_tstamp_update(event);
>> >> +
>> >>    /*
>> >>     * in cgroup mode, time_enabled represents
>> >>     * the time the event was enabled AND active
>> >
>> > But why!? I thought the whole point was to not need to do this.
>>
>> update_event_times() is not called from timer interrupt handler
>> thus it is not on the critical path which is optimized in this patch set.
>>
>> But update_event_times() is called in the context of read() syscall so
>> this is the place where we may update event times for INACTIVE events
>> instead of timer interrupt.
>>
>> Also update_event_times() is called on thread context switch out so
>> we get event times also updated when the thread migrates to other CPU.
>>
>> >
>> > The thing I outlined earlier would only need to update timestamps when
>> > events change state and at no other point in time.
>>
>> But we still may request times while event is in INACTIVE state
>> thru read() syscall and event timings need to be up-to-date.
>
> Sure, read() also updates.
>
> So the below completely rewrites timekeeping (and probably breaks
> world) but does away with the need to touch events that don't get
> scheduled.
>
> Esp the cgroup stuff is entirely untested since I simply don't know how
> to operate that. I did run Vince's tests on it, and I think it doesn't
> regress, but I'm near a migraine so I can't really see straight atm.
>
> Vince, Stephane, could you guys have a peek?
>
okay, I will run some tests with cgroups on my systems.

> (There's a few other bits in, I'll break up into patches and write
> comments and Changelogs later, I think its can be split in some 5
> patches).
>
> The basic idea is really simple, we have a single timestamp and
> depending on the state we update enabled/running. This obviously only
> requires updates when we change state and when we need up-to-date
> timestamps (read).
>
> No more weird and wonderful mind bending interaction between 3 different
> timestamps with arcane update rules.
>
> ---
>  include/linux/perf_event.h |  25 +-
>  kernel/events/core.c       | 551 ++++++++++++++++-----------------------------
>  2 files changed, 192 insertions(+), 384 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8e22f24ded6a..2a6ae48a1a96 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -485,9 +485,9 @@ struct perf_addr_filters_head {
>  };
>
>  /**
> - * enum perf_event_active_state - the states of a event
> + * enum perf_event_state - the states of a event
>   */
> -enum perf_event_active_state {
> +enum perf_event_state {
>         PERF_EVENT_STATE_DEAD           = -4,
>         PERF_EVENT_STATE_EXIT           = -3,
>         PERF_EVENT_STATE_ERROR          = -2,
> @@ -578,7 +578,7 @@ struct perf_event {
>         struct pmu                      *pmu;
>         void                            *pmu_private;
>
> -       enum perf_event_active_state    state;
> +       enum perf_event_state           state;
>         unsigned int                    attach_state;
>         local64_t                       count;
>         atomic64_t                      child_count;
> @@ -588,26 +588,10 @@ struct perf_event {
>          * has been enabled (i.e. eligible to run, and the task has
>          * been scheduled in, if this is a per-task event)
>          * and running (scheduled onto the CPU), respectively.
> -        *
> -        * They are computed from tstamp_enabled, tstamp_running and
> -        * tstamp_stopped when the event is in INACTIVE or ACTIVE state.
>          */
>         u64                             total_time_enabled;
>         u64                             total_time_running;
> -
> -       /*
> -        * These are timestamps used for computing total_time_enabled
> -        * and total_time_running when the event is in INACTIVE or
> -        * ACTIVE state, measured in nanoseconds from an arbitrary point
> -        * in time.
> -        * tstamp_enabled: the notional time when the event was enabled
> -        * tstamp_running: the notional time when the event was scheduled on
> -        * tstamp_stopped: in INACTIVE state, the notional time when the
> -        *      event was scheduled off.
> -        */
> -       u64                             tstamp_enabled;
> -       u64                             tstamp_running;
> -       u64                             tstamp_stopped;
> +       u64                             tstamp;
>
>         /*
>          * timestamp shadows the actual context timing but it can
> @@ -699,7 +683,6 @@ struct perf_event {
>
>  #ifdef CONFIG_CGROUP_PERF
>         struct perf_cgroup              *cgrp; /* cgroup event is attach to */
> -       int                             cgrp_defer_enabled;
>  #endif
>
>         struct list_head                sb_list;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 294f1927f944..e968b3eab9c7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -582,6 +582,70 @@ static inline u64 perf_event_clock(struct perf_event *event)
>         return event->clock();
>  }
>
> +/*
> + * XXX comment about timekeeping goes here
> + */
> +
> +static __always_inline enum perf_event_state
> +__perf_effective_state(struct perf_event *event)
> +{
> +       struct perf_event *leader = event->group_leader;
> +
> +       if (leader->state <= PERF_EVENT_STATE_OFF)
> +               return leader->state;
> +
> +       return event->state;
> +}
> +
> +static __always_inline void
> +__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
> +{
> +       enum perf_event_state state = __perf_effective_state(event);
> +       u64 delta = now - event->tstamp;
> +
> +       *enabled = event->total_time_enabled;
> +       if (state >= PERF_EVENT_STATE_INACTIVE)
> +               *enabled += delta;
> +
> +       *running = event->total_time_running;
> +       if (state >= PERF_EVENT_STATE_ACTIVE)
> +               *running += delta;
> +}
> +
> +static void perf_event_update_time(struct perf_event *event)
> +{
> +       u64 now = perf_event_time(event);
> +
> +       __perf_update_times(event, now, &event->total_time_enabled,
> +                                       &event->total_time_running);
> +       event->tstamp = now;
> +}
> +
> +static void perf_event_update_sibling_time(struct perf_event *leader)
> +{
> +       struct perf_event *sibling;
> +
> +       list_for_each_entry(sibling, &leader->sibling_list, group_entry)
> +               perf_event_update_time(sibling);
> +}
> +
> +static void
> +perf_event_set_state(struct perf_event *event, enum perf_event_state state)
> +{
> +       if (event->state == state)
> +               return;
> +
> +       perf_event_update_time(event);
> +       /*
> +        * If a group leader gets enabled/disabled all its siblings
> +        * are affected too.
> +        */
> +       if ((event->state < 0) ^ (state < 0))
> +               perf_event_update_sibling_time(event);
> +
> +       WRITE_ONCE(event->state, state);
> +}
> +
>  #ifdef CONFIG_CGROUP_PERF
>
>  static inline bool
> @@ -841,40 +905,6 @@ perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
>         event->shadow_ctx_time = now - t->timestamp;
>  }
>
> -static inline void
> -perf_cgroup_defer_enabled(struct perf_event *event)
> -{
> -       /*
> -        * when the current task's perf cgroup does not match
> -        * the event's, we need to remember to call the
> -        * perf_mark_enable() function the first time a task with
> -        * a matching perf cgroup is scheduled in.
> -        */
> -       if (is_cgroup_event(event) && !perf_cgroup_match(event))
> -               event->cgrp_defer_enabled = 1;
> -}
> -
> -static inline void
> -perf_cgroup_mark_enabled(struct perf_event *event,
> -                        struct perf_event_context *ctx)
> -{
> -       struct perf_event *sub;
> -       u64 tstamp = perf_event_time(event);
> -
> -       if (!event->cgrp_defer_enabled)
> -               return;
> -
> -       event->cgrp_defer_enabled = 0;
> -
> -       event->tstamp_enabled = tstamp - event->total_time_enabled;
> -       list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -               if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
> -                       sub->tstamp_enabled = tstamp - sub->total_time_enabled;
> -                       sub->cgrp_defer_enabled = 0;
> -               }
> -       }
> -}
> -
>  /*
>   * Update cpuctx->cgrp so that it is set when first cgroup event is added and
>   * cleared when last cgroup event is removed.
> @@ -973,17 +1003,6 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
>  }
>
>  static inline void
> -perf_cgroup_defer_enabled(struct perf_event *event)
> -{
> -}
> -
> -static inline void
> -perf_cgroup_mark_enabled(struct perf_event *event,
> -                        struct perf_event_context *ctx)
> -{
> -}
> -
> -static inline void
>  list_update_cgroup_event(struct perf_event *event,
>                          struct perf_event_context *ctx, bool add)
>  {
> @@ -1396,60 +1415,6 @@ static u64 perf_event_time(struct perf_event *event)
>         return ctx ? ctx->time : 0;
>  }
>
> -/*
> - * Update the total_time_enabled and total_time_running fields for a event.
> - */
> -static void update_event_times(struct perf_event *event)
> -{
> -       struct perf_event_context *ctx = event->ctx;
> -       u64 run_end;
> -
> -       lockdep_assert_held(&ctx->lock);
> -
> -       if (event->state < PERF_EVENT_STATE_INACTIVE ||
> -           event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
> -               return;
> -
> -       /*
> -        * in cgroup mode, time_enabled represents
> -        * the time the event was enabled AND active
> -        * tasks were in the monitored cgroup. This is
> -        * independent of the activity of the context as
> -        * there may be a mix of cgroup and non-cgroup events.
> -        *
> -        * That is why we treat cgroup events differently
> -        * here.
> -        */
> -       if (is_cgroup_event(event))
> -               run_end = perf_cgroup_event_time(event);
> -       else if (ctx->is_active)
> -               run_end = ctx->time;
> -       else
> -               run_end = event->tstamp_stopped;
> -
> -       event->total_time_enabled = run_end - event->tstamp_enabled;
> -
> -       if (event->state == PERF_EVENT_STATE_INACTIVE)
> -               run_end = event->tstamp_stopped;
> -       else
> -               run_end = perf_event_time(event);
> -
> -       event->total_time_running = run_end - event->tstamp_running;
> -
> -}
> -
> -/*
> - * Update total_time_enabled and total_time_running for all events in a group.
> - */
> -static void update_group_times(struct perf_event *leader)
> -{
> -       struct perf_event *event;
> -
> -       update_event_times(leader);
> -       list_for_each_entry(event, &leader->sibling_list, group_entry)
> -               update_event_times(event);
> -}
> -
>  static enum event_type_t get_event_type(struct perf_event *event)
>  {
>         struct perf_event_context *ctx = event->ctx;
> @@ -1492,6 +1457,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
>         WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT);
>         event->attach_state |= PERF_ATTACH_CONTEXT;
>
> +       event->tstamp = perf_event_time(event);
> +
>         /*
>          * If we're a stand alone event or group leader, we go to the context
>          * list, group events are kept attached to the group so that
> @@ -1699,8 +1666,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>         if (event->group_leader == event)
>                 list_del_init(&event->group_entry);
>
> -       update_group_times(event);
> -
>         /*
>          * If event was in error state, then keep it
>          * that way, otherwise bogus counts will be
> @@ -1709,7 +1674,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>          * of the event
>          */
>         if (event->state > PERF_EVENT_STATE_OFF)
> -               event->state = PERF_EVENT_STATE_OFF;
> +               perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>
>         ctx->generation++;
>  }
> @@ -1808,38 +1773,24 @@ event_sched_out(struct perf_event *event,
>                   struct perf_cpu_context *cpuctx,
>                   struct perf_event_context *ctx)
>  {
> -       u64 tstamp = perf_event_time(event);
> -       u64 delta;
> +       enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
>
>         WARN_ON_ONCE(event->ctx != ctx);
>         lockdep_assert_held(&ctx->lock);
>
> -       /*
> -        * An event which could not be activated because of
> -        * filter mismatch still needs to have its timings
> -        * maintained, otherwise bogus information is return
> -        * via read() for time_enabled, time_running:
> -        */
> -       if (event->state == PERF_EVENT_STATE_INACTIVE &&
> -           !event_filter_match(event)) {
> -               delta = tstamp - event->tstamp_stopped;
> -               event->tstamp_running += delta;
> -               event->tstamp_stopped = tstamp;
> -       }
> -
>         if (event->state != PERF_EVENT_STATE_ACTIVE)
>                 return;
>
>         perf_pmu_disable(event->pmu);
>
> -       event->tstamp_stopped = tstamp;
>         event->pmu->del(event, 0);
>         event->oncpu = -1;
> -       event->state = PERF_EVENT_STATE_INACTIVE;
> +
>         if (event->pending_disable) {
>                 event->pending_disable = 0;
> -               event->state = PERF_EVENT_STATE_OFF;
> +               state = PERF_EVENT_STATE_OFF;
>         }
> +       perf_event_set_state(event, state);
>
>         if (!is_software_event(event))
>                 cpuctx->active_oncpu--;
> @@ -1859,7 +1810,9 @@ group_sched_out(struct perf_event *group_event,
>                 struct perf_event_context *ctx)
>  {
>         struct perf_event *event;
> -       int state = group_event->state;
> +
> +       if (group_event->state != PERF_EVENT_STATE_ACTIVE)
> +               return;
>
>         perf_pmu_disable(ctx->pmu);
>
> @@ -1873,7 +1826,7 @@ group_sched_out(struct perf_event *group_event,
>
>         perf_pmu_enable(ctx->pmu);
>
> -       if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive)
> +       if (group_event->attr.exclusive)
>                 cpuctx->exclusive = 0;
>  }
>
> @@ -1893,6 +1846,11 @@ __perf_remove_from_context(struct perf_event *event,
>  {
>         unsigned long flags = (unsigned long)info;
>
> +       if (ctx->is_active & EVENT_TIME) {
> +               update_context_time(ctx);
> +               update_cgrp_time_from_cpuctx(cpuctx);
> +       }
> +
>         event_sched_out(event, cpuctx, ctx);
>         if (flags & DETACH_GROUP)
>                 perf_group_detach(event);
> @@ -1955,14 +1913,17 @@ static void __perf_event_disable(struct perf_event *event,
>         if (event->state < PERF_EVENT_STATE_INACTIVE)
>                 return;
>
> -       update_context_time(ctx);
> -       update_cgrp_time_from_event(event);
> -       update_group_times(event);
> +       if (ctx->is_active & EVENT_TIME) {
> +               update_context_time(ctx);
> +               update_cgrp_time_from_cpuctx(cpuctx);
> +       }
> +
>         if (event == event->group_leader)
>                 group_sched_out(event, cpuctx, ctx);
>         else
>                 event_sched_out(event, cpuctx, ctx);
> -       event->state = PERF_EVENT_STATE_OFF;
> +
> +       perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>  }
>
>  /*
> @@ -2019,8 +1980,7 @@ void perf_event_disable_inatomic(struct perf_event *event)
>  }
>
>  static void perf_set_shadow_time(struct perf_event *event,
> -                                struct perf_event_context *ctx,
> -                                u64 tstamp)
> +                                struct perf_event_context *ctx)
>  {
>         /*
>          * use the correct time source for the time snapshot
> @@ -2048,9 +2008,9 @@ static void perf_set_shadow_time(struct perf_event *event,
>          * is cleaner and simpler to understand.
>          */
>         if (is_cgroup_event(event))
> -               perf_cgroup_set_shadow_time(event, tstamp);
> +               perf_cgroup_set_shadow_time(event, event->tstamp);
>         else
> -               event->shadow_ctx_time = tstamp - ctx->timestamp;
> +               event->shadow_ctx_time = event->tstamp - ctx->timestamp;
>  }
>
>  #define MAX_INTERRUPTS (~0ULL)
> @@ -2063,7 +2023,6 @@ event_sched_in(struct perf_event *event,
>                  struct perf_cpu_context *cpuctx,
>                  struct perf_event_context *ctx)
>  {
> -       u64 tstamp = perf_event_time(event);
>         int ret = 0;
>
>         lockdep_assert_held(&ctx->lock);
> @@ -2077,7 +2036,7 @@ event_sched_in(struct perf_event *event,
>          * is visible.
>          */
>         smp_wmb();
> -       WRITE_ONCE(event->state, PERF_EVENT_STATE_ACTIVE);
> +       perf_event_set_state(event, PERF_EVENT_STATE_ACTIVE);
>
>         /*
>          * Unthrottle events, since we scheduled we might have missed several
> @@ -2089,26 +2048,19 @@ event_sched_in(struct perf_event *event,
>                 event->hw.interrupts = 0;
>         }
>
> -       /*
> -        * The new state must be visible before we turn it on in the hardware:
> -        */
> -       smp_wmb();
> -
>         perf_pmu_disable(event->pmu);
>
> -       perf_set_shadow_time(event, ctx, tstamp);
> +       perf_set_shadow_time(event, ctx);
>
>         perf_log_itrace_start(event);
>
>         if (event->pmu->add(event, PERF_EF_START)) {
> -               event->state = PERF_EVENT_STATE_INACTIVE;
> +               perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>                 event->oncpu = -1;
>                 ret = -EAGAIN;
>                 goto out;
>         }
>
> -       event->tstamp_running += tstamp - event->tstamp_stopped;
> -
>         if (!is_software_event(event))
>                 cpuctx->active_oncpu++;
>         if (!ctx->nr_active++)
> @@ -2132,8 +2084,6 @@ group_sched_in(struct perf_event *group_event,
>  {
>         struct perf_event *event, *partial_group = NULL;
>         struct pmu *pmu = ctx->pmu;
> -       u64 now = ctx->time;
> -       bool simulate = false;
>
>         if (group_event->state == PERF_EVENT_STATE_OFF)
>                 return 0;
> @@ -2163,27 +2113,13 @@ group_sched_in(struct perf_event *group_event,
>         /*
>          * Groups can be scheduled in as one unit only, so undo any
>          * partial group before returning:
> -        * The events up to the failed event are scheduled out normally,
> -        * tstamp_stopped will be updated.
> -        *
> -        * The failed events and the remaining siblings need to have
> -        * their timings updated as if they had gone thru event_sched_in()
> -        * and event_sched_out(). This is required to get consistent timings
> -        * across the group. This also takes care of the case where the group
> -        * could never be scheduled by ensuring tstamp_stopped is set to mark
> -        * the time the event was actually stopped, such that time delta
> -        * calculation in update_event_times() is correct.
> +        * The events up to the failed event are scheduled out normally.
>          */
>         list_for_each_entry(event, &group_event->sibling_list, group_entry) {
>                 if (event == partial_group)
> -                       simulate = true;
> +                       break;
>
> -               if (simulate) {
> -                       event->tstamp_running += now - event->tstamp_stopped;
> -                       event->tstamp_stopped = now;
> -               } else {
> -                       event_sched_out(event, cpuctx, ctx);
> -               }
> +               event_sched_out(event, cpuctx, ctx);
>         }
>         event_sched_out(group_event, cpuctx, ctx);
>
> @@ -2225,46 +2161,11 @@ static int group_can_go_on(struct perf_event *event,
>         return can_add_hw;
>  }
>
> -/*
> - * Complement to update_event_times(). This computes the tstamp_* values to
> - * continue 'enabled' state from @now, and effectively discards the time
> - * between the prior tstamp_stopped and now (as we were in the OFF state, or
> - * just switched (context) time base).
> - *
> - * This further assumes '@event->state == INACTIVE' (we just came from OFF) and
> - * cannot have been scheduled in yet. And going into INACTIVE state means
> - * '@event->tstamp_stopped = @now'.
> - *
> - * Thus given the rules of update_event_times():
> - *
> - *   total_time_enabled = tstamp_stopped - tstamp_enabled
> - *   total_time_running = tstamp_stopped - tstamp_running
> - *
> - * We can insert 'tstamp_stopped == now' and reverse them to compute new
> - * tstamp_* values.
> - */
> -static void __perf_event_enable_time(struct perf_event *event, u64 now)
> -{
> -       WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE);
> -
> -       event->tstamp_stopped = now;
> -       event->tstamp_enabled = now - event->total_time_enabled;
> -       event->tstamp_running = now - event->total_time_running;
> -}
> -
>  static void add_event_to_ctx(struct perf_event *event,
>                                struct perf_event_context *ctx)
>  {
> -       u64 tstamp = perf_event_time(event);
> -
>         list_add_event(event, ctx);
>         perf_group_attach(event);
> -       /*
> -        * We can be called with event->state == STATE_OFF when we create with
> -        * .disabled = 1. In that case the IOC_ENABLE will call this function.
> -        */
> -       if (event->state == PERF_EVENT_STATE_INACTIVE)
> -               __perf_event_enable_time(event, tstamp);
>  }
>
>  static void ctx_sched_out(struct perf_event_context *ctx,
> @@ -2496,28 +2397,6 @@ perf_install_in_context(struct perf_event_context *ctx,
>  }
>
>  /*
> - * Put a event into inactive state and update time fields.
> - * Enabling the leader of a group effectively enables all
> - * the group members that aren't explicitly disabled, so we
> - * have to update their ->tstamp_enabled also.
> - * Note: this works for group members as well as group leaders
> - * since the non-leader members' sibling_lists will be empty.
> - */
> -static void __perf_event_mark_enabled(struct perf_event *event)
> -{
> -       struct perf_event *sub;
> -       u64 tstamp = perf_event_time(event);
> -
> -       event->state = PERF_EVENT_STATE_INACTIVE;
> -       __perf_event_enable_time(event, tstamp);
> -       list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -               /* XXX should not be > INACTIVE if event isn't */
> -               if (sub->state >= PERF_EVENT_STATE_INACTIVE)
> -                       __perf_event_enable_time(sub, tstamp);
> -       }
> -}
> -
> -/*
>   * Cross CPU call to enable a performance event
>   */
>  static void __perf_event_enable(struct perf_event *event,
> @@ -2535,14 +2414,12 @@ static void __perf_event_enable(struct perf_event *event,
>         if (ctx->is_active)
>                 ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>
> -       __perf_event_mark_enabled(event);
> +       perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>
>         if (!ctx->is_active)
>                 return;
>
>         if (!event_filter_match(event)) {
> -               if (is_cgroup_event(event))
> -                       perf_cgroup_defer_enabled(event);
>                 ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
>                 return;
>         }
> @@ -2862,18 +2739,10 @@ static void __perf_event_sync_stat(struct perf_event *event,
>          * we know the event must be on the current CPU, therefore we
>          * don't need to use it.
>          */
> -       switch (event->state) {
> -       case PERF_EVENT_STATE_ACTIVE:
> +       if (event->state == PERF_EVENT_STATE_ACTIVE)
>                 event->pmu->read(event);
> -               /* fall-through */
>
> -       case PERF_EVENT_STATE_INACTIVE:
> -               update_event_times(event);
> -               break;
> -
> -       default:
> -               break;
> -       }
> +       perf_event_update_time(event);
>
>         /*
>          * In order to keep per-task stats reliable we need to flip the event
> @@ -3110,10 +2979,6 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>                 if (!event_filter_match(event))
>                         continue;
>
> -               /* may need to reset tstamp_enabled */
> -               if (is_cgroup_event(event))
> -                       perf_cgroup_mark_enabled(event, ctx);
> -
>                 if (group_can_go_on(event, cpuctx, 1))
>                         group_sched_in(event, cpuctx, ctx);
>
> @@ -3121,10 +2986,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>                  * If this pinned group hasn't been scheduled,
>                  * put it in error state.
>                  */
> -               if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -                       update_group_times(event);
> -                       event->state = PERF_EVENT_STATE_ERROR;
> -               }
> +               if (event->state == PERF_EVENT_STATE_INACTIVE)
> +                       perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>         }
>  }
>
> @@ -3146,10 +3009,6 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
>                 if (!event_filter_match(event))
>                         continue;
>
> -               /* may need to reset tstamp_enabled */
> -               if (is_cgroup_event(event))
> -                       perf_cgroup_mark_enabled(event, ctx);
> -
>                 if (group_can_go_on(event, cpuctx, can_add_hw)) {
>                         if (group_sched_in(event, cpuctx, ctx))
>                                 can_add_hw = 0;
> @@ -3541,7 +3400,7 @@ static int event_enable_on_exec(struct perf_event *event,
>         if (event->state >= PERF_EVENT_STATE_INACTIVE)
>                 return 0;
>
> -       __perf_event_mark_enabled(event);
> +       perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>
>         return 1;
>  }
> @@ -3590,12 +3449,6 @@ static void perf_event_enable_on_exec(int ctxn)
>                 put_ctx(clone_ctx);
>  }
>
> -struct perf_read_data {
> -       struct perf_event *event;
> -       bool group;
> -       int ret;
> -};
> -
>  static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>  {
>         u16 local_pkg, event_pkg;
> @@ -3613,64 +3466,6 @@ static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>         return event_cpu;
>  }
>
> -/*
> - * Cross CPU call to read the hardware event
> - */
> -static void __perf_event_read(void *info)
> -{
> -       struct perf_read_data *data = info;
> -       struct perf_event *sub, *event = data->event;
> -       struct perf_event_context *ctx = event->ctx;
> -       struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
> -       struct pmu *pmu = event->pmu;
> -
> -       /*
> -        * If this is a task context, we need to check whether it is
> -        * the current task context of this cpu.  If not it has been
> -        * scheduled out before the smp call arrived.  In that case
> -        * event->count would have been updated to a recent sample
> -        * when the event was scheduled out.
> -        */
> -       if (ctx->task && cpuctx->task_ctx != ctx)
> -               return;
> -
> -       raw_spin_lock(&ctx->lock);
> -       if (ctx->is_active) {
> -               update_context_time(ctx);
> -               update_cgrp_time_from_event(event);
> -       }
> -
> -       update_event_times(event);
> -       if (event->state != PERF_EVENT_STATE_ACTIVE)
> -               goto unlock;
> -
> -       if (!data->group) {
> -               pmu->read(event);
> -               data->ret = 0;
> -               goto unlock;
> -       }
> -
> -       pmu->start_txn(pmu, PERF_PMU_TXN_READ);
> -
> -       pmu->read(event);
> -
> -       list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -               update_event_times(sub);
> -               if (sub->state == PERF_EVENT_STATE_ACTIVE) {
> -                       /*
> -                        * Use sibling's PMU rather than @event's since
> -                        * sibling could be on different (eg: software) PMU.
> -                        */
> -                       sub->pmu->read(sub);
> -               }
> -       }
> -
> -       data->ret = pmu->commit_txn(pmu);
> -
> -unlock:
> -       raw_spin_unlock(&ctx->lock);
> -}
> -
>  static inline u64 perf_event_count(struct perf_event *event)
>  {
>         return local64_read(&event->count) + atomic64_read(&event->child_count);
> @@ -3733,63 +3528,81 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
>         return ret;
>  }
>
> -static int perf_event_read(struct perf_event *event, bool group)
> +struct perf_read_data {
> +       struct perf_event *event;
> +       bool group;
> +       int ret;
> +};
> +
> +static void __perf_event_read(struct perf_event *event,
> +                             struct perf_cpu_context *cpuctx,
> +                             struct perf_event_context *ctx,
> +                             void *data)
>  {
> -       int event_cpu, ret = 0;
> +       struct perf_read_data *prd = data;
> +       struct pmu *pmu = event->pmu;
> +       struct perf_event *sibling;
>
> -       /*
> -        * If event is enabled and currently active on a CPU, update the
> -        * value in the event structure:
> -        */
> -       if (event->state == PERF_EVENT_STATE_ACTIVE) {
> -               struct perf_read_data data = {
> -                       .event = event,
> -                       .group = group,
> -                       .ret = 0,
> -               };
> +       if (ctx->is_active & EVENT_TIME) {
> +               update_context_time(ctx);
> +               update_cgrp_time_from_cpuctx(cpuctx);
> +       }
>
> -               event_cpu = READ_ONCE(event->oncpu);
> -               if ((unsigned)event_cpu >= nr_cpu_ids)
> -                       return 0;
> +       perf_event_update_time(event);
> +       if (prd->group)
> +               perf_event_update_sibling_time(event);
>
> -               preempt_disable();
> -               event_cpu = __perf_event_read_cpu(event, event_cpu);
> +       if (event->state != PERF_EVENT_STATE_ACTIVE)
> +               return;
>
> +       if (!prd->group) {
> +               pmu->read(event);
> +               prd->ret = 0;
> +               return;
> +       }
> +
> +       pmu->start_txn(pmu, PERF_PMU_TXN_READ);
> +
> +       pmu->read(event);
> +       list_for_each_entry(sibling, &event->sibling_list, group_entry) {
> +               if (sibling->state == PERF_EVENT_STATE_ACTIVE) {
> +                       /*
> +                        * Use sibling's PMU rather than @event's since
> +                        * sibling could be on different (eg: software) PMU.
> +                        */
> +                       sibling->pmu->read(sibling);
> +               }
> +       }
> +
> +       prd->ret = pmu->commit_txn(pmu);
> +}
> +
> +static int perf_event_read(struct perf_event *event, bool group)
> +{
> +       struct perf_read_data prd = {
> +               .event = event,
> +               .group = group,
> +               .ret = 0,
> +       };
> +
> +       if (event->ctx->task) {
> +               event_function_call(event, __perf_event_read, &prd);
> +       } else {
>                 /*
> -                * Purposely ignore the smp_call_function_single() return
> -                * value.
> -                *
> -                * If event_cpu isn't a valid CPU it means the event got
> -                * scheduled out and that will have updated the event count.
> -                *
> -                * Therefore, either way, we'll have an up-to-date event count
> -                * after this.
> -                */
> -               (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
> -               preempt_enable();
> -               ret = data.ret;
> -       } else if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -               struct perf_event_context *ctx = event->ctx;
> -               unsigned long flags;
> -
> -               raw_spin_lock_irqsave(&ctx->lock, flags);
> -               /*
> -                * may read while context is not active
> -                * (e.g., thread is blocked), in that case
> -                * we cannot update context time
> +                * For uncore events (which are per definition per-cpu)
> +                * allow a different read CPU from event->cpu.
>                  */
> -               if (ctx->is_active) {
> -                       update_context_time(ctx);
> -                       update_cgrp_time_from_event(event);
> -               }
> -               if (group)
> -                       update_group_times(event);
> -               else
> -                       update_event_times(event);
> -               raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +               struct event_function_struct efs = {
> +                       .event = event,
> +                       .func = __perf_event_read,
> +                       .data = &prd,
> +               };
> +               int cpu = __perf_event_read_cpu(event, event->cpu);
> +
> +               cpu_function_call(cpu, event_function, &efs);
>         }
>
> -       return ret;
> +       return prd.ret;
>  }
>
>  /*
> @@ -4388,7 +4201,7 @@ static int perf_release(struct inode *inode, struct file *file)
>         return 0;
>  }
>
> -u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
> +static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>  {
>         struct perf_event *child;
>         u64 total = 0;
> @@ -4416,6 +4229,18 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>
>         return total;
>  }
> +
> +u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
> +{
> +       struct perf_event_context *ctx;
> +       u64 count;
> +
> +       ctx = perf_event_ctx_lock(event);
> +       count = __perf_event_read_value(event, enabled, running);
> +       perf_event_ctx_unlock(event, ctx);
> +
> +       return count;
> +}
>  EXPORT_SYMBOL_GPL(perf_event_read_value);
>
>  static int __perf_read_group_add(struct perf_event *leader,
> @@ -4431,6 +4256,8 @@ static int __perf_read_group_add(struct perf_event *leader,
>         if (ret)
>                 return ret;
>
> +       raw_spin_lock_irqsave(&ctx->lock, flags);
> +
>         /*
>          * Since we co-schedule groups, {enabled,running} times of siblings
>          * will be identical to those of the leader, so we only publish one
> @@ -4453,8 +4280,6 @@ static int __perf_read_group_add(struct perf_event *leader,
>         if (read_format & PERF_FORMAT_ID)
>                 values[n++] = primary_event_id(leader);
>
> -       raw_spin_lock_irqsave(&ctx->lock, flags);
> -
>         list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>                 values[n++] += perf_event_count(sub);
>                 if (read_format & PERF_FORMAT_ID)
> @@ -4518,7 +4343,7 @@ static int perf_read_one(struct perf_event *event,
>         u64 values[4];
>         int n = 0;
>
> -       values[n++] = perf_event_read_value(event, &enabled, &running);
> +       values[n++] = __perf_event_read_value(event, &enabled, &running);
>         if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
>                 values[n++] = enabled;
>         if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> @@ -4897,8 +4722,7 @@ static void calc_timer_values(struct perf_event *event,
>
>         *now = perf_clock();
>         ctx_time = event->shadow_ctx_time + *now;
> -       *enabled = ctx_time - event->tstamp_enabled;
> -       *running = ctx_time - event->tstamp_running;
> +       __perf_update_times(event, ctx_time, enabled, running);
>  }
>
>  static void perf_event_init_userpage(struct perf_event *event)
> @@ -10516,7 +10340,7 @@ perf_event_exit_event(struct perf_event *child_event,
>         if (parent_event)
>                 perf_group_detach(child_event);
>         list_del_event(child_event, child_ctx);
> -       child_event->state = PERF_EVENT_STATE_EXIT; /* is_event_hup() */
> +       perf_event_set_state(child_event, PERF_EVENT_STATE_EXIT); /* is_event_hup() */
>         raw_spin_unlock_irq(&child_ctx->lock);
>
>         /*
> @@ -10754,7 +10578,7 @@ inherit_event(struct perf_event *parent_event,
>               struct perf_event *group_leader,
>               struct perf_event_context *child_ctx)
>  {
> -       enum perf_event_active_state parent_state = parent_event->state;
> +       enum perf_event_state parent_state = parent_event->state;
>         struct perf_event *child_event;
>         unsigned long flags;
>
> @@ -11090,6 +10914,7 @@ static void __perf_event_exit_context(void *__info)
>         struct perf_event *event;
>
>         raw_spin_lock(&ctx->lock);
> +       ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>         list_for_each_entry(event, &ctx->event_list, event_entry)
>                 __perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
>         raw_spin_unlock(&ctx->lock);

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-08-31 17:18           ` [RFC][PATCH] perf: Rewrite enabled/running timekeeping Peter Zijlstra
  2017-08-31 19:51             ` Stephane Eranian
@ 2017-09-01 10:45             ` Alexey Budankov
  2017-09-01 12:31               ` Peter Zijlstra
  2017-09-01 11:17             ` Alexey Budankov
                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-09-01 10:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On 31.08.2017 20:18, Peter Zijlstra wrote:
> On Wed, Aug 23, 2017 at 11:54:15AM +0300, Alexey Budankov wrote:
>> On 22.08.2017 23:47, Peter Zijlstra wrote:
>>> On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote:
>>>> The key thing in the patch is explicit updating of tstamp fields for
>>>> INACTIVE events in update_event_times().
>>>
>>>> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event)
>>>>  	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
>>>>  		return;
>>>>  
>>>> +	if (event->state == PERF_EVENT_STATE_INACTIVE)
>>>> +		perf_event_tstamp_update(event);
>>>> +
>>>>  	/*
>>>>  	 * in cgroup mode, time_enabled represents
>>>>  	 * the time the event was enabled AND active
>>>
>>> But why!? I thought the whole point was to not need to do this.
>>
>> update_event_times() is not called from timer interrupt handler 
>> thus it is not on the critical path which is optimized in this patch set.
>>
>> But update_event_times() is called in the context of read() syscall so
>> this is the place where we may update event times for INACTIVE events 
>> instead of timer interrupt.
>>
>> Also update_event_times() is called on thread context switch out so
>> we get event times also updated when the thread migrates to other CPU.
>>
>>>
>>> The thing I outlined earlier would only need to update timestamps when
>>> events change state and at no other point in time.
>>
>> But we still may request times while event is in INACTIVE state 
>> thru read() syscall and event timings need to be up-to-date. 
> 
> Sure, read() also updates.
> 
> So the below completely rewrites timekeeping (and probably breaks
> world) but does away with the need to touch events that don't get
> scheduled.
> 
> Esp the cgroup stuff is entirely untested since I simply don't know how
> to operate that. I did run Vince's tests on it, and I think it doesn't
> regress, but I'm near a migraine so I can't really see straight atm.
> 
> Vince, Stephane, could you guys have a peek?
> 
> (There's a few other bits in, I'll break up into patches and write
> comments and Changelogs later, I think its can be split in some 5
> patches).
> 
> The basic idea is really simple, we have a single timestamp and
> depending on the state we update enabled/running. This obviously only
> requires updates when we change state and when we need up-to-date
> timestamps (read).
> 
> No more weird and wonderful mind bending interaction between 3 different
> timestamps with arcane update rules.

Well, this looks like an "opposite" approach to event timekeeping in 
comparison to what we currently have. 

Do you want this rework before or after the current patch set?

> 
> ---
>  include/linux/perf_event.h |  25 +-
>  kernel/events/core.c       | 551 ++++++++++++++++-----------------------------
>  2 files changed, 192 insertions(+), 384 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8e22f24ded6a..2a6ae48a1a96 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -485,9 +485,9 @@ struct perf_addr_filters_head {
>  };
>  
>  /**
> - * enum perf_event_active_state - the states of a event
> + * enum perf_event_state - the states of a event
>   */
> -enum perf_event_active_state {
> +enum perf_event_state {
>  	PERF_EVENT_STATE_DEAD		= -4,
>  	PERF_EVENT_STATE_EXIT		= -3,
>  	PERF_EVENT_STATE_ERROR		= -2,
> @@ -578,7 +578,7 @@ struct perf_event {
>  	struct pmu			*pmu;
>  	void				*pmu_private;
>  
> -	enum perf_event_active_state	state;
> +	enum perf_event_state		state;
>  	unsigned int			attach_state;
>  	local64_t			count;
>  	atomic64_t			child_count;
> @@ -588,26 +588,10 @@ struct perf_event {
>  	 * has been enabled (i.e. eligible to run, and the task has
>  	 * been scheduled in, if this is a per-task event)
>  	 * and running (scheduled onto the CPU), respectively.
> -	 *
> -	 * They are computed from tstamp_enabled, tstamp_running and
> -	 * tstamp_stopped when the event is in INACTIVE or ACTIVE state.
>  	 */
>  	u64				total_time_enabled;
>  	u64				total_time_running;
> -
> -	/*
> -	 * These are timestamps used for computing total_time_enabled
> -	 * and total_time_running when the event is in INACTIVE or
> -	 * ACTIVE state, measured in nanoseconds from an arbitrary point
> -	 * in time.
> -	 * tstamp_enabled: the notional time when the event was enabled
> -	 * tstamp_running: the notional time when the event was scheduled on
> -	 * tstamp_stopped: in INACTIVE state, the notional time when the
> -	 *	event was scheduled off.
> -	 */
> -	u64				tstamp_enabled;
> -	u64				tstamp_running;
> -	u64				tstamp_stopped;
> +	u64				tstamp;
>  
>  	/*
>  	 * timestamp shadows the actual context timing but it can
> @@ -699,7 +683,6 @@ struct perf_event {
>  
>  #ifdef CONFIG_CGROUP_PERF
>  	struct perf_cgroup		*cgrp; /* cgroup event is attach to */
> -	int				cgrp_defer_enabled;
>  #endif
>  
>  	struct list_head		sb_list;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 294f1927f944..e968b3eab9c7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -582,6 +582,70 @@ static inline u64 perf_event_clock(struct perf_event *event)
>  	return event->clock();
>  }
>  
> +/*
> + * XXX comment about timekeeping goes here
> + */
> +
> +static __always_inline enum perf_event_state
> +__perf_effective_state(struct perf_event *event)
> +{
> +	struct perf_event *leader = event->group_leader;
> +
> +	if (leader->state <= PERF_EVENT_STATE_OFF)
> +		return leader->state;
> +
> +	return event->state;
> +}
> +
> +static __always_inline void
> +__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
> +{
> +	enum perf_event_state state = __perf_effective_state(event);
> +	u64 delta = now - event->tstamp;
> +
> +	*enabled = event->total_time_enabled;
> +	if (state >= PERF_EVENT_STATE_INACTIVE)
> +		*enabled += delta;
> +
> +	*running = event->total_time_running;
> +	if (state >= PERF_EVENT_STATE_ACTIVE)
> +		*running += delta;
> +}
> +
> +static void perf_event_update_time(struct perf_event *event)
> +{
> +	u64 now = perf_event_time(event);
> +
> +	__perf_update_times(event, now, &event->total_time_enabled,
> +					&event->total_time_running);
> +	event->tstamp = now;
> +}
> +
> +static void perf_event_update_sibling_time(struct perf_event *leader)
> +{
> +	struct perf_event *sibling;
> +
> +	list_for_each_entry(sibling, &leader->sibling_list, group_entry)
> +		perf_event_update_time(sibling);
> +}
> +
> +static void
> +perf_event_set_state(struct perf_event *event, enum perf_event_state state)
> +{
> +	if (event->state == state)
> +		return;
> +
> +	perf_event_update_time(event);
> +	/*
> +	 * If a group leader gets enabled/disabled all its siblings
> +	 * are affected too.
> +	 */
> +	if ((event->state < 0) ^ (state < 0))
> +		perf_event_update_sibling_time(event);
> +
> +	WRITE_ONCE(event->state, state);
> +}
> +
>  #ifdef CONFIG_CGROUP_PERF
>  
>  static inline bool
> @@ -841,40 +905,6 @@ perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
>  	event->shadow_ctx_time = now - t->timestamp;
>  }
>  
> -static inline void
> -perf_cgroup_defer_enabled(struct perf_event *event)
> -{
> -	/*
> -	 * when the current task's perf cgroup does not match
> -	 * the event's, we need to remember to call the
> -	 * perf_mark_enable() function the first time a task with
> -	 * a matching perf cgroup is scheduled in.
> -	 */
> -	if (is_cgroup_event(event) && !perf_cgroup_match(event))
> -		event->cgrp_defer_enabled = 1;
> -}
> -
> -static inline void
> -perf_cgroup_mark_enabled(struct perf_event *event,
> -			 struct perf_event_context *ctx)
> -{
> -	struct perf_event *sub;
> -	u64 tstamp = perf_event_time(event);
> -
> -	if (!event->cgrp_defer_enabled)
> -		return;
> -
> -	event->cgrp_defer_enabled = 0;
> -
> -	event->tstamp_enabled = tstamp - event->total_time_enabled;
> -	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -		if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
> -			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
> -			sub->cgrp_defer_enabled = 0;
> -		}
> -	}
> -}
> -
>  /*
>   * Update cpuctx->cgrp so that it is set when first cgroup event is added and
>   * cleared when last cgroup event is removed.
> @@ -973,17 +1003,6 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
>  }
>  
>  static inline void
> -perf_cgroup_defer_enabled(struct perf_event *event)
> -{
> -}
> -
> -static inline void
> -perf_cgroup_mark_enabled(struct perf_event *event,
> -			 struct perf_event_context *ctx)
> -{
> -}
> -
> -static inline void
>  list_update_cgroup_event(struct perf_event *event,
>  			 struct perf_event_context *ctx, bool add)
>  {
> @@ -1396,60 +1415,6 @@ static u64 perf_event_time(struct perf_event *event)
>  	return ctx ? ctx->time : 0;
>  }
>  
> -/*
> - * Update the total_time_enabled and total_time_running fields for a event.
> - */
> -static void update_event_times(struct perf_event *event)
> -{
> -	struct perf_event_context *ctx = event->ctx;
> -	u64 run_end;
> -
> -	lockdep_assert_held(&ctx->lock);
> -
> -	if (event->state < PERF_EVENT_STATE_INACTIVE ||
> -	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
> -		return;
> -
> -	/*
> -	 * in cgroup mode, time_enabled represents
> -	 * the time the event was enabled AND active
> -	 * tasks were in the monitored cgroup. This is
> -	 * independent of the activity of the context as
> -	 * there may be a mix of cgroup and non-cgroup events.
> -	 *
> -	 * That is why we treat cgroup events differently
> -	 * here.
> -	 */
> -	if (is_cgroup_event(event))
> -		run_end = perf_cgroup_event_time(event);
> -	else if (ctx->is_active)
> -		run_end = ctx->time;
> -	else
> -		run_end = event->tstamp_stopped;
> -
> -	event->total_time_enabled = run_end - event->tstamp_enabled;
> -
> -	if (event->state == PERF_EVENT_STATE_INACTIVE)
> -		run_end = event->tstamp_stopped;
> -	else
> -		run_end = perf_event_time(event);
> -
> -	event->total_time_running = run_end - event->tstamp_running;
> -
> -}
> -
> -/*
> - * Update total_time_enabled and total_time_running for all events in a group.
> - */
> -static void update_group_times(struct perf_event *leader)
> -{
> -	struct perf_event *event;
> -
> -	update_event_times(leader);
> -	list_for_each_entry(event, &leader->sibling_list, group_entry)
> -		update_event_times(event);
> -}
> -
>  static enum event_type_t get_event_type(struct perf_event *event)
>  {
>  	struct perf_event_context *ctx = event->ctx;
> @@ -1492,6 +1457,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
>  	WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT);
>  	event->attach_state |= PERF_ATTACH_CONTEXT;
>  
> +	event->tstamp = perf_event_time(event);
> +
>  	/*
>  	 * If we're a stand alone event or group leader, we go to the context
>  	 * list, group events are kept attached to the group so that
> @@ -1699,8 +1666,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>  	if (event->group_leader == event)
>  		list_del_init(&event->group_entry);
>  
> -	update_group_times(event);
> -
>  	/*
>  	 * If event was in error state, then keep it
>  	 * that way, otherwise bogus counts will be
> @@ -1709,7 +1674,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>  	 * of the event
>  	 */
>  	if (event->state > PERF_EVENT_STATE_OFF)
> -		event->state = PERF_EVENT_STATE_OFF;
> +		perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>  
>  	ctx->generation++;
>  }
> @@ -1808,38 +1773,24 @@ event_sched_out(struct perf_event *event,
>  		  struct perf_cpu_context *cpuctx,
>  		  struct perf_event_context *ctx)
>  {
> -	u64 tstamp = perf_event_time(event);
> -	u64 delta;
> +	enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
>  
>  	WARN_ON_ONCE(event->ctx != ctx);
>  	lockdep_assert_held(&ctx->lock);
>  
> -	/*
> -	 * An event which could not be activated because of
> -	 * filter mismatch still needs to have its timings
> -	 * maintained, otherwise bogus information is return
> -	 * via read() for time_enabled, time_running:
> -	 */
> -	if (event->state == PERF_EVENT_STATE_INACTIVE &&
> -	    !event_filter_match(event)) {
> -		delta = tstamp - event->tstamp_stopped;
> -		event->tstamp_running += delta;
> -		event->tstamp_stopped = tstamp;
> -	}
> -
>  	if (event->state != PERF_EVENT_STATE_ACTIVE)
>  		return;
>  
>  	perf_pmu_disable(event->pmu);
>  
> -	event->tstamp_stopped = tstamp;
>  	event->pmu->del(event, 0);
>  	event->oncpu = -1;
> -	event->state = PERF_EVENT_STATE_INACTIVE;
> +
>  	if (event->pending_disable) {
>  		event->pending_disable = 0;
> -		event->state = PERF_EVENT_STATE_OFF;
> +		state = PERF_EVENT_STATE_OFF;
>  	}
> +	perf_event_set_state(event, state);
>  
>  	if (!is_software_event(event))
>  		cpuctx->active_oncpu--;
> @@ -1859,7 +1810,9 @@ group_sched_out(struct perf_event *group_event,
>  		struct perf_event_context *ctx)
>  {
>  	struct perf_event *event;
> -	int state = group_event->state;
> +
> +	if (group_event->state != PERF_EVENT_STATE_ACTIVE)
> +		return;
>  
>  	perf_pmu_disable(ctx->pmu);
>  
> @@ -1873,7 +1826,7 @@ group_sched_out(struct perf_event *group_event,
>  
>  	perf_pmu_enable(ctx->pmu);
>  
> -	if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive)
> +	if (group_event->attr.exclusive)
>  		cpuctx->exclusive = 0;
>  }
>  
> @@ -1893,6 +1846,11 @@ __perf_remove_from_context(struct perf_event *event,
>  {
>  	unsigned long flags = (unsigned long)info;
>  
> +	if (ctx->is_active & EVENT_TIME) {
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx);
> +	}
> +
>  	event_sched_out(event, cpuctx, ctx);
>  	if (flags & DETACH_GROUP)
>  		perf_group_detach(event);
> @@ -1955,14 +1913,17 @@ static void __perf_event_disable(struct perf_event *event,
>  	if (event->state < PERF_EVENT_STATE_INACTIVE)
>  		return;
>  
> -	update_context_time(ctx);
> -	update_cgrp_time_from_event(event);
> -	update_group_times(event);
> +	if (ctx->is_active & EVENT_TIME) {
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx);
> +	}
> +
>  	if (event == event->group_leader)
>  		group_sched_out(event, cpuctx, ctx);
>  	else
>  		event_sched_out(event, cpuctx, ctx);
> -	event->state = PERF_EVENT_STATE_OFF;
> +
> +	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>  }
>  
>  /*
> @@ -2019,8 +1980,7 @@ void perf_event_disable_inatomic(struct perf_event *event)
>  }
>  
>  static void perf_set_shadow_time(struct perf_event *event,
> -				 struct perf_event_context *ctx,
> -				 u64 tstamp)
> +				 struct perf_event_context *ctx)
>  {
>  	/*
>  	 * use the correct time source for the time snapshot
> @@ -2048,9 +2008,9 @@ static void perf_set_shadow_time(struct perf_event *event,
>  	 * is cleaner and simpler to understand.
>  	 */
>  	if (is_cgroup_event(event))
> -		perf_cgroup_set_shadow_time(event, tstamp);
> +		perf_cgroup_set_shadow_time(event, event->tstamp);
>  	else
> -		event->shadow_ctx_time = tstamp - ctx->timestamp;
> +		event->shadow_ctx_time = event->tstamp - ctx->timestamp;
>  }
>  
>  #define MAX_INTERRUPTS (~0ULL)
> @@ -2063,7 +2023,6 @@ event_sched_in(struct perf_event *event,
>  		 struct perf_cpu_context *cpuctx,
>  		 struct perf_event_context *ctx)
>  {
> -	u64 tstamp = perf_event_time(event);
>  	int ret = 0;
>  
>  	lockdep_assert_held(&ctx->lock);
> @@ -2077,7 +2036,7 @@ event_sched_in(struct perf_event *event,
>  	 * is visible.
>  	 */
>  	smp_wmb();
> -	WRITE_ONCE(event->state, PERF_EVENT_STATE_ACTIVE);
> +	perf_event_set_state(event, PERF_EVENT_STATE_ACTIVE);
>  
>  	/*
>  	 * Unthrottle events, since we scheduled we might have missed several
> @@ -2089,26 +2048,19 @@ event_sched_in(struct perf_event *event,
>  		event->hw.interrupts = 0;
>  	}
>  
> -	/*
> -	 * The new state must be visible before we turn it on in the hardware:
> -	 */
> -	smp_wmb();
> -
>  	perf_pmu_disable(event->pmu);
>  
> -	perf_set_shadow_time(event, ctx, tstamp);
> +	perf_set_shadow_time(event, ctx);
>  
>  	perf_log_itrace_start(event);
>  
>  	if (event->pmu->add(event, PERF_EF_START)) {
> -		event->state = PERF_EVENT_STATE_INACTIVE;
> +		perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  		event->oncpu = -1;
>  		ret = -EAGAIN;
>  		goto out;
>  	}
>  
> -	event->tstamp_running += tstamp - event->tstamp_stopped;
> -
>  	if (!is_software_event(event))
>  		cpuctx->active_oncpu++;
>  	if (!ctx->nr_active++)
> @@ -2132,8 +2084,6 @@ group_sched_in(struct perf_event *group_event,
>  {
>  	struct perf_event *event, *partial_group = NULL;
>  	struct pmu *pmu = ctx->pmu;
> -	u64 now = ctx->time;
> -	bool simulate = false;
>  
>  	if (group_event->state == PERF_EVENT_STATE_OFF)
>  		return 0;
> @@ -2163,27 +2113,13 @@ group_sched_in(struct perf_event *group_event,
>  	/*
>  	 * Groups can be scheduled in as one unit only, so undo any
>  	 * partial group before returning:
> -	 * The events up to the failed event are scheduled out normally,
> -	 * tstamp_stopped will be updated.
> -	 *
> -	 * The failed events and the remaining siblings need to have
> -	 * their timings updated as if they had gone thru event_sched_in()
> -	 * and event_sched_out(). This is required to get consistent timings
> -	 * across the group. This also takes care of the case where the group
> -	 * could never be scheduled by ensuring tstamp_stopped is set to mark
> -	 * the time the event was actually stopped, such that time delta
> -	 * calculation in update_event_times() is correct.
> +	 * The events up to the failed event are scheduled out normally.
>  	 */
>  	list_for_each_entry(event, &group_event->sibling_list, group_entry) {
>  		if (event == partial_group)
> -			simulate = true;
> +			break;
>  
> -		if (simulate) {
> -			event->tstamp_running += now - event->tstamp_stopped;
> -			event->tstamp_stopped = now;
> -		} else {
> -			event_sched_out(event, cpuctx, ctx);
> -		}
> +		event_sched_out(event, cpuctx, ctx);
>  	}
>  	event_sched_out(group_event, cpuctx, ctx);
>  
> @@ -2225,46 +2161,11 @@ static int group_can_go_on(struct perf_event *event,
>  	return can_add_hw;
>  }
>  
> -/*
> - * Complement to update_event_times(). This computes the tstamp_* values to
> - * continue 'enabled' state from @now, and effectively discards the time
> - * between the prior tstamp_stopped and now (as we were in the OFF state, or
> - * just switched (context) time base).
> - *
> - * This further assumes '@event->state == INACTIVE' (we just came from OFF) and
> - * cannot have been scheduled in yet. And going into INACTIVE state means
> - * '@event->tstamp_stopped = @now'.
> - *
> - * Thus given the rules of update_event_times():
> - *
> - *   total_time_enabled = tstamp_stopped - tstamp_enabled
> - *   total_time_running = tstamp_stopped - tstamp_running
> - *
> - * We can insert 'tstamp_stopped == now' and reverse them to compute new
> - * tstamp_* values.
> - */
> -static void __perf_event_enable_time(struct perf_event *event, u64 now)
> -{
> -	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE);
> -
> -	event->tstamp_stopped = now;
> -	event->tstamp_enabled = now - event->total_time_enabled;
> -	event->tstamp_running = now - event->total_time_running;
> -}
> -
>  static void add_event_to_ctx(struct perf_event *event,
>  			       struct perf_event_context *ctx)
>  {
> -	u64 tstamp = perf_event_time(event);
> -
>  	list_add_event(event, ctx);
>  	perf_group_attach(event);
> -	/*
> -	 * We can be called with event->state == STATE_OFF when we create with
> -	 * .disabled = 1. In that case the IOC_ENABLE will call this function.
> -	 */
> -	if (event->state == PERF_EVENT_STATE_INACTIVE)
> -		__perf_event_enable_time(event, tstamp);
>  }
>  
>  static void ctx_sched_out(struct perf_event_context *ctx,
> @@ -2496,28 +2397,6 @@ perf_install_in_context(struct perf_event_context *ctx,
>  }
>  
>  /*
> - * Put a event into inactive state and update time fields.
> - * Enabling the leader of a group effectively enables all
> - * the group members that aren't explicitly disabled, so we
> - * have to update their ->tstamp_enabled also.
> - * Note: this works for group members as well as group leaders
> - * since the non-leader members' sibling_lists will be empty.
> - */
> -static void __perf_event_mark_enabled(struct perf_event *event)
> -{
> -	struct perf_event *sub;
> -	u64 tstamp = perf_event_time(event);
> -
> -	event->state = PERF_EVENT_STATE_INACTIVE;
> -	__perf_event_enable_time(event, tstamp);
> -	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -		/* XXX should not be > INACTIVE if event isn't */
> -		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
> -			__perf_event_enable_time(sub, tstamp);
> -	}
> -}
> -
> -/*
>   * Cross CPU call to enable a performance event
>   */
>  static void __perf_event_enable(struct perf_event *event,
> @@ -2535,14 +2414,12 @@ static void __perf_event_enable(struct perf_event *event,
>  	if (ctx->is_active)
>  		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>  
> -	__perf_event_mark_enabled(event);
> +	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  
>  	if (!ctx->is_active)
>  		return;
>  
>  	if (!event_filter_match(event)) {
> -		if (is_cgroup_event(event))
> -			perf_cgroup_defer_enabled(event);
>  		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
>  		return;
>  	}
> @@ -2862,18 +2739,10 @@ static void __perf_event_sync_stat(struct perf_event *event,
>  	 * we know the event must be on the current CPU, therefore we
>  	 * don't need to use it.
>  	 */
> -	switch (event->state) {
> -	case PERF_EVENT_STATE_ACTIVE:
> +	if (event->state == PERF_EVENT_STATE_ACTIVE)
>  		event->pmu->read(event);
> -		/* fall-through */
>  
> -	case PERF_EVENT_STATE_INACTIVE:
> -		update_event_times(event);
> -		break;
> -
> -	default:
> -		break;
> -	}
> +	perf_event_update_time(event);
>  
>  	/*
>  	 * In order to keep per-task stats reliable we need to flip the event
> @@ -3110,10 +2979,6 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>  		if (!event_filter_match(event))
>  			continue;
>  
> -		/* may need to reset tstamp_enabled */
> -		if (is_cgroup_event(event))
> -			perf_cgroup_mark_enabled(event, ctx);
> -
>  		if (group_can_go_on(event, cpuctx, 1))
>  			group_sched_in(event, cpuctx, ctx);
>  
> @@ -3121,10 +2986,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>  		 * If this pinned group hasn't been scheduled,
>  		 * put it in error state.
>  		 */
> -		if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -			update_group_times(event);
> -			event->state = PERF_EVENT_STATE_ERROR;
> -		}
> +		if (event->state == PERF_EVENT_STATE_INACTIVE)
> +			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>  	}
>  }
>  
> @@ -3146,10 +3009,6 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
>  		if (!event_filter_match(event))
>  			continue;
>  
> -		/* may need to reset tstamp_enabled */
> -		if (is_cgroup_event(event))
> -			perf_cgroup_mark_enabled(event, ctx);
> -
>  		if (group_can_go_on(event, cpuctx, can_add_hw)) {
>  			if (group_sched_in(event, cpuctx, ctx))
>  				can_add_hw = 0;
> @@ -3541,7 +3400,7 @@ static int event_enable_on_exec(struct perf_event *event,
>  	if (event->state >= PERF_EVENT_STATE_INACTIVE)
>  		return 0;
>  
> -	__perf_event_mark_enabled(event);
> +	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  
>  	return 1;
>  }
> @@ -3590,12 +3449,6 @@ static void perf_event_enable_on_exec(int ctxn)
>  		put_ctx(clone_ctx);
>  }
>  
> -struct perf_read_data {
> -	struct perf_event *event;
> -	bool group;
> -	int ret;
> -};
> -
>  static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>  {
>  	u16 local_pkg, event_pkg;
> @@ -3613,64 +3466,6 @@ static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>  	return event_cpu;
>  }
>  
> -/*
> - * Cross CPU call to read the hardware event
> - */
> -static void __perf_event_read(void *info)
> -{
> -	struct perf_read_data *data = info;
> -	struct perf_event *sub, *event = data->event;
> -	struct perf_event_context *ctx = event->ctx;
> -	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
> -	struct pmu *pmu = event->pmu;
> -
> -	/*
> -	 * If this is a task context, we need to check whether it is
> -	 * the current task context of this cpu.  If not it has been
> -	 * scheduled out before the smp call arrived.  In that case
> -	 * event->count would have been updated to a recent sample
> -	 * when the event was scheduled out.
> -	 */
> -	if (ctx->task && cpuctx->task_ctx != ctx)
> -		return;
> -
> -	raw_spin_lock(&ctx->lock);
> -	if (ctx->is_active) {
> -		update_context_time(ctx);
> -		update_cgrp_time_from_event(event);
> -	}
> -
> -	update_event_times(event);
> -	if (event->state != PERF_EVENT_STATE_ACTIVE)
> -		goto unlock;
> -
> -	if (!data->group) {
> -		pmu->read(event);
> -		data->ret = 0;
> -		goto unlock;
> -	}
> -
> -	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
> -
> -	pmu->read(event);
> -
> -	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -		update_event_times(sub);
> -		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
> -			/*
> -			 * Use sibling's PMU rather than @event's since
> -			 * sibling could be on different (eg: software) PMU.
> -			 */
> -			sub->pmu->read(sub);
> -		}
> -	}
> -
> -	data->ret = pmu->commit_txn(pmu);
> -
> -unlock:
> -	raw_spin_unlock(&ctx->lock);
> -}
> -
>  static inline u64 perf_event_count(struct perf_event *event)
>  {
>  	return local64_read(&event->count) + atomic64_read(&event->child_count);
> @@ -3733,63 +3528,81 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
>  	return ret;
>  }
>  
> -static int perf_event_read(struct perf_event *event, bool group)
> +struct perf_read_data {
> +	struct perf_event *event;
> +	bool group;
> +	int ret;
> +};
> +
> +static void __perf_event_read(struct perf_event *event,
> +			      struct perf_cpu_context *cpuctx,
> +			      struct perf_event_context *ctx,
> +			      void *data)
>  {
> -	int event_cpu, ret = 0;
> +	struct perf_read_data *prd = data;
> +	struct pmu *pmu = event->pmu;
> +	struct perf_event *sibling;
>  
> -	/*
> -	 * If event is enabled and currently active on a CPU, update the
> -	 * value in the event structure:
> -	 */
> -	if (event->state == PERF_EVENT_STATE_ACTIVE) {
> -		struct perf_read_data data = {
> -			.event = event,
> -			.group = group,
> -			.ret = 0,
> -		};
> +	if (ctx->is_active & EVENT_TIME) {
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx);
> +	}
>  
> -		event_cpu = READ_ONCE(event->oncpu);
> -		if ((unsigned)event_cpu >= nr_cpu_ids)
> -			return 0;
> +	perf_event_update_time(event);
> +	if (prd->group)
> +		perf_event_update_sibling_time(event);
>  
> -		preempt_disable();
> -		event_cpu = __perf_event_read_cpu(event, event_cpu);
> +	if (event->state != PERF_EVENT_STATE_ACTIVE)
> +		return;
>  
> +	if (!prd->group) {
> +		pmu->read(event);
> +		prd->ret = 0;
> +		return;
> +	}
> +
> +	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
> +
> +	pmu->read(event);
> +	list_for_each_entry(sibling, &event->sibling_list, group_entry) {
> +		if (sibling->state == PERF_EVENT_STATE_ACTIVE) {
> +			/*
> +			 * Use sibling's PMU rather than @event's since
> +			 * sibling could be on different (eg: software) PMU.
> +			 */
> +			sibling->pmu->read(sibling);
> +		}
> +	}
> +
> +	prd->ret = pmu->commit_txn(pmu);
> +}
> +
> +static int perf_event_read(struct perf_event *event, bool group)
> +{
> +	struct perf_read_data prd = {
> +		.event = event,
> +		.group = group,
> +		.ret = 0,
> +	};
> +
> +	if (event->ctx->task) {
> +		event_function_call(event, __perf_event_read, &prd);
> +	} else {
>  		/*
> -		 * Purposely ignore the smp_call_function_single() return
> -		 * value.
> -		 *
> -		 * If event_cpu isn't a valid CPU it means the event got
> -		 * scheduled out and that will have updated the event count.
> -		 *
> -		 * Therefore, either way, we'll have an up-to-date event count
> -		 * after this.
> -		 */
> -		(void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
> -		preempt_enable();
> -		ret = data.ret;
> -	} else if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -		struct perf_event_context *ctx = event->ctx;
> -		unsigned long flags;
> -
> -		raw_spin_lock_irqsave(&ctx->lock, flags);
> -		/*
> -		 * may read while context is not active
> -		 * (e.g., thread is blocked), in that case
> -		 * we cannot update context time
> +		 * For uncore events (which are per definition per-cpu)
> +		 * allow a different read CPU from event->cpu.
>  		 */
> -		if (ctx->is_active) {
> -			update_context_time(ctx);
> -			update_cgrp_time_from_event(event);
> -		}
> -		if (group)
> -			update_group_times(event);
> -		else
> -			update_event_times(event);
> -		raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +		struct event_function_struct efs = {
> +			.event = event,
> +			.func = __perf_event_read,
> +			.data = &prd,
> +		};
> +		int cpu = __perf_event_read_cpu(event, event->cpu);
> +
> +		cpu_function_call(cpu, event_function, &efs);
>  	}
>  
> -	return ret;
> +	return prd.ret;
>  }
>  
>  /*
> @@ -4388,7 +4201,7 @@ static int perf_release(struct inode *inode, struct file *file)
>  	return 0;
>  }
>  
> -u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
> +static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>  {
>  	struct perf_event *child;
>  	u64 total = 0;
> @@ -4416,6 +4229,18 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>  
>  	return total;
>  }
> +
> +u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
> +{
> +	struct perf_event_context *ctx;
> +	u64 count;
> +
> +	ctx = perf_event_ctx_lock(event);
> +	count = __perf_event_read_value(event, enabled, running);
> +	perf_event_ctx_unlock(event, ctx);
> +
> +	return count;
> +}
>  EXPORT_SYMBOL_GPL(perf_event_read_value);
>  
>  static int __perf_read_group_add(struct perf_event *leader,
> @@ -4431,6 +4256,8 @@ static int __perf_read_group_add(struct perf_event *leader,
>  	if (ret)
>  		return ret;
>  
> +	raw_spin_lock_irqsave(&ctx->lock, flags);
> +
>  	/*
>  	 * Since we co-schedule groups, {enabled,running} times of siblings
>  	 * will be identical to those of the leader, so we only publish one
> @@ -4453,8 +4280,6 @@ static int __perf_read_group_add(struct perf_event *leader,
>  	if (read_format & PERF_FORMAT_ID)
>  		values[n++] = primary_event_id(leader);
>  
> -	raw_spin_lock_irqsave(&ctx->lock, flags);
> -
>  	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>  		values[n++] += perf_event_count(sub);
>  		if (read_format & PERF_FORMAT_ID)
> @@ -4518,7 +4343,7 @@ static int perf_read_one(struct perf_event *event,
>  	u64 values[4];
>  	int n = 0;
>  
> -	values[n++] = perf_event_read_value(event, &enabled, &running);
> +	values[n++] = __perf_event_read_value(event, &enabled, &running);
>  	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
>  		values[n++] = enabled;
>  	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> @@ -4897,8 +4722,7 @@ static void calc_timer_values(struct perf_event *event,
>  
>  	*now = perf_clock();
>  	ctx_time = event->shadow_ctx_time + *now;
> -	*enabled = ctx_time - event->tstamp_enabled;
> -	*running = ctx_time - event->tstamp_running;
> +	__perf_update_times(event, ctx_time, enabled, running);
>  }
>  
>  static void perf_event_init_userpage(struct perf_event *event)
> @@ -10516,7 +10340,7 @@ perf_event_exit_event(struct perf_event *child_event,
>  	if (parent_event)
>  		perf_group_detach(child_event);
>  	list_del_event(child_event, child_ctx);
> -	child_event->state = PERF_EVENT_STATE_EXIT; /* is_event_hup() */
> +	perf_event_set_state(child_event, PERF_EVENT_STATE_EXIT); /* is_event_hup() */
>  	raw_spin_unlock_irq(&child_ctx->lock);
>  
>  	/*
> @@ -10754,7 +10578,7 @@ inherit_event(struct perf_event *parent_event,
>  	      struct perf_event *group_leader,
>  	      struct perf_event_context *child_ctx)
>  {
> -	enum perf_event_active_state parent_state = parent_event->state;
> +	enum perf_event_state parent_state = parent_event->state;
>  	struct perf_event *child_event;
>  	unsigned long flags;
>  
> @@ -11090,6 +10914,7 @@ static void __perf_event_exit_context(void *__info)
>  	struct perf_event *event;
>  
>  	raw_spin_lock(&ctx->lock);
> +	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>  	list_for_each_entry(event, &ctx->event_list, event_entry)
>  		__perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
>  	raw_spin_unlock(&ctx->lock);
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-08-31 17:18           ` [RFC][PATCH] perf: Rewrite enabled/running timekeeping Peter Zijlstra
  2017-08-31 19:51             ` Stephane Eranian
  2017-09-01 10:45             ` Alexey Budankov
@ 2017-09-01 11:17             ` Alexey Budankov
  2017-09-01 12:42               ` Peter Zijlstra
  2017-09-01 21:03             ` Vince Weaver
  2017-09-04 10:46             ` Alexey Budankov
  4 siblings, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-09-01 11:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On 31.08.2017 20:18, Peter Zijlstra wrote:
> On Wed, Aug 23, 2017 at 11:54:15AM +0300, Alexey Budankov wrote:
>> On 22.08.2017 23:47, Peter Zijlstra wrote:
>>> On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote:
>>>> The key thing in the patch is explicit updating of tstamp fields for
>>>> INACTIVE events in update_event_times().
>>>
>>>> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event)
>>>>  	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
>>>>  		return;
>>>>  
>>>> +	if (event->state == PERF_EVENT_STATE_INACTIVE)
>>>> +		perf_event_tstamp_update(event);
>>>> +
>>>>  	/*
>>>>  	 * in cgroup mode, time_enabled represents
>>>>  	 * the time the event was enabled AND active
>>>
>>> But why!? I thought the whole point was to not need to do this.
>>
>> update_event_times() is not called from timer interrupt handler 
>> thus it is not on the critical path which is optimized in this patch set.
>>
>> But update_event_times() is called in the context of read() syscall so
>> this is the place where we may update event times for INACTIVE events 
>> instead of timer interrupt.
>>
>> Also update_event_times() is called on thread context switch out so
>> we get event times also updated when the thread migrates to other CPU.
>>
>>>
>>> The thing I outlined earlier would only need to update timestamps when
>>> events change state and at no other point in time.
>>
>> But we still may request times while event is in INACTIVE state 
>> thru read() syscall and event timings need to be up-to-date. 
> 
> Sure, read() also updates.
> 
> So the below completely rewrites timekeeping (and probably breaks
> world) but does away with the need to touch events that don't get
> scheduled.
> 
> Esp the cgroup stuff is entirely untested since I simply don't know how
> to operate that. I did run Vince's tests on it, and I think it doesn't
> regress, but I'm near a migraine so I can't really see straight atm.
> 
> Vince, Stephane, could you guys have a peek?
> 
> (There's a few other bits in, I'll break up into patches and write
> comments and Changelogs later, I think its can be split in some 5
> patches).
> 
> The basic idea is really simple, we have a single timestamp and
> depending on the state we update enabled/running. This obviously only
> requires updates when we change state and when we need up-to-date
> timestamps (read).
> 
> No more weird and wonderful mind bending interaction between 3 different
> timestamps with arcane update rules.
> 
> ---
>  include/linux/perf_event.h |  25 +-
>  kernel/events/core.c       | 551 ++++++++++++++++-----------------------------
>  2 files changed, 192 insertions(+), 384 deletions(-)
> 

Tried to apply on top of this:

perf/core 1b2f76d77a277bb70d38ad0991ed7f16bbc115a9 [origin/perf/core] Merge tag 'perf-core-for-mingo-4.14-20170829' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core

but failed:

Checking patch include/linux/perf_event.h...
Hunk #1 succeeded at 498 (offset 13 lines).
Hunk #2 succeeded at 591 (offset 13 lines).
Hunk #3 succeeded at 601 (offset 13 lines).
Hunk #4 succeeded at 696 (offset 13 lines).
Checking patch kernel/events/core.c...
error: while searching for:
	return event_cpu;
}

/*
 * Cross CPU call to read the hardware event
 */
static void __perf_event_read(void *info)
{
	struct perf_read_data *data = info;
	struct perf_event *sub, *event = data->event;
	struct perf_event_context *ctx = event->ctx;
	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
	struct pmu *pmu = event->pmu;

	/*
	 * If this is a task context, we need to check whether it is
	 * the current task context of this cpu.  If not it has been
	 * scheduled out before the smp call arrived.  In that case
	 * event->count would have been updated to a recent sample
	 * when the event was scheduled out.
	 */
	if (ctx->task && cpuctx->task_ctx != ctx)
		return;

	raw_spin_lock(&ctx->lock);
	if (ctx->is_active) {
		update_context_time(ctx);
		update_cgrp_time_from_event(event);
	}

	update_event_times(event);
	if (event->state != PERF_EVENT_STATE_ACTIVE)
		goto unlock;

	if (!data->group) {
		pmu->read(event);
		data->ret = 0;
		goto unlock;
	}

	pmu->start_txn(pmu, PERF_PMU_TXN_READ);

	pmu->read(event);

	list_for_each_entry(sub, &event->sibling_list, group_entry) {
		update_event_times(sub);
		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
			/*
			 * Use sibling's PMU rather than @event's since
			 * sibling could be on different (eg: software) PMU.
			 */
			sub->pmu->read(sub);
		}
	}

	data->ret = pmu->commit_txn(pmu);

unlock:
	raw_spin_unlock(&ctx->lock);
}

static inline u64 perf_event_count(struct perf_event *event)
{
	return local64_read(&event->count) + atomic64_read(&event->child_count);

error: patch failed: kernel/events/core.c:3613
error: kernel/events/core.c: patch does not apply

> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8e22f24ded6a..2a6ae48a1a96 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -485,9 +485,9 @@ struct perf_addr_filters_head {
>  };
>  
>  /**
> - * enum perf_event_active_state - the states of a event
> + * enum perf_event_state - the states of a event
>   */
> -enum perf_event_active_state {
> +enum perf_event_state {
>  	PERF_EVENT_STATE_DEAD		= -4,
>  	PERF_EVENT_STATE_EXIT		= -3,
>  	PERF_EVENT_STATE_ERROR		= -2,
> @@ -578,7 +578,7 @@ struct perf_event {
>  	struct pmu			*pmu;
>  	void				*pmu_private;
>  
> -	enum perf_event_active_state	state;
> +	enum perf_event_state		state;
>  	unsigned int			attach_state;
>  	local64_t			count;
>  	atomic64_t			child_count;
> @@ -588,26 +588,10 @@ struct perf_event {
>  	 * has been enabled (i.e. eligible to run, and the task has
>  	 * been scheduled in, if this is a per-task event)
>  	 * and running (scheduled onto the CPU), respectively.
> -	 *
> -	 * They are computed from tstamp_enabled, tstamp_running and
> -	 * tstamp_stopped when the event is in INACTIVE or ACTIVE state.
>  	 */
>  	u64				total_time_enabled;
>  	u64				total_time_running;
> -
> -	/*
> -	 * These are timestamps used for computing total_time_enabled
> -	 * and total_time_running when the event is in INACTIVE or
> -	 * ACTIVE state, measured in nanoseconds from an arbitrary point
> -	 * in time.
> -	 * tstamp_enabled: the notional time when the event was enabled
> -	 * tstamp_running: the notional time when the event was scheduled on
> -	 * tstamp_stopped: in INACTIVE state, the notional time when the
> -	 *	event was scheduled off.
> -	 */
> -	u64				tstamp_enabled;
> -	u64				tstamp_running;
> -	u64				tstamp_stopped;
> +	u64				tstamp;
>  
>  	/*
>  	 * timestamp shadows the actual context timing but it can
> @@ -699,7 +683,6 @@ struct perf_event {
>  
>  #ifdef CONFIG_CGROUP_PERF
>  	struct perf_cgroup		*cgrp; /* cgroup event is attach to */
> -	int				cgrp_defer_enabled;
>  #endif
>  
>  	struct list_head		sb_list;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 294f1927f944..e968b3eab9c7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -582,6 +582,70 @@ static inline u64 perf_event_clock(struct perf_event *event)
>  	return event->clock();
>  }
>  
> +/*
> + * XXX comment about timekeeping goes here
> + */
> +
> +static __always_inline enum perf_event_state
> +__perf_effective_state(struct perf_event *event)
> +{
> +	struct perf_event *leader = event->group_leader;
> +
> +	if (leader->state <= PERF_EVENT_STATE_OFF)
> +		return leader->state;
> +
> +	return event->state;
> +}
> +
> +static __always_inline void
> +__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
> +{
> +	enum perf_event_state state = __perf_effective_state(event);
> +	u64 delta = now - event->tstamp;
> +
> +	*enabled = event->total_time_enabled;
> +	if (state >= PERF_EVENT_STATE_INACTIVE)
> +		*enabled += delta;
> +
> +	*running = event->total_time_running;
> +	if (state >= PERF_EVENT_STATE_ACTIVE)
> +		*running += delta;
> +}
> +
> +static void perf_event_update_time(struct perf_event *event)
> +{
> +	u64 now = perf_event_time(event);
> +
> +	__perf_update_times(event, now, &event->total_time_enabled,
> +					&event->total_time_running);
> +	event->tstamp = now;
> +}
> +
> +static void perf_event_update_sibling_time(struct perf_event *leader)
> +{
> +	struct perf_event *sibling;
> +
> +	list_for_each_entry(sibling, &leader->sibling_list, group_entry)
> +		perf_event_update_time(sibling);
> +}
> +
> +static void
> +perf_event_set_state(struct perf_event *event, enum perf_event_state state)
> +{
> +	if (event->state == state)
> +		return;
> +
> +	perf_event_update_time(event);
> +	/*
> +	 * If a group leader gets enabled/disabled all its siblings
> +	 * are affected too.
> +	 */
> +	if ((event->state < 0) ^ (state < 0))
> +		perf_event_update_sibling_time(event);
> +
> +	WRITE_ONCE(event->state, state);
> +}
> +
>  #ifdef CONFIG_CGROUP_PERF
>  
>  static inline bool
> @@ -841,40 +905,6 @@ perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
>  	event->shadow_ctx_time = now - t->timestamp;
>  }
>  
> -static inline void
> -perf_cgroup_defer_enabled(struct perf_event *event)
> -{
> -	/*
> -	 * when the current task's perf cgroup does not match
> -	 * the event's, we need to remember to call the
> -	 * perf_mark_enable() function the first time a task with
> -	 * a matching perf cgroup is scheduled in.
> -	 */
> -	if (is_cgroup_event(event) && !perf_cgroup_match(event))
> -		event->cgrp_defer_enabled = 1;
> -}
> -
> -static inline void
> -perf_cgroup_mark_enabled(struct perf_event *event,
> -			 struct perf_event_context *ctx)
> -{
> -	struct perf_event *sub;
> -	u64 tstamp = perf_event_time(event);
> -
> -	if (!event->cgrp_defer_enabled)
> -		return;
> -
> -	event->cgrp_defer_enabled = 0;
> -
> -	event->tstamp_enabled = tstamp - event->total_time_enabled;
> -	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -		if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
> -			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
> -			sub->cgrp_defer_enabled = 0;
> -		}
> -	}
> -}
> -
>  /*
>   * Update cpuctx->cgrp so that it is set when first cgroup event is added and
>   * cleared when last cgroup event is removed.
> @@ -973,17 +1003,6 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
>  }
>  
>  static inline void
> -perf_cgroup_defer_enabled(struct perf_event *event)
> -{
> -}
> -
> -static inline void
> -perf_cgroup_mark_enabled(struct perf_event *event,
> -			 struct perf_event_context *ctx)
> -{
> -}
> -
> -static inline void
>  list_update_cgroup_event(struct perf_event *event,
>  			 struct perf_event_context *ctx, bool add)
>  {
> @@ -1396,60 +1415,6 @@ static u64 perf_event_time(struct perf_event *event)
>  	return ctx ? ctx->time : 0;
>  }
>  
> -/*
> - * Update the total_time_enabled and total_time_running fields for a event.
> - */
> -static void update_event_times(struct perf_event *event)
> -{
> -	struct perf_event_context *ctx = event->ctx;
> -	u64 run_end;
> -
> -	lockdep_assert_held(&ctx->lock);
> -
> -	if (event->state < PERF_EVENT_STATE_INACTIVE ||
> -	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
> -		return;
> -
> -	/*
> -	 * in cgroup mode, time_enabled represents
> -	 * the time the event was enabled AND active
> -	 * tasks were in the monitored cgroup. This is
> -	 * independent of the activity of the context as
> -	 * there may be a mix of cgroup and non-cgroup events.
> -	 *
> -	 * That is why we treat cgroup events differently
> -	 * here.
> -	 */
> -	if (is_cgroup_event(event))
> -		run_end = perf_cgroup_event_time(event);
> -	else if (ctx->is_active)
> -		run_end = ctx->time;
> -	else
> -		run_end = event->tstamp_stopped;
> -
> -	event->total_time_enabled = run_end - event->tstamp_enabled;
> -
> -	if (event->state == PERF_EVENT_STATE_INACTIVE)
> -		run_end = event->tstamp_stopped;
> -	else
> -		run_end = perf_event_time(event);
> -
> -	event->total_time_running = run_end - event->tstamp_running;
> -
> -}
> -
> -/*
> - * Update total_time_enabled and total_time_running for all events in a group.
> - */
> -static void update_group_times(struct perf_event *leader)
> -{
> -	struct perf_event *event;
> -
> -	update_event_times(leader);
> -	list_for_each_entry(event, &leader->sibling_list, group_entry)
> -		update_event_times(event);
> -}
> -
>  static enum event_type_t get_event_type(struct perf_event *event)
>  {
>  	struct perf_event_context *ctx = event->ctx;
> @@ -1492,6 +1457,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
>  	WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT);
>  	event->attach_state |= PERF_ATTACH_CONTEXT;
>  
> +	event->tstamp = perf_event_time(event);
> +
>  	/*
>  	 * If we're a stand alone event or group leader, we go to the context
>  	 * list, group events are kept attached to the group so that
> @@ -1699,8 +1666,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>  	if (event->group_leader == event)
>  		list_del_init(&event->group_entry);
>  
> -	update_group_times(event);
> -
>  	/*
>  	 * If event was in error state, then keep it
>  	 * that way, otherwise bogus counts will be
> @@ -1709,7 +1674,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>  	 * of the event
>  	 */
>  	if (event->state > PERF_EVENT_STATE_OFF)
> -		event->state = PERF_EVENT_STATE_OFF;
> +		perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>  
>  	ctx->generation++;
>  }
> @@ -1808,38 +1773,24 @@ event_sched_out(struct perf_event *event,
>  		  struct perf_cpu_context *cpuctx,
>  		  struct perf_event_context *ctx)
>  {
> -	u64 tstamp = perf_event_time(event);
> -	u64 delta;
> +	enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
>  
>  	WARN_ON_ONCE(event->ctx != ctx);
>  	lockdep_assert_held(&ctx->lock);
>  
> -	/*
> -	 * An event which could not be activated because of
> -	 * filter mismatch still needs to have its timings
> -	 * maintained, otherwise bogus information is return
> -	 * via read() for time_enabled, time_running:
> -	 */
> -	if (event->state == PERF_EVENT_STATE_INACTIVE &&
> -	    !event_filter_match(event)) {
> -		delta = tstamp - event->tstamp_stopped;
> -		event->tstamp_running += delta;
> -		event->tstamp_stopped = tstamp;
> -	}
> -
>  	if (event->state != PERF_EVENT_STATE_ACTIVE)
>  		return;
>  
>  	perf_pmu_disable(event->pmu);
>  
> -	event->tstamp_stopped = tstamp;
>  	event->pmu->del(event, 0);
>  	event->oncpu = -1;
> -	event->state = PERF_EVENT_STATE_INACTIVE;
> +
>  	if (event->pending_disable) {
>  		event->pending_disable = 0;
> -		event->state = PERF_EVENT_STATE_OFF;
> +		state = PERF_EVENT_STATE_OFF;
>  	}
> +	perf_event_set_state(event, state);
>  
>  	if (!is_software_event(event))
>  		cpuctx->active_oncpu--;
> @@ -1859,7 +1810,9 @@ group_sched_out(struct perf_event *group_event,
>  		struct perf_event_context *ctx)
>  {
>  	struct perf_event *event;
> -	int state = group_event->state;
> +
> +	if (group_event->state != PERF_EVENT_STATE_ACTIVE)
> +		return;
>  
>  	perf_pmu_disable(ctx->pmu);
>  
> @@ -1873,7 +1826,7 @@ group_sched_out(struct perf_event *group_event,
>  
>  	perf_pmu_enable(ctx->pmu);
>  
> -	if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive)
> +	if (group_event->attr.exclusive)
>  		cpuctx->exclusive = 0;
>  }
>  
> @@ -1893,6 +1846,11 @@ __perf_remove_from_context(struct perf_event *event,
>  {
>  	unsigned long flags = (unsigned long)info;
>  
> +	if (ctx->is_active & EVENT_TIME) {
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx);
> +	}
> +
>  	event_sched_out(event, cpuctx, ctx);
>  	if (flags & DETACH_GROUP)
>  		perf_group_detach(event);
> @@ -1955,14 +1913,17 @@ static void __perf_event_disable(struct perf_event *event,
>  	if (event->state < PERF_EVENT_STATE_INACTIVE)
>  		return;
>  
> -	update_context_time(ctx);
> -	update_cgrp_time_from_event(event);
> -	update_group_times(event);
> +	if (ctx->is_active & EVENT_TIME) {
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx);
> +	}
> +
>  	if (event == event->group_leader)
>  		group_sched_out(event, cpuctx, ctx);
>  	else
>  		event_sched_out(event, cpuctx, ctx);
> -	event->state = PERF_EVENT_STATE_OFF;
> +
> +	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>  }
>  
>  /*
> @@ -2019,8 +1980,7 @@ void perf_event_disable_inatomic(struct perf_event *event)
>  }
>  
>  static void perf_set_shadow_time(struct perf_event *event,
> -				 struct perf_event_context *ctx,
> -				 u64 tstamp)
> +				 struct perf_event_context *ctx)
>  {
>  	/*
>  	 * use the correct time source for the time snapshot
> @@ -2048,9 +2008,9 @@ static void perf_set_shadow_time(struct perf_event *event,
>  	 * is cleaner and simpler to understand.
>  	 */
>  	if (is_cgroup_event(event))
> -		perf_cgroup_set_shadow_time(event, tstamp);
> +		perf_cgroup_set_shadow_time(event, event->tstamp);
>  	else
> -		event->shadow_ctx_time = tstamp - ctx->timestamp;
> +		event->shadow_ctx_time = event->tstamp - ctx->timestamp;
>  }
>  
>  #define MAX_INTERRUPTS (~0ULL)
> @@ -2063,7 +2023,6 @@ event_sched_in(struct perf_event *event,
>  		 struct perf_cpu_context *cpuctx,
>  		 struct perf_event_context *ctx)
>  {
> -	u64 tstamp = perf_event_time(event);
>  	int ret = 0;
>  
>  	lockdep_assert_held(&ctx->lock);
> @@ -2077,7 +2036,7 @@ event_sched_in(struct perf_event *event,
>  	 * is visible.
>  	 */
>  	smp_wmb();
> -	WRITE_ONCE(event->state, PERF_EVENT_STATE_ACTIVE);
> +	perf_event_set_state(event, PERF_EVENT_STATE_ACTIVE);
>  
>  	/*
>  	 * Unthrottle events, since we scheduled we might have missed several
> @@ -2089,26 +2048,19 @@ event_sched_in(struct perf_event *event,
>  		event->hw.interrupts = 0;
>  	}
>  
> -	/*
> -	 * The new state must be visible before we turn it on in the hardware:
> -	 */
> -	smp_wmb();
> -
>  	perf_pmu_disable(event->pmu);
>  
> -	perf_set_shadow_time(event, ctx, tstamp);
> +	perf_set_shadow_time(event, ctx);
>  
>  	perf_log_itrace_start(event);
>  
>  	if (event->pmu->add(event, PERF_EF_START)) {
> -		event->state = PERF_EVENT_STATE_INACTIVE;
> +		perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  		event->oncpu = -1;
>  		ret = -EAGAIN;
>  		goto out;
>  	}
>  
> -	event->tstamp_running += tstamp - event->tstamp_stopped;
> -
>  	if (!is_software_event(event))
>  		cpuctx->active_oncpu++;
>  	if (!ctx->nr_active++)
> @@ -2132,8 +2084,6 @@ group_sched_in(struct perf_event *group_event,
>  {
>  	struct perf_event *event, *partial_group = NULL;
>  	struct pmu *pmu = ctx->pmu;
> -	u64 now = ctx->time;
> -	bool simulate = false;
>  
>  	if (group_event->state == PERF_EVENT_STATE_OFF)
>  		return 0;
> @@ -2163,27 +2113,13 @@ group_sched_in(struct perf_event *group_event,
>  	/*
>  	 * Groups can be scheduled in as one unit only, so undo any
>  	 * partial group before returning:
> -	 * The events up to the failed event are scheduled out normally,
> -	 * tstamp_stopped will be updated.
> -	 *
> -	 * The failed events and the remaining siblings need to have
> -	 * their timings updated as if they had gone thru event_sched_in()
> -	 * and event_sched_out(). This is required to get consistent timings
> -	 * across the group. This also takes care of the case where the group
> -	 * could never be scheduled by ensuring tstamp_stopped is set to mark
> -	 * the time the event was actually stopped, such that time delta
> -	 * calculation in update_event_times() is correct.
> +	 * The events up to the failed event are scheduled out normally.
>  	 */
>  	list_for_each_entry(event, &group_event->sibling_list, group_entry) {
>  		if (event == partial_group)
> -			simulate = true;
> +			break;
>  
> -		if (simulate) {
> -			event->tstamp_running += now - event->tstamp_stopped;
> -			event->tstamp_stopped = now;
> -		} else {
> -			event_sched_out(event, cpuctx, ctx);
> -		}
> +		event_sched_out(event, cpuctx, ctx);
>  	}
>  	event_sched_out(group_event, cpuctx, ctx);
>  
> @@ -2225,46 +2161,11 @@ static int group_can_go_on(struct perf_event *event,
>  	return can_add_hw;
>  }
>  
> -/*
> - * Complement to update_event_times(). This computes the tstamp_* values to
> - * continue 'enabled' state from @now, and effectively discards the time
> - * between the prior tstamp_stopped and now (as we were in the OFF state, or
> - * just switched (context) time base).
> - *
> - * This further assumes '@event->state == INACTIVE' (we just came from OFF) and
> - * cannot have been scheduled in yet. And going into INACTIVE state means
> - * '@event->tstamp_stopped = @now'.
> - *
> - * Thus given the rules of update_event_times():
> - *
> - *   total_time_enabled = tstamp_stopped - tstamp_enabled
> - *   total_time_running = tstamp_stopped - tstamp_running
> - *
> - * We can insert 'tstamp_stopped == now' and reverse them to compute new
> - * tstamp_* values.
> - */
> -static void __perf_event_enable_time(struct perf_event *event, u64 now)
> -{
> -	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE);
> -
> -	event->tstamp_stopped = now;
> -	event->tstamp_enabled = now - event->total_time_enabled;
> -	event->tstamp_running = now - event->total_time_running;
> -}
> -
>  static void add_event_to_ctx(struct perf_event *event,
>  			       struct perf_event_context *ctx)
>  {
> -	u64 tstamp = perf_event_time(event);
> -
>  	list_add_event(event, ctx);
>  	perf_group_attach(event);
> -	/*
> -	 * We can be called with event->state == STATE_OFF when we create with
> -	 * .disabled = 1. In that case the IOC_ENABLE will call this function.
> -	 */
> -	if (event->state == PERF_EVENT_STATE_INACTIVE)
> -		__perf_event_enable_time(event, tstamp);
>  }
>  
>  static void ctx_sched_out(struct perf_event_context *ctx,
> @@ -2496,28 +2397,6 @@ perf_install_in_context(struct perf_event_context *ctx,
>  }
>  
>  /*
> - * Put a event into inactive state and update time fields.
> - * Enabling the leader of a group effectively enables all
> - * the group members that aren't explicitly disabled, so we
> - * have to update their ->tstamp_enabled also.
> - * Note: this works for group members as well as group leaders
> - * since the non-leader members' sibling_lists will be empty.
> - */
> -static void __perf_event_mark_enabled(struct perf_event *event)
> -{
> -	struct perf_event *sub;
> -	u64 tstamp = perf_event_time(event);
> -
> -	event->state = PERF_EVENT_STATE_INACTIVE;
> -	__perf_event_enable_time(event, tstamp);
> -	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -		/* XXX should not be > INACTIVE if event isn't */
> -		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
> -			__perf_event_enable_time(sub, tstamp);
> -	}
> -}
> -
> -/*
>   * Cross CPU call to enable a performance event
>   */
>  static void __perf_event_enable(struct perf_event *event,
> @@ -2535,14 +2414,12 @@ static void __perf_event_enable(struct perf_event *event,
>  	if (ctx->is_active)
>  		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>  
> -	__perf_event_mark_enabled(event);
> +	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  
>  	if (!ctx->is_active)
>  		return;
>  
>  	if (!event_filter_match(event)) {
> -		if (is_cgroup_event(event))
> -			perf_cgroup_defer_enabled(event);
>  		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
>  		return;
>  	}
> @@ -2862,18 +2739,10 @@ static void __perf_event_sync_stat(struct perf_event *event,
>  	 * we know the event must be on the current CPU, therefore we
>  	 * don't need to use it.
>  	 */
> -	switch (event->state) {
> -	case PERF_EVENT_STATE_ACTIVE:
> +	if (event->state == PERF_EVENT_STATE_ACTIVE)
>  		event->pmu->read(event);
> -		/* fall-through */
>  
> -	case PERF_EVENT_STATE_INACTIVE:
> -		update_event_times(event);
> -		break;
> -
> -	default:
> -		break;
> -	}
> +	perf_event_update_time(event);
>  
>  	/*
>  	 * In order to keep per-task stats reliable we need to flip the event
> @@ -3110,10 +2979,6 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>  		if (!event_filter_match(event))
>  			continue;
>  
> -		/* may need to reset tstamp_enabled */
> -		if (is_cgroup_event(event))
> -			perf_cgroup_mark_enabled(event, ctx);
> -
>  		if (group_can_go_on(event, cpuctx, 1))
>  			group_sched_in(event, cpuctx, ctx);
>  
> @@ -3121,10 +2986,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>  		 * If this pinned group hasn't been scheduled,
>  		 * put it in error state.
>  		 */
> -		if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -			update_group_times(event);
> -			event->state = PERF_EVENT_STATE_ERROR;
> -		}
> +		if (event->state == PERF_EVENT_STATE_INACTIVE)
> +			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>  	}
>  }
>  
> @@ -3146,10 +3009,6 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
>  		if (!event_filter_match(event))
>  			continue;
>  
> -		/* may need to reset tstamp_enabled */
> -		if (is_cgroup_event(event))
> -			perf_cgroup_mark_enabled(event, ctx);
> -
>  		if (group_can_go_on(event, cpuctx, can_add_hw)) {
>  			if (group_sched_in(event, cpuctx, ctx))
>  				can_add_hw = 0;
> @@ -3541,7 +3400,7 @@ static int event_enable_on_exec(struct perf_event *event,
>  	if (event->state >= PERF_EVENT_STATE_INACTIVE)
>  		return 0;
>  
> -	__perf_event_mark_enabled(event);
> +	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  
>  	return 1;
>  }
> @@ -3590,12 +3449,6 @@ static void perf_event_enable_on_exec(int ctxn)
>  		put_ctx(clone_ctx);
>  }
>  
> -struct perf_read_data {
> -	struct perf_event *event;
> -	bool group;
> -	int ret;
> -};
> -
>  static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>  {
>  	u16 local_pkg, event_pkg;
> @@ -3613,64 +3466,6 @@ static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>  	return event_cpu;
>  }
>  
> -/*
> - * Cross CPU call to read the hardware event
> - */
> -static void __perf_event_read(void *info)
> -{
> -	struct perf_read_data *data = info;
> -	struct perf_event *sub, *event = data->event;
> -	struct perf_event_context *ctx = event->ctx;
> -	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
> -	struct pmu *pmu = event->pmu;
> -
> -	/*
> -	 * If this is a task context, we need to check whether it is
> -	 * the current task context of this cpu.  If not it has been
> -	 * scheduled out before the smp call arrived.  In that case
> -	 * event->count would have been updated to a recent sample
> -	 * when the event was scheduled out.
> -	 */
> -	if (ctx->task && cpuctx->task_ctx != ctx)
> -		return;
> -
> -	raw_spin_lock(&ctx->lock);
> -	if (ctx->is_active) {
> -		update_context_time(ctx);
> -		update_cgrp_time_from_event(event);
> -	}
> -
> -	update_event_times(event);
> -	if (event->state != PERF_EVENT_STATE_ACTIVE)
> -		goto unlock;
> -
> -	if (!data->group) {
> -		pmu->read(event);
> -		data->ret = 0;
> -		goto unlock;
> -	}
> -
> -	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
> -
> -	pmu->read(event);
> -
> -	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -		update_event_times(sub);
> -		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
> -			/*
> -			 * Use sibling's PMU rather than @event's since
> -			 * sibling could be on different (eg: software) PMU.
> -			 */
> -			sub->pmu->read(sub);
> -		}
> -	}
> -
> -	data->ret = pmu->commit_txn(pmu);
> -
> -unlock:
> -	raw_spin_unlock(&ctx->lock);
> -}
> -
>  static inline u64 perf_event_count(struct perf_event *event)
>  {
>  	return local64_read(&event->count) + atomic64_read(&event->child_count);
> @@ -3733,63 +3528,81 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
>  	return ret;
>  }
>  
> -static int perf_event_read(struct perf_event *event, bool group)
> +struct perf_read_data {
> +	struct perf_event *event;
> +	bool group;
> +	int ret;
> +};
> +
> +static void __perf_event_read(struct perf_event *event,
> +			      struct perf_cpu_context *cpuctx,
> +			      struct perf_event_context *ctx,
> +			      void *data)
>  {
> -	int event_cpu, ret = 0;
> +	struct perf_read_data *prd = data;
> +	struct pmu *pmu = event->pmu;
> +	struct perf_event *sibling;
>  
> -	/*
> -	 * If event is enabled and currently active on a CPU, update the
> -	 * value in the event structure:
> -	 */
> -	if (event->state == PERF_EVENT_STATE_ACTIVE) {
> -		struct perf_read_data data = {
> -			.event = event,
> -			.group = group,
> -			.ret = 0,
> -		};
> +	if (ctx->is_active & EVENT_TIME) {
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx);
> +	}
>  
> -		event_cpu = READ_ONCE(event->oncpu);
> -		if ((unsigned)event_cpu >= nr_cpu_ids)
> -			return 0;
> +	perf_event_update_time(event);
> +	if (prd->group)
> +		perf_event_update_sibling_time(event);
>  
> -		preempt_disable();
> -		event_cpu = __perf_event_read_cpu(event, event_cpu);
> +	if (event->state != PERF_EVENT_STATE_ACTIVE)
> +		return;
>  
> +	if (!prd->group) {
> +		pmu->read(event);
> +		prd->ret = 0;
> +		return;
> +	}
> +
> +	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
> +
> +	pmu->read(event);
> +	list_for_each_entry(sibling, &event->sibling_list, group_entry) {
> +		if (sibling->state == PERF_EVENT_STATE_ACTIVE) {
> +			/*
> +			 * Use sibling's PMU rather than @event's since
> +			 * sibling could be on different (eg: software) PMU.
> +			 */
> +			sibling->pmu->read(sibling);
> +		}
> +	}
> +
> +	prd->ret = pmu->commit_txn(pmu);
> +}
> +
> +static int perf_event_read(struct perf_event *event, bool group)
> +{
> +	struct perf_read_data prd = {
> +		.event = event,
> +		.group = group,
> +		.ret = 0,
> +	};
> +
> +	if (event->ctx->task) {
> +		event_function_call(event, __perf_event_read, &prd);
> +	} else {
>  		/*
> -		 * Purposely ignore the smp_call_function_single() return
> -		 * value.
> -		 *
> -		 * If event_cpu isn't a valid CPU it means the event got
> -		 * scheduled out and that will have updated the event count.
> -		 *
> -		 * Therefore, either way, we'll have an up-to-date event count
> -		 * after this.
> -		 */
> -		(void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
> -		preempt_enable();
> -		ret = data.ret;
> -	} else if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -		struct perf_event_context *ctx = event->ctx;
> -		unsigned long flags;
> -
> -		raw_spin_lock_irqsave(&ctx->lock, flags);
> -		/*
> -		 * may read while context is not active
> -		 * (e.g., thread is blocked), in that case
> -		 * we cannot update context time
> +		 * For uncore events (which are per definition per-cpu)
> +		 * allow a different read CPU from event->cpu.
>  		 */
> -		if (ctx->is_active) {
> -			update_context_time(ctx);
> -			update_cgrp_time_from_event(event);
> -		}
> -		if (group)
> -			update_group_times(event);
> -		else
> -			update_event_times(event);
> -		raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +		struct event_function_struct efs = {
> +			.event = event,
> +			.func = __perf_event_read,
> +			.data = &prd,
> +		};
> +		int cpu = __perf_event_read_cpu(event, event->cpu);
> +
> +		cpu_function_call(cpu, event_function, &efs);
>  	}
>  
> -	return ret;
> +	return prd.ret;
>  }
>  
>  /*
> @@ -4388,7 +4201,7 @@ static int perf_release(struct inode *inode, struct file *file)
>  	return 0;
>  }
>  
> -u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
> +static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>  {
>  	struct perf_event *child;
>  	u64 total = 0;
> @@ -4416,6 +4229,18 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>  
>  	return total;
>  }
> +
> +u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
> +{
> +	struct perf_event_context *ctx;
> +	u64 count;
> +
> +	ctx = perf_event_ctx_lock(event);
> +	count = __perf_event_read_value(event, enabled, running);
> +	perf_event_ctx_unlock(event, ctx);
> +
> +	return count;
> +}
>  EXPORT_SYMBOL_GPL(perf_event_read_value);
>  
>  static int __perf_read_group_add(struct perf_event *leader,
> @@ -4431,6 +4256,8 @@ static int __perf_read_group_add(struct perf_event *leader,
>  	if (ret)
>  		return ret;
>  
> +	raw_spin_lock_irqsave(&ctx->lock, flags);
> +
>  	/*
>  	 * Since we co-schedule groups, {enabled,running} times of siblings
>  	 * will be identical to those of the leader, so we only publish one
> @@ -4453,8 +4280,6 @@ static int __perf_read_group_add(struct perf_event *leader,
>  	if (read_format & PERF_FORMAT_ID)
>  		values[n++] = primary_event_id(leader);
>  
> -	raw_spin_lock_irqsave(&ctx->lock, flags);
> -
>  	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>  		values[n++] += perf_event_count(sub);
>  		if (read_format & PERF_FORMAT_ID)
> @@ -4518,7 +4343,7 @@ static int perf_read_one(struct perf_event *event,
>  	u64 values[4];
>  	int n = 0;
>  
> -	values[n++] = perf_event_read_value(event, &enabled, &running);
> +	values[n++] = __perf_event_read_value(event, &enabled, &running);
>  	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
>  		values[n++] = enabled;
>  	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> @@ -4897,8 +4722,7 @@ static void calc_timer_values(struct perf_event *event,
>  
>  	*now = perf_clock();
>  	ctx_time = event->shadow_ctx_time + *now;
> -	*enabled = ctx_time - event->tstamp_enabled;
> -	*running = ctx_time - event->tstamp_running;
> +	__perf_update_times(event, ctx_time, enabled, running);
>  }
>  
>  static void perf_event_init_userpage(struct perf_event *event)
> @@ -10516,7 +10340,7 @@ perf_event_exit_event(struct perf_event *child_event,
>  	if (parent_event)
>  		perf_group_detach(child_event);
>  	list_del_event(child_event, child_ctx);
> -	child_event->state = PERF_EVENT_STATE_EXIT; /* is_event_hup() */
> +	perf_event_set_state(child_event, PERF_EVENT_STATE_EXIT); /* is_event_hup() */
>  	raw_spin_unlock_irq(&child_ctx->lock);
>  
>  	/*
> @@ -10754,7 +10578,7 @@ inherit_event(struct perf_event *parent_event,
>  	      struct perf_event *group_leader,
>  	      struct perf_event_context *child_ctx)
>  {
> -	enum perf_event_active_state parent_state = parent_event->state;
> +	enum perf_event_state parent_state = parent_event->state;
>  	struct perf_event *child_event;
>  	unsigned long flags;
>  
> @@ -11090,6 +10914,7 @@ static void __perf_event_exit_context(void *__info)
>  	struct perf_event *event;
>  
>  	raw_spin_lock(&ctx->lock);
> +	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>  	list_for_each_entry(event, &ctx->event_list, event_entry)
>  		__perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
>  	raw_spin_unlock(&ctx->lock);
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-01 10:45             ` Alexey Budankov
@ 2017-09-01 12:31               ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-09-01 12:31 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On Fri, Sep 01, 2017 at 01:45:17PM +0300, Alexey Budankov wrote:
> Well, this looks like an "opposite" approach to event timekeeping in 
> comparison to what we currently have. 

I would say 'sane' approach. The current thing is horrible.

> Do you want this rework before or after the current patch set?

Before I would think, because the whole point of the rb-tree thing is to
not touch all events all the time. And you can only do that after you
fix that timekeeping.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-01 11:17             ` Alexey Budankov
@ 2017-09-01 12:42               ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-09-01 12:42 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On Fri, Sep 01, 2017 at 02:17:17PM +0300, Alexey Budankov wrote:
> > No more weird and wonderful mind bending interaction between 3 different
> > timestamps with arcane update rules.
> > 
> > ---
> >  include/linux/perf_event.h |  25 +-
> >  kernel/events/core.c       | 551 ++++++++++++++++-----------------------------
> >  2 files changed, 192 insertions(+), 384 deletions(-)
> > 
> 
> Tried to apply on top of this:
> 
> perf/core 1b2f76d77a277bb70d38ad0991ed7f16bbc115a9 [origin/perf/core] Merge tag 'perf-core-for-mingo-4.14-20170829' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core

Applies on top of tip/master without issue.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-08-31 17:18           ` [RFC][PATCH] perf: Rewrite enabled/running timekeeping Peter Zijlstra
                               ` (2 preceding siblings ...)
  2017-09-01 11:17             ` Alexey Budankov
@ 2017-09-01 21:03             ` Vince Weaver
  2017-09-04 10:46             ` Alexey Budankov
  4 siblings, 0 replies; 76+ messages in thread
From: Vince Weaver @ 2017-09-01 21:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexey Budankov, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Andi Kleen, Kan Liang, Dmitri Prokhorov,
	Valery Cherepennikov, Mark Rutland, Stephane Eranian,
	David Carrillo-Cisneros, linux-kernel, Thomas Gleixner

On Thu, 31 Aug 2017, Peter Zijlstra wrote:

> So the below completely rewrites timekeeping (and probably breaks
> world) but does away with the need to touch events that don't get
> scheduled.
> 
> Esp the cgroup stuff is entirely untested since I simply don't know how
> to operate that. I did run Vince's tests on it, and I think it doesn't
> regress, but I'm near a migraine so I can't really see straight atm.
> 
> Vince, Stephane, could you guys have a peek?

I have to admit that I *always* got lost trying to figure out the old 
so I might not be the best person to review the changes.

I did try running the perf_event_tests on a few machines and they all pass.

I also ran the PAPI tests and a few of the multiplexing tests fail about 
10% of the time but I think they also fail 10% of the time with the old 
code too.  I need to figure out why that's happening but it's likely a 
PAPI issue not a kernel one.

Vince

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-08-31 17:18           ` [RFC][PATCH] perf: Rewrite enabled/running timekeeping Peter Zijlstra
                               ` (3 preceding siblings ...)
  2017-09-01 21:03             ` Vince Weaver
@ 2017-09-04 10:46             ` Alexey Budankov
  2017-09-04 12:08               ` Peter Zijlstra
  4 siblings, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-09-04 10:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

Hi,
On 31.08.2017 20:18, Peter Zijlstra wrote:
> On Wed, Aug 23, 2017 at 11:54:15AM +0300, Alexey Budankov wrote:
>> On 22.08.2017 23:47, Peter Zijlstra wrote:
>>> On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote:
>>>> The key thing in the patch is explicit updating of tstamp fields for
>>>> INACTIVE events in update_event_times().
>>>
>>>> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event)
>>>>  	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
>>>>  		return;
>>>>  
>>>> +	if (event->state == PERF_EVENT_STATE_INACTIVE)
>>>> +		perf_event_tstamp_update(event);
>>>> +
>>>>  	/*
>>>>  	 * in cgroup mode, time_enabled represents
>>>>  	 * the time the event was enabled AND active
>>>
>>> But why!? I thought the whole point was to not need to do this.
>>
>> update_event_times() is not called from timer interrupt handler 
>> thus it is not on the critical path which is optimized in this patch set.
>>
>> But update_event_times() is called in the context of read() syscall so
>> this is the place where we may update event times for INACTIVE events 
>> instead of timer interrupt.
>>
>> Also update_event_times() is called on thread context switch out so
>> we get event times also updated when the thread migrates to other CPU.
>>
>>>
>>> The thing I outlined earlier would only need to update timestamps when
>>> events change state and at no other point in time.
>>
>> But we still may request times while event is in INACTIVE state 
>> thru read() syscall and event timings need to be up-to-date. 
> 
> Sure, read() also updates.
> 
> So the below completely rewrites timekeeping (and probably breaks
> world) but does away with the need to touch events that don't get
> scheduled.

We still need and do iterate thru all events at some points e.g. on context switches.

> 
> Esp the cgroup stuff is entirely untested since I simply don't know how
> to operate that. I did run Vince's tests on it, and I think it doesn't
> regress, but I'm near a migraine so I can't really see straight atm.
> 
> Vince, Stephane, could you guys have a peek?
> 
> (There's a few other bits in, I'll break up into patches and write
> comments and Changelogs later, I think its can be split in some 5
> patches).
> 
> The basic idea is really simple, we have a single timestamp and
> depending on the state we update enabled/running. This obviously only
> requires updates when we change state and when we need up-to-date
> timestamps (read).

I would prefer to have this rework in a FSM similar to that below, 
so state transition and the corresponding tstamp, total_time_enabled 
and total_time_running manipulation logic would be consolidated in 
one place and adjacent lines of code.

>From the table below event->state FSM is not as simple as it may seem 
on the first sight so in order to avoid regressions after rework we 
better keep that in mind and explicitly implement allowed and disallowed
state transitions.

    A	  	I	    O	       E	   X	      D          U

A   Te+,Tr+     Te+,Tr+     Te+,Tr+    Te+,Tr+     Te+,Tr+    Te+,Tr+    ---
    ts 	        ts          ts         ts          ts         ts

I   Te+,ts      Te+,ts      Te+,ts     Te+,ts      Te+,ts     Te+,ts     ---

O   Te=0,Tr=0,  Te=0,Tr=0,  Te=0,Tr=0  Te=0,Tr=0   Te=0,Tr=0  Te=0,Tr=0  ---
    ts          ts          ts         ts          ts         ts

E   Te=0,Tr=0,  Te=0,Tr=0,  Te=0,Tr=0  Te=0,Tr=0   Te=0,Tr=0  Te=0,Tr=0  ---
    ts          ts          ts         ts          ts         ts

X   ---         ---         ---        ---         ---        ---        ---

D   ---         ---         ---        ---         ---        ---        ---

U   ---         Te=0,Tr=0   Te=0,Tr=0  ---         ---        ---        ---
                ts          ts          

LEGEND:

U - allocation, A - ACTIVE, I - INACTIVE, O - OFF, 
E - ERROR, X - EXIT, D - DEAD,

Te=0  - event->total_time_enabled  = 0
Te+   - event->total_time_enabled += delta

Tr=0  - event->total_time_running  = 0
Tr+   - event->total_time_running += delta

ts    - event->tstamp = perf_event_time(event)

static void
perf_event_change_state(struct perf_event *event, enum perf_event_state state)
{
        u64 delta = 0;
	u64 now = perf_event_time(event);

        delta = now - event->tstamp;
	event->tstamp = now;

	switch(event->state)
	{
	case A:
		switch(state)
		{
		case A:
                        ...
			break;
		case I:
			event->total_time_enabled += delta;
			event->total_time_running += delta;
			event->state = state;
			break;
		case O:
			...
			break;
		case E:
			...
	...
	case I:
		...
		break;
	...
	}
}

---

Regards,
Alexey

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-04 10:46             ` Alexey Budankov
@ 2017-09-04 12:08               ` Peter Zijlstra
  2017-09-04 14:56                 ` Alexey Budankov
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-09-04 12:08 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On Mon, Sep 04, 2017 at 01:46:45PM +0300, Alexey Budankov wrote:
> > So the below completely rewrites timekeeping (and probably breaks
> > world) but does away with the need to touch events that don't get
> > scheduled.
> 
> We still need and do iterate thru all events at some points e.g. on context switches.

Why do we _need_ to? On ctx switch we should stop iteration for a PMU
once we fail to schedule an event, same as for rotation.

> > The basic idea is really simple, we have a single timestamp and
> > depending on the state we update enabled/running. This obviously only
> > requires updates when we change state and when we need up-to-date
> > timestamps (read).
> 
> I would prefer to have this rework in a FSM similar to that below, 
> so state transition and the corresponding tstamp, total_time_enabled 
> and total_time_running manipulation logic would be consolidated in 
> one place and adjacent lines of code.
> 
> From the table below event->state FSM is not as simple as it may seem 
> on the first sight so in order to avoid regressions after rework we 
> better keep that in mind and explicitly implement allowed and disallowed
> state transitions.

Maybe if we introduce something like CONFIG_PERF_DEBUG, but I fear that
for normal operation that's all fairly horrible overhead.

>     A	  	I	    O	       E	   X	      D          U
> 
> A   Te+,Tr+     Te+,Tr+     Te+,Tr+    Te+,Tr+     Te+,Tr+    Te+,Tr+    ---
>     ts 	        ts          ts         ts          ts         ts
> 
> I   Te+,ts      Te+,ts      Te+,ts     Te+,ts      Te+,ts     Te+,ts     ---
> 
> O   Te=0,Tr=0,  Te=0,Tr=0,  Te=0,Tr=0  Te=0,Tr=0   Te=0,Tr=0  Te=0,Tr=0  ---
>     ts          ts          ts         ts          ts         ts
> 
> E   Te=0,Tr=0,  Te=0,Tr=0,  Te=0,Tr=0  Te=0,Tr=0   Te=0,Tr=0  Te=0,Tr=0  ---
>     ts          ts          ts         ts          ts         ts
> 
> X   ---         ---         ---        ---         ---        ---        ---
> 
> D   ---         ---         ---        ---         ---        ---        ---
> 
> U   ---         Te=0,Tr=0   Te=0,Tr=0  ---         ---        ---        ---
>                 ts          ts          
> 
> LEGEND:
> 
> U - allocation, A - ACTIVE, I - INACTIVE, O - OFF, 
> E - ERROR, X - EXIT, D - DEAD,

Not sure we care about the different <0 values, they're all effectively
OFF.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-04 12:08               ` Peter Zijlstra
@ 2017-09-04 14:56                 ` Alexey Budankov
  2017-09-04 15:41                   ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-09-04 14:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On 04.09.2017 15:08, Peter Zijlstra wrote:
> On Mon, Sep 04, 2017 at 01:46:45PM +0300, Alexey Budankov wrote:
>>> So the below completely rewrites timekeeping (and probably breaks
>>> world) but does away with the need to touch events that don't get
>>> scheduled.
>>
>> We still need and do iterate thru all events at some points e.g. on context switches.
> 
> Why do we _need_ to?

We do so in the current implementation with several tstamp_* fields.

> On ctx switch we should stop iteration for a PMU once we fail toschedule an event, same as for rotation> 
>>> The basic idea is really simple, we have a single timestamp and
>>> depending on the state we update enabled/running. This obviously only
>>> requires updates when we change state and when we need up-to-date
>>> timestamps (read).
>>
>> I would prefer to have this rework in a FSM similar to that below, 
>> so state transition and the corresponding tstamp, total_time_enabled 
>> and total_time_running manipulation logic would be consolidated in 
>> one place and adjacent lines of code.
>>
>> From the table below event->state FSM is not as simple as it may seem 
>> on the first sight so in order to avoid regressions after rework we 
>> better keep that in mind and explicitly implement allowed and disallowed
>> state transitions.
> 
> Maybe if we introduce something like CONFIG_PERF_DEBUG, but I fear that
> for normal operation that's all fairly horrible overhead.
> 
>>     A	  	I	    O	       E	   X	      D     U
>>
>> A   Te+,Tr+     Te+,Tr+     Te+,Tr+    Te+,Tr+     Te+,Tr+    Te+,Tr+    ---
>>     ts 	        ts          ts         ts          ts         ts
>>
>> I   Te+,ts      Te+,ts      Te+,ts     Te+,ts      Te+,ts     Te+,ts     ---
>>
>> O   Te=0,Tr=0,  Te=0,Tr=0,  Te=0,Tr=0  Te=0,Tr=0   Te=0,Tr=0  Te=0,Tr=0  ---
>>     ts          ts          ts         ts          ts         ts
>>
>> E   Te=0,Tr=0,  Te=0,Tr=0,  Te=0,Tr=0  Te=0,Tr=0   Te=0,Tr=0  Te=0,Tr=0  ---
>>     ts          ts          ts         ts          ts         ts
>>
>> X   ---         ---         ---        ---         ---        ---        ---
>>
>> D   ---         ---         ---        ---         ---        ---        ---
>>
>> U   ---         Te=0,Tr=0   Te=0,Tr=0  ---         ---        ---        ---
>>                 ts          ts          
>>
>> LEGEND:
>>
>> U - allocation, A - ACTIVE, I - INACTIVE, O - OFF, 
>> E - ERROR, X - EXIT, D - DEAD,
> 
> Not sure we care about the different <0 values, they're all effectively
> OFF.

We still need to care about proper initial state of timings when moving above >=0 state.

> 
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-04 14:56                 ` Alexey Budankov
@ 2017-09-04 15:41                   ` Peter Zijlstra
  2017-09-04 15:58                     ` Peter Zijlstra
  2017-09-05 10:17                     ` Alexey Budankov
  0 siblings, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-09-04 15:41 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On Mon, Sep 04, 2017 at 05:56:06PM +0300, Alexey Budankov wrote:
> On 04.09.2017 15:08, Peter Zijlstra wrote:
> > On Mon, Sep 04, 2017 at 01:46:45PM +0300, Alexey Budankov wrote:
> >>> So the below completely rewrites timekeeping (and probably breaks
> >>> world) but does away with the need to touch events that don't get
> >>> scheduled.
> >>
> >> We still need and do iterate thru all events at some points e.g. on context switches.
> > 
> > Why do we _need_ to?
> 
> We do so in the current implementation with several tstamp_* fields.

Right, but we want to stop doing so asap :-)


> >> U - allocation, A - ACTIVE, I - INACTIVE, O - OFF, 
> >> E - ERROR, X - EXIT, D - DEAD,
> > 
> > Not sure we care about the different <0 values, they're all effectively
> > OFF.
> 
> We still need to care about proper initial state of timings when moving above >=0 state.

Very true. I'm not sure I fully covered that, let me see if there's
something sensible to do for that.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-04 15:41                   ` Peter Zijlstra
@ 2017-09-04 15:58                     ` Peter Zijlstra
  2017-09-05 10:17                     ` Alexey Budankov
  1 sibling, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-09-04 15:58 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On Mon, Sep 04, 2017 at 05:41:45PM +0200, Peter Zijlstra wrote:

> > >> U - allocation, A - ACTIVE, I - INACTIVE, O - OFF, 
> > >> E - ERROR, X - EXIT, D - DEAD,
> > > 
> > > Not sure we care about the different <0 values, they're all effectively
> > > OFF.
> > 
> > We still need to care about proper initial state of timings when moving above >=0 state.
> 
> Very true. I'm not sure I fully covered that, let me see if there's
> something sensible to do for that.


So given this:

static __always_inline enum perf_event_state
__perf_effective_state(struct perf_event *event)
{
	struct perf_event *leader = event->group_leader;

	if (leader->state <= PERF_EVENT_STATE_OFF)
		return leader->state;

	return event->state;
}

static __always_inline void
__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
{
	enum perf_event_state state = __perf_effective_state(event);
	u64 delta = now - event->tstamp;

	*enabled = event->total_time_enabled;
	if (state >= PERF_EVENT_STATE_INACTIVE)
		*enabled += delta;

	*running = event->total_time_running;
	if (state >= PERF_EVENT_STATE_ACTIVE)
		*running += delta;
}

static void perf_event_update_time(struct perf_event *event)
{
	u64 now = perf_event_time(event);

	__perf_update_times(event, now, &event->total_time_enabled,
					&event->total_time_running);
	event->tstamp = now;
}

static void perf_event_update_sibling_time(struct perf_event *leader)
{
	struct perf_event *sibling;

	list_for_each_entry(sibling, &leader->sibling_list, group_entry)
		perf_event_update_time(sibling);
}

static void
perf_event_set_state(struct perf_event *event, enum perf_event_state state)
{
	if (event->state == state)
		return;

	perf_event_update_time(event);
	/*
	 * If a group leader gets enabled/disabled all its siblings
	 * are affected too.
	 */
	if ((event->state < 0) ^ (state < 0))
		perf_event_update_sibling_time(event);

	WRITE_ONCE(event->state, state);
}


If event->state < 0, and we do perf_event_set_state(event, INACTIVE)
then perf_event_update_time() will not add to enabled, not add to
running, but set ->tstamp = now.

So I think it DTRT.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-08-31 19:51             ` Stephane Eranian
@ 2017-09-05  7:51               ` Stephane Eranian
  2017-09-05  9:44                 ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Stephane Eranian @ 2017-09-05  7:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexey Budankov, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Andi Kleen, Kan Liang, Dmitri Prokhorov,
	Valery Cherepennikov, Mark Rutland, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On Thu, Aug 31, 2017 at 12:51 PM, Stephane Eranian <eranian@google.com> wrote:
> Hi,
>
> On Thu, Aug 31, 2017 at 10:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Wed, Aug 23, 2017 at 11:54:15AM +0300, Alexey Budankov wrote:
>>> On 22.08.2017 23:47, Peter Zijlstra wrote:
>>> > On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote:
>>> >> The key thing in the patch is explicit updating of tstamp fields for
>>> >> INACTIVE events in update_event_times().
>>> >
>>> >> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event)
>>> >>        event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
>>> >>            return;
>>> >>
>>> >> +  if (event->state == PERF_EVENT_STATE_INACTIVE)
>>> >> +          perf_event_tstamp_update(event);
>>> >> +
>>> >>    /*
>>> >>     * in cgroup mode, time_enabled represents
>>> >>     * the time the event was enabled AND active
>>> >
>>> > But why!? I thought the whole point was to not need to do this.
>>>
>>> update_event_times() is not called from timer interrupt handler
>>> thus it is not on the critical path which is optimized in this patch set.
>>>
>>> But update_event_times() is called in the context of read() syscall so
>>> this is the place where we may update event times for INACTIVE events
>>> instead of timer interrupt.
>>>
>>> Also update_event_times() is called on thread context switch out so
>>> we get event times also updated when the thread migrates to other CPU.
>>>
>>> >
>>> > The thing I outlined earlier would only need to update timestamps when
>>> > events change state and at no other point in time.
>>>
>>> But we still may request times while event is in INACTIVE state
>>> thru read() syscall and event timings need to be up-to-date.
>>
>> Sure, read() also updates.
>>
>> So the below completely rewrites timekeeping (and probably breaks
>> world) but does away with the need to touch events that don't get
>> scheduled.
>>
>> Esp the cgroup stuff is entirely untested since I simply don't know how
>> to operate that. I did run Vince's tests on it, and I think it doesn't
>> regress, but I'm near a migraine so I can't really see straight atm.
>>
>> Vince, Stephane, could you guys have a peek?
>>
> okay, I will run some tests with cgroups on my systems.
>
I ran some cgroups tests, including multiplexing and so far it appears to work
normally.
It is easy to create a cgroup and move a shell into it:
$ mount -t cgroup none /sys/fs/cgroups
$ cd /sys/fs/cgroups/perf_events
$ mkdir memtoy
$ cd memtoy
$ echo $$ >tasks

At this point your shell is part of the cgroup.
Then you can use perf to monitor globally or inside the cgroup:

$ perf stat -a -e cycles,cycles -G memtoy -I 1000 sleep 1000

That monitors cycles on all CPUs twice, once only when a member
of the cgroup memtoy runs, and the other globally.



>> (There's a few other bits in, I'll break up into patches and write
>> comments and Changelogs later, I think its can be split in some 5
>> patches).
>>
>> The basic idea is really simple, we have a single timestamp and
>> depending on the state we update enabled/running. This obviously only
>> requires updates when we change state and when we need up-to-date
>> timestamps (read).
>>
>> No more weird and wonderful mind bending interaction between 3 different
>> timestamps with arcane update rules.
>>
>> ---
>>  include/linux/perf_event.h |  25 +-
>>  kernel/events/core.c       | 551 ++++++++++++++++-----------------------------
>>  2 files changed, 192 insertions(+), 384 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 8e22f24ded6a..2a6ae48a1a96 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -485,9 +485,9 @@ struct perf_addr_filters_head {
>>  };
>>
>>  /**
>> - * enum perf_event_active_state - the states of a event
>> + * enum perf_event_state - the states of a event
>>   */
>> -enum perf_event_active_state {
>> +enum perf_event_state {
>>         PERF_EVENT_STATE_DEAD           = -4,
>>         PERF_EVENT_STATE_EXIT           = -3,
>>         PERF_EVENT_STATE_ERROR          = -2,
>> @@ -578,7 +578,7 @@ struct perf_event {
>>         struct pmu                      *pmu;
>>         void                            *pmu_private;
>>
>> -       enum perf_event_active_state    state;
>> +       enum perf_event_state           state;
>>         unsigned int                    attach_state;
>>         local64_t                       count;
>>         atomic64_t                      child_count;
>> @@ -588,26 +588,10 @@ struct perf_event {
>>          * has been enabled (i.e. eligible to run, and the task has
>>          * been scheduled in, if this is a per-task event)
>>          * and running (scheduled onto the CPU), respectively.
>> -        *
>> -        * They are computed from tstamp_enabled, tstamp_running and
>> -        * tstamp_stopped when the event is in INACTIVE or ACTIVE state.
>>          */
>>         u64                             total_time_enabled;
>>         u64                             total_time_running;
>> -
>> -       /*
>> -        * These are timestamps used for computing total_time_enabled
>> -        * and total_time_running when the event is in INACTIVE or
>> -        * ACTIVE state, measured in nanoseconds from an arbitrary point
>> -        * in time.
>> -        * tstamp_enabled: the notional time when the event was enabled
>> -        * tstamp_running: the notional time when the event was scheduled on
>> -        * tstamp_stopped: in INACTIVE state, the notional time when the
>> -        *      event was scheduled off.
>> -        */
>> -       u64                             tstamp_enabled;
>> -       u64                             tstamp_running;
>> -       u64                             tstamp_stopped;
>> +       u64                             tstamp;
>>
>>         /*
>>          * timestamp shadows the actual context timing but it can
>> @@ -699,7 +683,6 @@ struct perf_event {
>>
>>  #ifdef CONFIG_CGROUP_PERF
>>         struct perf_cgroup              *cgrp; /* cgroup event is attach to */
>> -       int                             cgrp_defer_enabled;
>>  #endif
>>
>>         struct list_head                sb_list;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 294f1927f944..e968b3eab9c7 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -582,6 +582,70 @@ static inline u64 perf_event_clock(struct perf_event *event)
>>         return event->clock();
>>  }
>>
>> +/*
>> + * XXX comment about timekeeping goes here
>> + */
>> +
>> +static __always_inline enum perf_event_state
>> +__perf_effective_state(struct perf_event *event)
>> +{
>> +       struct perf_event *leader = event->group_leader;
>> +
>> +       if (leader->state <= PERF_EVENT_STATE_OFF)
>> +               return leader->state;
>> +
>> +       return event->state;
>> +}
>> +
>> +static __always_inline void
>> +__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
>> +{
>> +       enum perf_event_state state = __perf_effective_state(event);
>> +       u64 delta = now - event->tstamp;
>> +
>> +       *enabled = event->total_time_enabled;
>> +       if (state >= PERF_EVENT_STATE_INACTIVE)
>> +               *enabled += delta;
>> +
>> +       *running = event->total_time_running;
>> +       if (state >= PERF_EVENT_STATE_ACTIVE)
>> +               *running += delta;
>> +}
>> +
>> +static void perf_event_update_time(struct perf_event *event)
>> +{
>> +       u64 now = perf_event_time(event);
>> +
>> +       __perf_update_times(event, now, &event->total_time_enabled,
>> +                                       &event->total_time_running);
>> +       event->tstamp = now;
>> +}
>> +
>> +static void perf_event_update_sibling_time(struct perf_event *leader)
>> +{
>> +       struct perf_event *sibling;
>> +
>> +       list_for_each_entry(sibling, &leader->sibling_list, group_entry)
>> +               perf_event_update_time(sibling);
>> +}
>> +
>> +static void
>> +perf_event_set_state(struct perf_event *event, enum perf_event_state state)
>> +{
>> +       if (event->state == state)
>> +               return;
>> +
>> +       perf_event_update_time(event);
>> +       /*
>> +        * If a group leader gets enabled/disabled all its siblings
>> +        * are affected too.
>> +        */
>> +       if ((event->state < 0) ^ (state < 0))
>> +               perf_event_update_sibling_time(event);
>> +
>> +       WRITE_ONCE(event->state, state);
>> +}
>> +
>>  #ifdef CONFIG_CGROUP_PERF
>>
>>  static inline bool
>> @@ -841,40 +905,6 @@ perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
>>         event->shadow_ctx_time = now - t->timestamp;
>>  }
>>
>> -static inline void
>> -perf_cgroup_defer_enabled(struct perf_event *event)
>> -{
>> -       /*
>> -        * when the current task's perf cgroup does not match
>> -        * the event's, we need to remember to call the
>> -        * perf_mark_enable() function the first time a task with
>> -        * a matching perf cgroup is scheduled in.
>> -        */
>> -       if (is_cgroup_event(event) && !perf_cgroup_match(event))
>> -               event->cgrp_defer_enabled = 1;
>> -}
>> -
>> -static inline void
>> -perf_cgroup_mark_enabled(struct perf_event *event,
>> -                        struct perf_event_context *ctx)
>> -{
>> -       struct perf_event *sub;
>> -       u64 tstamp = perf_event_time(event);
>> -
>> -       if (!event->cgrp_defer_enabled)
>> -               return;
>> -
>> -       event->cgrp_defer_enabled = 0;
>> -
>> -       event->tstamp_enabled = tstamp - event->total_time_enabled;
>> -       list_for_each_entry(sub, &event->sibling_list, group_entry) {
>> -               if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
>> -                       sub->tstamp_enabled = tstamp - sub->total_time_enabled;
>> -                       sub->cgrp_defer_enabled = 0;
>> -               }
>> -       }
>> -}
>> -
>>  /*
>>   * Update cpuctx->cgrp so that it is set when first cgroup event is added and
>>   * cleared when last cgroup event is removed.
>> @@ -973,17 +1003,6 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
>>  }
>>
>>  static inline void
>> -perf_cgroup_defer_enabled(struct perf_event *event)
>> -{
>> -}
>> -
>> -static inline void
>> -perf_cgroup_mark_enabled(struct perf_event *event,
>> -                        struct perf_event_context *ctx)
>> -{
>> -}
>> -
>> -static inline void
>>  list_update_cgroup_event(struct perf_event *event,
>>                          struct perf_event_context *ctx, bool add)
>>  {
>> @@ -1396,60 +1415,6 @@ static u64 perf_event_time(struct perf_event *event)
>>         return ctx ? ctx->time : 0;
>>  }
>>
>> -/*
>> - * Update the total_time_enabled and total_time_running fields for a event.
>> - */
>> -static void update_event_times(struct perf_event *event)
>> -{
>> -       struct perf_event_context *ctx = event->ctx;
>> -       u64 run_end;
>> -
>> -       lockdep_assert_held(&ctx->lock);
>> -
>> -       if (event->state < PERF_EVENT_STATE_INACTIVE ||
>> -           event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
>> -               return;
>> -
>> -       /*
>> -        * in cgroup mode, time_enabled represents
>> -        * the time the event was enabled AND active
>> -        * tasks were in the monitored cgroup. This is
>> -        * independent of the activity of the context as
>> -        * there may be a mix of cgroup and non-cgroup events.
>> -        *
>> -        * That is why we treat cgroup events differently
>> -        * here.
>> -        */
>> -       if (is_cgroup_event(event))
>> -               run_end = perf_cgroup_event_time(event);
>> -       else if (ctx->is_active)
>> -               run_end = ctx->time;
>> -       else
>> -               run_end = event->tstamp_stopped;
>> -
>> -       event->total_time_enabled = run_end - event->tstamp_enabled;
>> -
>> -       if (event->state == PERF_EVENT_STATE_INACTIVE)
>> -               run_end = event->tstamp_stopped;
>> -       else
>> -               run_end = perf_event_time(event);
>> -
>> -       event->total_time_running = run_end - event->tstamp_running;
>> -
>> -}
>> -
>> -/*
>> - * Update total_time_enabled and total_time_running for all events in a group.
>> - */
>> -static void update_group_times(struct perf_event *leader)
>> -{
>> -       struct perf_event *event;
>> -
>> -       update_event_times(leader);
>> -       list_for_each_entry(event, &leader->sibling_list, group_entry)
>> -               update_event_times(event);
>> -}
>> -
>>  static enum event_type_t get_event_type(struct perf_event *event)
>>  {
>>         struct perf_event_context *ctx = event->ctx;
>> @@ -1492,6 +1457,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
>>         WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT);
>>         event->attach_state |= PERF_ATTACH_CONTEXT;
>>
>> +       event->tstamp = perf_event_time(event);
>> +
>>         /*
>>          * If we're a stand alone event or group leader, we go to the context
>>          * list, group events are kept attached to the group so that
>> @@ -1699,8 +1666,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>>         if (event->group_leader == event)
>>                 list_del_init(&event->group_entry);
>>
>> -       update_group_times(event);
>> -
>>         /*
>>          * If event was in error state, then keep it
>>          * that way, otherwise bogus counts will be
>> @@ -1709,7 +1674,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>>          * of the event
>>          */
>>         if (event->state > PERF_EVENT_STATE_OFF)
>> -               event->state = PERF_EVENT_STATE_OFF;
>> +               perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>>
>>         ctx->generation++;
>>  }
>> @@ -1808,38 +1773,24 @@ event_sched_out(struct perf_event *event,
>>                   struct perf_cpu_context *cpuctx,
>>                   struct perf_event_context *ctx)
>>  {
>> -       u64 tstamp = perf_event_time(event);
>> -       u64 delta;
>> +       enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
>>
>>         WARN_ON_ONCE(event->ctx != ctx);
>>         lockdep_assert_held(&ctx->lock);
>>
>> -       /*
>> -        * An event which could not be activated because of
>> -        * filter mismatch still needs to have its timings
>> -        * maintained, otherwise bogus information is return
>> -        * via read() for time_enabled, time_running:
>> -        */
>> -       if (event->state == PERF_EVENT_STATE_INACTIVE &&
>> -           !event_filter_match(event)) {
>> -               delta = tstamp - event->tstamp_stopped;
>> -               event->tstamp_running += delta;
>> -               event->tstamp_stopped = tstamp;
>> -       }
>> -
>>         if (event->state != PERF_EVENT_STATE_ACTIVE)
>>                 return;
>>
>>         perf_pmu_disable(event->pmu);
>>
>> -       event->tstamp_stopped = tstamp;
>>         event->pmu->del(event, 0);
>>         event->oncpu = -1;
>> -       event->state = PERF_EVENT_STATE_INACTIVE;
>> +
>>         if (event->pending_disable) {
>>                 event->pending_disable = 0;
>> -               event->state = PERF_EVENT_STATE_OFF;
>> +               state = PERF_EVENT_STATE_OFF;
>>         }
>> +       perf_event_set_state(event, state);
>>
>>         if (!is_software_event(event))
>>                 cpuctx->active_oncpu--;
>> @@ -1859,7 +1810,9 @@ group_sched_out(struct perf_event *group_event,
>>                 struct perf_event_context *ctx)
>>  {
>>         struct perf_event *event;
>> -       int state = group_event->state;
>> +
>> +       if (group_event->state != PERF_EVENT_STATE_ACTIVE)
>> +               return;
>>
>>         perf_pmu_disable(ctx->pmu);
>>
>> @@ -1873,7 +1826,7 @@ group_sched_out(struct perf_event *group_event,
>>
>>         perf_pmu_enable(ctx->pmu);
>>
>> -       if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive)
>> +       if (group_event->attr.exclusive)
>>                 cpuctx->exclusive = 0;
>>  }
>>
>> @@ -1893,6 +1846,11 @@ __perf_remove_from_context(struct perf_event *event,
>>  {
>>         unsigned long flags = (unsigned long)info;
>>
>> +       if (ctx->is_active & EVENT_TIME) {
>> +               update_context_time(ctx);
>> +               update_cgrp_time_from_cpuctx(cpuctx);
>> +       }
>> +
>>         event_sched_out(event, cpuctx, ctx);
>>         if (flags & DETACH_GROUP)
>>                 perf_group_detach(event);
>> @@ -1955,14 +1913,17 @@ static void __perf_event_disable(struct perf_event *event,
>>         if (event->state < PERF_EVENT_STATE_INACTIVE)
>>                 return;
>>
>> -       update_context_time(ctx);
>> -       update_cgrp_time_from_event(event);
>> -       update_group_times(event);
>> +       if (ctx->is_active & EVENT_TIME) {
>> +               update_context_time(ctx);
>> +               update_cgrp_time_from_cpuctx(cpuctx);
>> +       }
>> +
>>         if (event == event->group_leader)
>>                 group_sched_out(event, cpuctx, ctx);
>>         else
>>                 event_sched_out(event, cpuctx, ctx);
>> -       event->state = PERF_EVENT_STATE_OFF;
>> +
>> +       perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>>  }
>>
>>  /*
>> @@ -2019,8 +1980,7 @@ void perf_event_disable_inatomic(struct perf_event *event)
>>  }
>>
>>  static void perf_set_shadow_time(struct perf_event *event,
>> -                                struct perf_event_context *ctx,
>> -                                u64 tstamp)
>> +                                struct perf_event_context *ctx)
>>  {
>>         /*
>>          * use the correct time source for the time snapshot
>> @@ -2048,9 +2008,9 @@ static void perf_set_shadow_time(struct perf_event *event,
>>          * is cleaner and simpler to understand.
>>          */
>>         if (is_cgroup_event(event))
>> -               perf_cgroup_set_shadow_time(event, tstamp);
>> +               perf_cgroup_set_shadow_time(event, event->tstamp);
>>         else
>> -               event->shadow_ctx_time = tstamp - ctx->timestamp;
>> +               event->shadow_ctx_time = event->tstamp - ctx->timestamp;
>>  }
>>
>>  #define MAX_INTERRUPTS (~0ULL)
>> @@ -2063,7 +2023,6 @@ event_sched_in(struct perf_event *event,
>>                  struct perf_cpu_context *cpuctx,
>>                  struct perf_event_context *ctx)
>>  {
>> -       u64 tstamp = perf_event_time(event);
>>         int ret = 0;
>>
>>         lockdep_assert_held(&ctx->lock);
>> @@ -2077,7 +2036,7 @@ event_sched_in(struct perf_event *event,
>>          * is visible.
>>          */
>>         smp_wmb();
>> -       WRITE_ONCE(event->state, PERF_EVENT_STATE_ACTIVE);
>> +       perf_event_set_state(event, PERF_EVENT_STATE_ACTIVE);
>>
>>         /*
>>          * Unthrottle events, since we scheduled we might have missed several
>> @@ -2089,26 +2048,19 @@ event_sched_in(struct perf_event *event,
>>                 event->hw.interrupts = 0;
>>         }
>>
>> -       /*
>> -        * The new state must be visible before we turn it on in the hardware:
>> -        */
>> -       smp_wmb();
>> -
>>         perf_pmu_disable(event->pmu);
>>
>> -       perf_set_shadow_time(event, ctx, tstamp);
>> +       perf_set_shadow_time(event, ctx);
>>
>>         perf_log_itrace_start(event);
>>
>>         if (event->pmu->add(event, PERF_EF_START)) {
>> -               event->state = PERF_EVENT_STATE_INACTIVE;
>> +               perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>>                 event->oncpu = -1;
>>                 ret = -EAGAIN;
>>                 goto out;
>>         }
>>
>> -       event->tstamp_running += tstamp - event->tstamp_stopped;
>> -
>>         if (!is_software_event(event))
>>                 cpuctx->active_oncpu++;
>>         if (!ctx->nr_active++)
>> @@ -2132,8 +2084,6 @@ group_sched_in(struct perf_event *group_event,
>>  {
>>         struct perf_event *event, *partial_group = NULL;
>>         struct pmu *pmu = ctx->pmu;
>> -       u64 now = ctx->time;
>> -       bool simulate = false;
>>
>>         if (group_event->state == PERF_EVENT_STATE_OFF)
>>                 return 0;
>> @@ -2163,27 +2113,13 @@ group_sched_in(struct perf_event *group_event,
>>         /*
>>          * Groups can be scheduled in as one unit only, so undo any
>>          * partial group before returning:
>> -        * The events up to the failed event are scheduled out normally,
>> -        * tstamp_stopped will be updated.
>> -        *
>> -        * The failed events and the remaining siblings need to have
>> -        * their timings updated as if they had gone thru event_sched_in()
>> -        * and event_sched_out(). This is required to get consistent timings
>> -        * across the group. This also takes care of the case where the group
>> -        * could never be scheduled by ensuring tstamp_stopped is set to mark
>> -        * the time the event was actually stopped, such that time delta
>> -        * calculation in update_event_times() is correct.
>> +        * The events up to the failed event are scheduled out normally.
>>          */
>>         list_for_each_entry(event, &group_event->sibling_list, group_entry) {
>>                 if (event == partial_group)
>> -                       simulate = true;
>> +                       break;
>>
>> -               if (simulate) {
>> -                       event->tstamp_running += now - event->tstamp_stopped;
>> -                       event->tstamp_stopped = now;
>> -               } else {
>> -                       event_sched_out(event, cpuctx, ctx);
>> -               }
>> +               event_sched_out(event, cpuctx, ctx);
>>         }
>>         event_sched_out(group_event, cpuctx, ctx);
>>
>> @@ -2225,46 +2161,11 @@ static int group_can_go_on(struct perf_event *event,
>>         return can_add_hw;
>>  }
>>
>> -/*
>> - * Complement to update_event_times(). This computes the tstamp_* values to
>> - * continue 'enabled' state from @now, and effectively discards the time
>> - * between the prior tstamp_stopped and now (as we were in the OFF state, or
>> - * just switched (context) time base).
>> - *
>> - * This further assumes '@event->state == INACTIVE' (we just came from OFF) and
>> - * cannot have been scheduled in yet. And going into INACTIVE state means
>> - * '@event->tstamp_stopped = @now'.
>> - *
>> - * Thus given the rules of update_event_times():
>> - *
>> - *   total_time_enabled = tstamp_stopped - tstamp_enabled
>> - *   total_time_running = tstamp_stopped - tstamp_running
>> - *
>> - * We can insert 'tstamp_stopped == now' and reverse them to compute new
>> - * tstamp_* values.
>> - */
>> -static void __perf_event_enable_time(struct perf_event *event, u64 now)
>> -{
>> -       WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE);
>> -
>> -       event->tstamp_stopped = now;
>> -       event->tstamp_enabled = now - event->total_time_enabled;
>> -       event->tstamp_running = now - event->total_time_running;
>> -}
>> -
>>  static void add_event_to_ctx(struct perf_event *event,
>>                                struct perf_event_context *ctx)
>>  {
>> -       u64 tstamp = perf_event_time(event);
>> -
>>         list_add_event(event, ctx);
>>         perf_group_attach(event);
>> -       /*
>> -        * We can be called with event->state == STATE_OFF when we create with
>> -        * .disabled = 1. In that case the IOC_ENABLE will call this function.
>> -        */
>> -       if (event->state == PERF_EVENT_STATE_INACTIVE)
>> -               __perf_event_enable_time(event, tstamp);
>>  }
>>
>>  static void ctx_sched_out(struct perf_event_context *ctx,
>> @@ -2496,28 +2397,6 @@ perf_install_in_context(struct perf_event_context *ctx,
>>  }
>>
>>  /*
>> - * Put a event into inactive state and update time fields.
>> - * Enabling the leader of a group effectively enables all
>> - * the group members that aren't explicitly disabled, so we
>> - * have to update their ->tstamp_enabled also.
>> - * Note: this works for group members as well as group leaders
>> - * since the non-leader members' sibling_lists will be empty.
>> - */
>> -static void __perf_event_mark_enabled(struct perf_event *event)
>> -{
>> -       struct perf_event *sub;
>> -       u64 tstamp = perf_event_time(event);
>> -
>> -       event->state = PERF_EVENT_STATE_INACTIVE;
>> -       __perf_event_enable_time(event, tstamp);
>> -       list_for_each_entry(sub, &event->sibling_list, group_entry) {
>> -               /* XXX should not be > INACTIVE if event isn't */
>> -               if (sub->state >= PERF_EVENT_STATE_INACTIVE)
>> -                       __perf_event_enable_time(sub, tstamp);
>> -       }
>> -}
>> -
>> -/*
>>   * Cross CPU call to enable a performance event
>>   */
>>  static void __perf_event_enable(struct perf_event *event,
>> @@ -2535,14 +2414,12 @@ static void __perf_event_enable(struct perf_event *event,
>>         if (ctx->is_active)
>>                 ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>>
>> -       __perf_event_mark_enabled(event);
>> +       perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>>
>>         if (!ctx->is_active)
>>                 return;
>>
>>         if (!event_filter_match(event)) {
>> -               if (is_cgroup_event(event))
>> -                       perf_cgroup_defer_enabled(event);
>>                 ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
>>                 return;
>>         }
>> @@ -2862,18 +2739,10 @@ static void __perf_event_sync_stat(struct perf_event *event,
>>          * we know the event must be on the current CPU, therefore we
>>          * don't need to use it.
>>          */
>> -       switch (event->state) {
>> -       case PERF_EVENT_STATE_ACTIVE:
>> +       if (event->state == PERF_EVENT_STATE_ACTIVE)
>>                 event->pmu->read(event);
>> -               /* fall-through */
>>
>> -       case PERF_EVENT_STATE_INACTIVE:
>> -               update_event_times(event);
>> -               break;
>> -
>> -       default:
>> -               break;
>> -       }
>> +       perf_event_update_time(event);
>>
>>         /*
>>          * In order to keep per-task stats reliable we need to flip the event
>> @@ -3110,10 +2979,6 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>>                 if (!event_filter_match(event))
>>                         continue;
>>
>> -               /* may need to reset tstamp_enabled */
>> -               if (is_cgroup_event(event))
>> -                       perf_cgroup_mark_enabled(event, ctx);
>> -
>>                 if (group_can_go_on(event, cpuctx, 1))
>>                         group_sched_in(event, cpuctx, ctx);
>>
>> @@ -3121,10 +2986,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>>                  * If this pinned group hasn't been scheduled,
>>                  * put it in error state.
>>                  */
>> -               if (event->state == PERF_EVENT_STATE_INACTIVE) {
>> -                       update_group_times(event);
>> -                       event->state = PERF_EVENT_STATE_ERROR;
>> -               }
>> +               if (event->state == PERF_EVENT_STATE_INACTIVE)
>> +                       perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>>         }
>>  }
>>
>> @@ -3146,10 +3009,6 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
>>                 if (!event_filter_match(event))
>>                         continue;
>>
>> -               /* may need to reset tstamp_enabled */
>> -               if (is_cgroup_event(event))
>> -                       perf_cgroup_mark_enabled(event, ctx);
>> -
>>                 if (group_can_go_on(event, cpuctx, can_add_hw)) {
>>                         if (group_sched_in(event, cpuctx, ctx))
>>                                 can_add_hw = 0;
>> @@ -3541,7 +3400,7 @@ static int event_enable_on_exec(struct perf_event *event,
>>         if (event->state >= PERF_EVENT_STATE_INACTIVE)
>>                 return 0;
>>
>> -       __perf_event_mark_enabled(event);
>> +       perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>>
>>         return 1;
>>  }
>> @@ -3590,12 +3449,6 @@ static void perf_event_enable_on_exec(int ctxn)
>>                 put_ctx(clone_ctx);
>>  }
>>
>> -struct perf_read_data {
>> -       struct perf_event *event;
>> -       bool group;
>> -       int ret;
>> -};
>> -
>>  static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>>  {
>>         u16 local_pkg, event_pkg;
>> @@ -3613,64 +3466,6 @@ static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>>         return event_cpu;
>>  }
>>
>> -/*
>> - * Cross CPU call to read the hardware event
>> - */
>> -static void __perf_event_read(void *info)
>> -{
>> -       struct perf_read_data *data = info;
>> -       struct perf_event *sub, *event = data->event;
>> -       struct perf_event_context *ctx = event->ctx;
>> -       struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
>> -       struct pmu *pmu = event->pmu;
>> -
>> -       /*
>> -        * If this is a task context, we need to check whether it is
>> -        * the current task context of this cpu.  If not it has been
>> -        * scheduled out before the smp call arrived.  In that case
>> -        * event->count would have been updated to a recent sample
>> -        * when the event was scheduled out.
>> -        */
>> -       if (ctx->task && cpuctx->task_ctx != ctx)
>> -               return;
>> -
>> -       raw_spin_lock(&ctx->lock);
>> -       if (ctx->is_active) {
>> -               update_context_time(ctx);
>> -               update_cgrp_time_from_event(event);
>> -       }
>> -
>> -       update_event_times(event);
>> -       if (event->state != PERF_EVENT_STATE_ACTIVE)
>> -               goto unlock;
>> -
>> -       if (!data->group) {
>> -               pmu->read(event);
>> -               data->ret = 0;
>> -               goto unlock;
>> -       }
>> -
>> -       pmu->start_txn(pmu, PERF_PMU_TXN_READ);
>> -
>> -       pmu->read(event);
>> -
>> -       list_for_each_entry(sub, &event->sibling_list, group_entry) {
>> -               update_event_times(sub);
>> -               if (sub->state == PERF_EVENT_STATE_ACTIVE) {
>> -                       /*
>> -                        * Use sibling's PMU rather than @event's since
>> -                        * sibling could be on different (eg: software) PMU.
>> -                        */
>> -                       sub->pmu->read(sub);
>> -               }
>> -       }
>> -
>> -       data->ret = pmu->commit_txn(pmu);
>> -
>> -unlock:
>> -       raw_spin_unlock(&ctx->lock);
>> -}
>> -
>>  static inline u64 perf_event_count(struct perf_event *event)
>>  {
>>         return local64_read(&event->count) + atomic64_read(&event->child_count);
>> @@ -3733,63 +3528,81 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
>>         return ret;
>>  }
>>
>> -static int perf_event_read(struct perf_event *event, bool group)
>> +struct perf_read_data {
>> +       struct perf_event *event;
>> +       bool group;
>> +       int ret;
>> +};
>> +
>> +static void __perf_event_read(struct perf_event *event,
>> +                             struct perf_cpu_context *cpuctx,
>> +                             struct perf_event_context *ctx,
>> +                             void *data)
>>  {
>> -       int event_cpu, ret = 0;
>> +       struct perf_read_data *prd = data;
>> +       struct pmu *pmu = event->pmu;
>> +       struct perf_event *sibling;
>>
>> -       /*
>> -        * If event is enabled and currently active on a CPU, update the
>> -        * value in the event structure:
>> -        */
>> -       if (event->state == PERF_EVENT_STATE_ACTIVE) {
>> -               struct perf_read_data data = {
>> -                       .event = event,
>> -                       .group = group,
>> -                       .ret = 0,
>> -               };
>> +       if (ctx->is_active & EVENT_TIME) {
>> +               update_context_time(ctx);
>> +               update_cgrp_time_from_cpuctx(cpuctx);
>> +       }
>>
>> -               event_cpu = READ_ONCE(event->oncpu);
>> -               if ((unsigned)event_cpu >= nr_cpu_ids)
>> -                       return 0;
>> +       perf_event_update_time(event);
>> +       if (prd->group)
>> +               perf_event_update_sibling_time(event);
>>
>> -               preempt_disable();
>> -               event_cpu = __perf_event_read_cpu(event, event_cpu);
>> +       if (event->state != PERF_EVENT_STATE_ACTIVE)
>> +               return;
>>
>> +       if (!prd->group) {
>> +               pmu->read(event);
>> +               prd->ret = 0;
>> +               return;
>> +       }
>> +
>> +       pmu->start_txn(pmu, PERF_PMU_TXN_READ);
>> +
>> +       pmu->read(event);
>> +       list_for_each_entry(sibling, &event->sibling_list, group_entry) {
>> +               if (sibling->state == PERF_EVENT_STATE_ACTIVE) {
>> +                       /*
>> +                        * Use sibling's PMU rather than @event's since
>> +                        * sibling could be on different (eg: software) PMU.
>> +                        */
>> +                       sibling->pmu->read(sibling);
>> +               }
>> +       }
>> +
>> +       prd->ret = pmu->commit_txn(pmu);
>> +}
>> +
>> +static int perf_event_read(struct perf_event *event, bool group)
>> +{
>> +       struct perf_read_data prd = {
>> +               .event = event,
>> +               .group = group,
>> +               .ret = 0,
>> +       };
>> +
>> +       if (event->ctx->task) {
>> +               event_function_call(event, __perf_event_read, &prd);
>> +       } else {
>>                 /*
>> -                * Purposely ignore the smp_call_function_single() return
>> -                * value.
>> -                *
>> -                * If event_cpu isn't a valid CPU it means the event got
>> -                * scheduled out and that will have updated the event count.
>> -                *
>> -                * Therefore, either way, we'll have an up-to-date event count
>> -                * after this.
>> -                */
>> -               (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
>> -               preempt_enable();
>> -               ret = data.ret;
>> -       } else if (event->state == PERF_EVENT_STATE_INACTIVE) {
>> -               struct perf_event_context *ctx = event->ctx;
>> -               unsigned long flags;
>> -
>> -               raw_spin_lock_irqsave(&ctx->lock, flags);
>> -               /*
>> -                * may read while context is not active
>> -                * (e.g., thread is blocked), in that case
>> -                * we cannot update context time
>> +                * For uncore events (which are per definition per-cpu)
>> +                * allow a different read CPU from event->cpu.
>>                  */
>> -               if (ctx->is_active) {
>> -                       update_context_time(ctx);
>> -                       update_cgrp_time_from_event(event);
>> -               }
>> -               if (group)
>> -                       update_group_times(event);
>> -               else
>> -                       update_event_times(event);
>> -               raw_spin_unlock_irqrestore(&ctx->lock, flags);
>> +               struct event_function_struct efs = {
>> +                       .event = event,
>> +                       .func = __perf_event_read,
>> +                       .data = &prd,
>> +               };
>> +               int cpu = __perf_event_read_cpu(event, event->cpu);
>> +
>> +               cpu_function_call(cpu, event_function, &efs);
>>         }
>>
>> -       return ret;
>> +       return prd.ret;
>>  }
>>
>>  /*
>> @@ -4388,7 +4201,7 @@ static int perf_release(struct inode *inode, struct file *file)
>>         return 0;
>>  }
>>
>> -u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>> +static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>>  {
>>         struct perf_event *child;
>>         u64 total = 0;
>> @@ -4416,6 +4229,18 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>>
>>         return total;
>>  }
>> +
>> +u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>> +{
>> +       struct perf_event_context *ctx;
>> +       u64 count;
>> +
>> +       ctx = perf_event_ctx_lock(event);
>> +       count = __perf_event_read_value(event, enabled, running);
>> +       perf_event_ctx_unlock(event, ctx);
>> +
>> +       return count;
>> +}
>>  EXPORT_SYMBOL_GPL(perf_event_read_value);
>>
>>  static int __perf_read_group_add(struct perf_event *leader,
>> @@ -4431,6 +4256,8 @@ static int __perf_read_group_add(struct perf_event *leader,
>>         if (ret)
>>                 return ret;
>>
>> +       raw_spin_lock_irqsave(&ctx->lock, flags);
>> +
>>         /*
>>          * Since we co-schedule groups, {enabled,running} times of siblings
>>          * will be identical to those of the leader, so we only publish one
>> @@ -4453,8 +4280,6 @@ static int __perf_read_group_add(struct perf_event *leader,
>>         if (read_format & PERF_FORMAT_ID)
>>                 values[n++] = primary_event_id(leader);
>>
>> -       raw_spin_lock_irqsave(&ctx->lock, flags);
>> -
>>         list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>>                 values[n++] += perf_event_count(sub);
>>                 if (read_format & PERF_FORMAT_ID)
>> @@ -4518,7 +4343,7 @@ static int perf_read_one(struct perf_event *event,
>>         u64 values[4];
>>         int n = 0;
>>
>> -       values[n++] = perf_event_read_value(event, &enabled, &running);
>> +       values[n++] = __perf_event_read_value(event, &enabled, &running);
>>         if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
>>                 values[n++] = enabled;
>>         if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
>> @@ -4897,8 +4722,7 @@ static void calc_timer_values(struct perf_event *event,
>>
>>         *now = perf_clock();
>>         ctx_time = event->shadow_ctx_time + *now;
>> -       *enabled = ctx_time - event->tstamp_enabled;
>> -       *running = ctx_time - event->tstamp_running;
>> +       __perf_update_times(event, ctx_time, enabled, running);
>>  }
>>
>>  static void perf_event_init_userpage(struct perf_event *event)
>> @@ -10516,7 +10340,7 @@ perf_event_exit_event(struct perf_event *child_event,
>>         if (parent_event)
>>                 perf_group_detach(child_event);
>>         list_del_event(child_event, child_ctx);
>> -       child_event->state = PERF_EVENT_STATE_EXIT; /* is_event_hup() */
>> +       perf_event_set_state(child_event, PERF_EVENT_STATE_EXIT); /* is_event_hup() */
>>         raw_spin_unlock_irq(&child_ctx->lock);
>>
>>         /*
>> @@ -10754,7 +10578,7 @@ inherit_event(struct perf_event *parent_event,
>>               struct perf_event *group_leader,
>>               struct perf_event_context *child_ctx)
>>  {
>> -       enum perf_event_active_state parent_state = parent_event->state;
>> +       enum perf_event_state parent_state = parent_event->state;
>>         struct perf_event *child_event;
>>         unsigned long flags;
>>
>> @@ -11090,6 +10914,7 @@ static void __perf_event_exit_context(void *__info)
>>         struct perf_event *event;
>>
>>         raw_spin_lock(&ctx->lock);
>> +       ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>>         list_for_each_entry(event, &ctx->event_list, event_entry)
>>                 __perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
>>         raw_spin_unlock(&ctx->lock);

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-05  7:51               ` Stephane Eranian
@ 2017-09-05  9:44                 ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-09-05  9:44 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Alexey Budankov, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Andi Kleen, Kan Liang, Dmitri Prokhorov,
	Valery Cherepennikov, Mark Rutland, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On Tue, Sep 05, 2017 at 12:51:35AM -0700, Stephane Eranian wrote:
> >> Esp the cgroup stuff is entirely untested since I simply don't know how
> >> to operate that. I did run Vince's tests on it, and I think it doesn't
> >> regress, but I'm near a migraine so I can't really see straight atm.
> >>
> >> Vince, Stephane, could you guys have a peek?
> >>
> > okay, I will run some tests with cgroups on my systems.
> >
> I ran some cgroups tests, including multiplexing and so far it appears to work
> normally.

Shiny!

> It is easy to create a cgroup and move a shell into it:
> $ mount -t cgroup none /sys/fs/cgroups
> $ cd /sys/fs/cgroups/perf_events
> $ mkdir memtoy
> $ cd memtoy
> $ echo $$ >tasks
> 
> At this point your shell is part of the cgroup.
> Then you can use perf to monitor globally or inside the cgroup:
> 
> $ perf stat -a -e cycles,cycles -G memtoy -I 1000 sleep 1000
> 
> That monitors cycles on all CPUs twice, once only when a member
> of the cgroup memtoy runs, and the other globally.

Right, thanks!

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-04 15:41                   ` Peter Zijlstra
  2017-09-04 15:58                     ` Peter Zijlstra
@ 2017-09-05 10:17                     ` Alexey Budankov
  2017-09-05 11:19                       ` Peter Zijlstra
  2017-09-05 12:06                       ` Alexey Budankov
  1 sibling, 2 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-09-05 10:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On 04.09.2017 18:41, Peter Zijlstra wrote:
> On Mon, Sep 04, 2017 at 05:56:06PM +0300, Alexey Budankov wrote:
>> On 04.09.2017 15:08, Peter Zijlstra wrote:
>>> On Mon, Sep 04, 2017 at 01:46:45PM +0300, Alexey Budankov wrote:
>>>>> So the below completely rewrites timekeeping (and probably breaks
>>>>> world) but does away with the need to touch events that don't get
>>>>> scheduled.
>>>>
>>>> We still need and do iterate thru all events at some points e.g. on context switches.
>>>
>>> Why do we _need_ to?
>>
>> We do so in the current implementation with several tstamp_* fields.
> 
> Right, but we want to stop doing so asap :-)
> 

Well, I see you point :). It turns out that with straightforward timekeeping 
we can also avoid whole tree iteration on context switches additionally to RB 
tree based iterations and rotations on hrtimer interrupt. That brings even more 
performance and rotation switch can be avoided.

However we can't completely get rid of whole tree iterations because of 
inheritance code on forks in perf_event_init_context() here:

perf_event_groups_for_each(event, &parent_ctx->pinned_groups, group_node) {
		ret = inherit_task_group(event, parent, parent_ctx,
					 child, ctxn, &inherited_all);
		if (ret)
			goto out_unlock;
}

and here:

perf_event_groups_for_each(event, &parent_ctx->flexible_groups, group_node) {
		ret = inherit_task_group(event, parent, parent_ctx,
					 child, ctxn, &inherited_all);
		if (ret)
			goto out_unlock;
}

Below is the patch set put on top of the timekeeping rework. 

It is for tip/master branch.

---
 include/linux/perf_event.h |  40 +--
 kernel/events/core.c       | 839 +++++++++++++++++++++++----------------------
 2 files changed, 448 insertions(+), 431 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8e22f24..0365371 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -485,9 +485,9 @@ struct perf_addr_filters_head {
 };
 
 /**
- * enum perf_event_active_state - the states of a event
+ * enum perf_event_state - the states of a event
  */
-enum perf_event_active_state {
+enum perf_event_state {
 	PERF_EVENT_STATE_DEAD		= -4,
 	PERF_EVENT_STATE_EXIT		= -3,
 	PERF_EVENT_STATE_ERROR		= -2,
@@ -557,7 +557,11 @@ struct perf_event {
 	 */
 	struct list_head		group_entry;
 	struct list_head		sibling_list;
-
+	/*
+	 * Node on the pinned or flexible tree located at the event context;
+	 */
+	struct rb_node			group_node;
+	u64				group_index;
 	/*
 	 * We need storage to track the entries in perf_pmu_migrate_context; we
 	 * cannot use the event_entry because of RCU and we want to keep the
@@ -578,7 +582,7 @@ struct perf_event {
 	struct pmu			*pmu;
 	void				*pmu_private;
 
-	enum perf_event_active_state	state;
+	enum perf_event_state		state;
 	unsigned int			attach_state;
 	local64_t			count;
 	atomic64_t			child_count;
@@ -588,26 +592,10 @@ struct perf_event {
 	 * has been enabled (i.e. eligible to run, and the task has
 	 * been scheduled in, if this is a per-task event)
 	 * and running (scheduled onto the CPU), respectively.
-	 *
-	 * They are computed from tstamp_enabled, tstamp_running and
-	 * tstamp_stopped when the event is in INACTIVE or ACTIVE state.
 	 */
 	u64				total_time_enabled;
 	u64				total_time_running;
-
-	/*
-	 * These are timestamps used for computing total_time_enabled
-	 * and total_time_running when the event is in INACTIVE or
-	 * ACTIVE state, measured in nanoseconds from an arbitrary point
-	 * in time.
-	 * tstamp_enabled: the notional time when the event was enabled
-	 * tstamp_running: the notional time when the event was scheduled on
-	 * tstamp_stopped: in INACTIVE state, the notional time when the
-	 *	event was scheduled off.
-	 */
-	u64				tstamp_enabled;
-	u64				tstamp_running;
-	u64				tstamp_stopped;
+	u64				tstamp;
 
 	/*
 	 * timestamp shadows the actual context timing but it can
@@ -699,13 +687,17 @@ struct perf_event {
 
 #ifdef CONFIG_CGROUP_PERF
 	struct perf_cgroup		*cgrp; /* cgroup event is attach to */
-	int				cgrp_defer_enabled;
 #endif
 
 	struct list_head		sb_list;
 #endif /* CONFIG_PERF_EVENTS */
 };
 
+struct perf_event_groups {
+	struct rb_root	tree;
+	u64		index;
+};
+
 /**
  * struct perf_event_context - event context structure
  *
@@ -726,8 +718,8 @@ struct perf_event_context {
 	struct mutex			mutex;
 
 	struct list_head		active_ctx_list;
-	struct list_head		pinned_groups;
-	struct list_head		flexible_groups;
+	struct perf_event_groups	pinned_groups;
+	struct perf_event_groups	flexible_groups;
 	struct list_head		event_list;
 	int				nr_events;
 	int				nr_active;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 294f192..3e8eef8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -582,6 +582,70 @@ static inline u64 perf_event_clock(struct perf_event *event)
 	return event->clock();
 }
 
+/*
+ * XXX comment about timekeeping goes here
+ */
+
+static __always_inline enum perf_event_state
+__perf_effective_state(struct perf_event *event)
+{
+	struct perf_event *leader = event->group_leader;
+
+	if (leader->state <= PERF_EVENT_STATE_OFF)
+		return leader->state;
+
+	return event->state;
+}
+
+static __always_inline void
+__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
+{
+	enum perf_event_state state = __perf_effective_state(event);
+	u64 delta = now - event->tstamp;
+
+	*enabled = event->total_time_enabled;
+	if (state >= PERF_EVENT_STATE_INACTIVE)
+		*enabled += delta;
+
+	*running = event->total_time_running;
+	if (state >= PERF_EVENT_STATE_ACTIVE)
+		*running += delta;
+}
+
+static void perf_event_update_time(struct perf_event *event)
+{
+	u64 now = perf_event_time(event);
+
+	__perf_update_times(event, now, &event->total_time_enabled,
+					&event->total_time_running);
+	event->tstamp = now;
+}
+
+static void perf_event_update_sibling_time(struct perf_event *leader)
+{
+	struct perf_event *sibling;
+
+	list_for_each_entry(sibling, &leader->sibling_list, group_entry)
+		perf_event_update_time(sibling);
+}
+
+static void
+perf_event_set_state(struct perf_event *event, enum perf_event_state state)
+{
+	if (event->state == state)
+		return;
+
+	perf_event_update_time(event);
+	/*
+	 * If a group leader gets enabled/disabled all its siblings
+	 * are affected too.
+	 */
+	if ((event->state < 0) ^ (state < 0))
+		perf_event_update_sibling_time(event);
+
+	WRITE_ONCE(event->state, state);
+}
+
 #ifdef CONFIG_CGROUP_PERF
 
 static inline bool
@@ -841,40 +905,6 @@ perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
 	event->shadow_ctx_time = now - t->timestamp;
 }
 
-static inline void
-perf_cgroup_defer_enabled(struct perf_event *event)
-{
-	/*
-	 * when the current task's perf cgroup does not match
-	 * the event's, we need to remember to call the
-	 * perf_mark_enable() function the first time a task with
-	 * a matching perf cgroup is scheduled in.
-	 */
-	if (is_cgroup_event(event) && !perf_cgroup_match(event))
-		event->cgrp_defer_enabled = 1;
-}
-
-static inline void
-perf_cgroup_mark_enabled(struct perf_event *event,
-			 struct perf_event_context *ctx)
-{
-	struct perf_event *sub;
-	u64 tstamp = perf_event_time(event);
-
-	if (!event->cgrp_defer_enabled)
-		return;
-
-	event->cgrp_defer_enabled = 0;
-
-	event->tstamp_enabled = tstamp - event->total_time_enabled;
-	list_for_each_entry(sub, &event->sibling_list, group_entry) {
-		if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
-			sub->cgrp_defer_enabled = 0;
-		}
-	}
-}
-
 /*
  * Update cpuctx->cgrp so that it is set when first cgroup event is added and
  * cleared when last cgroup event is removed.
@@ -973,17 +1003,6 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
 }
 
 static inline void
-perf_cgroup_defer_enabled(struct perf_event *event)
-{
-}
-
-static inline void
-perf_cgroup_mark_enabled(struct perf_event *event,
-			 struct perf_event_context *ctx)
-{
-}
-
-static inline void
 list_update_cgroup_event(struct perf_event *event,
 			 struct perf_event_context *ctx, bool add)
 {
@@ -1396,91 +1415,212 @@ static u64 perf_event_time(struct perf_event *event)
 	return ctx ? ctx->time : 0;
 }
 
-/*
- * Update the total_time_enabled and total_time_running fields for a event.
- */
-static void update_event_times(struct perf_event *event)
+static enum event_type_t get_event_type(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
-	u64 run_end;
+	enum event_type_t event_type;
 
 	lockdep_assert_held(&ctx->lock);
 
-	if (event->state < PERF_EVENT_STATE_INACTIVE ||
-	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
-		return;
-
 	/*
-	 * in cgroup mode, time_enabled represents
-	 * the time the event was enabled AND active
-	 * tasks were in the monitored cgroup. This is
-	 * independent of the activity of the context as
-	 * there may be a mix of cgroup and non-cgroup events.
-	 *
-	 * That is why we treat cgroup events differently
-	 * here.
+	 * It's 'group type', really, because if our group leader is
+	 * pinned, so are we.
 	 */
-	if (is_cgroup_event(event))
-		run_end = perf_cgroup_event_time(event);
-	else if (ctx->is_active)
-		run_end = ctx->time;
-	else
-		run_end = event->tstamp_stopped;
+	if (event->group_leader != event)
+		event = event->group_leader;
 
-	event->total_time_enabled = run_end - event->tstamp_enabled;
+	event_type = event->attr.pinned ? EVENT_PINNED : EVENT_FLEXIBLE;
+	if (!ctx->task)
+		event_type |= EVENT_CPU;
+
+	return event_type;
+}
 
-	if (event->state == PERF_EVENT_STATE_INACTIVE)
-		run_end = event->tstamp_stopped;
+/*
+ * Helper function to initialize group leader event;
+ */
+void init_event_group(struct perf_event *event)
+{
+	RB_CLEAR_NODE(&event->group_node);
+	event->group_index = 0;
+}
+
+/*
+ * Extract pinned or flexible groups from the context
+ * based on event attrs bits;
+ */
+static struct perf_event_groups *
+get_event_groups(struct perf_event *event, struct perf_event_context *ctx)
+{
+	if (event->attr.pinned)
+		return &ctx->pinned_groups;
 	else
-		run_end = perf_event_time(event);
+		return &ctx->flexible_groups;
+}
+
+/*
+ * Helper function to initializes perf event groups object;
+ */
+void perf_event_groups_init(struct perf_event_groups *groups)
+{
+	groups->tree = RB_ROOT;
+	groups->index = 0;
+}
 
-	event->total_time_running = run_end - event->tstamp_running;
+/*
+ * Compare function for event groups;
+ * Implements complex key that first sorts by CPU and then by
+ * virtual index which provides ordering when rotating
+ * groups for the same CPU;
+ */
+int perf_event_groups_less(struct perf_event *left, struct perf_event *right)
+{
+	if (left->cpu < right->cpu) {
+		return 1;
+	} else if (left->cpu > right->cpu) {
+		return 0;
+	} else {
+		if (left->group_index < right->group_index) {
+			return 1;
+		} else if(left->group_index > right->group_index) {
+			return 0;
+		} else {
+			return 0;
+		}
+	}
+}
+/*
+ * Insert a group into a tree using event->cpu as a key. If event->cpu node
+ * is already attached to the tree then the event is added to the attached
+ * group's group_list list.
+ */
+static void
+perf_event_groups_insert(struct perf_event_groups *groups,
+		struct perf_event *event)
+{
+	struct perf_event *node_event;
+	struct rb_node *parent;
+	struct rb_node **node;
+
+	event->group_index = ++groups->index;
+
+	node = &groups->tree.rb_node;
+	parent = *node;
+
+	while (*node) {
+		parent = *node;
+		node_event = container_of(*node,
+				struct perf_event, group_node);
+
+		if (perf_event_groups_less(event, node_event))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
 
+	rb_link_node(&event->group_node, parent, node);
+	rb_insert_color(&event->group_node, &groups->tree);
 }
 
 /*
- * Update total_time_enabled and total_time_running for all events in a group.
+ * Helper function to insert event into the pinned or
+ * flexible groups;
  */
-static void update_group_times(struct perf_event *leader)
+static void
+add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
 {
-	struct perf_event *event;
+	struct perf_event_groups *groups;
 
-	update_event_times(leader);
-	list_for_each_entry(event, &leader->sibling_list, group_entry)
-		update_event_times(event);
+	groups = get_event_groups(event, ctx);
+	perf_event_groups_insert(groups, event);
 }
 
-static enum event_type_t get_event_type(struct perf_event *event)
+/*
+ * Delete a group from a tree. If the group is directly attached to the tree
+ * it also detaches all groups on the group's group_list list.
+ */
+static void
+perf_event_groups_delete(struct perf_event_groups *groups,
+		struct perf_event *event)
 {
-	struct perf_event_context *ctx = event->ctx;
-	enum event_type_t event_type;
+	if (!RB_EMPTY_NODE(&event->group_node) &&
+	    !RB_EMPTY_ROOT(&groups->tree))
+		rb_erase(&event->group_node, &groups->tree);
 
-	lockdep_assert_held(&ctx->lock);
+	init_event_group(event);
+}
 
-	/*
-	 * It's 'group type', really, because if our group leader is
-	 * pinned, so are we.
-	 */
-	if (event->group_leader != event)
-		event = event->group_leader;
+/*
+ * Helper function to delete event from its groups;
+ */
+static void
+del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
+{
+	struct perf_event_groups *groups;
 
-	event_type = event->attr.pinned ? EVENT_PINNED : EVENT_FLEXIBLE;
-	if (!ctx->task)
-		event_type |= EVENT_CPU;
+	groups = get_event_groups(event, ctx);
+	perf_event_groups_delete(groups, event);
+}
 
-	return event_type;
+/*
+ * Get a group by a cpu key from groups tree with the least group_index;
+ */
+static struct perf_event *
+perf_event_groups_first(struct perf_event_groups *groups, int cpu)
+{
+	struct perf_event *node_event = NULL, *match = NULL;
+	struct rb_node *node = groups->tree.rb_node;
+
+	while (node) {
+		node_event = container_of(node,
+				struct perf_event, group_node);
+
+		if (cpu < node_event->cpu) {
+			node = node->rb_left;
+		} else if (cpu > node_event->cpu) {
+			node = node->rb_right;
+		} else {
+			match = node_event;
+			node = node->rb_left;
+		}
+	}
+
+	return match;
 }
 
-static struct list_head *
-ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
+/*
+ * Find group list by a cpu key and rotate it.
+ */
+static void
+perf_event_groups_rotate(struct perf_event_groups *groups, int cpu)
 {
-	if (event->attr.pinned)
-		return &ctx->pinned_groups;
-	else
-		return &ctx->flexible_groups;
+	struct perf_event *event =
+			perf_event_groups_first(groups, cpu);
+
+	if (event) {
+		perf_event_groups_delete(groups, event);
+		perf_event_groups_insert(groups, event);
+	}
 }
 
 /*
+ * Iterate event groups thru the whole tree.
+ */
+#define perf_event_groups_for_each(event, groups, node)		\
+	for (event = rb_entry_safe(rb_first(&((groups)->tree)),	\
+				typeof(*event), node); event;	\
+		event = rb_entry_safe(rb_next(&event->node),	\
+				typeof(*event), node))
+/*
+ * Iterate event groups with cpu == key.
+ */
+#define perf_event_groups_for_each_cpu(event, key, groups, node) \
+	for (event = perf_event_groups_first(groups, key);	 \
+		event && event->cpu == key;			 \
+		event = rb_entry_safe(rb_next(&event->node),	 \
+				typeof(*event), node))
+
+/*
  * Add a event from the lists for its context.
  * Must be called with ctx->mutex and ctx->lock held.
  */
@@ -1492,18 +1632,16 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT);
 	event->attach_state |= PERF_ATTACH_CONTEXT;
 
+	event->tstamp = perf_event_time(event);
+
 	/*
 	 * If we're a stand alone event or group leader, we go to the context
 	 * list, group events are kept attached to the group so that
 	 * perf_group_detach can, at all times, locate all siblings.
 	 */
 	if (event->group_leader == event) {
-		struct list_head *list;
-
 		event->group_caps = event->event_caps;
-
-		list = ctx_group_list(event, ctx);
-		list_add_tail(&event->group_entry, list);
+		add_event_to_groups(event, ctx);
 	}
 
 	list_update_cgroup_event(event, ctx, true);
@@ -1697,9 +1835,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 	list_del_rcu(&event->event_entry);
 
 	if (event->group_leader == event)
-		list_del_init(&event->group_entry);
-
-	update_group_times(event);
+		del_event_from_groups(event, ctx);
 
 	/*
 	 * If event was in error state, then keep it
@@ -1709,7 +1845,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 	 * of the event
 	 */
 	if (event->state > PERF_EVENT_STATE_OFF)
-		event->state = PERF_EVENT_STATE_OFF;
+		perf_event_set_state(event, PERF_EVENT_STATE_OFF);
 
 	ctx->generation++;
 }
@@ -1717,7 +1853,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 static void perf_group_detach(struct perf_event *event)
 {
 	struct perf_event *sibling, *tmp;
-	struct list_head *list = NULL;
 
 	lockdep_assert_held(&event->ctx->lock);
 
@@ -1738,22 +1873,23 @@ static void perf_group_detach(struct perf_event *event)
 		goto out;
 	}
 
-	if (!list_empty(&event->group_entry))
-		list = &event->group_entry;
-
 	/*
 	 * If this was a group event with sibling events then
 	 * upgrade the siblings to singleton events by adding them
 	 * to whatever list we are on.
 	 */
 	list_for_each_entry_safe(sibling, tmp, &event->sibling_list, group_entry) {
-		if (list)
-			list_move_tail(&sibling->group_entry, list);
+
 		sibling->group_leader = sibling;
 
 		/* Inherit group flags from the previous leader */
 		sibling->group_caps = event->group_caps;
 
+		if (!RB_EMPTY_NODE(&event->group_node)) {
+			list_del_init(&sibling->group_entry);
+			add_event_to_groups(sibling, event->ctx);
+		}
+
 		WARN_ON_ONCE(sibling->ctx != event->ctx);
 	}
 
@@ -1808,38 +1944,24 @@ event_sched_out(struct perf_event *event,
 		  struct perf_cpu_context *cpuctx,
 		  struct perf_event_context *ctx)
 {
-	u64 tstamp = perf_event_time(event);
-	u64 delta;
+	enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
 
 	WARN_ON_ONCE(event->ctx != ctx);
 	lockdep_assert_held(&ctx->lock);
 
-	/*
-	 * An event which could not be activated because of
-	 * filter mismatch still needs to have its timings
-	 * maintained, otherwise bogus information is return
-	 * via read() for time_enabled, time_running:
-	 */
-	if (event->state == PERF_EVENT_STATE_INACTIVE &&
-	    !event_filter_match(event)) {
-		delta = tstamp - event->tstamp_stopped;
-		event->tstamp_running += delta;
-		event->tstamp_stopped = tstamp;
-	}
-
 	if (event->state != PERF_EVENT_STATE_ACTIVE)
 		return;
 
 	perf_pmu_disable(event->pmu);
 
-	event->tstamp_stopped = tstamp;
 	event->pmu->del(event, 0);
 	event->oncpu = -1;
-	event->state = PERF_EVENT_STATE_INACTIVE;
+
 	if (event->pending_disable) {
 		event->pending_disable = 0;
-		event->state = PERF_EVENT_STATE_OFF;
+		state = PERF_EVENT_STATE_OFF;
 	}
+	perf_event_set_state(event, state);
 
 	if (!is_software_event(event))
 		cpuctx->active_oncpu--;
@@ -1859,7 +1981,9 @@ group_sched_out(struct perf_event *group_event,
 		struct perf_event_context *ctx)
 {
 	struct perf_event *event;
-	int state = group_event->state;
+
+	if (group_event->state != PERF_EVENT_STATE_ACTIVE)
+		return;
 
 	perf_pmu_disable(ctx->pmu);
 
@@ -1873,7 +1997,7 @@ group_sched_out(struct perf_event *group_event,
 
 	perf_pmu_enable(ctx->pmu);
 
-	if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive)
+	if (group_event->attr.exclusive)
 		cpuctx->exclusive = 0;
 }
 
@@ -1893,6 +2017,11 @@ __perf_remove_from_context(struct perf_event *event,
 {
 	unsigned long flags = (unsigned long)info;
 
+	if (ctx->is_active & EVENT_TIME) {
+		update_context_time(ctx);
+		update_cgrp_time_from_cpuctx(cpuctx);
+	}
+
 	event_sched_out(event, cpuctx, ctx);
 	if (flags & DETACH_GROUP)
 		perf_group_detach(event);
@@ -1955,14 +2084,17 @@ static void __perf_event_disable(struct perf_event *event,
 	if (event->state < PERF_EVENT_STATE_INACTIVE)
 		return;
 
-	update_context_time(ctx);
-	update_cgrp_time_from_event(event);
-	update_group_times(event);
+	if (ctx->is_active & EVENT_TIME) {
+		update_context_time(ctx);
+		update_cgrp_time_from_cpuctx(cpuctx);
+	}
+
 	if (event == event->group_leader)
 		group_sched_out(event, cpuctx, ctx);
 	else
 		event_sched_out(event, cpuctx, ctx);
-	event->state = PERF_EVENT_STATE_OFF;
+
+	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
 }
 
 /*
@@ -2019,8 +2151,7 @@ void perf_event_disable_inatomic(struct perf_event *event)
 }
 
 static void perf_set_shadow_time(struct perf_event *event,
-				 struct perf_event_context *ctx,
-				 u64 tstamp)
+				 struct perf_event_context *ctx)
 {
 	/*
 	 * use the correct time source for the time snapshot
@@ -2048,9 +2179,9 @@ static void perf_set_shadow_time(struct perf_event *event,
 	 * is cleaner and simpler to understand.
 	 */
 	if (is_cgroup_event(event))
-		perf_cgroup_set_shadow_time(event, tstamp);
+		perf_cgroup_set_shadow_time(event, event->tstamp);
 	else
-		event->shadow_ctx_time = tstamp - ctx->timestamp;
+		event->shadow_ctx_time = event->tstamp - ctx->timestamp;
 }
 
 #define MAX_INTERRUPTS (~0ULL)
@@ -2063,7 +2194,6 @@ event_sched_in(struct perf_event *event,
 		 struct perf_cpu_context *cpuctx,
 		 struct perf_event_context *ctx)
 {
-	u64 tstamp = perf_event_time(event);
 	int ret = 0;
 
 	lockdep_assert_held(&ctx->lock);
@@ -2077,7 +2207,7 @@ event_sched_in(struct perf_event *event,
 	 * is visible.
 	 */
 	smp_wmb();
-	WRITE_ONCE(event->state, PERF_EVENT_STATE_ACTIVE);
+	perf_event_set_state(event, PERF_EVENT_STATE_ACTIVE);
 
 	/*
 	 * Unthrottle events, since we scheduled we might have missed several
@@ -2089,26 +2219,19 @@ event_sched_in(struct perf_event *event,
 		event->hw.interrupts = 0;
 	}
 
-	/*
-	 * The new state must be visible before we turn it on in the hardware:
-	 */
-	smp_wmb();
-
 	perf_pmu_disable(event->pmu);
 
-	perf_set_shadow_time(event, ctx, tstamp);
+	perf_set_shadow_time(event, ctx);
 
 	perf_log_itrace_start(event);
 
 	if (event->pmu->add(event, PERF_EF_START)) {
-		event->state = PERF_EVENT_STATE_INACTIVE;
+		perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
 		event->oncpu = -1;
 		ret = -EAGAIN;
 		goto out;
 	}
 
-	event->tstamp_running += tstamp - event->tstamp_stopped;
-
 	if (!is_software_event(event))
 		cpuctx->active_oncpu++;
 	if (!ctx->nr_active++)
@@ -2132,8 +2255,6 @@ group_sched_in(struct perf_event *group_event,
 {
 	struct perf_event *event, *partial_group = NULL;
 	struct pmu *pmu = ctx->pmu;
-	u64 now = ctx->time;
-	bool simulate = false;
 
 	if (group_event->state == PERF_EVENT_STATE_OFF)
 		return 0;
@@ -2163,27 +2284,13 @@ group_sched_in(struct perf_event *group_event,
 	/*
 	 * Groups can be scheduled in as one unit only, so undo any
 	 * partial group before returning:
-	 * The events up to the failed event are scheduled out normally,
-	 * tstamp_stopped will be updated.
-	 *
-	 * The failed events and the remaining siblings need to have
-	 * their timings updated as if they had gone thru event_sched_in()
-	 * and event_sched_out(). This is required to get consistent timings
-	 * across the group. This also takes care of the case where the group
-	 * could never be scheduled by ensuring tstamp_stopped is set to mark
-	 * the time the event was actually stopped, such that time delta
-	 * calculation in update_event_times() is correct.
+	 * The events up to the failed event are scheduled out normally.
 	 */
 	list_for_each_entry(event, &group_event->sibling_list, group_entry) {
 		if (event == partial_group)
-			simulate = true;
+			break;
 
-		if (simulate) {
-			event->tstamp_running += now - event->tstamp_stopped;
-			event->tstamp_stopped = now;
-		} else {
-			event_sched_out(event, cpuctx, ctx);
-		}
+		event_sched_out(event, cpuctx, ctx);
 	}
 	event_sched_out(group_event, cpuctx, ctx);
 
@@ -2225,46 +2332,27 @@ static int group_can_go_on(struct perf_event *event,
 	return can_add_hw;
 }
 
-/*
- * Complement to update_event_times(). This computes the tstamp_* values to
- * continue 'enabled' state from @now, and effectively discards the time
- * between the prior tstamp_stopped and now (as we were in the OFF state, or
- * just switched (context) time base).
- *
- * This further assumes '@event->state == INACTIVE' (we just came from OFF) and
- * cannot have been scheduled in yet. And going into INACTIVE state means
- * '@event->tstamp_stopped = @now'.
- *
- * Thus given the rules of update_event_times():
- *
- *   total_time_enabled = tstamp_stopped - tstamp_enabled
- *   total_time_running = tstamp_stopped - tstamp_running
- *
- * We can insert 'tstamp_stopped == now' and reverse them to compute new
- * tstamp_* values.
- */
-static void __perf_event_enable_time(struct perf_event *event, u64 now)
+static int
+flexible_group_sched_in(struct perf_event *event,
+			struct perf_event_context *ctx,
+		        struct perf_cpu_context *cpuctx,
+			int *can_add_hw)
 {
-	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE);
+	if (event->state <= PERF_EVENT_STATE_OFF || !event_filter_match(event))
+		return 0;
+
+	if (group_can_go_on(event, cpuctx, *can_add_hw))
+		if (group_sched_in(event, cpuctx, ctx))
+			*can_add_hw = 0;
 
-	event->tstamp_stopped = now;
-	event->tstamp_enabled = now - event->total_time_enabled;
-	event->tstamp_running = now - event->total_time_running;
+	return 1;
 }
 
 static void add_event_to_ctx(struct perf_event *event,
 			       struct perf_event_context *ctx)
 {
-	u64 tstamp = perf_event_time(event);
-
 	list_add_event(event, ctx);
 	perf_group_attach(event);
-	/*
-	 * We can be called with event->state == STATE_OFF when we create with
-	 * .disabled = 1. In that case the IOC_ENABLE will call this function.
-	 */
-	if (event->state == PERF_EVENT_STATE_INACTIVE)
-		__perf_event_enable_time(event, tstamp);
 }
 
 static void ctx_sched_out(struct perf_event_context *ctx,
@@ -2496,28 +2584,6 @@ perf_install_in_context(struct perf_event_context *ctx,
 }
 
 /*
- * Put a event into inactive state and update time fields.
- * Enabling the leader of a group effectively enables all
- * the group members that aren't explicitly disabled, so we
- * have to update their ->tstamp_enabled also.
- * Note: this works for group members as well as group leaders
- * since the non-leader members' sibling_lists will be empty.
- */
-static void __perf_event_mark_enabled(struct perf_event *event)
-{
-	struct perf_event *sub;
-	u64 tstamp = perf_event_time(event);
-
-	event->state = PERF_EVENT_STATE_INACTIVE;
-	__perf_event_enable_time(event, tstamp);
-	list_for_each_entry(sub, &event->sibling_list, group_entry) {
-		/* XXX should not be > INACTIVE if event isn't */
-		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
-			__perf_event_enable_time(sub, tstamp);
-	}
-}
-
-/*
  * Cross CPU call to enable a performance event
  */
 static void __perf_event_enable(struct perf_event *event,
@@ -2535,14 +2601,12 @@ static void __perf_event_enable(struct perf_event *event,
 	if (ctx->is_active)
 		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
 
-	__perf_event_mark_enabled(event);
+	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
 
 	if (!ctx->is_active)
 		return;
 
 	if (!event_filter_match(event)) {
-		if (is_cgroup_event(event))
-			perf_cgroup_defer_enabled(event);
 		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
 		return;
 	}
@@ -2750,6 +2814,7 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 			  struct perf_cpu_context *cpuctx,
 			  enum event_type_t event_type)
 {
+	int sw = -1, cpu = smp_processor_id();
 	int is_active = ctx->is_active;
 	struct perf_event *event;
 
@@ -2798,12 +2863,20 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 
 	perf_pmu_disable(ctx->pmu);
 	if (is_active & EVENT_PINNED) {
-		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
+		perf_event_groups_for_each_cpu(event, cpu,
+				&ctx->pinned_groups, group_node)
+			group_sched_out(event, cpuctx, ctx);
+		perf_event_groups_for_each_cpu(event, sw,
+				&ctx->pinned_groups, group_node)
 			group_sched_out(event, cpuctx, ctx);
 	}
 
 	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry(event, &ctx->flexible_groups, group_entry)
+		perf_event_groups_for_each_cpu(event, cpu,
+				&ctx->flexible_groups, group_node)
+			group_sched_out(event, cpuctx, ctx);
+		perf_event_groups_for_each_cpu(event, sw,
+				&ctx->flexible_groups, group_node)
 			group_sched_out(event, cpuctx, ctx);
 	}
 	perf_pmu_enable(ctx->pmu);
@@ -2862,18 +2935,10 @@ static void __perf_event_sync_stat(struct perf_event *event,
 	 * we know the event must be on the current CPU, therefore we
 	 * don't need to use it.
 	 */
-	switch (event->state) {
-	case PERF_EVENT_STATE_ACTIVE:
+	if (event->state == PERF_EVENT_STATE_ACTIVE)
 		event->pmu->read(event);
-		/* fall-through */
 
-	case PERF_EVENT_STATE_INACTIVE:
-		update_event_times(event);
-		break;
-
-	default:
-		break;
-	}
+	perf_event_update_time(event);
 
 	/*
 	 * In order to keep per-task stats reliable we need to flip the event
@@ -3102,28 +3167,28 @@ static void
 ctx_pinned_sched_in(struct perf_event_context *ctx,
 		    struct perf_cpu_context *cpuctx)
 {
+	int sw = -1, cpu = smp_processor_id();
 	struct perf_event *event;
+	int can_add_hw;
+
+	perf_event_groups_for_each_cpu(event, sw,
+			&ctx->pinned_groups, group_node) {
+		can_add_hw = 1;
+		if (flexible_group_sched_in(event, ctx, cpuctx, &can_add_hw)) {
+			if (event->state == PERF_EVENT_STATE_INACTIVE)
+				perf_event_set_state(event,
+						PERF_EVENT_STATE_ERROR);
+		}
+	}
 
-	list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		if (!event_filter_match(event))
-			continue;
-
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
-
-		if (group_can_go_on(event, cpuctx, 1))
-			group_sched_in(event, cpuctx, ctx);
-
-		/*
-		 * If this pinned group hasn't been scheduled,
-		 * put it in error state.
-		 */
-		if (event->state == PERF_EVENT_STATE_INACTIVE) {
-			update_group_times(event);
-			event->state = PERF_EVENT_STATE_ERROR;
+	can_add_hw = 1;
+	perf_event_groups_for_each_cpu(event, cpu,
+			&ctx->pinned_groups, group_node) {
+		can_add_hw = 1;
+		if (flexible_group_sched_in(event, ctx, cpuctx, &can_add_hw)) {
+			if (event->state == PERF_EVENT_STATE_INACTIVE)
+				perf_event_set_state(event,
+						PERF_EVENT_STATE_ERROR);
 		}
 	}
 }
@@ -3132,29 +3197,18 @@ static void
 ctx_flexible_sched_in(struct perf_event_context *ctx,
 		      struct perf_cpu_context *cpuctx)
 {
+	int sw = -1, cpu = smp_processor_id();
 	struct perf_event *event;
 	int can_add_hw = 1;
 
-	list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
-		/* Ignore events in OFF or ERROR state */
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		/*
-		 * Listen to the 'cpu' scheduling filter constraint
-		 * of events:
-		 */
-		if (!event_filter_match(event))
-			continue;
-
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
+	perf_event_groups_for_each_cpu(event, sw,
+			&ctx->flexible_groups, group_node)
+		flexible_group_sched_in(event, ctx, cpuctx, &can_add_hw);
 
-		if (group_can_go_on(event, cpuctx, can_add_hw)) {
-			if (group_sched_in(event, cpuctx, ctx))
-				can_add_hw = 0;
-		}
-	}
+	can_add_hw = 1;
+	perf_event_groups_for_each_cpu(event, cpu,
+			&ctx->flexible_groups, group_node)
+		flexible_group_sched_in(event, ctx, cpuctx, &can_add_hw);
 }
 
 static void
@@ -3235,7 +3289,7 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	 * However, if task's ctx is not carrying any pinned
 	 * events, no need to flip the cpuctx's events around.
 	 */
-	if (!list_empty(&ctx->pinned_groups))
+	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
 		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
 	perf_event_sched_in(cpuctx, ctx, task);
 	perf_pmu_enable(ctx->pmu);
@@ -3472,8 +3526,12 @@ static void rotate_ctx(struct perf_event_context *ctx)
 	 * Rotate the first entry last of non-pinned groups. Rotation might be
 	 * disabled by the inheritance code.
 	 */
-	if (!ctx->rotate_disable)
-		list_rotate_left(&ctx->flexible_groups);
+	if (!ctx->rotate_disable) {
+		int sw = -1, cpu = smp_processor_id();
+
+		perf_event_groups_rotate(&ctx->flexible_groups, sw);
+		perf_event_groups_rotate(&ctx->flexible_groups, cpu);
+	}
 }
 
 static int perf_rotate_context(struct perf_cpu_context *cpuctx)
@@ -3541,7 +3599,7 @@ static int event_enable_on_exec(struct perf_event *event,
 	if (event->state >= PERF_EVENT_STATE_INACTIVE)
 		return 0;
 
-	__perf_event_mark_enabled(event);
+	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
 
 	return 1;
 }
@@ -3590,12 +3648,6 @@ static void perf_event_enable_on_exec(int ctxn)
 		put_ctx(clone_ctx);
 }
 
-struct perf_read_data {
-	struct perf_event *event;
-	bool group;
-	int ret;
-};
-
 static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
 {
 	u16 local_pkg, event_pkg;
@@ -3613,64 +3665,6 @@ static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
 	return event_cpu;
 }
 
-/*
- * Cross CPU call to read the hardware event
- */
-static void __perf_event_read(void *info)
-{
-	struct perf_read_data *data = info;
-	struct perf_event *sub, *event = data->event;
-	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
-	struct pmu *pmu = event->pmu;
-
-	/*
-	 * If this is a task context, we need to check whether it is
-	 * the current task context of this cpu.  If not it has been
-	 * scheduled out before the smp call arrived.  In that case
-	 * event->count would have been updated to a recent sample
-	 * when the event was scheduled out.
-	 */
-	if (ctx->task && cpuctx->task_ctx != ctx)
-		return;
-
-	raw_spin_lock(&ctx->lock);
-	if (ctx->is_active) {
-		update_context_time(ctx);
-		update_cgrp_time_from_event(event);
-	}
-
-	update_event_times(event);
-	if (event->state != PERF_EVENT_STATE_ACTIVE)
-		goto unlock;
-
-	if (!data->group) {
-		pmu->read(event);
-		data->ret = 0;
-		goto unlock;
-	}
-
-	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
-
-	pmu->read(event);
-
-	list_for_each_entry(sub, &event->sibling_list, group_entry) {
-		update_event_times(sub);
-		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
-			/*
-			 * Use sibling's PMU rather than @event's since
-			 * sibling could be on different (eg: software) PMU.
-			 */
-			sub->pmu->read(sub);
-		}
-	}
-
-	data->ret = pmu->commit_txn(pmu);
-
-unlock:
-	raw_spin_unlock(&ctx->lock);
-}
-
 static inline u64 perf_event_count(struct perf_event *event)
 {
 	return local64_read(&event->count) + atomic64_read(&event->child_count);
@@ -3733,63 +3727,81 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
 	return ret;
 }
 
-static int perf_event_read(struct perf_event *event, bool group)
+struct perf_read_data {
+	struct perf_event *event;
+	bool group;
+	int ret;
+};
+
+static void __perf_event_read(struct perf_event *event,
+			      struct perf_cpu_context *cpuctx,
+			      struct perf_event_context *ctx,
+			      void *data)
 {
-	int event_cpu, ret = 0;
+	struct perf_read_data *prd = data;
+	struct pmu *pmu = event->pmu;
+	struct perf_event *sibling;
 
-	/*
-	 * If event is enabled and currently active on a CPU, update the
-	 * value in the event structure:
-	 */
-	if (event->state == PERF_EVENT_STATE_ACTIVE) {
-		struct perf_read_data data = {
-			.event = event,
-			.group = group,
-			.ret = 0,
-		};
+	if (ctx->is_active & EVENT_TIME) {
+		update_context_time(ctx);
+		update_cgrp_time_from_cpuctx(cpuctx);
+	}
 
-		event_cpu = READ_ONCE(event->oncpu);
-		if ((unsigned)event_cpu >= nr_cpu_ids)
-			return 0;
+	perf_event_update_time(event);
+	if (prd->group)
+		perf_event_update_sibling_time(event);
 
-		preempt_disable();
-		event_cpu = __perf_event_read_cpu(event, event_cpu);
+	if (event->state != PERF_EVENT_STATE_ACTIVE)
+		return;
 
+	if (!prd->group) {
+		pmu->read(event);
+		prd->ret = 0;
+		return;
+	}
+
+	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
+
+	pmu->read(event);
+	list_for_each_entry(sibling, &event->sibling_list, group_entry) {
+		if (sibling->state == PERF_EVENT_STATE_ACTIVE) {
+			/*
+			 * Use sibling's PMU rather than @event's since
+			 * sibling could be on different (eg: software) PMU.
+			 */
+			sibling->pmu->read(sibling);
+		}
+	}
+
+	prd->ret = pmu->commit_txn(pmu);
+}
+
+static int perf_event_read(struct perf_event *event, bool group)
+{
+	struct perf_read_data prd = {
+		.event = event,
+		.group = group,
+		.ret = 0,
+	};
+
+	if (event->ctx->task) {
+		event_function_call(event, __perf_event_read, &prd);
+	} else {
 		/*
-		 * Purposely ignore the smp_call_function_single() return
-		 * value.
-		 *
-		 * If event_cpu isn't a valid CPU it means the event got
-		 * scheduled out and that will have updated the event count.
-		 *
-		 * Therefore, either way, we'll have an up-to-date event count
-		 * after this.
-		 */
-		(void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
-		preempt_enable();
-		ret = data.ret;
-	} else if (event->state == PERF_EVENT_STATE_INACTIVE) {
-		struct perf_event_context *ctx = event->ctx;
-		unsigned long flags;
-
-		raw_spin_lock_irqsave(&ctx->lock, flags);
-		/*
-		 * may read while context is not active
-		 * (e.g., thread is blocked), in that case
-		 * we cannot update context time
+		 * For uncore events (which are per definition per-cpu)
+		 * allow a different read CPU from event->cpu.
 		 */
-		if (ctx->is_active) {
-			update_context_time(ctx);
-			update_cgrp_time_from_event(event);
-		}
-		if (group)
-			update_group_times(event);
-		else
-			update_event_times(event);
-		raw_spin_unlock_irqrestore(&ctx->lock, flags);
+		struct event_function_struct efs = {
+			.event = event,
+			.func = __perf_event_read,
+			.data = &prd,
+		};
+		int cpu = __perf_event_read_cpu(event, event->cpu);
+
+		cpu_function_call(cpu, event_function, &efs);
 	}
 
-	return ret;
+	return prd.ret;
 }
 
 /*
@@ -3800,8 +3812,8 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 	raw_spin_lock_init(&ctx->lock);
 	mutex_init(&ctx->mutex);
 	INIT_LIST_HEAD(&ctx->active_ctx_list);
-	INIT_LIST_HEAD(&ctx->pinned_groups);
-	INIT_LIST_HEAD(&ctx->flexible_groups);
+	perf_event_groups_init(&ctx->pinned_groups);
+	perf_event_groups_init(&ctx->flexible_groups);
 	INIT_LIST_HEAD(&ctx->event_list);
 	atomic_set(&ctx->refcount, 1);
 }
@@ -4388,7 +4400,7 @@ static int perf_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
-u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
+static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 {
 	struct perf_event *child;
 	u64 total = 0;
@@ -4416,6 +4428,18 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 
 	return total;
 }
+
+u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
+{
+	struct perf_event_context *ctx;
+	u64 count;
+
+	ctx = perf_event_ctx_lock(event);
+	count = __perf_event_read_value(event, enabled, running);
+	perf_event_ctx_unlock(event, ctx);
+
+	return count;
+}
 EXPORT_SYMBOL_GPL(perf_event_read_value);
 
 static int __perf_read_group_add(struct perf_event *leader,
@@ -4431,6 +4455,8 @@ static int __perf_read_group_add(struct perf_event *leader,
 	if (ret)
 		return ret;
 
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+
 	/*
 	 * Since we co-schedule groups, {enabled,running} times of siblings
 	 * will be identical to those of the leader, so we only publish one
@@ -4453,8 +4479,6 @@ static int __perf_read_group_add(struct perf_event *leader,
 	if (read_format & PERF_FORMAT_ID)
 		values[n++] = primary_event_id(leader);
 
-	raw_spin_lock_irqsave(&ctx->lock, flags);
-
 	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
 		values[n++] += perf_event_count(sub);
 		if (read_format & PERF_FORMAT_ID)
@@ -4518,7 +4542,7 @@ static int perf_read_one(struct perf_event *event,
 	u64 values[4];
 	int n = 0;
 
-	values[n++] = perf_event_read_value(event, &enabled, &running);
+	values[n++] = __perf_event_read_value(event, &enabled, &running);
 	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
 		values[n++] = enabled;
 	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
@@ -4897,8 +4921,7 @@ static void calc_timer_values(struct perf_event *event,
 
 	*now = perf_clock();
 	ctx_time = event->shadow_ctx_time + *now;
-	*enabled = ctx_time - event->tstamp_enabled;
-	*running = ctx_time - event->tstamp_running;
+	__perf_update_times(event, ctx_time, enabled, running);
 }
 
 static void perf_event_init_userpage(struct perf_event *event)
@@ -9456,6 +9479,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	INIT_LIST_HEAD(&event->group_entry);
 	INIT_LIST_HEAD(&event->event_entry);
 	INIT_LIST_HEAD(&event->sibling_list);
+	init_event_group(event);
 	INIT_LIST_HEAD(&event->rb_entry);
 	INIT_LIST_HEAD(&event->active_entry);
 	INIT_LIST_HEAD(&event->addr_filters.list);
@@ -10516,7 +10540,7 @@ perf_event_exit_event(struct perf_event *child_event,
 	if (parent_event)
 		perf_group_detach(child_event);
 	list_del_event(child_event, child_ctx);
-	child_event->state = PERF_EVENT_STATE_EXIT; /* is_event_hup() */
+	perf_event_set_state(child_event, PERF_EVENT_STATE_EXIT); /* is_event_hup() */
 	raw_spin_unlock_irq(&child_ctx->lock);
 
 	/*
@@ -10754,7 +10778,7 @@ inherit_event(struct perf_event *parent_event,
 	      struct perf_event *group_leader,
 	      struct perf_event_context *child_ctx)
 {
-	enum perf_event_active_state parent_state = parent_event->state;
+	enum perf_event_state parent_state = parent_event->state;
 	struct perf_event *child_event;
 	unsigned long flags;
 
@@ -10966,7 +10990,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 	 * We dont have to disable NMIs - we are only looking at
 	 * the list, not manipulating it:
 	 */
-	list_for_each_entry(event, &parent_ctx->pinned_groups, group_entry) {
+	perf_event_groups_for_each(event, &parent_ctx->pinned_groups, group_node) {
 		ret = inherit_task_group(event, parent, parent_ctx,
 					 child, ctxn, &inherited_all);
 		if (ret)
@@ -10982,7 +11006,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 	parent_ctx->rotate_disable = 1;
 	raw_spin_unlock_irqrestore(&parent_ctx->lock, flags);
 
-	list_for_each_entry(event, &parent_ctx->flexible_groups, group_entry) {
+	perf_event_groups_for_each(event, &parent_ctx->flexible_groups, group_node) {
 		ret = inherit_task_group(event, parent, parent_ctx,
 					 child, ctxn, &inherited_all);
 		if (ret)
@@ -11090,6 +11114,7 @@ static void __perf_event_exit_context(void *__info)
 	struct perf_event *event;
 
 	raw_spin_lock(&ctx->lock);
+	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
 	list_for_each_entry(event, &ctx->event_list, event_entry)
 		__perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
 	raw_spin_unlock(&ctx->lock);

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-05 10:17                     ` Alexey Budankov
@ 2017-09-05 11:19                       ` Peter Zijlstra
  2017-09-11  6:55                         ` Alexey Budankov
  2017-09-05 12:06                       ` Alexey Budankov
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-09-05 11:19 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On Tue, Sep 05, 2017 at 01:17:39PM +0300, Alexey Budankov wrote:
> However we can't completely get rid of whole tree iterations because of 
> inheritance code on forks in perf_event_init_context() here:

Right, fork() / inherit needs to iterate the full thing, nothing to be
done about that.

I'll go make proper patches for that timekeeping rewrite and then have a
look at your patches.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-05 10:17                     ` Alexey Budankov
  2017-09-05 11:19                       ` Peter Zijlstra
@ 2017-09-05 12:06                       ` Alexey Budankov
  2017-09-05 12:59                         ` Peter Zijlstra
  2017-09-05 16:03                         ` Peter Zijlstra
  1 sibling, 2 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-09-05 12:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

Hi,
On 05.09.2017 13:17, Alexey Budankov wrote:
> On 04.09.2017 18:41, Peter Zijlstra wrote:
>> On Mon, Sep 04, 2017 at 05:56:06PM +0300, Alexey Budankov wrote:
>>> On 04.09.2017 15:08, Peter Zijlstra wrote:
>>>> On Mon, Sep 04, 2017 at 01:46:45PM +0300, Alexey Budankov wrote:
>>>>>> So the below completely rewrites timekeeping (and probably breaks
>>>>>> world) but does away with the need to touch events that don't get
>>>>>> scheduled.
>>>>>
>>>>> We still need and do iterate thru all events at some points e.g. on context switches.
>>>>
>>>> Why do we _need_ to?
>>>
>>> We do so in the current implementation with several tstamp_* fields.
>>
>> Right, but we want to stop doing so asap :-)
>>
> 
> Well, I see you point :). It turns out that with straightforward timekeeping 
> we can also avoid whole tree iteration on context switches additionally to RB 
> tree based iterations and rotations on hrtimer interrupt. That brings even more 
> performance and rotation switch can be avoided.
> 
> However we can't completely get rid of whole tree iterations because of 
> inheritance code on forks in perf_event_init_context() here:
> 
> perf_event_groups_for_each(event, &parent_ctx->pinned_groups, group_node) {
> 		ret = inherit_task_group(event, parent, parent_ctx,
> 					 child, ctxn, &inherited_all);
> 		if (ret)
> 			goto out_unlock;
> }
> 
> and here:
> 
> perf_event_groups_for_each(event, &parent_ctx->flexible_groups, group_node) {
> 		ret = inherit_task_group(event, parent, parent_ctx,
> 					 child, ctxn, &inherited_all);
> 		if (ret)
> 			goto out_unlock;
> }
> 
> Below is the patch set put on top of the timekeeping rework. 
> 
> It is for tip/master branch.
> 
> ---
>  include/linux/perf_event.h |  40 +--
>  kernel/events/core.c       | 839 +++++++++++++++++++++++----------------------
>  2 files changed, 448 insertions(+), 431 deletions(-)
> 

Got this under perf_fuzzer on Xeon Phi (KNL):

[ 6614.226280] ------------[ cut here ]------------
[ 6614.226305] WARNING: CPU: 45 PID: 43385 at kernel/events/core.c:239 event_function+0xb3/0xe0
[ 6614.226310] Modules linked in: btrfs xor raid6_pq ufs hfsplus hfs minix vfat msdos fat jfs xfs reiserfs binfmt_misc xt_CHECKSUM iptable_mangle fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables cmac arc4 md4 nls_utf8 nfsv3 cifs rpcsec_gss_krb5 nfsv4 nfs ccm dns_resolver fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl hfi1 sb_edac nfsd x86_pkg_temp_thermal intel_powerclamp rdmavt ipmi_ssif coretemp iTCO_wdt joydev ib_core crct10dif_pclmul iTCO_vendor_support crc32_pclmul ipmi_si ghash_clmulni_intel
[ 6614.226444]  auth_rpcgss ipmi_devintf mei_me tpm_tis intel_cstate mei tpm_tis_core intel_uncore pcspkr intel_rapl_perf tpm ipmi_msghandler shpchp nfs_acl lpc_ich lockd i2c_i801 wmi grace acpi_power_meter acpi_pad sunrpc mgag200 drm_kms_helper ttm drm igb crc32c_intel ptp pps_core dca i2c_algo_bit
[ 6614.226501] CPU: 45 PID: 43385 Comm: perf_fuzzer Not tainted 4.13.0-v11.1.2+ #5
[ 6614.226506] Hardware name: Intel Corporation S7200AP/S7200AP, BIOS S72C610.86B.01.01.0190.080520162104 08/05/2016
[ 6614.226511] task: ffff8d0f5866c000 task.stack: ffffa6b05aae8000
[ 6614.226517] RIP: 0010:event_function+0xb3/0xe0
[ 6614.226522] RSP: 0018:ffffa6b05aaebc30 EFLAGS: 00010087
[ 6614.226528] RAX: 0000000000000000 RBX: ffffc6b03c545a90 RCX: ffffa6b05aaebd00
[ 6614.226532] RDX: 0000000000000000 RSI: ffffa6b05aaebca0 RDI: ffffc6b03c545a98
[ 6614.226536] RBP: ffffa6b05aaebc58 R08: 000000000001f8e0 R09: ffff8d0fa99fd120
[ 6614.226540] R10: ffff8d0fbb407900 R11: 0000000000000000 R12: ffffc6b03ba05a90
[ 6614.226544] R13: 0000000000000000 R14: ffffa6b05aaebd48 R15: ffff8d0f9c48a000
[ 6614.226550] FS:  00007f05cff85740(0000) GS:ffff8d0fbc540000(0000) knlGS:0000000000000000
[ 6614.226555] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6614.226559] CR2: 00000000007211c0 CR3: 0000002f269fa000 CR4: 00000000001407e0
[ 6614.226563] Call Trace:
[ 6614.226577]  remote_function+0x3b/0x50
[ 6614.226585]  generic_exec_single+0x9a/0xd0
[ 6614.226592]  smp_call_function_single+0xc8/0x100
[ 6614.226599]  cpu_function_call+0x43/0x60
[ 6614.226606]  ? cpu_clock_event_read+0x10/0x10
[ 6614.226612]  perf_event_read+0xc7/0xe0
[ 6614.226619]  ? perf_install_in_context+0xf0/0xf0
[ 6614.226625]  __perf_read_group_add+0x25/0x180
[ 6614.226632]  perf_read+0xcb/0x2b0
[ 6614.226640]  __vfs_read+0x37/0x160
[ 6614.226648]  ? security_file_permission+0x9d/0xc0
[ 6614.226655]  vfs_read+0x8c/0x130
[ 6614.226661]  SyS_read+0x55/0xc0
[ 6614.226670]  do_syscall_64+0x67/0x180
[ 6614.226678]  entry_SYSCALL64_slow_path+0x25/0x25
[ 6614.226684] RIP: 0033:0x7f05cfaad980
[ 6614.226688] RSP: 002b:00007fff1b562b48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 6614.226694] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f05cfaad980
[ 6614.226699] RDX: 0000000000004fcf RSI: 0000000000735680 RDI: 0000000000000003
[ 6614.226703] RBP: 00007fff1b562b60 R08: 00007f05cfd800f4 R09: 00007f05cfd80140
[ 6614.226707] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401980
[ 6614.226711] R13: 00007fff1b564f60 R14: 0000000000000000 R15: 0000000000000000
[ 6614.226716] Code: e2 48 89 de 4c 89 ff 41 ff 56 08 31 c0 4d 85 ed 74 05 41 c6 45 08 00 c6 43 08 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 49 39 dc 74 cf <0f> ff eb cb 0f ff 0f 1f 80 00 00 00 00 e9 78 ff ff ff 0f ff 66 
[ 6614.226812] ---[ end trace ff12704813059a28 ]---

static int event_function(void *info)
{
	struct event_function_struct *efs = info;
	struct perf_event *event = efs->event;
	struct perf_event_context *ctx = event->ctx;
	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
	struct perf_event_context *task_ctx = cpuctx->task_ctx;
	int ret = 0;

	WARN_ON_ONCE(!irqs_disabled());

	perf_ctx_lock(cpuctx, task_ctx);
	/*
	 * Since we do the IPI call without holding ctx->lock things can have
	 * changed, double check we hit the task we set out to hit.
	 */
	if (ctx->task) {
		if (ctx->task != current) {
			ret = -ESRCH;
			goto unlock;
		}

		/*
		 * We only use event_function_call() on established contexts,
		 * and event_function() is only ever called when active (or
		 * rather, we'll have bailed in task_function_call() or the
		 * above ctx->task != current test), therefore we must have
		 * ctx->is_active here.
		 */
		WARN_ON_ONCE(!ctx->is_active);
		/*
		 * And since we have ctx->is_active, cpuctx->task_ctx must
		 * match.
		 */
		WARN_ON_ONCE(task_ctx != ctx);
	} else {
===>		WARN_ON_ONCE(&cpuctx->ctx != ctx);
	}

	efs->func(event, cpuctx, ctx, efs->data);
unlock:
	perf_ctx_unlock(cpuctx, task_ctx);

	return ret;
}

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-05 12:06                       ` Alexey Budankov
@ 2017-09-05 12:59                         ` Peter Zijlstra
  2017-09-05 16:03                         ` Peter Zijlstra
  1 sibling, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-09-05 12:59 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On Tue, Sep 05, 2017 at 03:06:26PM +0300, Alexey Budankov wrote:
> [ 6614.226305] WARNING: CPU: 45 PID: 43385 at kernel/events/core.c:239 event_function+0xb3/0xe0

> [ 6614.226563] Call Trace:
> [ 6614.226577]  remote_function+0x3b/0x50
> [ 6614.226585]  generic_exec_single+0x9a/0xd0
> [ 6614.226592]  smp_call_function_single+0xc8/0x100
> [ 6614.226599]  cpu_function_call+0x43/0x60
> [ 6614.226606]  ? cpu_clock_event_read+0x10/0x10
> [ 6614.226612]  perf_event_read+0xc7/0xe0
> [ 6614.226619]  ? perf_install_in_context+0xf0/0xf0
> [ 6614.226625]  __perf_read_group_add+0x25/0x180
> [ 6614.226632]  perf_read+0xcb/0x2b0
> [ 6614.226640]  __vfs_read+0x37/0x160
> [ 6614.226648]  ? security_file_permission+0x9d/0xc0
> [ 6614.226655]  vfs_read+0x8c/0x130
> [ 6614.226661]  SyS_read+0x55/0xc0
> [ 6614.226670]  do_syscall_64+0x67/0x180
> [ 6614.226678]  entry_SYSCALL64_slow_path+0x25/0x25


Hmm.. must be the perf_event_read() rewrite, let me stare at that.

Thanks for testing.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-05 12:06                       ` Alexey Budankov
  2017-09-05 12:59                         ` Peter Zijlstra
@ 2017-09-05 16:03                         ` Peter Zijlstra
  2017-09-06 13:48                           ` Alexey Budankov
  2017-09-08  8:47                           ` Alexey Budankov
  1 sibling, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-09-05 16:03 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On Tue, Sep 05, 2017 at 03:06:26PM +0300, Alexey Budankov wrote:
> [ 6614.226305] WARNING: CPU: 45 PID: 43385 at kernel/events/core.c:239 event_function+0xb3/0xe0

I think I avoided that problem by not radically rewriting
perf_event_read() but fixing it instead:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=perf/core&id=8ad650955ede95e4a6fd6afbda2a0b37d4af9c29

Full tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core


Very minimally tested so far, I'll continue tomorrow.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-05 16:03                         ` Peter Zijlstra
@ 2017-09-06 13:48                           ` Alexey Budankov
  2017-09-08  8:47                           ` Alexey Budankov
  1 sibling, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-09-06 13:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On 05.09.2017 19:03, Peter Zijlstra wrote:
> On Tue, Sep 05, 2017 at 03:06:26PM +0300, Alexey Budankov wrote:
>> [ 6614.226305] WARNING: CPU: 45 PID: 43385 at kernel/events/core.c:239 event_function+0xb3/0xe0
> 
> I think I avoided that problem by not radically rewriting
> perf_event_read() but fixing it instead:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=perf/core&id=8ad650955ede95e4a6fd6afbda2a0b37d4af9c29
> 
> Full tree at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core
> 
> 
> Very minimally tested so far, I'll continue tomorrow.
> 
No access to:

git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core 

for some reason. Also tried:

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core
https://kernel.googlesource.com/pub/scm/linux/kernel/git/peterz/queue.git perf/core

with no luck.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-05 16:03                         ` Peter Zijlstra
  2017-09-06 13:48                           ` Alexey Budankov
@ 2017-09-08  8:47                           ` Alexey Budankov
  2018-03-12 17:43                             ` [tip:perf/core] perf/cor: Use RB trees for pinned/flexible groups tip-bot for Alexey Budankov
  1 sibling, 1 reply; 76+ messages in thread
From: Alexey Budankov @ 2017-09-08  8:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

On 05.09.2017 19:03, Peter Zijlstra wrote:
> On Tue, Sep 05, 2017 at 03:06:26PM +0300, Alexey Budankov wrote:
>> [ 6614.226305] WARNING: CPU: 45 PID: 43385 at kernel/events/core.c:239 event_function+0xb3/0xe0
> 
> I think I avoided that problem by not radically rewriting
> perf_event_read() but fixing it instead:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=perf/core&id=8ad650955ede95e4a6fd6afbda2a0b37d4af9c29
> 
> Full tree at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core
> 
> 
> Very minimally tested so far, I'll continue tomorrow.
> 

The patch set v9 on top of peterz/queue perf/core repository above:

---
 include/linux/perf_event.h |  16 ++-
 kernel/events/core.c       | 307 +++++++++++++++++++++++++++++++++++++--------
 2 files changed, 267 insertions(+), 56 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2a6ae48..92cda40 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -557,7 +557,11 @@ struct perf_event {
 	 */
 	struct list_head		group_entry;
 	struct list_head		sibling_list;
-
+	/*
+	 * Node on the pinned or flexible tree located at the event context;
+	 */
+	struct rb_node			group_node;
+	u64				group_index;
 	/*
 	 * We need storage to track the entries in perf_pmu_migrate_context; we
 	 * cannot use the event_entry because of RCU and we want to keep the
@@ -689,6 +693,12 @@ struct perf_event {
 #endif /* CONFIG_PERF_EVENTS */
 };
 
+
+struct perf_event_groups {
+	struct rb_root	tree;
+	u64		index;
+};
+
 /**
  * struct perf_event_context - event context structure
  *
@@ -709,8 +719,8 @@ struct perf_event_context {
 	struct mutex			mutex;
 
 	struct list_head		active_ctx_list;
-	struct list_head		pinned_groups;
-	struct list_head		flexible_groups;
+	struct perf_event_groups	pinned_groups;
+	struct perf_event_groups	flexible_groups;
 	struct list_head		event_list;
 	int				nr_events;
 	int				nr_active;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 56e9214..8158f1d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1454,8 +1454,21 @@ static enum event_type_t get_event_type(struct perf_event *event)
 	return event_type;
 }
 
-static struct list_head *
-ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
+/*
+ * Helper function to initialize group leader event;
+ */
+void init_event_group(struct perf_event *event)
+{
+	RB_CLEAR_NODE(&event->group_node);
+	event->group_index = 0;
+}
+
+/*
+ * Extract pinned or flexible groups from the context
+ * based on event attrs bits;
+ */
+static struct perf_event_groups *
+get_event_groups(struct perf_event *event, struct perf_event_context *ctx)
 {
 	if (event->attr.pinned)
 		return &ctx->pinned_groups;
@@ -1464,6 +1477,169 @@ static enum event_type_t get_event_type(struct perf_event *event)
 }
 
 /*
+ * Helper function to initializes perf event groups object;
+ */
+void perf_event_groups_init(struct perf_event_groups *groups)
+{
+	groups->tree = RB_ROOT;
+	groups->index = 0;
+}
+
+/*
+ * Compare function for event groups;
+ * Implements complex key that first sorts by CPU and then by
+ * virtual index which provides ordering when rotating
+ * groups for the same CPU;
+ */
+int perf_event_groups_less(struct perf_event *left, struct perf_event *right)
+{
+	if (left->cpu < right->cpu) {
+		return 1;
+	} else if (left->cpu > right->cpu) {
+		return 0;
+	} else {
+		if (left->group_index < right->group_index) {
+			return 1;
+		} else if(left->group_index > right->group_index) {
+			return 0;
+		} else {
+			return 0;
+		}
+	}
+}
+
+/*
+ * Insert a group into a tree using event->cpu as a key. If event->cpu node
+ * is already attached to the tree then the event is added to the attached
+ * group's group_list list.
+ */
+static void
+perf_event_groups_insert(struct perf_event_groups *groups,
+		struct perf_event *event)
+{
+	struct perf_event *node_event;
+	struct rb_node *parent;
+	struct rb_node **node;
+
+	event->group_index = ++groups->index;
+
+	node = &groups->tree.rb_node;
+	parent = *node;
+
+	while (*node) {
+		parent = *node;
+		node_event = container_of(*node,
+				struct perf_event, group_node);
+
+		if (perf_event_groups_less(event, node_event))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&event->group_node, parent, node);
+	rb_insert_color(&event->group_node, &groups->tree);
+}
+
+/*
+ * Helper function to insert event into the pinned or
+ * flexible groups;
+ */
+static void
+add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
+{
+	struct perf_event_groups *groups;
+
+	groups = get_event_groups(event, ctx);
+	perf_event_groups_insert(groups, event);
+}
+
+/*
+ * Delete a group from a tree. If the group is directly attached to the tree
+ * it also detaches all groups on the group's group_list list.
+ */
+static void
+perf_event_groups_delete(struct perf_event_groups *groups,
+		struct perf_event *event)
+{
+	if (!RB_EMPTY_NODE(&event->group_node) &&
+	    !RB_EMPTY_ROOT(&groups->tree))
+		rb_erase(&event->group_node, &groups->tree);
+
+	init_event_group(event);
+}
+
+/*
+ * Helper function to delete event from its groups;
+ */
+static void
+del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
+{
+	struct perf_event_groups *groups;
+
+	groups = get_event_groups(event, ctx);
+	perf_event_groups_delete(groups, event);
+}
+
+/*
+ * Get a group by a cpu key from groups tree with the least group_index;
+ */
+static struct perf_event *
+perf_event_groups_first(struct perf_event_groups *groups, int cpu)
+{
+	struct perf_event *node_event = NULL, *match = NULL;
+	struct rb_node *node = groups->tree.rb_node;
+
+	while (node) {
+		node_event = container_of(node,
+				struct perf_event, group_node);
+
+		if (cpu < node_event->cpu) {
+			node = node->rb_left;
+		} else if (cpu > node_event->cpu) {
+			node = node->rb_right;
+		} else {
+			match = node_event;
+			node = node->rb_left;
+		}
+	}
+
+	return match;
+}
+
+/*
+ * Find group list by a cpu key and rotate it.
+ */
+static void
+perf_event_groups_rotate(struct perf_event_groups *groups, int cpu)
+{
+	struct perf_event *event =
+			perf_event_groups_first(groups, cpu);
+
+	if (event) {
+		perf_event_groups_delete(groups, event);
+		perf_event_groups_insert(groups, event);
+	}
+}
+
+/*
+ * Iterate event groups thru the whole tree.
+ */
+#define perf_event_groups_for_each(event, groups, node)		\
+	for (event = rb_entry_safe(rb_first(&((groups)->tree)),	\
+				typeof(*event), node); event;	\
+		event = rb_entry_safe(rb_next(&event->node),	\
+				typeof(*event), node))
+/*
+ * Iterate event groups with cpu == key.
+ */
+#define perf_event_groups_for_each_cpu(event, key, groups, node) \
+	for (event = perf_event_groups_first(groups, key);	 \
+		event && event->cpu == key;			 \
+		event = rb_entry_safe(rb_next(&event->node),	 \
+				typeof(*event), node))
+
+/*
  * Add a event from the lists for its context.
  * Must be called with ctx->mutex and ctx->lock held.
  */
@@ -1483,12 +1659,8 @@ static enum event_type_t get_event_type(struct perf_event *event)
 	 * perf_group_detach can, at all times, locate all siblings.
 	 */
 	if (event->group_leader == event) {
-		struct list_head *list;
-
 		event->group_caps = event->event_caps;
-
-		list = ctx_group_list(event, ctx);
-		list_add_tail(&event->group_entry, list);
+		add_event_to_groups(event, ctx);
 	}
 
 	list_update_cgroup_event(event, ctx, true);
@@ -1682,7 +1854,7 @@ static void perf_group_attach(struct perf_event *event)
 	list_del_rcu(&event->event_entry);
 
 	if (event->group_leader == event)
-		list_del_init(&event->group_entry);
+		del_event_from_groups(event, ctx);
 
 	/*
 	 * If event was in error state, then keep it
@@ -1700,7 +1872,6 @@ static void perf_group_attach(struct perf_event *event)
 static void perf_group_detach(struct perf_event *event)
 {
 	struct perf_event *sibling, *tmp;
-	struct list_head *list = NULL;
 
 	lockdep_assert_held(&event->ctx->lock);
 
@@ -1721,22 +1892,23 @@ static void perf_group_detach(struct perf_event *event)
 		goto out;
 	}
 
-	if (!list_empty(&event->group_entry))
-		list = &event->group_entry;
-
 	/*
 	 * If this was a group event with sibling events then
 	 * upgrade the siblings to singleton events by adding them
 	 * to whatever list we are on.
 	 */
 	list_for_each_entry_safe(sibling, tmp, &event->sibling_list, group_entry) {
-		if (list)
-			list_move_tail(&sibling->group_entry, list);
+
 		sibling->group_leader = sibling;
 
 		/* Inherit group flags from the previous leader */
 		sibling->group_caps = event->group_caps;
 
+		if (!RB_EMPTY_NODE(&event->group_node)) {
+			list_del_init(&sibling->group_entry);
+			add_event_to_groups(sibling, event->ctx);
+		}
+
 		WARN_ON_ONCE(sibling->ctx != event->ctx);
 	}
 
@@ -2180,6 +2352,22 @@ static int group_can_go_on(struct perf_event *event,
 	return can_add_hw;
 }
 
+static int
+flexible_group_sched_in(struct perf_event *event,
+			struct perf_event_context *ctx,
+		        struct perf_cpu_context *cpuctx,
+			int *can_add_hw)
+{
+	if (event->state <= PERF_EVENT_STATE_OFF || !event_filter_match(event))
+		return 0;
+
+	if (group_can_go_on(event, cpuctx, *can_add_hw))
+		if (group_sched_in(event, cpuctx, ctx))
+			*can_add_hw = 0;
+
+	return 1;
+}
+
 static void add_event_to_ctx(struct perf_event *event,
 			       struct perf_event_context *ctx)
 {
@@ -2646,6 +2834,7 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 			  struct perf_cpu_context *cpuctx,
 			  enum event_type_t event_type)
 {
+	int sw = -1, cpu = smp_processor_id();
 	int is_active = ctx->is_active;
 	struct perf_event *event;
 
@@ -2694,12 +2883,20 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 
 	perf_pmu_disable(ctx->pmu);
 	if (is_active & EVENT_PINNED) {
-		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
+		perf_event_groups_for_each_cpu(event, cpu,
+				&ctx->pinned_groups, group_node)
+			group_sched_out(event, cpuctx, ctx);
+		perf_event_groups_for_each_cpu(event, sw,
+				&ctx->pinned_groups, group_node)
 			group_sched_out(event, cpuctx, ctx);
 	}
 
 	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry(event, &ctx->flexible_groups, group_entry)
+		perf_event_groups_for_each_cpu(event, cpu,
+				&ctx->flexible_groups, group_node)
+			group_sched_out(event, cpuctx, ctx);
+		perf_event_groups_for_each_cpu(event, sw,
+				&ctx->flexible_groups, group_node)
 			group_sched_out(event, cpuctx, ctx);
 	}
 	perf_pmu_enable(ctx->pmu);
@@ -2990,23 +3187,28 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
 ctx_pinned_sched_in(struct perf_event_context *ctx,
 		    struct perf_cpu_context *cpuctx)
 {
+	int sw = -1, cpu = smp_processor_id();
 	struct perf_event *event;
+	int can_add_hw;
+
+	perf_event_groups_for_each_cpu(event, sw,
+			&ctx->pinned_groups, group_node) {
+		can_add_hw = 1;
+		if (flexible_group_sched_in(event, ctx, cpuctx, &can_add_hw)) {
+			if (event->state == PERF_EVENT_STATE_INACTIVE)
+				perf_event_set_state(event,
+						PERF_EVENT_STATE_ERROR);
+		}
+	}
 
-	list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		if (!event_filter_match(event))
-			continue;
-
-		if (group_can_go_on(event, cpuctx, 1))
-			group_sched_in(event, cpuctx, ctx);
-
-		/*
-		 * If this pinned group hasn't been scheduled,
-		 * put it in error state.
-		 */
-		if (event->state == PERF_EVENT_STATE_INACTIVE)
-			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
+	perf_event_groups_for_each_cpu(event, cpu,
+			&ctx->pinned_groups, group_node) {
+		can_add_hw = 1;
+		if (flexible_group_sched_in(event, ctx, cpuctx, &can_add_hw)) {
+			if (event->state == PERF_EVENT_STATE_INACTIVE)
+				perf_event_set_state(event,
+						PERF_EVENT_STATE_ERROR);
+		}
 	}
 }
 
@@ -3014,25 +3216,19 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
 ctx_flexible_sched_in(struct perf_event_context *ctx,
 		      struct perf_cpu_context *cpuctx)
 {
+	int sw = -1, cpu = smp_processor_id();
 	struct perf_event *event;
 	int can_add_hw = 1;
 
-	list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
-		/* Ignore events in OFF or ERROR state */
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		/*
-		 * Listen to the 'cpu' scheduling filter constraint
-		 * of events:
-		 */
-		if (!event_filter_match(event))
-			continue;
+	perf_event_groups_for_each_cpu(event, sw,
+			&ctx->flexible_groups, group_node)
+		flexible_group_sched_in(event, ctx, cpuctx, &can_add_hw);
+
+	can_add_hw = 1;
+	perf_event_groups_for_each_cpu(event, cpu,
+			&ctx->flexible_groups, group_node)
+		flexible_group_sched_in(event, ctx, cpuctx, &can_add_hw);
 
-		if (group_can_go_on(event, cpuctx, can_add_hw)) {
-			if (group_sched_in(event, cpuctx, ctx))
-				can_add_hw = 0;
-		}
-	}
 }
 
 static void
@@ -3113,7 +3309,7 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	 * However, if task's ctx is not carrying any pinned
 	 * events, no need to flip the cpuctx's events around.
 	 */
-	if (!list_empty(&ctx->pinned_groups))
+	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
 		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
 	perf_event_sched_in(cpuctx, ctx, task);
 	perf_pmu_enable(ctx->pmu);
@@ -3350,8 +3546,12 @@ static void rotate_ctx(struct perf_event_context *ctx)
 	 * Rotate the first entry last of non-pinned groups. Rotation might be
 	 * disabled by the inheritance code.
 	 */
-	if (!ctx->rotate_disable)
-		list_rotate_left(&ctx->flexible_groups);
+	if (!ctx->rotate_disable) {
+		int sw = -1, cpu = smp_processor_id();
+
+		perf_event_groups_rotate(&ctx->flexible_groups, sw);
+		perf_event_groups_rotate(&ctx->flexible_groups, cpu);
+	}
 }
 
 static int perf_rotate_context(struct perf_cpu_context *cpuctx)
@@ -3698,8 +3898,8 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 	raw_spin_lock_init(&ctx->lock);
 	mutex_init(&ctx->mutex);
 	INIT_LIST_HEAD(&ctx->active_ctx_list);
-	INIT_LIST_HEAD(&ctx->pinned_groups);
-	INIT_LIST_HEAD(&ctx->flexible_groups);
+	perf_event_groups_init(&ctx->pinned_groups);
+	perf_event_groups_init(&ctx->flexible_groups);
 	INIT_LIST_HEAD(&ctx->event_list);
 	atomic_set(&ctx->refcount, 1);
 }
@@ -9370,6 +9570,7 @@ static void account_event(struct perf_event *event)
 	INIT_LIST_HEAD(&event->group_entry);
 	INIT_LIST_HEAD(&event->event_entry);
 	INIT_LIST_HEAD(&event->sibling_list);
+	init_event_group(event);
 	INIT_LIST_HEAD(&event->rb_entry);
 	INIT_LIST_HEAD(&event->active_entry);
 	INIT_LIST_HEAD(&event->addr_filters.list);
@@ -10880,7 +11081,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 	 * We dont have to disable NMIs - we are only looking at
 	 * the list, not manipulating it:
 	 */
-	list_for_each_entry(event, &parent_ctx->pinned_groups, group_entry) {
+	perf_event_groups_for_each(event, &parent_ctx->pinned_groups, group_node) {
 		ret = inherit_task_group(event, parent, parent_ctx,
 					 child, ctxn, &inherited_all);
 		if (ret)
@@ -10896,7 +11097,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 	parent_ctx->rotate_disable = 1;
 	raw_spin_unlock_irqrestore(&parent_ctx->lock, flags);
 
-	list_for_each_entry(event, &parent_ctx->flexible_groups, group_entry) {
+	perf_event_groups_for_each(event, &parent_ctx->flexible_groups, group_node) {
 		ret = inherit_task_group(event, parent, parent_ctx,
 					 child, ctxn, &inherited_all);
 		if (ret)

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
  2017-09-05 11:19                       ` Peter Zijlstra
@ 2017-09-11  6:55                         ` Alexey Budankov
  0 siblings, 0 replies; 76+ messages in thread
From: Alexey Budankov @ 2017-09-11  6:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Andi Kleen, Kan Liang, Dmitri Prokhorov, Valery Cherepennikov,
	Mark Rutland, Stephane Eranian, David Carrillo-Cisneros,
	linux-kernel, Vince Weaver, Thomas Gleixner

Hi,
On 05.09.2017 14:19, Peter Zijlstra wrote:
> On Tue, Sep 05, 2017 at 01:17:39PM +0300, Alexey Budankov wrote:
>> However we can't completely get rid of whole tree iterations because of 
>> inheritance code on forks in perf_event_init_context() here:
> 
> Right, fork() / inherit needs to iterate the full thing, nothing to be
> done about that.
> 
> I'll go make proper patches for that timekeeping rewrite and then have a
> look at your patches.
> 

Is there any progress so far? The latest patch version is here: 

https://lkml.org/lkml/2017/9/8/118

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [tip:perf/core] perf/cor: Use RB trees for pinned/flexible groups
  2017-09-08  8:47                           ` Alexey Budankov
@ 2018-03-12 17:43                             ` tip-bot for Alexey Budankov
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Alexey Budankov @ 2018-03-12 17:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dmitry.Prohorov, eranian, torvalds, acme, vincent.weaver, hpa,
	tglx, mark.rutland, valery.cherepennikov, peterz, davidcc, mingo,
	alexander.shishkin, jolsa, kan.liang, alexey.budankov,
	linux-kernel

Commit-ID:  8e1a2031e4b556b01ca53cd1fb2d83d811a6605b
Gitweb:     https://git.kernel.org/tip/8e1a2031e4b556b01ca53cd1fb2d83d811a6605b
Author:     Alexey Budankov <alexey.budankov@linux.intel.com>
AuthorDate: Fri, 8 Sep 2017 11:47:03 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 12 Mar 2018 15:28:49 +0100

perf/cor: Use RB trees for pinned/flexible groups

Change event groups into RB trees sorted by CPU and then by a 64bit
index, so that multiplexing hrtimer interrupt handler would be able
skipping to the current CPU's list and ignore groups allocated for the
other CPUs.

New API for manipulating event groups in the trees is implemented as well
as adoption on the API in the current implementation.

pinned_group_sched_in() and flexible_group_sched_in() API are
introduced to consolidate code enabling the whole group from pinned
and flexible groups appropriately.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/372f9c8b-0cfe-4240-e44d-83d863d40813@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/perf_event.h |  16 ++-
 kernel/events/core.c       | 307 +++++++++++++++++++++++++++++++++++++--------
 2 files changed, 267 insertions(+), 56 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7546822a1d74..6e3f854a34d8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -558,7 +558,11 @@ struct perf_event {
 	 */
 	struct list_head		group_entry;
 	struct list_head		sibling_list;
-
+	/*
+	 * Node on the pinned or flexible tree located at the event context;
+	 */
+	struct rb_node			group_node;
+	u64				group_index;
 	/*
 	 * We need storage to track the entries in perf_pmu_migrate_context; we
 	 * cannot use the event_entry because of RCU and we want to keep the
@@ -690,6 +694,12 @@ struct perf_event {
 #endif /* CONFIG_PERF_EVENTS */
 };
 
+
+struct perf_event_groups {
+	struct rb_root	tree;
+	u64		index;
+};
+
 /**
  * struct perf_event_context - event context structure
  *
@@ -710,8 +720,8 @@ struct perf_event_context {
 	struct mutex			mutex;
 
 	struct list_head		active_ctx_list;
-	struct list_head		pinned_groups;
-	struct list_head		flexible_groups;
+	struct perf_event_groups	pinned_groups;
+	struct perf_event_groups	flexible_groups;
 	struct list_head		event_list;
 	int				nr_events;
 	int				nr_active;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8b6a2774e084..c9fee3640f40 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1460,8 +1460,21 @@ static enum event_type_t get_event_type(struct perf_event *event)
 	return event_type;
 }
 
-static struct list_head *
-ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
+/*
+ * Helper function to initialize group leader event;
+ */
+void init_event_group(struct perf_event *event)
+{
+	RB_CLEAR_NODE(&event->group_node);
+	event->group_index = 0;
+}
+
+/*
+ * Extract pinned or flexible groups from the context
+ * based on event attrs bits;
+ */
+static struct perf_event_groups *
+get_event_groups(struct perf_event *event, struct perf_event_context *ctx)
 {
 	if (event->attr.pinned)
 		return &ctx->pinned_groups;
@@ -1469,6 +1482,169 @@ ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
 		return &ctx->flexible_groups;
 }
 
+/*
+ * Helper function to initializes perf event groups object;
+ */
+void perf_event_groups_init(struct perf_event_groups *groups)
+{
+	groups->tree = RB_ROOT;
+	groups->index = 0;
+}
+
+/*
+ * Compare function for event groups;
+ * Implements complex key that first sorts by CPU and then by
+ * virtual index which provides ordering when rotating
+ * groups for the same CPU;
+ */
+int perf_event_groups_less(struct perf_event *left, struct perf_event *right)
+{
+	if (left->cpu < right->cpu) {
+		return 1;
+	} else if (left->cpu > right->cpu) {
+		return 0;
+	} else {
+		if (left->group_index < right->group_index) {
+			return 1;
+		} else if(left->group_index > right->group_index) {
+			return 0;
+		} else {
+			return 0;
+		}
+	}
+}
+
+/*
+ * Insert a group into a tree using event->cpu as a key. If event->cpu node
+ * is already attached to the tree then the event is added to the attached
+ * group's group_list list.
+ */
+static void
+perf_event_groups_insert(struct perf_event_groups *groups,
+		struct perf_event *event)
+{
+	struct perf_event *node_event;
+	struct rb_node *parent;
+	struct rb_node **node;
+
+	event->group_index = ++groups->index;
+
+	node = &groups->tree.rb_node;
+	parent = *node;
+
+	while (*node) {
+		parent = *node;
+		node_event = container_of(*node,
+				struct perf_event, group_node);
+
+		if (perf_event_groups_less(event, node_event))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&event->group_node, parent, node);
+	rb_insert_color(&event->group_node, &groups->tree);
+}
+
+/*
+ * Helper function to insert event into the pinned or
+ * flexible groups;
+ */
+static void
+add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
+{
+	struct perf_event_groups *groups;
+
+	groups = get_event_groups(event, ctx);
+	perf_event_groups_insert(groups, event);
+}
+
+/*
+ * Delete a group from a tree. If the group is directly attached to the tree
+ * it also detaches all groups on the group's group_list list.
+ */
+static void
+perf_event_groups_delete(struct perf_event_groups *groups,
+		struct perf_event *event)
+{
+	if (!RB_EMPTY_NODE(&event->group_node) &&
+	    !RB_EMPTY_ROOT(&groups->tree))
+		rb_erase(&event->group_node, &groups->tree);
+
+	init_event_group(event);
+}
+
+/*
+ * Helper function to delete event from its groups;
+ */
+static void
+del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
+{
+	struct perf_event_groups *groups;
+
+	groups = get_event_groups(event, ctx);
+	perf_event_groups_delete(groups, event);
+}
+
+/*
+ * Get a group by a cpu key from groups tree with the least group_index;
+ */
+static struct perf_event *
+perf_event_groups_first(struct perf_event_groups *groups, int cpu)
+{
+	struct perf_event *node_event = NULL, *match = NULL;
+	struct rb_node *node = groups->tree.rb_node;
+
+	while (node) {
+		node_event = container_of(node,
+				struct perf_event, group_node);
+
+		if (cpu < node_event->cpu) {
+			node = node->rb_left;
+		} else if (cpu > node_event->cpu) {
+			node = node->rb_right;
+		} else {
+			match = node_event;
+			node = node->rb_left;
+		}
+	}
+
+	return match;
+}
+
+/*
+ * Find group list by a cpu key and rotate it.
+ */
+static void
+perf_event_groups_rotate(struct perf_event_groups *groups, int cpu)
+{
+	struct perf_event *event =
+			perf_event_groups_first(groups, cpu);
+
+	if (event) {
+		perf_event_groups_delete(groups, event);
+		perf_event_groups_insert(groups, event);
+	}
+}
+
+/*
+ * Iterate event groups thru the whole tree.
+ */
+#define perf_event_groups_for_each(event, groups, node)		\
+	for (event = rb_entry_safe(rb_first(&((groups)->tree)),	\
+				typeof(*event), node); event;	\
+		event = rb_entry_safe(rb_next(&event->node),	\
+				typeof(*event), node))
+/*
+ * Iterate event groups with cpu == key.
+ */
+#define perf_event_groups_for_each_cpu(event, key, groups, node) \
+	for (event = perf_event_groups_first(groups, key);	 \
+		event && event->cpu == key;			 \
+		event = rb_entry_safe(rb_next(&event->node),	 \
+				typeof(*event), node))
+
 /*
  * Add a event from the lists for its context.
  * Must be called with ctx->mutex and ctx->lock held.
@@ -1489,12 +1665,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	 * perf_group_detach can, at all times, locate all siblings.
 	 */
 	if (event->group_leader == event) {
-		struct list_head *list;
-
 		event->group_caps = event->event_caps;
-
-		list = ctx_group_list(event, ctx);
-		list_add_tail(&event->group_entry, list);
+		add_event_to_groups(event, ctx);
 	}
 
 	list_update_cgroup_event(event, ctx, true);
@@ -1688,7 +1860,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 	list_del_rcu(&event->event_entry);
 
 	if (event->group_leader == event)
-		list_del_init(&event->group_entry);
+		del_event_from_groups(event, ctx);
 
 	/*
 	 * If event was in error state, then keep it
@@ -1706,7 +1878,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 static void perf_group_detach(struct perf_event *event)
 {
 	struct perf_event *sibling, *tmp;
-	struct list_head *list = NULL;
 
 	lockdep_assert_held(&event->ctx->lock);
 
@@ -1727,22 +1898,23 @@ static void perf_group_detach(struct perf_event *event)
 		goto out;
 	}
 
-	if (!list_empty(&event->group_entry))
-		list = &event->group_entry;
-
 	/*
 	 * If this was a group event with sibling events then
 	 * upgrade the siblings to singleton events by adding them
 	 * to whatever list we are on.
 	 */
 	list_for_each_entry_safe(sibling, tmp, &event->sibling_list, group_entry) {
-		if (list)
-			list_move_tail(&sibling->group_entry, list);
+
 		sibling->group_leader = sibling;
 
 		/* Inherit group flags from the previous leader */
 		sibling->group_caps = event->group_caps;
 
+		if (!RB_EMPTY_NODE(&event->group_node)) {
+			list_del_init(&sibling->group_entry);
+			add_event_to_groups(sibling, event->ctx);
+		}
+
 		WARN_ON_ONCE(sibling->ctx != event->ctx);
 	}
 
@@ -2186,6 +2358,22 @@ static int group_can_go_on(struct perf_event *event,
 	return can_add_hw;
 }
 
+static int
+flexible_group_sched_in(struct perf_event *event,
+			struct perf_event_context *ctx,
+		        struct perf_cpu_context *cpuctx,
+			int *can_add_hw)
+{
+	if (event->state <= PERF_EVENT_STATE_OFF || !event_filter_match(event))
+		return 0;
+
+	if (group_can_go_on(event, cpuctx, *can_add_hw))
+		if (group_sched_in(event, cpuctx, ctx))
+			*can_add_hw = 0;
+
+	return 1;
+}
+
 static void add_event_to_ctx(struct perf_event *event,
 			       struct perf_event_context *ctx)
 {
@@ -2652,6 +2840,7 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 			  struct perf_cpu_context *cpuctx,
 			  enum event_type_t event_type)
 {
+	int sw = -1, cpu = smp_processor_id();
 	int is_active = ctx->is_active;
 	struct perf_event *event;
 
@@ -2700,12 +2889,20 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 
 	perf_pmu_disable(ctx->pmu);
 	if (is_active & EVENT_PINNED) {
-		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
+		perf_event_groups_for_each_cpu(event, cpu,
+				&ctx->pinned_groups, group_node)
+			group_sched_out(event, cpuctx, ctx);
+		perf_event_groups_for_each_cpu(event, sw,
+				&ctx->pinned_groups, group_node)
 			group_sched_out(event, cpuctx, ctx);
 	}
 
 	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry(event, &ctx->flexible_groups, group_entry)
+		perf_event_groups_for_each_cpu(event, cpu,
+				&ctx->flexible_groups, group_node)
+			group_sched_out(event, cpuctx, ctx);
+		perf_event_groups_for_each_cpu(event, sw,
+				&ctx->flexible_groups, group_node)
 			group_sched_out(event, cpuctx, ctx);
 	}
 	perf_pmu_enable(ctx->pmu);
@@ -2996,23 +3193,28 @@ static void
 ctx_pinned_sched_in(struct perf_event_context *ctx,
 		    struct perf_cpu_context *cpuctx)
 {
+	int sw = -1, cpu = smp_processor_id();
 	struct perf_event *event;
+	int can_add_hw;
+
+	perf_event_groups_for_each_cpu(event, sw,
+			&ctx->pinned_groups, group_node) {
+		can_add_hw = 1;
+		if (flexible_group_sched_in(event, ctx, cpuctx, &can_add_hw)) {
+			if (event->state == PERF_EVENT_STATE_INACTIVE)
+				perf_event_set_state(event,
+						PERF_EVENT_STATE_ERROR);
+		}
+	}
 
-	list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		if (!event_filter_match(event))
-			continue;
-
-		if (group_can_go_on(event, cpuctx, 1))
-			group_sched_in(event, cpuctx, ctx);
-
-		/*
-		 * If this pinned group hasn't been scheduled,
-		 * put it in error state.
-		 */
-		if (event->state == PERF_EVENT_STATE_INACTIVE)
-			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
+	perf_event_groups_for_each_cpu(event, cpu,
+			&ctx->pinned_groups, group_node) {
+		can_add_hw = 1;
+		if (flexible_group_sched_in(event, ctx, cpuctx, &can_add_hw)) {
+			if (event->state == PERF_EVENT_STATE_INACTIVE)
+				perf_event_set_state(event,
+						PERF_EVENT_STATE_ERROR);
+		}
 	}
 }
 
@@ -3020,25 +3222,19 @@ static void
 ctx_flexible_sched_in(struct perf_event_context *ctx,
 		      struct perf_cpu_context *cpuctx)
 {
+	int sw = -1, cpu = smp_processor_id();
 	struct perf_event *event;
 	int can_add_hw = 1;
 
-	list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
-		/* Ignore events in OFF or ERROR state */
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		/*
-		 * Listen to the 'cpu' scheduling filter constraint
-		 * of events:
-		 */
-		if (!event_filter_match(event))
-			continue;
+	perf_event_groups_for_each_cpu(event, sw,
+			&ctx->flexible_groups, group_node)
+		flexible_group_sched_in(event, ctx, cpuctx, &can_add_hw);
+
+	can_add_hw = 1;
+	perf_event_groups_for_each_cpu(event, cpu,
+			&ctx->flexible_groups, group_node)
+		flexible_group_sched_in(event, ctx, cpuctx, &can_add_hw);
 
-		if (group_can_go_on(event, cpuctx, can_add_hw)) {
-			if (group_sched_in(event, cpuctx, ctx))
-				can_add_hw = 0;
-		}
-	}
 }
 
 static void
@@ -3119,7 +3315,7 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	 * However, if task's ctx is not carrying any pinned
 	 * events, no need to flip the cpuctx's events around.
 	 */
-	if (!list_empty(&ctx->pinned_groups))
+	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
 		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
 	perf_event_sched_in(cpuctx, ctx, task);
 	perf_pmu_enable(ctx->pmu);
@@ -3356,8 +3552,12 @@ static void rotate_ctx(struct perf_event_context *ctx)
 	 * Rotate the first entry last of non-pinned groups. Rotation might be
 	 * disabled by the inheritance code.
 	 */
-	if (!ctx->rotate_disable)
-		list_rotate_left(&ctx->flexible_groups);
+	if (!ctx->rotate_disable) {
+		int sw = -1, cpu = smp_processor_id();
+
+		perf_event_groups_rotate(&ctx->flexible_groups, sw);
+		perf_event_groups_rotate(&ctx->flexible_groups, cpu);
+	}
 }
 
 static int perf_rotate_context(struct perf_cpu_context *cpuctx)
@@ -3715,8 +3915,8 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 	raw_spin_lock_init(&ctx->lock);
 	mutex_init(&ctx->mutex);
 	INIT_LIST_HEAD(&ctx->active_ctx_list);
-	INIT_LIST_HEAD(&ctx->pinned_groups);
-	INIT_LIST_HEAD(&ctx->flexible_groups);
+	perf_event_groups_init(&ctx->pinned_groups);
+	perf_event_groups_init(&ctx->flexible_groups);
 	INIT_LIST_HEAD(&ctx->event_list);
 	atomic_set(&ctx->refcount, 1);
 }
@@ -9561,6 +9761,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	INIT_LIST_HEAD(&event->group_entry);
 	INIT_LIST_HEAD(&event->event_entry);
 	INIT_LIST_HEAD(&event->sibling_list);
+	init_event_group(event);
 	INIT_LIST_HEAD(&event->rb_entry);
 	INIT_LIST_HEAD(&event->active_entry);
 	INIT_LIST_HEAD(&event->addr_filters.list);
@@ -11085,7 +11286,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 	 * We dont have to disable NMIs - we are only looking at
 	 * the list, not manipulating it:
 	 */
-	list_for_each_entry(event, &parent_ctx->pinned_groups, group_entry) {
+	perf_event_groups_for_each(event, &parent_ctx->pinned_groups, group_node) {
 		ret = inherit_task_group(event, parent, parent_ctx,
 					 child, ctxn, &inherited_all);
 		if (ret)
@@ -11101,7 +11302,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 	parent_ctx->rotate_disable = 1;
 	raw_spin_unlock_irqrestore(&parent_ctx->lock, flags);
 
-	list_for_each_entry(event, &parent_ctx->flexible_groups, group_entry) {
+	perf_event_groups_for_each(event, &parent_ctx->flexible_groups, group_node) {
 		ret = inherit_task_group(event, parent, parent_ctx,
 					 child, ctxn, &inherited_all);
 		if (ret)

^ permalink raw reply related	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2018-03-12 17:43 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-02  8:11 [PATCH v6 0/3] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
2017-08-02  8:13 ` [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups Alexey Budankov
2017-08-03 13:00   ` Peter Zijlstra
2017-08-03 20:30     ` Alexey Budankov
2017-08-04 14:36       ` Peter Zijlstra
2017-08-07  7:17         ` Alexey Budankov
2017-08-07  8:39           ` Peter Zijlstra
2017-08-07  9:13             ` Peter Zijlstra
2017-08-07 15:32               ` Alexey Budankov
2017-08-07 15:55                 ` Peter Zijlstra
2017-08-07 16:27                   ` Alexey Budankov
2017-08-07 16:57                     ` Peter Zijlstra
2017-08-07 17:39                       ` Andi Kleen
2017-08-07 18:12                         ` Peter Zijlstra
2017-08-07 18:13                       ` Alexey Budankov
2017-08-15 17:28           ` Alexey Budankov
2017-08-23 13:39             ` Alexander Shishkin
2017-08-23 14:18               ` Alexey Budankov
2017-08-29 13:51             ` Alexander Shishkin
2017-08-30  8:30               ` Alexey Budankov
2017-08-30 10:18                 ` Alexander Shishkin
2017-08-30 10:30                   ` Alexey Budankov
2017-08-30 11:13                     ` Alexander Shishkin
2017-08-30 11:16                 ` Alexey Budankov
2017-08-31 10:12                   ` Alexey Budankov
2017-08-31 10:12             ` Alexey Budankov
2017-08-04 14:53       ` Peter Zijlstra
2017-08-07 15:22         ` Alexey Budankov
2017-08-02  8:15 ` [PATCH v6 2/3]: perf/core: use context tstamp_data for skipped events on mux interrupt Alexey Budankov
2017-08-03 13:04   ` Peter Zijlstra
2017-08-03 14:00   ` Peter Zijlstra
2017-08-03 15:58     ` Alexey Budankov
2017-08-04 12:36       ` Peter Zijlstra
2017-08-03 15:00   ` Peter Zijlstra
2017-08-03 18:47     ` Alexey Budankov
2017-08-04 12:35       ` Peter Zijlstra
2017-08-04 12:51         ` Peter Zijlstra
2017-08-04 14:25           ` Alexey Budankov
2017-08-04 14:23         ` Alexey Budankov
2017-08-10 15:57     ` Alexey Budankov
2017-08-22 20:47       ` Peter Zijlstra
2017-08-23  8:54         ` Alexey Budankov
2017-08-31 17:18           ` [RFC][PATCH] perf: Rewrite enabled/running timekeeping Peter Zijlstra
2017-08-31 19:51             ` Stephane Eranian
2017-09-05  7:51               ` Stephane Eranian
2017-09-05  9:44                 ` Peter Zijlstra
2017-09-01 10:45             ` Alexey Budankov
2017-09-01 12:31               ` Peter Zijlstra
2017-09-01 11:17             ` Alexey Budankov
2017-09-01 12:42               ` Peter Zijlstra
2017-09-01 21:03             ` Vince Weaver
2017-09-04 10:46             ` Alexey Budankov
2017-09-04 12:08               ` Peter Zijlstra
2017-09-04 14:56                 ` Alexey Budankov
2017-09-04 15:41                   ` Peter Zijlstra
2017-09-04 15:58                     ` Peter Zijlstra
2017-09-05 10:17                     ` Alexey Budankov
2017-09-05 11:19                       ` Peter Zijlstra
2017-09-11  6:55                         ` Alexey Budankov
2017-09-05 12:06                       ` Alexey Budankov
2017-09-05 12:59                         ` Peter Zijlstra
2017-09-05 16:03                         ` Peter Zijlstra
2017-09-06 13:48                           ` Alexey Budankov
2017-09-08  8:47                           ` Alexey Budankov
2018-03-12 17:43                             ` [tip:perf/core] perf/cor: Use RB trees for pinned/flexible groups tip-bot for Alexey Budankov
2017-08-02  8:16 ` [PATCH v6 3/3]: perf/core: add mux switch to skip to the current CPU's events list on mux interrupt Alexey Budankov
2017-08-18  5:17 ` [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
2017-08-18  5:21   ` [PATCH v7 1/2] perf/core: use rb trees for pinned/flexible groups Alexey Budankov
2017-08-23 11:17     ` Alexander Shishkin
2017-08-23 17:23       ` Alexey Budankov
2017-08-18  5:22   ` [PATCH v7 2/2] perf/core: add mux switch to skip to the current CPU's events list on mux interrupt Alexey Budankov
2017-08-23 11:54     ` Alexander Shishkin
2017-08-23 18:12       ` Alexey Budankov
2017-08-22 20:21   ` [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Peter Zijlstra
2017-08-23  8:54     ` Alexey Budankov
2017-08-31 10:12     ` Alexey Budankov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.