linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8] perf: Sharing PMU counters across compatible events
@ 2019-12-07  0:24 Song Liu
  2019-12-11 18:50 ` Song Liu
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Song Liu @ 2019-12-07  0:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Song Liu, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Jiri Olsa, Alexey Budankov, Namhyung Kim, Tejun Heo

This patch tries to enable PMU sharing. When multiple perf_events are
counting the same metric, they can share the hardware PMU counter. We
call these events as "compatible events".

The PMU sharing are limited to events within the same perf_event_context
(ctx). When a event is installed or enabled, search the ctx for compatible
events. This is implemented in perf_event_setup_dup(). One of these
compatible events are picked as the master (stored in event->dup_master).
Similarly, when the event is removed or disabled, perf_event_remove_dup()
is used to clean up sharing.

A new state PERF_EVENT_STATE_ENABLED is introduced for the master event.
This state is used when the slave event is ACTIVE, but the master event
is not.

On the critical paths (add, del read), sharing PMU counters doesn't
increase the complexity. Helper functions event_pmu_[add|del|read]() are
introduced to cover these cases. All these functions have O(1) time
complexity.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Song Liu <songliubraving@fb.com>

---
Changes in v8:
Fix issues with task event (Jiri).
Fix issues with event inherit.
Fix mmap'ed events, i.e. perf test 4 (kernel test bot).

Changes in v7:
Major rewrite to avoid allocating extra master event.
---
 include/linux/perf_event.h |  14 +-
 kernel/events/core.c       | 342 ++++++++++++++++++++++++++++++++++---
 2 files changed, 327 insertions(+), 29 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6d4c22aee384..7d49f9251621 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -547,7 +547,9 @@ enum perf_event_state {
 	PERF_EVENT_STATE_ERROR		= -2,
 	PERF_EVENT_STATE_OFF		= -1,
 	PERF_EVENT_STATE_INACTIVE	=  0,
-	PERF_EVENT_STATE_ACTIVE		=  1,
+	/* the hw PMC is enabled, but this event is not counting */
+	PERF_EVENT_STATE_ENABLED	=  1,
+	PERF_EVENT_STATE_ACTIVE		=  2,
 };
 
 struct file;
@@ -750,6 +752,16 @@ struct perf_event {
 	void *security;
 #endif
 	struct list_head		sb_list;
+
+	/* for PMU sharing */
+	struct perf_event		*dup_master;
+	/* check event_sync_dup_count() for the use of dup_base_* */
+	u64				dup_base_count;
+	u64				dup_base_child_count;
+	/* when this event is master,  read from master*count */
+	local64_t			master_count;
+	atomic64_t			master_child_count;
+	int				dup_active_count;
 #endif /* CONFIG_PERF_EVENTS */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4ff86d57f9e5..7a4bc3860dfa 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1657,6 +1657,140 @@ perf_event_groups_next(struct perf_event *event)
 		event = rb_entry_safe(rb_next(&event->group_node),	\
 				typeof(*event), group_node))
 
+static inline bool perf_event_can_share(struct perf_event *event)
+{
+	/* only share hardware counting events */
+	return !is_software_event(event) && !is_sampling_event(event);
+}
+
+/*
+ * Returns whether the two events can share a PMU counter.
+ *
+ * Note: This function does NOT check perf_event_can_share() for
+ * the two events, they should be checked before this function
+ */
+static inline bool perf_event_compatible(struct perf_event *event_a,
+					 struct perf_event *event_b)
+{
+	return event_a->attr.type == event_b->attr.type &&
+		event_a->attr.config == event_b->attr.config &&
+		event_a->attr.config1 == event_b->attr.config1 &&
+		event_a->attr.config2 == event_b->attr.config2;
+}
+
+/* prepare the dup_master, this event is its own dup_master */
+static void perf_event_init_dup_master(struct perf_event *event)
+{
+	event->dup_master = event;
+	/*
+	 * dup_master->count is used by the hw PMC, and shared with other
+	 * events, so we have to read from dup_master->master_count. Copy
+	 * event->count to event->master_count.
+	 *
+	 * Same logic for child_count and master_child_count.
+	 */
+	local64_set(&event->master_count, local64_read(&event->count));
+	atomic64_set(&event->master_child_count,
+		     atomic64_read(&event->child_count));
+
+	event->dup_active_count = 0;
+}
+
+/* tear down dup_master, no more sharing for this event */
+static void perf_event_exit_dup_master(struct perf_event *event)
+{
+	WARN_ON_ONCE(event->dup_active_count);
+
+	event->dup_master = NULL;
+	/* restore event->count and event->child_count */
+	local64_set(&event->count, local64_read(&event->master_count));
+	atomic64_set(&event->child_count,
+		     atomic64_read(&event->master_child_count));
+}
+
+/* After adding a event to the ctx, try find compatible event(s). */
+static void perf_event_setup_dup(struct perf_event *event,
+				 struct perf_event_context *ctx)
+
+{
+	struct perf_event *tmp;
+
+	if (event->dup_master ||
+	    event->state != PERF_EVENT_STATE_INACTIVE ||
+	    !perf_event_can_share(event))
+		return;
+
+	/* look for dup with other events */
+	list_for_each_entry(tmp, &ctx->event_list, event_entry) {
+		WARN_ON_ONCE(tmp->state > PERF_EVENT_STATE_INACTIVE);
+
+		if (tmp == event ||
+		    tmp->state != PERF_EVENT_STATE_INACTIVE ||
+		    !perf_event_can_share(tmp) ||
+		    !perf_event_compatible(event, tmp))
+			continue;
+
+		/* first dup, pick tmp as the master */
+		if (!tmp->dup_master)
+			perf_event_init_dup_master(tmp);
+
+		event->dup_master = tmp->dup_master;
+		break;
+	}
+}
+
+/* Remove dup_master for the event */
+static void perf_event_remove_dup(struct perf_event *event,
+				  struct perf_event_context *ctx)
+
+{
+	struct perf_event *tmp, *new_master;
+	int count;
+
+	/* no sharing */
+	if (!event->dup_master)
+		return;
+
+	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE &&
+		     event->state != PERF_EVENT_STATE_OFF);
+
+	/* this event is not the master */
+	if (event->dup_master != event) {
+		event->dup_master = NULL;
+		return;
+	}
+
+	/* this event is the master */
+	perf_event_exit_dup_master(event);
+
+	count = 0;
+	new_master = NULL;
+	list_for_each_entry(tmp, &ctx->event_list, event_entry) {
+		WARN_ON_ONCE(tmp->state > PERF_EVENT_STATE_INACTIVE);
+		if (tmp->dup_master == event) {
+			count++;
+			if (!new_master)
+				new_master = tmp;
+		}
+	}
+
+	if (!count)
+		return;
+
+	if (count == 1) {
+		/* no more sharing */
+		new_master->dup_master = NULL;
+		return;
+	}
+
+	perf_event_init_dup_master(new_master);
+
+	/* switch to new_master */
+	list_for_each_entry(tmp, &ctx->event_list, event_entry)
+		if (tmp->dup_master == event)
+			tmp->dup_master = new_master;
+}
+
 /*
  * Add an event from the lists for its context.
  * Must be called with ctx->mutex and ctx->lock held.
@@ -2084,6 +2218,98 @@ event_filter_match(struct perf_event *event)
 	       perf_cgroup_match(event) && pmu_filter_match(event);
 }
 
+/* PMU sharing aware version of event->pmu->add() */
+static int event_pmu_add(struct perf_event *event,
+			 struct perf_event_context *ctx)
+{
+	struct perf_event *master;
+	int ret;
+
+	/* no sharing, just do event->pmu->add() */
+	if (!event->dup_master)
+		return event->pmu->add(event, PERF_EF_START);
+
+	master = event->dup_master;
+
+	if (!master->dup_active_count) {
+		ret = event->pmu->add(master, PERF_EF_START);
+		if (ret)
+			return ret;
+
+		if (master != event)
+			perf_event_set_state(master, PERF_EVENT_STATE_ENABLED);
+	}
+
+	master->dup_active_count++;
+	master->pmu->read(master);
+	event->dup_base_count = local64_read(&master->count);
+	event->dup_base_child_count = atomic64_read(&master->child_count);
+	return 0;
+}
+
+/*
+ * sync data count from dup->master to event, called on event_pmu_read()
+ * and event_pmu_del()
+ */
+static void event_sync_dup_count(struct perf_event *event,
+				 struct perf_event *master)
+{
+	u64 new_count;
+	u64 new_child_count;
+
+	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_ACTIVE);
+
+	event->pmu->read(master);
+	new_count = local64_read(&master->count);
+	new_child_count = atomic64_read(&master->child_count);
+
+	if (event == master) {
+		local64_add(new_count - event->dup_base_count,
+			    &event->master_count);
+		atomic64_add(new_child_count - event->dup_base_child_count,
+			     &event->master_child_count);
+	} else {
+		local64_add(new_count - event->dup_base_count, &event->count);
+		atomic64_add(new_child_count - event->dup_base_child_count,
+			     &event->child_count);
+	}
+
+	/* save dup_base_* for next sync */
+	event->dup_base_count = new_count;
+	event->dup_base_child_count = new_child_count;
+}
+
+/* PMU sharing aware version of event->pmu->del() */
+static void event_pmu_del(struct perf_event *event,
+			  struct perf_event_context *ctx)
+{
+	struct perf_event *master;
+
+	if (event->dup_master == NULL) {
+		event->pmu->del(event, 0);
+		return;
+	}
+
+	master = event->dup_master;
+	event_sync_dup_count(event, master);
+	if (--master->dup_active_count == 0) {
+		event->pmu->del(master, 0);
+		perf_event_set_state(master, PERF_EVENT_STATE_INACTIVE);
+	} else if (master == event) {
+		perf_event_set_state(master, PERF_EVENT_STATE_ENABLED);
+	}
+}
+
+/* PMU sharing aware version of event->pmu->read() */
+static void event_pmu_read(struct perf_event *event)
+{
+	if (event->dup_master == NULL) {
+		event->pmu->read(event);
+		return;
+	}
+	event_sync_dup_count(event, event->dup_master);
+}
+
 static void
 event_sched_out(struct perf_event *event,
 		  struct perf_cpu_context *cpuctx,
@@ -2106,7 +2332,7 @@ event_sched_out(struct perf_event *event,
 
 	perf_pmu_disable(event->pmu);
 
-	event->pmu->del(event, 0);
+	event_pmu_del(event, ctx);
 	event->oncpu = -1;
 
 	if (READ_ONCE(event->pending_disable) >= 0) {
@@ -2155,6 +2381,16 @@ group_sched_out(struct perf_event *group_event,
 
 #define DETACH_GROUP	0x01UL
 
+static void ctx_sched_out(struct perf_event_context *ctx,
+			  struct perf_cpu_context *cpuctx,
+			  enum event_type_t event_type);
+
+static void ctx_resched(struct perf_cpu_context *cpuctx,
+			struct perf_event_context *task_ctx,
+			enum event_type_t event_type,
+			struct perf_event *event_add_dup,
+			struct perf_event *event_del_dup);
+
 /*
  * Cross CPU call to remove a performance event
  *
@@ -2174,6 +2410,14 @@ __perf_remove_from_context(struct perf_event *event,
 		update_cgrp_time_from_cpuctx(cpuctx);
 	}
 
+	if (event->dup_master == event) {
+		if (ctx->is_active)
+			ctx_resched(cpuctx, cpuctx->task_ctx,
+				    get_event_type(event), NULL, event);
+		else
+			perf_event_remove_dup(event, ctx);
+	}
+
 	event_sched_out(event, cpuctx, ctx);
 	if (flags & DETACH_GROUP)
 		perf_group_detach(event);
@@ -2241,6 +2485,14 @@ static void __perf_event_disable(struct perf_event *event,
 		update_cgrp_time_from_event(event);
 	}
 
+	if (event->dup_master == event) {
+		if (ctx->is_active)
+			ctx_resched(cpuctx, cpuctx->task_ctx,
+				    get_event_type(event), NULL, event);
+		else
+			perf_event_remove_dup(event, ctx);
+	}
+
 	if (event == event->group_leader)
 		group_sched_out(event, cpuctx, ctx);
 	else
@@ -2379,7 +2631,7 @@ event_sched_in(struct perf_event *event,
 
 	perf_log_itrace_start(event);
 
-	if (event->pmu->add(event, PERF_EF_START)) {
+	if (event_pmu_add(event, ctx)) {
 		perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
 		event->oncpu = -1;
 		ret = -EAGAIN;
@@ -2493,9 +2745,6 @@ static void add_event_to_ctx(struct perf_event *event,
 	perf_group_attach(event);
 }
 
-static void ctx_sched_out(struct perf_event_context *ctx,
-			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type);
 static void
 ctx_sched_in(struct perf_event_context *ctx,
 	     struct perf_cpu_context *cpuctx,
@@ -2544,7 +2793,9 @@ static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
  */
 static void ctx_resched(struct perf_cpu_context *cpuctx,
 			struct perf_event_context *task_ctx,
-			enum event_type_t event_type)
+			enum event_type_t event_type,
+			struct perf_event *event_add_dup,
+			struct perf_event *event_del_dup)
 {
 	enum event_type_t ctx_event_type;
 	bool cpu_event = !!(event_type & EVENT_CPU);
@@ -2574,6 +2825,18 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 	else if (ctx_event_type & EVENT_PINNED)
 		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
 
+	if (event_add_dup) {
+		if (event_add_dup->ctx->is_active)
+			ctx_sched_out(event_add_dup->ctx, cpuctx, EVENT_ALL);
+		perf_event_setup_dup(event_add_dup, event_add_dup->ctx);
+	}
+
+	if (event_del_dup) {
+		if (event_del_dup->ctx->is_active)
+			ctx_sched_out(event_del_dup->ctx, cpuctx, EVENT_ALL);
+		perf_event_remove_dup(event_del_dup, event_del_dup->ctx);
+	}
+
 	perf_event_sched_in(cpuctx, task_ctx, current);
 	perf_pmu_enable(cpuctx->ctx.pmu);
 }
@@ -2584,7 +2847,7 @@ void perf_pmu_resched(struct pmu *pmu)
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
 
 	perf_ctx_lock(cpuctx, task_ctx);
-	ctx_resched(cpuctx, task_ctx, EVENT_ALL|EVENT_CPU);
+	ctx_resched(cpuctx, task_ctx, EVENT_ALL|EVENT_CPU, NULL, NULL);
 	perf_ctx_unlock(cpuctx, task_ctx);
 }
 
@@ -2642,9 +2905,11 @@ static int  __perf_install_in_context(void *info)
 	if (reprogram) {
 		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
 		add_event_to_ctx(event, ctx);
-		ctx_resched(cpuctx, task_ctx, get_event_type(event));
+		ctx_resched(cpuctx, task_ctx, get_event_type(event),
+			    event, NULL);
 	} else {
 		add_event_to_ctx(event, ctx);
+		perf_event_setup_dup(event, ctx);
 	}
 
 unlock:
@@ -2789,8 +3054,10 @@ static void __perf_event_enable(struct perf_event *event,
 
 	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
 
-	if (!ctx->is_active)
+	if (!ctx->is_active) {
+		perf_event_setup_dup(event, ctx);
 		return;
+	}
 
 	if (!event_filter_match(event)) {
 		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
@@ -2801,7 +3068,7 @@ static void __perf_event_enable(struct perf_event *event,
 	 * If the event is in a group and isn't the group leader,
 	 * then don't put it on unless the group is on.
 	 */
-	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
+	if (leader != event && leader->state <= PERF_EVENT_STATE_INACTIVE) {
 		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
 		return;
 	}
@@ -2810,7 +3077,8 @@ static void __perf_event_enable(struct perf_event *event,
 	if (ctx->task)
 		WARN_ON_ONCE(task_ctx != ctx);
 
-	ctx_resched(cpuctx, task_ctx, get_event_type(event));
+	ctx_resched(cpuctx, task_ctx, get_event_type(event),
+		    event, NULL);
 }
 
 /*
@@ -3148,8 +3416,8 @@ static void __perf_event_sync_stat(struct perf_event *event,
 	 * we know the event must be on the current CPU, therefore we
 	 * don't need to use it.
 	 */
-	if (event->state == PERF_EVENT_STATE_ACTIVE)
-		event->pmu->read(event);
+	if (event->state > PERF_EVENT_STATE_INACTIVE)
+		event_pmu_read(event);
 
 	perf_event_update_time(event);
 
@@ -3953,7 +4221,7 @@ static void perf_event_enable_on_exec(int ctxn)
 	 */
 	if (enabled) {
 		clone_ctx = unclone_ctx(ctx);
-		ctx_resched(cpuctx, ctx, event_type);
+		ctx_resched(cpuctx, ctx, event_type, NULL, NULL);
 	} else {
 		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
 	}
@@ -4024,22 +4292,22 @@ static void __perf_event_read(void *info)
 		goto unlock;
 
 	if (!data->group) {
-		pmu->read(event);
+		event_pmu_read(event);
 		data->ret = 0;
 		goto unlock;
 	}
 
 	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
 
-	pmu->read(event);
+	event_pmu_read(event);
 
 	for_each_sibling_event(sub, event) {
-		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
+		if (sub->state > PERF_EVENT_STATE_INACTIVE) {
 			/*
 			 * Use sibling's PMU rather than @event's since
 			 * sibling could be on different (eg: software) PMU.
 			 */
-			sub->pmu->read(sub);
+			event_pmu_read(sub);
 		}
 	}
 
@@ -4051,6 +4319,9 @@ static void __perf_event_read(void *info)
 
 static inline u64 perf_event_count(struct perf_event *event)
 {
+	if (event->dup_master == event)
+		return local64_read(&event->master_count) +
+			atomic64_read(&event->master_child_count);
 	return local64_read(&event->count) + atomic64_read(&event->child_count);
 }
 
@@ -4109,9 +4380,12 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
 	 * oncpu == -1).
 	 */
 	if (event->oncpu == smp_processor_id())
-		event->pmu->read(event);
+		event_pmu_read(event);
 
-	*value = local64_read(&event->count);
+	if (event->dup_master == event)
+		*value = local64_read(&event->master_count);
+	else
+		*value = local64_read(&event->count);
 	if (enabled || running) {
 		u64 now = event->shadow_ctx_time + perf_clock();
 		u64 __enabled, __running;
@@ -4138,7 +4412,7 @@ static int perf_event_read(struct perf_event *event, bool group)
 	 * value in the event structure:
 	 */
 again:
-	if (state == PERF_EVENT_STATE_ACTIVE) {
+	if (state > PERF_EVENT_STATE_INACTIVE) {
 		struct perf_read_data data;
 
 		/*
@@ -6488,8 +6762,8 @@ static void perf_output_read_group(struct perf_output_handle *handle,
 		values[n++] = running;
 
 	if ((leader != event) &&
-	    (leader->state == PERF_EVENT_STATE_ACTIVE))
-		leader->pmu->read(leader);
+	    (leader->state > PERF_EVENT_STATE_INACTIVE))
+		event_pmu_read(leader);
 
 	values[n++] = perf_event_count(leader);
 	if (read_format & PERF_FORMAT_ID)
@@ -6501,8 +6775,8 @@ static void perf_output_read_group(struct perf_output_handle *handle,
 		n = 0;
 
 		if ((sub != event) &&
-		    (sub->state == PERF_EVENT_STATE_ACTIVE))
-			sub->pmu->read(sub);
+		    (sub->state > PERF_EVENT_STATE_INACTIVE))
+			event_pmu_read(sub);
 
 		values[n++] = perf_event_count(sub);
 		if (read_format & PERF_FORMAT_ID)
@@ -9800,10 +10074,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 
 	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
 
-	if (event->state != PERF_EVENT_STATE_ACTIVE)
+	if (event->state <= PERF_EVENT_STATE_INACTIVE)
 		return HRTIMER_NORESTART;
 
-	event->pmu->read(event);
+	event_pmu_read(event);
 
 	perf_sample_data_init(&data, 0, event->hw.last_period);
 	regs = get_irq_regs();
@@ -11492,9 +11766,17 @@ SYSCALL_DEFINE5(perf_event_open,
 		perf_remove_from_context(group_leader, 0);
 		put_ctx(gctx);
 
+		/*
+		 * move_group only happens to sw events, from sw ctx to hw
+		 * ctx. The sw events should not have valid dup_master. So
+		 * it is not necessary to handle dup_events.
+		 */
+		WARN_ON_ONCE(group_leader->dup_master);
+
 		for_each_sibling_event(sibling, group_leader) {
 			perf_remove_from_context(sibling, 0);
 			put_ctx(gctx);
+			WARN_ON_ONCE(sibling->dup_master);
 		}
 
 		/*
@@ -11761,7 +12043,10 @@ static void sync_child_event(struct perf_event *child_event,
 	/*
 	 * Add back the child's count to the parent's count:
 	 */
-	atomic64_add(child_val, &parent_event->child_count);
+	if (parent_event->dup_master == parent_event)
+		atomic64_add(child_val, &parent_event->master_child_count);
+	else
+		atomic64_add(child_val, &parent_event->child_count);
 	atomic64_add(child_event->total_time_enabled,
 		     &parent_event->child_total_time_enabled);
 	atomic64_add(child_event->total_time_running,
@@ -12140,6 +12425,7 @@ inherit_event(struct perf_event *parent_event,
 	 */
 	raw_spin_lock_irqsave(&child_ctx->lock, flags);
 	add_event_to_ctx(child_event, child_ctx);
+	perf_event_setup_dup(child_event, child_ctx);
 	raw_spin_unlock_irqrestore(&child_ctx->lock, flags);
 
 	/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-07  0:24 [PATCH v8] perf: Sharing PMU counters across compatible events Song Liu
@ 2019-12-11 18:50 ` Song Liu
  2019-12-12 13:42 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Song Liu @ 2019-12-11 18:50 UTC (permalink / raw)
  To: open list
  Cc: Kernel Team, Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo

Hi Peter and Jiri,

> On Dec 6, 2019, at 4:24 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> This patch tries to enable PMU sharing. When multiple perf_events are
> counting the same metric, they can share the hardware PMU counter. We
> call these events as "compatible events".
> 
> The PMU sharing are limited to events within the same perf_event_context
> (ctx). When a event is installed or enabled, search the ctx for compatible
> events. This is implemented in perf_event_setup_dup(). One of these
> compatible events are picked as the master (stored in event->dup_master).
> Similarly, when the event is removed or disabled, perf_event_remove_dup()
> is used to clean up sharing.
> 
> A new state PERF_EVENT_STATE_ENABLED is introduced for the master event.
> This state is used when the slave event is ACTIVE, but the master event
> is not.
> 
> On the critical paths (add, del read), sharing PMU counters doesn't
> increase the complexity. Helper functions event_pmu_[add|del|read]() are
> introduced to cover these cases. All these functions have O(1) time
> complexity.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> 
> ---
> Changes in v8:
> Fix issues with task event (Jiri).
> Fix issues with event inherit.
> Fix mmap'ed events, i.e. perf test 4 (kernel test bot).
> 
> Changes in v7:
> Major rewrite to avoid allocating extra master event.

Could you please share your feedbacks on this version?

Thanks,
Song


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-07  0:24 [PATCH v8] perf: Sharing PMU counters across compatible events Song Liu
  2019-12-11 18:50 ` Song Liu
@ 2019-12-12 13:42 ` Peter Zijlstra
  2019-12-12 15:18   ` Song Liu
  2019-12-12 13:49 ` Peter Zijlstra
  2019-12-12 15:39 ` Peter Zijlstra
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-12-12 13:42 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, kernel-team, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo

On Fri, Dec 06, 2019 at 04:24:47PM -0800, Song Liu wrote:
> +/* Remove dup_master for the event */
> +static void perf_event_remove_dup(struct perf_event *event,
> +				  struct perf_event_context *ctx)
> +
> +{
> +	struct perf_event *tmp, *new_master;
> +	int count;
> +
> +	/* no sharing */
> +	if (!event->dup_master)
> +		return;
> +
> +	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE &&
> +		     event->state != PERF_EVENT_STATE_OFF);
> +
> +	/* this event is not the master */
> +	if (event->dup_master != event) {
> +		event->dup_master = NULL;
> +		return;
> +	}
> +
> +	/* this event is the master */
> +	perf_event_exit_dup_master(event);
> +
> +	count = 0;
> +	new_master = NULL;
> +	list_for_each_entry(tmp, &ctx->event_list, event_entry) {
> +		WARN_ON_ONCE(tmp->state > PERF_EVENT_STATE_INACTIVE);
> +		if (tmp->dup_master == event) {
> +			count++;
> +			if (!new_master)
> +				new_master = tmp;
> +		}
> +	}
> +
> +	if (!count)
> +		return;
> +
> +	if (count == 1) {
> +		/* no more sharing */
> +		new_master->dup_master = NULL;
> +		return;
> +	}
> +
> +	perf_event_init_dup_master(new_master);
> +
> +	/* switch to new_master */
> +	list_for_each_entry(tmp, &ctx->event_list, event_entry)
> +		if (tmp->dup_master == event)
> +			tmp->dup_master = new_master;
> +}

I'm thinking you can do that in a single iteration:

	list_for_each_entry(tmp, &ctx->event_list, event_entry) {
		if (tmp->dup_master != event)
			continue;

		if (!new_master)
			new_master = tmp;

		tmp->dup_master = new_master;
		count++;
	}

	if (count == 1)
		new_master->dup_master = NULL;
	else
		perf_event_init_dup_master(new_master);

Hmm?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-07  0:24 [PATCH v8] perf: Sharing PMU counters across compatible events Song Liu
  2019-12-11 18:50 ` Song Liu
  2019-12-12 13:42 ` Peter Zijlstra
@ 2019-12-12 13:49 ` Peter Zijlstra
  2019-12-12 15:37   ` Song Liu
  2019-12-12 15:39 ` Peter Zijlstra
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-12-12 13:49 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, kernel-team, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo

On Fri, Dec 06, 2019 at 04:24:47PM -0800, Song Liu wrote:

> @@ -750,6 +752,16 @@ struct perf_event {
>  	void *security;
>  #endif
>  	struct list_head		sb_list;
> +
> +	/* for PMU sharing */
> +	struct perf_event		*dup_master;
> +	/* check event_sync_dup_count() for the use of dup_base_* */
> +	u64				dup_base_count;
> +	u64				dup_base_child_count;
> +	/* when this event is master,  read from master*count */
> +	local64_t			master_count;
> +	atomic64_t			master_child_count;
> +	int				dup_active_count;
>  #endif /* CONFIG_PERF_EVENTS */
>  };

> +/* PMU sharing aware version of event->pmu->add() */
> +static int event_pmu_add(struct perf_event *event,
> +			 struct perf_event_context *ctx)
> +{
> +	struct perf_event *master;
> +	int ret;
> +
> +	/* no sharing, just do event->pmu->add() */
> +	if (!event->dup_master)
> +		return event->pmu->add(event, PERF_EF_START);

Possibly we should look at the location of perf_event::dup_master to be
in a hot cacheline. Because I'm thinking you just added a guaranteed
miss here.

> +
> +	master = event->dup_master;
> +
> +	if (!master->dup_active_count) {
> +		ret = event->pmu->add(master, PERF_EF_START);
> +		if (ret)
> +			return ret;
> +
> +		if (master != event)
> +			perf_event_set_state(master, PERF_EVENT_STATE_ENABLED);
> +	}
> +
> +	master->dup_active_count++;
> +	master->pmu->read(master);
> +	event->dup_base_count = local64_read(&master->count);
> +	event->dup_base_child_count = atomic64_read(&master->child_count);
> +	return 0;
> +}

> +/* PMU sharing aware version of event->pmu->del() */
> +static void event_pmu_del(struct perf_event *event,
> +			  struct perf_event_context *ctx)
> +{
> +	struct perf_event *master;
> +
> +	if (event->dup_master == NULL) {
> +		event->pmu->del(event, 0);
> +		return;
> +	}

How about you write it exactly like the add version:

	if (!event->dup_master)
		return event->pmu->del(event, 0);

?

> +
> +	master = event->dup_master;
> +	event_sync_dup_count(event, master);
> +	if (--master->dup_active_count == 0) {
> +		event->pmu->del(master, 0);
> +		perf_event_set_state(master, PERF_EVENT_STATE_INACTIVE);
> +	} else if (master == event) {
> +		perf_event_set_state(master, PERF_EVENT_STATE_ENABLED);
> +	}
> +}
> +
> +/* PMU sharing aware version of event->pmu->read() */
> +static void event_pmu_read(struct perf_event *event)
> +{
> +	if (event->dup_master == NULL) {
> +		event->pmu->read(event);
> +		return;
> +	}

And here too.

> +	event_sync_dup_count(event, event->dup_master);
> +}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-12 13:42 ` Peter Zijlstra
@ 2019-12-12 15:18   ` Song Liu
  0 siblings, 0 replies; 14+ messages in thread
From: Song Liu @ 2019-12-12 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: open list, Kernel Team, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo



> On Dec 12, 2019, at 5:42 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Fri, Dec 06, 2019 at 04:24:47PM -0800, Song Liu wrote:
>> +/* Remove dup_master for the event */
>> +static void perf_event_remove_dup(struct perf_event *event,
>> +				  struct perf_event_context *ctx)
>> +
>> +{
>> +	struct perf_event *tmp, *new_master;
>> +	int count;
>> +
>> +	/* no sharing */
>> +	if (!event->dup_master)
>> +		return;
>> +
>> +	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE &&
>> +		     event->state != PERF_EVENT_STATE_OFF);
>> +
>> +	/* this event is not the master */
>> +	if (event->dup_master != event) {
>> +		event->dup_master = NULL;
>> +		return;
>> +	}
>> +
>> +	/* this event is the master */
>> +	perf_event_exit_dup_master(event);
>> +
>> +	count = 0;
>> +	new_master = NULL;
>> +	list_for_each_entry(tmp, &ctx->event_list, event_entry) {
>> +		WARN_ON_ONCE(tmp->state > PERF_EVENT_STATE_INACTIVE);
>> +		if (tmp->dup_master == event) {
>> +			count++;
>> +			if (!new_master)
>> +				new_master = tmp;
>> +		}
>> +	}
>> +
>> +	if (!count)
>> +		return;
>> +
>> +	if (count == 1) {
>> +		/* no more sharing */
>> +		new_master->dup_master = NULL;
>> +		return;
>> +	}
>> +
>> +	perf_event_init_dup_master(new_master);
>> +
>> +	/* switch to new_master */
>> +	list_for_each_entry(tmp, &ctx->event_list, event_entry)
>> +		if (tmp->dup_master == event)
>> +			tmp->dup_master = new_master;
>> +}
> 
> I'm thinking you can do that in a single iteration:
> 
> 	list_for_each_entry(tmp, &ctx->event_list, event_entry) {
> 		if (tmp->dup_master != event)
> 			continue;
> 
> 		if (!new_master)
> 			new_master = tmp;
> 
> 		tmp->dup_master = new_master;
> 		count++;
> 	}
> 
> 	if (count == 1)
> 		new_master->dup_master = NULL;
> 	else
> 		perf_event_init_dup_master(new_master);
> 
> Hmm?

This should work. Let me fix in v9. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-12 13:49 ` Peter Zijlstra
@ 2019-12-12 15:37   ` Song Liu
  2019-12-12 15:44     ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Song Liu @ 2019-12-12 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: open list, Kernel Team, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo



> On Dec 12, 2019, at 5:49 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Fri, Dec 06, 2019 at 04:24:47PM -0800, Song Liu wrote:
> 
>> @@ -750,6 +752,16 @@ struct perf_event {
>> 	void *security;
>> #endif
>> 	struct list_head		sb_list;
>> +
>> +	/* for PMU sharing */
>> +	struct perf_event		*dup_master;
>> +	/* check event_sync_dup_count() for the use of dup_base_* */
>> +	u64				dup_base_count;
>> +	u64				dup_base_child_count;
>> +	/* when this event is master,  read from master*count */
>> +	local64_t			master_count;
>> +	atomic64_t			master_child_count;
>> +	int				dup_active_count;
>> #endif /* CONFIG_PERF_EVENTS */
>> };
> 
>> +/* PMU sharing aware version of event->pmu->add() */
>> +static int event_pmu_add(struct perf_event *event,
>> +			 struct perf_event_context *ctx)
>> +{
>> +	struct perf_event *master;
>> +	int ret;
>> +
>> +	/* no sharing, just do event->pmu->add() */
>> +	if (!event->dup_master)
>> +		return event->pmu->add(event, PERF_EF_START);
> 
> Possibly we should look at the location of perf_event::dup_master to be
> in a hot cacheline. Because I'm thinking you just added a guaranteed
> miss here.

I am not quite sure the best location for these. How about:

diff --git i/include/linux/perf_event.h w/include/linux/perf_event.h
index 7d49f9251621..218cc7f75775 100644
--- i/include/linux/perf_event.h
+++ w/include/linux/perf_event.h
@@ -643,6 +643,16 @@ struct perf_event {
        local64_t                       count;
        atomic64_t                      child_count;

+       /* for PMU sharing */
+       struct perf_event               *dup_master;
+       /* check event_sync_dup_count() for the use of dup_base_* */
+       u64                             dup_base_count;
+       u64                             dup_base_child_count;
+       /* when this event is master,  read from master*count */
+       local64_t                       master_count;
+       atomic64_t                      master_child_count;
+       int                             dup_active_count;
+
        /*
         * These are the total time in nanoseconds that the event
         * has been enabled (i.e. eligible to run, and the task has

?

> 
>> +
>> +	master = event->dup_master;
>> +
>> +	if (!master->dup_active_count) {
>> +		ret = event->pmu->add(master, PERF_EF_START);
>> +		if (ret)
>> +			return ret;
>> +
>> +		if (master != event)
>> +			perf_event_set_state(master, PERF_EVENT_STATE_ENABLED);
>> +	}
>> +
>> +	master->dup_active_count++;
>> +	master->pmu->read(master);
>> +	event->dup_base_count = local64_read(&master->count);
>> +	event->dup_base_child_count = atomic64_read(&master->child_count);
>> +	return 0;
>> +}
> 
>> +/* PMU sharing aware version of event->pmu->del() */
>> +static void event_pmu_del(struct perf_event *event,
>> +			  struct perf_event_context *ctx)
>> +{
>> +	struct perf_event *master;
>> +
>> +	if (event->dup_master == NULL) {
>> +		event->pmu->del(event, 0);
>> +		return;
>> +	}
> 
> How about you write it exactly like the add version:
> 
> 	if (!event->dup_master)
> 		return event->pmu->del(event, 0);
> 
> ?

Sure, will fix in v9. 

Thanks,
Song
> 
>> +
>> +	master = event->dup_master;
>> +	event_sync_dup_count(event, master);
>> +	if (--master->dup_active_count == 0) {
>> +		event->pmu->del(master, 0);
>> +		perf_event_set_state(master, PERF_EVENT_STATE_INACTIVE);
>> +	} else if (master == event) {
>> +		perf_event_set_state(master, PERF_EVENT_STATE_ENABLED);
>> +	}
>> +}
>> +
>> +/* PMU sharing aware version of event->pmu->read() */
>> +static void event_pmu_read(struct perf_event *event)
>> +{
>> +	if (event->dup_master == NULL) {
>> +		event->pmu->read(event);
>> +		return;
>> +	}
> 
> And here too.
> 
>> +	event_sync_dup_count(event, event->dup_master);
>> +}


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-07  0:24 [PATCH v8] perf: Sharing PMU counters across compatible events Song Liu
                   ` (2 preceding siblings ...)
  2019-12-12 13:49 ` Peter Zijlstra
@ 2019-12-12 15:39 ` Peter Zijlstra
  2019-12-12 15:45   ` Song Liu
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-12-12 15:39 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, kernel-team, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo

On Fri, Dec 06, 2019 at 04:24:47PM -0800, Song Liu wrote:

> @@ -2174,6 +2410,14 @@ __perf_remove_from_context(struct perf_event *event,
>  		update_cgrp_time_from_cpuctx(cpuctx);
>  	}
>  
> +	if (event->dup_master == event) {
> +		if (ctx->is_active)
> +			ctx_resched(cpuctx, cpuctx->task_ctx,
> +				    get_event_type(event), NULL, event);
> +		else
> +			perf_event_remove_dup(event, ctx);
> +	}
> +
>  	event_sched_out(event, cpuctx, ctx);
>  	if (flags & DETACH_GROUP)
>  		perf_group_detach(event);
> @@ -2241,6 +2485,14 @@ static void __perf_event_disable(struct perf_event *event,
>  		update_cgrp_time_from_event(event);
>  	}
>  
> +	if (event->dup_master == event) {
> +		if (ctx->is_active)
> +			ctx_resched(cpuctx, cpuctx->task_ctx,
> +				    get_event_type(event), NULL, event);
> +		else
> +			perf_event_remove_dup(event, ctx);
> +	}
> +
>  	if (event == event->group_leader)
>  		group_sched_out(event, cpuctx, ctx);
>  	else

> @@ -2544,7 +2793,9 @@ static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
>   */
>  static void ctx_resched(struct perf_cpu_context *cpuctx,
>  			struct perf_event_context *task_ctx,
> -			enum event_type_t event_type)
> +			enum event_type_t event_type,
> +			struct perf_event *event_add_dup,
> +			struct perf_event *event_del_dup)
>  {
>  	enum event_type_t ctx_event_type;
>  	bool cpu_event = !!(event_type & EVENT_CPU);
> @@ -2574,6 +2825,18 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>  	else if (ctx_event_type & EVENT_PINNED)
>  		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>  
> +	if (event_add_dup) {
> +		if (event_add_dup->ctx->is_active)
> +			ctx_sched_out(event_add_dup->ctx, cpuctx, EVENT_ALL);
> +		perf_event_setup_dup(event_add_dup, event_add_dup->ctx);
> +	}
> +
> +	if (event_del_dup) {
> +		if (event_del_dup->ctx->is_active)
> +			ctx_sched_out(event_del_dup->ctx, cpuctx, EVENT_ALL);
> +		perf_event_remove_dup(event_del_dup, event_del_dup->ctx);
> +	}
> +
>  	perf_event_sched_in(cpuctx, task_ctx, current);
>  	perf_pmu_enable(cpuctx->ctx.pmu);
>  }

Yuck!

Why do you do a full reschedule when you take out a master?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-12 15:37   ` Song Liu
@ 2019-12-12 15:44     ` Peter Zijlstra
  2019-12-12 15:47       ` Song Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-12-12 15:44 UTC (permalink / raw)
  To: Song Liu
  Cc: open list, Kernel Team, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo

On Thu, Dec 12, 2019 at 03:37:05PM +0000, Song Liu wrote:
> 
> 
> > On Dec 12, 2019, at 5:49 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Fri, Dec 06, 2019 at 04:24:47PM -0800, Song Liu wrote:
> > 
> >> @@ -750,6 +752,16 @@ struct perf_event {
> >> 	void *security;
> >> #endif
> >> 	struct list_head		sb_list;
> >> +
> >> +	/* for PMU sharing */
> >> +	struct perf_event		*dup_master;
> >> +	/* check event_sync_dup_count() for the use of dup_base_* */
> >> +	u64				dup_base_count;
> >> +	u64				dup_base_child_count;
> >> +	/* when this event is master,  read from master*count */
> >> +	local64_t			master_count;
> >> +	atomic64_t			master_child_count;
> >> +	int				dup_active_count;
> >> #endif /* CONFIG_PERF_EVENTS */
> >> };
> > 
> >> +/* PMU sharing aware version of event->pmu->add() */
> >> +static int event_pmu_add(struct perf_event *event,
> >> +			 struct perf_event_context *ctx)
> >> +{
> >> +	struct perf_event *master;
> >> +	int ret;
> >> +
> >> +	/* no sharing, just do event->pmu->add() */
> >> +	if (!event->dup_master)
> >> +		return event->pmu->add(event, PERF_EF_START);
> > 
> > Possibly we should look at the location of perf_event::dup_master to be
> > in a hot cacheline. Because I'm thinking you just added a guaranteed
> > miss here.
> 
> I am not quite sure the best location for these. How about:
> 
> diff --git i/include/linux/perf_event.h w/include/linux/perf_event.h
> index 7d49f9251621..218cc7f75775 100644
> --- i/include/linux/perf_event.h
> +++ w/include/linux/perf_event.h
> @@ -643,6 +643,16 @@ struct perf_event {
>         local64_t                       count;
>         atomic64_t                      child_count;
> 
> +       /* for PMU sharing */
> +       struct perf_event               *dup_master;
> +       /* check event_sync_dup_count() for the use of dup_base_* */
> +       u64                             dup_base_count;
> +       u64                             dup_base_child_count;
> +       /* when this event is master,  read from master*count */
> +       local64_t                       master_count;
> +       atomic64_t                      master_child_count;
> +       int                             dup_active_count;
> +
>         /*
>          * These are the total time in nanoseconds that the event
>          * has been enabled (i.e. eligible to run, and the task has

Ah, no. Only put dup_master somewhere hot. The rest is not that
important. For instance, you can put it right next to event->pmu,
because that's going to be used right next to it, right?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-12 15:39 ` Peter Zijlstra
@ 2019-12-12 15:45   ` Song Liu
  2019-12-12 16:00     ` Song Liu
  2019-12-12 18:52     ` Peter Zijlstra
  0 siblings, 2 replies; 14+ messages in thread
From: Song Liu @ 2019-12-12 15:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: open list, Kernel Team, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo



> On Dec 12, 2019, at 7:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Fri, Dec 06, 2019 at 04:24:47PM -0800, Song Liu wrote:
> 
>> @@ -2174,6 +2410,14 @@ __perf_remove_from_context(struct perf_event *event,
>> 		update_cgrp_time_from_cpuctx(cpuctx);
>> 	}
>> 
>> +	if (event->dup_master == event) {
>> +		if (ctx->is_active)
>> +			ctx_resched(cpuctx, cpuctx->task_ctx,
>> +				    get_event_type(event), NULL, event);
>> +		else
>> +			perf_event_remove_dup(event, ctx);
>> +	}
>> +
>> 	event_sched_out(event, cpuctx, ctx);
>> 	if (flags & DETACH_GROUP)
>> 		perf_group_detach(event);
>> @@ -2241,6 +2485,14 @@ static void __perf_event_disable(struct perf_event *event,
>> 		update_cgrp_time_from_event(event);
>> 	}
>> 
>> +	if (event->dup_master == event) {
>> +		if (ctx->is_active)
>> +			ctx_resched(cpuctx, cpuctx->task_ctx,
>> +				    get_event_type(event), NULL, event);
>> +		else
>> +			perf_event_remove_dup(event, ctx);
>> +	}
>> +
>> 	if (event == event->group_leader)
>> 		group_sched_out(event, cpuctx, ctx);
>> 	else
> 
>> @@ -2544,7 +2793,9 @@ static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
>>  */
>> static void ctx_resched(struct perf_cpu_context *cpuctx,
>> 			struct perf_event_context *task_ctx,
>> -			enum event_type_t event_type)
>> +			enum event_type_t event_type,
>> +			struct perf_event *event_add_dup,
>> +			struct perf_event *event_del_dup)
>> {
>> 	enum event_type_t ctx_event_type;
>> 	bool cpu_event = !!(event_type & EVENT_CPU);
>> @@ -2574,6 +2825,18 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>> 	else if (ctx_event_type & EVENT_PINNED)
>> 		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>> 
>> +	if (event_add_dup) {
>> +		if (event_add_dup->ctx->is_active)
>> +			ctx_sched_out(event_add_dup->ctx, cpuctx, EVENT_ALL);
>> +		perf_event_setup_dup(event_add_dup, event_add_dup->ctx);
>> +	}
>> +
>> +	if (event_del_dup) {
>> +		if (event_del_dup->ctx->is_active)
>> +			ctx_sched_out(event_del_dup->ctx, cpuctx, EVENT_ALL);
>> +		perf_event_remove_dup(event_del_dup, event_del_dup->ctx);
>> +	}
>> +
>> 	perf_event_sched_in(cpuctx, task_ctx, current);
>> 	perf_pmu_enable(cpuctx->ctx.pmu);
>> }
> 
> Yuck!
> 
> Why do you do a full reschedule when you take out a master?

If there is active slave using this master, we need to schedule out
them before removing the master. 

We can improve the check though. We only need to do it if the master
is in state PERF_EVENT_STATE_ENABLED. 

Or we can add a different function to only schedule out slaves. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-12 15:44     ` Peter Zijlstra
@ 2019-12-12 15:47       ` Song Liu
  0 siblings, 0 replies; 14+ messages in thread
From: Song Liu @ 2019-12-12 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: open list, Kernel Team, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo



> On Dec 12, 2019, at 7:44 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Dec 12, 2019 at 03:37:05PM +0000, Song Liu wrote:
>> 
>> 
>>> On Dec 12, 2019, at 5:49 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Fri, Dec 06, 2019 at 04:24:47PM -0800, Song Liu wrote:
>>> 
>>>> @@ -750,6 +752,16 @@ struct perf_event {
>>>> 	void *security;
>>>> #endif
>>>> 	struct list_head		sb_list;
>>>> +
>>>> +	/* for PMU sharing */
>>>> +	struct perf_event		*dup_master;
>>>> +	/* check event_sync_dup_count() for the use of dup_base_* */
>>>> +	u64				dup_base_count;
>>>> +	u64				dup_base_child_count;
>>>> +	/* when this event is master,  read from master*count */
>>>> +	local64_t			master_count;
>>>> +	atomic64_t			master_child_count;
>>>> +	int				dup_active_count;
>>>> #endif /* CONFIG_PERF_EVENTS */
>>>> };
>>> 
>>>> +/* PMU sharing aware version of event->pmu->add() */
>>>> +static int event_pmu_add(struct perf_event *event,
>>>> +			 struct perf_event_context *ctx)
>>>> +{
>>>> +	struct perf_event *master;
>>>> +	int ret;
>>>> +
>>>> +	/* no sharing, just do event->pmu->add() */
>>>> +	if (!event->dup_master)
>>>> +		return event->pmu->add(event, PERF_EF_START);
>>> 
>>> Possibly we should look at the location of perf_event::dup_master to be
>>> in a hot cacheline. Because I'm thinking you just added a guaranteed
>>> miss here.
>> 
>> I am not quite sure the best location for these. How about:
>> 
>> diff --git i/include/linux/perf_event.h w/include/linux/perf_event.h
>> index 7d49f9251621..218cc7f75775 100644
>> --- i/include/linux/perf_event.h
>> +++ w/include/linux/perf_event.h
>> @@ -643,6 +643,16 @@ struct perf_event {
>>        local64_t                       count;
>>        atomic64_t                      child_count;
>> 
>> +       /* for PMU sharing */
>> +       struct perf_event               *dup_master;
>> +       /* check event_sync_dup_count() for the use of dup_base_* */
>> +       u64                             dup_base_count;
>> +       u64                             dup_base_child_count;
>> +       /* when this event is master,  read from master*count */
>> +       local64_t                       master_count;
>> +       atomic64_t                      master_child_count;
>> +       int                             dup_active_count;
>> +
>>        /*
>>         * These are the total time in nanoseconds that the event
>>         * has been enabled (i.e. eligible to run, and the task has
> 
> Ah, no. Only put dup_master somewhere hot. The rest is not that
> important. For instance, you can put it right next to event->pmu,
> because that's going to be used right next to it, right?

I see. Will fix in that way. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-12 15:45   ` Song Liu
@ 2019-12-12 16:00     ` Song Liu
  2019-12-12 18:01       ` Song Liu
  2019-12-12 18:52     ` Peter Zijlstra
  1 sibling, 1 reply; 14+ messages in thread
From: Song Liu @ 2019-12-12 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: open list, Kernel Team, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo



> On Dec 12, 2019, at 7:45 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Dec 12, 2019, at 7:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> On Fri, Dec 06, 2019 at 04:24:47PM -0800, Song Liu wrote:
>> 
>>> @@ -2174,6 +2410,14 @@ __perf_remove_from_context(struct perf_event *event,
>>> 		update_cgrp_time_from_cpuctx(cpuctx);
>>> 	}
>>> 
>>> +	if (event->dup_master == event) {
>>> +		if (ctx->is_active)
>>> +			ctx_resched(cpuctx, cpuctx->task_ctx,
>>> +				    get_event_type(event), NULL, event);
>>> +		else
>>> +			perf_event_remove_dup(event, ctx);
>>> +	}
>>> +
>>> 	event_sched_out(event, cpuctx, ctx);
>>> 	if (flags & DETACH_GROUP)
>>> 		perf_group_detach(event);
>>> @@ -2241,6 +2485,14 @@ static void __perf_event_disable(struct perf_event *event,
>>> 		update_cgrp_time_from_event(event);
>>> 	}
>>> 
>>> +	if (event->dup_master == event) {
>>> +		if (ctx->is_active)
>>> +			ctx_resched(cpuctx, cpuctx->task_ctx,
>>> +				    get_event_type(event), NULL, event);
>>> +		else
>>> +			perf_event_remove_dup(event, ctx);
>>> +	}
>>> +
>>> 	if (event == event->group_leader)
>>> 		group_sched_out(event, cpuctx, ctx);
>>> 	else
>> 
>>> @@ -2544,7 +2793,9 @@ static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
>>> */
>>> static void ctx_resched(struct perf_cpu_context *cpuctx,
>>> 			struct perf_event_context *task_ctx,
>>> -			enum event_type_t event_type)
>>> +			enum event_type_t event_type,
>>> +			struct perf_event *event_add_dup,
>>> +			struct perf_event *event_del_dup)
>>> {
>>> 	enum event_type_t ctx_event_type;
>>> 	bool cpu_event = !!(event_type & EVENT_CPU);
>>> @@ -2574,6 +2825,18 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>>> 	else if (ctx_event_type & EVENT_PINNED)
>>> 		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>>> 
>>> +	if (event_add_dup) {
>>> +		if (event_add_dup->ctx->is_active)
>>> +			ctx_sched_out(event_add_dup->ctx, cpuctx, EVENT_ALL);
>>> +		perf_event_setup_dup(event_add_dup, event_add_dup->ctx);
>>> +	}
>>> +
>>> +	if (event_del_dup) {
>>> +		if (event_del_dup->ctx->is_active)
>>> +			ctx_sched_out(event_del_dup->ctx, cpuctx, EVENT_ALL);
>>> +		perf_event_remove_dup(event_del_dup, event_del_dup->ctx);
>>> +	}
>>> +
>>> 	perf_event_sched_in(cpuctx, task_ctx, current);
>>> 	perf_pmu_enable(cpuctx->ctx.pmu);
>>> }
>> 
>> Yuck!
>> 
>> Why do you do a full reschedule when you take out a master?
> 
> If there is active slave using this master, we need to schedule out
> them before removing the master. 
> 
> We can improve the check though. We only need to do it if the master
> is in state PERF_EVENT_STATE_ENABLED. 
> 
> Or we can add a different function to only schedule out slaves. 

It is tricky to only schedule out slaves, because the slave could be in
a group. If we don't reschedule all events, we need to make sure that
"swapping master" always succeed. 

Song

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-12 16:00     ` Song Liu
@ 2019-12-12 18:01       ` Song Liu
  0 siblings, 0 replies; 14+ messages in thread
From: Song Liu @ 2019-12-12 18:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: open list, Kernel Team, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo

Hi Peter,

> On Dec 12, 2019, at 8:00 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Dec 12, 2019, at 7:45 AM, Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Dec 12, 2019, at 7:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Fri, Dec 06, 2019 at 04:24:47PM -0800, Song Liu wrote:
>>> 
>>>> @@ -2174,6 +2410,14 @@ __perf_remove_from_context(struct perf_event *event,
>>>> 		update_cgrp_time_from_cpuctx(cpuctx);
>>>> 	}
>>>> 
>>>> +	if (event->dup_master == event) {
>>>> +		if (ctx->is_active)
>>>> +			ctx_resched(cpuctx, cpuctx->task_ctx,
>>>> +				    get_event_type(event), NULL, event);
>>>> +		else
>>>> +			perf_event_remove_dup(event, ctx);
>>>> +	}
>>>> +
>>>> 	event_sched_out(event, cpuctx, ctx);
>>>> 	if (flags & DETACH_GROUP)
>>>> 		perf_group_detach(event);
>>>> @@ -2241,6 +2485,14 @@ static void __perf_event_disable(struct perf_event *event,
>>>> 		update_cgrp_time_from_event(event);
>>>> 	}
>>>> 
>>>> +	if (event->dup_master == event) {
>>>> +		if (ctx->is_active)
>>>> +			ctx_resched(cpuctx, cpuctx->task_ctx,
>>>> +				    get_event_type(event), NULL, event);
>>>> +		else
>>>> +			perf_event_remove_dup(event, ctx);
>>>> +	}
>>>> +
>>>> 	if (event == event->group_leader)
>>>> 		group_sched_out(event, cpuctx, ctx);
>>>> 	else
>>> 
>>>> @@ -2544,7 +2793,9 @@ static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
>>>> */
>>>> static void ctx_resched(struct perf_cpu_context *cpuctx,
>>>> 			struct perf_event_context *task_ctx,
>>>> -			enum event_type_t event_type)
>>>> +			enum event_type_t event_type,
>>>> +			struct perf_event *event_add_dup,
>>>> +			struct perf_event *event_del_dup)
>>>> {
>>>> 	enum event_type_t ctx_event_type;
>>>> 	bool cpu_event = !!(event_type & EVENT_CPU);
>>>> @@ -2574,6 +2825,18 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>>>> 	else if (ctx_event_type & EVENT_PINNED)
>>>> 		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>>>> 
>>>> +	if (event_add_dup) {
>>>> +		if (event_add_dup->ctx->is_active)
>>>> +			ctx_sched_out(event_add_dup->ctx, cpuctx, EVENT_ALL);
>>>> +		perf_event_setup_dup(event_add_dup, event_add_dup->ctx);
>>>> +	}
>>>> +
>>>> +	if (event_del_dup) {
>>>> +		if (event_del_dup->ctx->is_active)
>>>> +			ctx_sched_out(event_del_dup->ctx, cpuctx, EVENT_ALL);
>>>> +		perf_event_remove_dup(event_del_dup, event_del_dup->ctx);
>>>> +	}
>>>> +
>>>> 	perf_event_sched_in(cpuctx, task_ctx, current);
>>>> 	perf_pmu_enable(cpuctx->ctx.pmu);
>>>> }
>>> 
>>> Yuck!
>>> 
>>> Why do you do a full reschedule when you take out a master?
>> 
>> If there is active slave using this master, we need to schedule out
>> them before removing the master. 
>> 
>> We can improve the check though. We only need to do it if the master
>> is in state PERF_EVENT_STATE_ENABLED. 
>> 
>> Or we can add a different function to only schedule out slaves. 
> 
> It is tricky to only schedule out slaves, because the slave could be in
> a group. If we don't reschedule all events, we need to make sure that
> "swapping master" always succeed. 

What would you suggest for this one? Maybe we can keep this as-is and 
optimize later? 

Thanks,
Song


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-12 15:45   ` Song Liu
  2019-12-12 16:00     ` Song Liu
@ 2019-12-12 18:52     ` Peter Zijlstra
  2019-12-12 23:49       ` Song Liu
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-12-12 18:52 UTC (permalink / raw)
  To: Song Liu
  Cc: open list, Kernel Team, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo

On Thu, Dec 12, 2019 at 03:45:49PM +0000, Song Liu wrote:
> > On Dec 12, 2019, at 7:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > Yuck!
> > 
> > Why do you do a full reschedule when you take out a master?
> 
> If there is active slave using this master, we need to schedule out
> them before removing the master. 
> 
> We can improve the check though. We only need to do it if the master
> is in state PERF_EVENT_STATE_ENABLED. 
> 
> Or we can add a different function to only schedule out slaves. 

So I've been thinking, this is because an NMI from another event can
come in and then does PERF_SAMPLE_READ which covers our event, right?

AFAICT every other case will run under ctx->lock, which we own at this
point.

So can't we:

 1 - stop the current master (such that the counts are frozen)
 2 - pick the new master
 3 - initialize the new master (such that the counts match)
 4 - set the new master on all other events
 5 - start the new master (counters run again)

Then, no matter where the NMI lands, it will always find either the old
or the new master and their counts will match.

You really don't need to stop all events.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v8] perf: Sharing PMU counters across compatible events
  2019-12-12 18:52     ` Peter Zijlstra
@ 2019-12-12 23:49       ` Song Liu
  0 siblings, 0 replies; 14+ messages in thread
From: Song Liu @ 2019-12-12 23:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: open list, Kernel Team, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexey Budankov, Namhyung Kim, Tejun Heo



> On Dec 12, 2019, at 10:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Dec 12, 2019 at 03:45:49PM +0000, Song Liu wrote:
>>> On Dec 12, 2019, at 7:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>>> Yuck!
>>> 
>>> Why do you do a full reschedule when you take out a master?
>> 
>> If there is active slave using this master, we need to schedule out
>> them before removing the master. 
>> 
>> We can improve the check though. We only need to do it if the master
>> is in state PERF_EVENT_STATE_ENABLED. 
>> 
>> Or we can add a different function to only schedule out slaves. 
> 
> So I've been thinking, this is because an NMI from another event can
> come in and then does PERF_SAMPLE_READ which covers our event, right?
> 
> AFAICT every other case will run under ctx->lock, which we own at this
> point.

Right, we hold ctx->lock here, so it should be safe in most case. 

> 
> So can't we:
> 
> 1 - stop the current master (such that the counts are frozen)
> 2 - pick the new master
> 3 - initialize the new master (such that the counts match)
> 4 - set the new master on all other events
> 5 - start the new master (counters run again)
> 
> Then, no matter where the NMI lands, it will always find either the old
> or the new master and their counts will match.
> 
> You really don't need to stop all events.

I think this should work. Let me try it. 

Thanks for the suggestion,
Song

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-12-12 23:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-07  0:24 [PATCH v8] perf: Sharing PMU counters across compatible events Song Liu
2019-12-11 18:50 ` Song Liu
2019-12-12 13:42 ` Peter Zijlstra
2019-12-12 15:18   ` Song Liu
2019-12-12 13:49 ` Peter Zijlstra
2019-12-12 15:37   ` Song Liu
2019-12-12 15:44     ` Peter Zijlstra
2019-12-12 15:47       ` Song Liu
2019-12-12 15:39 ` Peter Zijlstra
2019-12-12 15:45   ` Song Liu
2019-12-12 16:00     ` Song Liu
2019-12-12 18:01       ` Song Liu
2019-12-12 18:52     ` Peter Zijlstra
2019-12-12 23:49       ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).