Re: [PATCH v6] perf: Sharing PMU counters across compatible events

From: Song Liu <songliubraving@fb.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: open list <linux-kernel@vger.kernel.org>,
	Kernel Team <Kernel-team@fb.com>,
	"acme@kernel.org" <acme@kernel.org>,
	"Arnaldo Carvalho de Melo" <acme@redhat.com>,
	Jiri Olsa <jolsa@kernel.org>,
	Alexey Budankov <alexey.budankov@linux.intel.com>,
	Namhyung Kim <namhyung@kernel.org>, "Tejun Heo" <tj@kernel.org>
Subject: Re: [PATCH v6] perf: Sharing PMU counters across compatible events
Date: Thu, 31 Oct 2019 16:29:16 +0000	[thread overview]
Message-ID: <19AE6C78-C54C-4C37-BBD2-0396BB97A474@fb.com> (raw)
In-Reply-To: <20191031124332.GQ4131@hirez.programming.kicks-ass.net>

> On Oct 31, 2019, at 5:43 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Sep 18, 2019 at 10:23:14PM -0700, Song Liu wrote:
>> This patch tries to enable PMU sharing. To make perf event scheduling
>> fast, we use special data structures.
>> 
>> An array of "struct perf_event_dup" is added to the perf_event_context,
>> to remember all the duplicated events under this ctx. All the events
>> under this ctx has a "dup_id" pointing to its perf_event_dup. Compatible
>> events under the same ctx share the same perf_event_dup. The following
>> figure shows a simplified version of the data structure.
>> 
>>      ctx ->  perf_event_dup -> master
>>                     ^
>>                     |
>>         perf_event /|
>>                     |
>>         perf_event /
>> 
>> Connection among perf_event and perf_event_dup are built when events are
>> added or removed from the ctx. So these are not on the critical path of
>> schedule or perf_rotate_context().
>> 
>> On the critical paths (add, del read), sharing PMU counters doesn't
>> increase the complexity. Helper functions event_pmu_[add|del|read]() are
>> introduced to cover these cases. All these functions have O(1) time
>> complexity.
>> 
>> We allocate a separate perf_event for perf_event_dup->master. This needs
>> extra attention, because perf_event_alloc() may sleep. To allocate the
>> master event properly, a new pointer, tmp_master, is added to perf_event.
>> tmp_master carries a separate perf_event into list_[add|del]_event().
>> The master event has valid ->ctx and holds ctx->refcount.
> 
> That is realy nasty and expensive, it basically means every !sampling
> event carries a double allocate.
> 
> Why can't we use one of the actual events as master?

I think we can use one of the event as master. We need to be careful when
the master event is removed, but it should be doable. Let me try. 

> 
>> +/*
>> + * Sharing PMU across compatible events
>> + *
>> + * If two perf_events in the same perf_event_context are counting same
>> + * hardware events (instructions, cycles, etc.), they could share the
>> + * hardware PMU counter.
>> + *
>> + * When a perf_event is added to the ctx (list_add_event), it is compared
>> + * against other events in the ctx. If they can share the PMU counter,
>> + * a perf_event_dup is allocated to represent the sharing.
>> + *
>> + * Each perf_event_dup has a virtual master event, which is called by
>> + * pmu->add() and pmu->del(). We cannot call perf_event_alloc() in
>> + * list_add_event(), so it is allocated and carried by event->tmp_master
>> + * into list_add_event().
>> + *
>> + * Virtual master in different cases/paths:
>> + *
>> + * < I > perf_event_open() -> close() path:
>> + *
>> + * 1. Allocated by perf_event_alloc() in sys_perf_event_open();
>> + * 2. event->tmp_master->ctx assigned in perf_install_in_context();
>> + * 3.a. if used by ctx->dup_events, freed in perf_event_release_kernel();
>> + * 3.b. if not used by ctx->dup_events, freed in perf_event_open().
>> + *
>> + * < II > inherit_event() path:
>> + *
>> + * 1. Allocated by perf_event_alloc() in inherit_event();
>> + * 2. tmp_master->ctx assigned in inherit_event();
>> + * 3.a. if used by ctx->dup_events, freed in perf_event_release_kernel();
>> + * 3.b. if not used by ctx->dup_events, freed in inherit_event().
>> + *
>> + * < III > perf_pmu_migrate_context() path:
>> + * all dup_events removed during migration (no sharing after the move).
>> + *
>> + * < IV > perf_event_create_kernel_counter() path:
>> + * not supported yet.
>> + */
>> +struct perf_event_dup {
>> +	/*
>> +	 * master event being called by pmu->add() and pmu->del().
>> +	 * This event is allocated with perf_event_alloc(). When
>> +	 * attached to a ctx, this event should hold ctx->refcount.
>> +	 */
>> +	struct perf_event       *master;
>> +	/* number of events in the ctx that shares the master */
>> +	int			total_event_count;
>> +	/* number of active events of the master */
>> +	int			active_event_count;
>> +};
>> +
>> +#define MAX_PERF_EVENT_DUP_PER_CTX 4
>> /**
>>  * struct perf_event_context - event context structure
>>  *
>> @@ -791,6 +849,9 @@ struct perf_event_context {
>> #endif
>> 	void				*task_ctx_data; /* pmu specific data */
>> 	struct rcu_head			rcu_head;
>> +
>> +	/* for PMU sharing. array is needed for O(1) access */
>> +	struct perf_event_dup		dup_events[MAX_PERF_EVENT_DUP_PER_CTX];
> 
> Yuck!
> 
> event_pmu_{add,del,read}() appear to be the consumer of this array
> thing, but I'm not seeing why we need it.
> 
> That is, again, why can't we use one of the actual events as master and
> have a dup_master pointer per event and then do something like:
> 
> event_pmu_add()
> {
> 	if (event->dup_master != event)
> 		return;
> 
> 	event->pmu->add(event, PERF_EF_START);
> }
> 
> Such that we only schedule the master events and ignore all duplicates.
> 
> Then on read it can do something like:
> 
> event_pmu_read()
> {
> 	if (event->dup_master == event)
> 		return;
> 
> 	/* use event->dup_master as counter */
> again:
> 	prev_count = local64_read(&hwc->prev_count);
> 	count = local64_read(&event->dup_master->count);
> 	if (local64_cmpxchg(&hwc->prev_count, prev_count, count) != prev_count)
> 		goto again;
> 
> 	delta = count - prev_count;
> 	local64_add(delta, &event->count);
> }
> 
>> };
> 
>> +/* Returns whether a perf_event can share PMU counter with other events */
>> +static inline bool perf_event_can_share(struct perf_event *event)
>> +{
>> +	/* only do sharing for hardware events */
>> +	if (is_software_event(event))
>> +		return false;
>> +
>> +	/*
>> +	 * limit sharing to counting events.
>> +	 * perf-stat sets PERF_SAMPLE_IDENTIFIER for counting events, so
>> +	 * let that in.
>> +	 */
>> +	if (event->attr.sample_type & ~PERF_SAMPLE_IDENTIFIER)
>> +		return false;
> 
> Why is is_sampling_event() not usable?

Hmm... let me try it. Thanks for the pointer. 

> 
>> +
>> +	return true;
>> +}
>> +
>> +/*
>> + * Returns whether the two events can share a PMU counter.
>> + *
>> + * Note: This function does NOT check perf_event_can_share() for
>> + * the two events, they should be checked before this function
>> + */
>> +static inline bool perf_event_compatible(struct perf_event *event_a,
>> +					 struct perf_event *event_b)
>> +{
>> +	return event_a->attr.type == event_b->attr.type &&
>> +		event_a->attr.config == event_b->attr.config &&
>> +		event_a->attr.config1 == event_b->attr.config1 &&
>> +		event_a->attr.config2 == event_b->attr.config2;
>> +}
> 
> Slightly scared by this one.

I feel a little nervous too. Maybe we should memcmp the two attr?

> 
> 
>> @@ -2612,20 +2828,9 @@ static int  __perf_install_in_context(void *info)
>> 		raw_spin_lock(&task_ctx->lock);
>> 	}
>> 
>> -#ifdef CONFIG_CGROUP_PERF
>> -	if (is_cgroup_event(event)) {
>> -		/*
>> -		 * If the current cgroup doesn't match the event's
>> -		 * cgroup, we should not try to schedule it.
>> -		 */
>> -		struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
>> -		reprogram = cgroup_is_descendant(cgrp->css.cgroup,
>> -					event->cgrp->css.cgroup);
>> -	}
>> -#endif
> 
> Why is this removed?

e... I bet I messed this up during a rebase... Sorry..

> 
>> @@ -10986,6 +11198,14 @@ SYSCALL_DEFINE5(perf_event_open,
>> 		goto err_cred;
>> 	}
>> 
>> +	if (perf_event_can_share(event)) {
>> +		event->tmp_master = perf_event_alloc(&event->attr, cpu,
>> +						     task, NULL, NULL,
>> +						     NULL, NULL, -1);
>> +		if (IS_ERR(event->tmp_master))
>> +			event->tmp_master = NULL;
>> +	}
> 
> 
>> @@ -11773,6 +12005,14 @@ inherit_event(struct perf_event *parent_event,
>> 	if (IS_ERR(child_event))
>> 		return child_event;
>> 
>> +	if (perf_event_can_share(child_event)) {
>> +		child_event->tmp_master = perf_event_alloc(&parent_event->attr,
>> +							   parent_event->cpu,
>> +							   child, NULL, NULL,
>> +							   NULL, NULL, -1);
>> +		if (IS_ERR(child_event->tmp_master))
>> +			child_event->tmp_master = NULL;
>> +	}
> 
> So this is terrible!

Let me try get rid of the double alloc. 

Thanks for these feedback!
Song