All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] perf: Implement read_group() PMU operation
@ 2015-02-06  2:59 Sukadev Bhattiprolu
  2015-02-12 15:58 ` Peter Zijlstra
  2015-02-22 21:04 ` Cody P Schafer
  0 siblings, 2 replies; 5+ messages in thread
From: Sukadev Bhattiprolu @ 2015-02-06  2:59 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, Michael Ellerman, Anton Blanchard,
	Stephane Eranian
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, linux-kernel

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Date: Thu Feb  5 20:56:20 EST 2015 -0300
Subject: [RFC][PATCH] perf: Implement read_group() PMU operation

This is a lightly tested, exploratory patch to allow PMUs to return
several counters at once. Appreciate any comments :-)

Unlike normal hardware PMCs, the 24x7 counters[1] in Power8 are stored
in memory and accessed via a hypervisor call (HCALL).  A major aspect
of the HCALL is that it allows retireving _SEVERAL_ counters at once
(unlike regular PMCs, which are read one at a time).

This patch implements a ->read_group() PMU operation that tries to
take advantage of this ability to read several counters at once.  A
PMU that implements the ->read_group() operation would allow users
to retrieve several counters at once and get a more consistent
snapshot.

NOTE: 	This patch has a TODO in h_24x7_event_read_group() in that it
	still does multiple HCALLS. I think that can be optimized 
	independently, once the pmu->read_group() interface itself is
	finalized.

Appreciate comments on the ->read_group interface and best managing the
interfaces between the core and PMU layers - eg: Ok for hv-24x7 PMU to
to walk the ->sibling_list ?

[1] Some notes about 24x7 counters:

        Power8 supports 24x7 counters[1] which differ from traditional PMCs
	in several ways:

	- The 24x7 counters are always on and counting. Rather than
	  start/stop the PMCs, we read/report the _change_ in values
	  in the counters during the execution of the workload.

	- The 24x7 counters are not tied to a task context (they are
	  always on).

	- Rather than reading the event counts from registers, we make
	  a hypervisor call (HCALL) to retrieve counts. The HCALL allows
	  retrieving a large number of counters in a single call.

	- These counters don't generate interrupts when they overflow (so
	  sampling does not apply to these counters).
---	

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1d36314..b69fbdf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -232,6 +232,13 @@ struct pmu {
 	void (*read)			(struct perf_event *event);
 
 	/*
+	 * Read a group of counters.
+	 */
+	int (*read_group)		(struct perf_event *event,
+						u64 *values,
+						int ncounters);
+
+	/*
 	 * Group events scheduling is treated as a transaction, add
 	 * group events as a whole and perform one schedulability test.
 	 * If the test fails, roll back the whole group
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 934687f..026a9d0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3549,10 +3549,43 @@ static int perf_event_read_group(struct perf_event *event,
 	struct perf_event *leader = event->group_leader, *sub;
 	int n = 0, size = 0, ret = -EFAULT;
 	struct perf_event_context *ctx = leader->ctx;
+	u64 *valuesp;
 	u64 values[5];
+	int use_group_read;
 	u64 count, enabled, running;
+	struct pmu *pmu = event->pmu;
+
+	/*
+	 * If PMU supports group read and group read is requested,
+	 * allocate memory before taking the mutex.
+	 */
+	use_group_read = 0;
+	if ((read_format & PERF_FORMAT_GROUP) && pmu->read_group) {
+		use_group_read++;
+	}
+
+	if (use_group_read) {
+		valuesp = kzalloc(leader->nr_siblings * sizeof(u64), GFP_KERNEL);
+		if (!valuesp)
+			return -ENOMEM;
+	}
 
 	mutex_lock(&ctx->mutex);
+
+	if (use_group_read) {
+		ret = pmu->read_group(leader, valuesp, leader->nr_siblings);
+		if (ret >= 0) {
+			size = ret * sizeof(u64);
+
+			ret = size;
+			if (copy_to_user(buf, valuesp, size))
+				ret = -EFAULT;
+		}
+
+		kfree(valuesp);
+		goto unlock;
+	}
+
 	count = perf_event_read_value(leader, &enabled, &running);
 
 	values[n++] = 1 + leader->nr_siblings;
diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 9445a82..cd48cf0 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -1071,12 +1071,33 @@ static int h_24x7_event_init(struct perf_event *event)
 	struct hv_perf_caps caps;
 	unsigned domain;
 	unsigned long hret;
+	u64 read_format, inv_flags;
 	u64 ct;
 
 	/* Not our event */
 	if (event->attr.type != event->pmu->type)
 		return -ENOENT;
 
+	/*
+	 * We don't support enabled/running times with PERF_FORMAT_GROUP.
+	 * The ->read_group() operation is intended to be used in continous
+	 * monitoring mode, so these time values are not important at least
+	 * for now.
+	 *
+	 * Not sure if the PERF_FORMAT_ID is useful. Block it for now.
+	 */
+	read_format = event->attr.read_format;
+	inv_flags = PERF_FORMAT_TOTAL_TIME_ENABLED;
+	inv_flags |= PERF_FORMAT_TOTAL_TIME_RUNNING;
+	inv_flags |= PERF_FORMAT_ID;
+
+	if ((read_format & PERF_FORMAT_GROUP) && (read_format & inv_flags)) {
+		pr_devel("%s(): Invalid flags: rf 0x%llx, invf 0x%llx\n",
+				__func__, (unsigned long long)read_format,
+				(unsigned long long)inv_flags);
+		return -EINVAL;
+	}
+
 	/* Unused areas must be 0 */
 	if (event_get_reserved1(event) ||
 	    event_get_reserved2(event) ||
@@ -1181,6 +1202,50 @@ static int h_24x7_event_add(struct perf_event *event, int flags)
 	return 0;
 }
 
+static int h_24x7_event_read_group(struct perf_event *leader, u64 *values,
+				int ncounters)
+{
+	struct perf_event *sub;
+	int n = 0;
+
+	BUG_ON(!(leader->attr.read_format & PERF_FORMAT_GROUP));
+
+	/*
+	 * sys_perf_event_open() for now prevents inheritance with
+	 * PERF_FORMAT_GROUP. Ensure that hasn't changed.
+	 */
+	BUG_ON(!list_empty(&leader->child_list));
+
+	if (ncounters < leader->nr_siblings) {
+		pr_devel("%s(): Insufficient buffer : ns %d, nc %d\n",
+				__func__, leader->nr_siblings, ncounters);
+		return -EINVAL;
+	}
+
+	raw_spin_lock(&leader->ctx->lock);
+
+	if (leader->state == PERF_EVENT_STATE_ACTIVE) {
+		h_24x7_event_update(leader);
+		values[n++] = local64_read(&leader->count);
+	}
+
+	/*
+	 * TODO: For now, make one HCALL per event. We will soon retrieve
+	 * 	 several events with one HCALL.
+	 */
+	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
+		if (sub->state != PERF_EVENT_STATE_ACTIVE)
+			continue;
+
+		h_24x7_event_update(sub);
+		values[n++] =  local64_read(&sub->count);
+	}
+
+	raw_spin_unlock(&leader->ctx->lock);
+
+	return n;
+}
+
 static struct pmu h_24x7_pmu = {
 	.task_ctx_nr = perf_invalid_context,
 
@@ -1192,6 +1257,7 @@ static struct pmu h_24x7_pmu = {
 	.start       = h_24x7_event_start,
 	.stop        = h_24x7_event_stop,
 	.read        = h_24x7_event_update,
+	.read_group  = h_24x7_event_read_group,
 };
 
 static int hv_24x7_init(void)


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] perf: Implement read_group() PMU operation
  2015-02-06  2:59 [RFC][PATCH] perf: Implement read_group() PMU operation Sukadev Bhattiprolu
@ 2015-02-12 15:58 ` Peter Zijlstra
  2015-02-17  8:33   ` Sukadev Bhattiprolu
  2015-02-22 21:04 ` Cody P Schafer
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2015-02-12 15:58 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: mingo, Michael Ellerman, Anton Blanchard, Stephane Eranian,
	Jiri Olsa, Arnaldo Carvalho de Melo, linux-kernel

On Thu, Feb 05, 2015 at 06:59:15PM -0800, Sukadev Bhattiprolu wrote:
> From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> Date: Thu Feb  5 20:56:20 EST 2015 -0300
> Subject: [RFC][PATCH] perf: Implement read_group() PMU operation
> 
> This is a lightly tested, exploratory patch to allow PMUs to return
> several counters at once. Appreciate any comments :-)
> 
> Unlike normal hardware PMCs, the 24x7 counters[1] in Power8 are stored
> in memory and accessed via a hypervisor call (HCALL).  A major aspect
> of the HCALL is that it allows retireving _SEVERAL_ counters at once
> (unlike regular PMCs, which are read one at a time).
> 
> This patch implements a ->read_group() PMU operation that tries to
> take advantage of this ability to read several counters at once.  A
> PMU that implements the ->read_group() operation would allow users
> to retrieve several counters at once and get a more consistent
> snapshot.
> 
> NOTE: 	This patch has a TODO in h_24x7_event_read_group() in that it
> 	still does multiple HCALLS. I think that can be optimized 
> 	independently, once the pmu->read_group() interface itself is
> 	finalized.
> 
> Appreciate comments on the ->read_group interface and best managing the
> interfaces between the core and PMU layers - eg: Ok for hv-24x7 PMU to
> to walk the ->sibling_list ?


> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3549,10 +3549,43 @@ static int perf_event_read_group(struct perf_event *event,

You also want perf_output_read_group().

>  	struct perf_event *leader = event->group_leader, *sub;
>  	int n = 0, size = 0, ret = -EFAULT;
>  	struct perf_event_context *ctx = leader->ctx;
> +	u64 *valuesp;
>  	u64 values[5];
> +	int use_group_read;
>  	u64 count, enabled, running;
> +	struct pmu *pmu = event->pmu;
> +
> +	/*
> +	 * If PMU supports group read and group read is requested,
> +	 * allocate memory before taking the mutex.
> +	 */
> +	use_group_read = 0;
> +	if ((read_format & PERF_FORMAT_GROUP) && pmu->read_group) {
> +		use_group_read++;
> +	}
> +
> +	if (use_group_read) {
> +		valuesp = kzalloc(leader->nr_siblings * sizeof(u64), GFP_KERNEL);
> +		if (!valuesp)
> +			return -ENOMEM;
> +	}

This seems 'sad', the hardware already knows how many it can maximally
use at once and can preallocate, right?

>  
>  	mutex_lock(&ctx->mutex);
> +
> +	if (use_group_read) {
> +		ret = pmu->read_group(leader, valuesp, leader->nr_siblings);
> +		if (ret >= 0) {
> +			size = ret * sizeof(u64);
> +
> +			ret = size;
> +			if (copy_to_user(buf, valuesp, size))
> +				ret = -EFAULT;
> +		}
> +
> +		kfree(valuesp);
> +		goto unlock;
> +	}
> +
>  	count = perf_event_read_value(leader, &enabled, &running);
>  
>  	values[n++] = 1 + leader->nr_siblings;

Since ->read() has a void return value, we can delay its effect, so I'm
currently thinking we might want to extend the transaction interface for
this; give pmu::start_txn() a flags argument to indicate scheduling
(add) or reading (read).

So we'd end up with something like:

	pmu->start_txn(pmu, PMU_TXN_READ);

	leader->read();

	for_each_sibling()
		sibling->read();

	pmu->commit_txn();

after which we can use the values updated by the read calls. The trivial
no-support implementation lets read do its immediate thing like it does
now.

A more complex driver can then collect the actual counter values and
execute one hypercall using its pre-allocated memory.

So no allocations in the core code, and no sibling iterations in the
driver code.

Would that work for you?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] perf: Implement read_group() PMU operation
  2015-02-12 15:58 ` Peter Zijlstra
@ 2015-02-17  8:33   ` Sukadev Bhattiprolu
  2015-02-17 10:03     ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: Sukadev Bhattiprolu @ 2015-02-17  8:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, Michael Ellerman, Anton Blanchard, Stephane Eranian,
	Jiri Olsa, Arnaldo Carvalho de Melo, linux-kernel

Peter Zijlstra [peterz@infradead.org] wrote:
| > --- a/kernel/events/core.c
| > +++ b/kernel/events/core.c
| > @@ -3549,10 +3549,43 @@ static int perf_event_read_group(struct perf_event *event,
| 
| You also want perf_output_read_group().

Ok. Will look into it. We currently don't support sampling with
the 24x7 counters but we should make sure that the new interface
fits correctly.


| 
| >  	struct perf_event *leader = event->group_leader, *sub;
| >  	int n = 0, size = 0, ret = -EFAULT;
| >  	struct perf_event_context *ctx = leader->ctx;
| > +	u64 *valuesp;
| >  	u64 values[5];
| > +	int use_group_read;
| >  	u64 count, enabled, running;
| > +	struct pmu *pmu = event->pmu;
| > +
| > +	/*
| > +	 * If PMU supports group read and group read is requested,
| > +	 * allocate memory before taking the mutex.
| > +	 */
| > +	use_group_read = 0;
| > +	if ((read_format & PERF_FORMAT_GROUP) && pmu->read_group) {
| > +		use_group_read++;
| > +	}
| > +
| > +	if (use_group_read) {
| > +		valuesp = kzalloc(leader->nr_siblings * sizeof(u64), GFP_KERNEL);
| > +		if (!valuesp)
| > +			return -ENOMEM;
| > +	}
| 
| This seems 'sad', the hardware already knows how many it can maximally
| use at once and can preallocate, right?

Yeah :-) In a subsequent version, I got rid of the allocation by moving
the copy_to_user() into the PMU, but still needed to walk the sibling list.
| 
| >  
| >  	mutex_lock(&ctx->mutex);
| > +
| > +	if (use_group_read) {
| > +		ret = pmu->read_group(leader, valuesp, leader->nr_siblings);
| > +		if (ret >= 0) {
| > +			size = ret * sizeof(u64);
| > +
| > +			ret = size;
| > +			if (copy_to_user(buf, valuesp, size))
| > +				ret = -EFAULT;
| > +		}
| > +
| > +		kfree(valuesp);
| > +		goto unlock;
| > +	}
| > +
| >  	count = perf_event_read_value(leader, &enabled, &running);
| >  
| >  	values[n++] = 1 + leader->nr_siblings;
| 
| Since ->read() has a void return value, we can delay its effect, so I'm
| currently thinking we might want to extend the transaction interface for
| this; give pmu::start_txn() a flags argument to indicate scheduling
| (add) or reading (read).
| 
| So we'd end up with something like:
| 
| 	pmu->start_txn(pmu, PMU_TXN_READ);
| 
| 	leader->read();
| 
| 	for_each_sibling()
| 		sibling->read();
| 
| 	pmu->commit_txn();

So, each of the ->read() calls is really "appending a counter" to a
list of counters that the PMU should read and the values for the counters
(results of the read) are only available after the commit_txn() right? 

In which case, perf_event_read_group() would then follow this commit_txn()
with its "normal" read,  and the PMU would return the result cached during
->commit_txn(). If so, we need a way to invalidate the cached result ?

| 
| after which we can use the values updated by the read calls. The trivial
| no-support implementation lets read do its immediate thing like it does
| now.
| 
| A more complex driver can then collect the actual counter values and
| execute one hypercall using its pre-allocated memory.

the hypercall should happen  in the ->commit_txn() right ?

| 
| So no allocations in the core code, and no sibling iterations in the
| driver code.
| 
| Would that work for you?

I think it would, 

I am working on breaking up the underlying code along start/read/commit
lines, and hope to have it done later this week and then can experiment
more with this interface.

Appreciate the input.

Sukadev


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] perf: Implement read_group() PMU operation
  2015-02-17  8:33   ` Sukadev Bhattiprolu
@ 2015-02-17 10:03     ` Peter Zijlstra
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Zijlstra @ 2015-02-17 10:03 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: mingo, Michael Ellerman, Anton Blanchard, Stephane Eranian,
	Jiri Olsa, Arnaldo Carvalho de Melo, linux-kernel

On Tue, Feb 17, 2015 at 12:33:12AM -0800, Sukadev Bhattiprolu wrote:
> Peter Zijlstra [peterz@infradead.org] wrote:
> | > --- a/kernel/events/core.c
> | > +++ b/kernel/events/core.c
> | > @@ -3549,10 +3549,43 @@ static int perf_event_read_group(struct perf_event *event,
> | 
> | You also want perf_output_read_group().
> 
> Ok. Will look into it. We currently don't support sampling with
> the 24x7 counters but we should make sure that the new interface
> fits correctly.

One thing someone 'could' do is group them together with a software
event that _can_ sample, and then use SAMPLE_READ to periodically stuff
values into the buffer.

> | Since ->read() has a void return value, we can delay its effect, so I'm
> | currently thinking we might want to extend the transaction interface for
> | this; give pmu::start_txn() a flags argument to indicate scheduling
> | (add) or reading (read).
> | 
> | So we'd end up with something like:
> | 
> | 	pmu->start_txn(pmu, PMU_TXN_READ);
> | 
> | 	leader->read();
> | 
> | 	for_each_sibling()
> | 		sibling->read();
> | 
> | 	pmu->commit_txn();
> 
> So, each of the ->read() calls is really "appending a counter" to a
> list of counters that the PMU should read and the values for the counters
> (results of the read) are only available after the commit_txn() right?

Correct.

> In which case, perf_event_read_group() would then follow this commit_txn()
> with its "normal" read,  and the PMU would return the result cached during
> ->commit_txn(). If so, we need a way to invalidate the cached result ?

I was thinking of breaking up that code into two loops, once to call
->read() and update states, the second to use the now up-to-date data
and frob it into the stream.

But I must say I've not entirely given it much thought. But that way
you're not stuck with this cache and related problems.

> | after which we can use the values updated by the read calls. The trivial
> | no-support implementation lets read do its immediate thing like it does
> | now.
> | 
> | A more complex driver can then collect the actual counter values and
> | execute one hypercall using its pre-allocated memory.
> 
> the hypercall should happen  in the ->commit_txn() right ?

Yah. Of course, if a ->read() is not part of a txn then it must do the
hypercall for just the one value.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] perf: Implement read_group() PMU operation
  2015-02-06  2:59 [RFC][PATCH] perf: Implement read_group() PMU operation Sukadev Bhattiprolu
  2015-02-12 15:58 ` Peter Zijlstra
@ 2015-02-22 21:04 ` Cody P Schafer
  1 sibling, 0 replies; 5+ messages in thread
From: Cody P Schafer @ 2015-02-22 21:04 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Peter Zijlstra, Ingo Molnar, Michael Ellerman, Anton Blanchard,
	Stephane Eranian, Jiri Olsa, Arnaldo Carvalho de Melo, LKML

On Thu, Feb 5, 2015 at 9:59 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> Date: Thu Feb  5 20:56:20 EST 2015 -0300
> Subject: [RFC][PATCH] perf: Implement read_group() PMU operation
>
> This is a lightly tested, exploratory patch to allow PMUs to return
> several counters at once. Appreciate any comments :-)
>

Back when I was fiddling with this, I started looking into changing
the {start,commit,cancel}_txn to operate on (struct perf_event *)
rather than (struct pmu *), and commit_txn would generate the actual
request & reads based on the perf_event's group (sounds similar but
not identical to what Peter's proposed previously).

The key bit I was concerned about was that these "PMUs" aren't
actually physical hw, so it made a bit more sense to pin the grouping
to a group rather than a txn over a PMU.

[Of course, I never did confirm if that actually fit with how perf was
modeling txns]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-02-22 21:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-06  2:59 [RFC][PATCH] perf: Implement read_group() PMU operation Sukadev Bhattiprolu
2015-02-12 15:58 ` Peter Zijlstra
2015-02-17  8:33   ` Sukadev Bhattiprolu
2015-02-17 10:03     ` Peter Zijlstra
2015-02-22 21:04 ` Cody P Schafer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.