[RFC PATCH 0/3] perf: show package power consumption in perf

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] perf: show package power consumption in perf
@ 2010-08-18  7:59 Zhang Rui
  2010-08-18 12:25 ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Zhang Rui @ 2010-08-18  7:59 UTC (permalink / raw)
  To: peterz
  Cc: LKML, mingo, robert.richter, acme, paulus, dzickus, gorcunov,
	fweisbec, Lin Ming, Brown, Len, Matthew Garrett, Zhang, Rui

Hi, all,

RAPL(running average power limit) is a new feature which provides
mechanisms to enforce power consumption limit, on some new processors.

Generally speaking, by using RAPL, OS can set a power budget in a
certain time window, and let Hardware to throttle the processor
P/T-state to meet this energy limitation.

RAPL also provides a new MSR, i.e. MSR_PKG_ENERGY_STATUS, which reports
the total amount of energy consumed by the package.

I'm not sure if to support RAPL or not, but anyway, it sounds like a
good idea to export the energy status in perf.

So a new perf pmu and event to show the package energy consumed is
introduced in this patch.

Here is what I get after applying the three patches,

#./perf stat -e energy test
Performance counter stats for 'test':

	202	Joules cost by package
7.926001238	seconds time elapsed

Note that this patch set is made based on Peter's perf-pmu branch,
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git
 which provides better interfaces to register/unregister a new pmu.

any comment are welcome. :)

thanks,
rui

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-18  7:59 [RFC PATCH 0/3] perf: show package power consumption in perf Zhang Rui
@ 2010-08-18 12:25 ` Peter Zijlstra
  2010-08-18 12:41   ` Matt Fleming
  2010-08-19  2:43   ` Lin Ming
  0 siblings, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2010-08-18 12:25 UTC (permalink / raw)
  To: Zhang Rui
  Cc: LKML, mingo, robert.richter, acme, paulus, dzickus, gorcunov,
	fweisbec, Lin Ming, Brown, Len, Matthew Garrett, Matt Fleming

On Wed, 2010-08-18 at 15:59 +0800, Zhang Rui wrote:
> Hi, all,
> 
> RAPL(running average power limit) is a new feature which provides
> mechanisms to enforce power consumption limit, on some new processors.
> 
> Generally speaking, by using RAPL, OS can set a power budget in a
> certain time window, and let Hardware to throttle the processor
> P/T-state to meet this energy limitation.
> 
> RAPL also provides a new MSR, i.e. MSR_PKG_ENERGY_STATUS, which reports
> the total amount of energy consumed by the package.
> 
> I'm not sure if to support RAPL or not, but anyway, it sounds like a
> good idea to export the energy status in perf.
> 
> So a new perf pmu and event to show the package energy consumed is
> introduced in this patch.
> 
> Here is what I get after applying the three patches,
> 
> #./perf stat -e energy test
> Performance counter stats for 'test':
> 
> 	202	Joules cost by package
> 7.926001238	seconds time elapsed
> 
> 
> Note that this patch set is made based on Peter's perf-pmu branch,
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git
>  which provides better interfaces to register/unregister a new pmu.
> 
> any comment are welcome. :)


Nice,.. however:

 - if it is a pure read-only counter without sampling support,
   expose it as such, don't fudge in the hrtimer stuff. Simply
   fail to create a sampling event.

   SH has the same problem for its 'normal' PMU, the solution is
   to use event groups, Matt was looking at adding support to
   perf-record for that, if creating a sampling event fails, fall
   back to {hrtimer, $event} groups.

 - since its a free-running, non-configurable counter, you can indeed
   act like its a 'software' event in that you can schedule consumers
   without constraints, however I don't think the PERF_COUNT_SW_* space
   is the right way to expose this counter.

   Better would be to use the sysfs stuff Lin has been working on (for
   which I still need to catch up on the latest discussions), it would
   then be tied to the pmu instance and appear/disappear when you load/
   unload the module.

   However for testing purposes I see why you'd want to have _a_
   interface :-)

- it would be nice if you'd write the cpu detection a bit more readable,
  also, it looks like you forgot to check x86_vendor == X86_VENDOR_INTEL.

> +static int __init intel_rapl_init(void)
> +{
> +	/*
> +	 * RAPL features are only supported on processors have a CPUID
> +	 * signature with DisplayFamily_DisplayModel of 06_2AH, 06_2DH
> +	 */
> +	if (boot_cpu_data.x86 != 0x06 ||
> +	    (boot_cpu_data.x86_model != 0x2A &&
> +	    boot_cpu_data.x86_model != 0x2D))
> +		return -ENODEV;
> +
> +	if (rapl_check_unit())
> +		return -ENODEV;
> +
> +	perf_pmu_register(&rapl_pmu);
> +	return 0;
> +}

Maybe something like (see intel_pmu_init() for example):

  if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
    return -ENODEV;

  if (boot_cpu_data.x86 != 0x06)
    return -ENODEV;

  switch (boot_cpu_data.x86_model) {
  case 0x2A: /* sandybridge ?! 32nm */
  case 0x2D: /* othermodel 32nm */
    break;

  default:
    return -ENODEV;
  }

Which again reminds me to ask of Intel, a comprehensive x86_model list,
please?

Alternatively, you can create a X86_FEATURE_RAPL and simply use
boot_cpu_has(X86_FEATURE_RAPL) (much like intel_ds_init() has).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-18 12:25 ` Peter Zijlstra
@ 2010-08-18 12:41   ` Matt Fleming
  2010-08-19  3:28     ` Lin Ming
  2010-08-19  2:43   ` Lin Ming
  1 sibling, 1 reply; 20+ messages in thread
From: Matt Fleming @ 2010-08-18 12:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang Rui, LKML, mingo, robert.richter, acme, paulus, dzickus,
	gorcunov, fweisbec, Lin Ming, Brown, Len, Matthew Garrett

On Wed, Aug 18, 2010 at 02:25:29PM +0200, Peter Zijlstra wrote:
> On Wed, 2010-08-18 at 15:59 +0800, Zhang Rui wrote:
> > Hi, all,
> > 
> > RAPL(running average power limit) is a new feature which provides
> > mechanisms to enforce power consumption limit, on some new processors.
> > 
> > Generally speaking, by using RAPL, OS can set a power budget in a
> > certain time window, and let Hardware to throttle the processor
> > P/T-state to meet this energy limitation.
> > 
> > RAPL also provides a new MSR, i.e. MSR_PKG_ENERGY_STATUS, which reports
> > the total amount of energy consumed by the package.
> > 
> > I'm not sure if to support RAPL or not, but anyway, it sounds like a
> > good idea to export the energy status in perf.
> > 
> > So a new perf pmu and event to show the package energy consumed is
> > introduced in this patch.
> > 
> > Here is what I get after applying the three patches,
> > 
> > #./perf stat -e energy test
> > Performance counter stats for 'test':
> > 
> > 	202	Joules cost by package
> > 7.926001238	seconds time elapsed
> > 
> > 
> > Note that this patch set is made based on Peter's perf-pmu branch,
> > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git
> >  which provides better interfaces to register/unregister a new pmu.
> > 
> > any comment are welcome. :)
> 
> 
> Nice,.. however:
> 
>  - if it is a pure read-only counter without sampling support,
>    expose it as such, don't fudge in the hrtimer stuff. Simply
>    fail to create a sampling event.
> 
>    SH has the same problem for its 'normal' PMU, the solution is
>    to use event groups, Matt was looking at adding support to
>    perf-record for that, if creating a sampling event fails, fall
>    back to {hrtimer, $event} groups.

I had a quick look over the patches and Peter is right - the group
events stuff would probably fit quite well here. Unfortunately, due to
holidays and things, I haven't been able to get them finished
yet. I'll get on that ASAP.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-18 12:41   ` Matt Fleming
@ 2010-08-19  3:28     ` Lin Ming
  2010-08-19  7:54       ` Matt Fleming
  2010-08-19  9:02       ` Peter Zijlstra
  0 siblings, 2 replies; 20+ messages in thread
From: Lin Ming @ 2010-08-19  3:28 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Peter Zijlstra, Zhang, Rui, LKML, mingo, robert.richter, acme,
	paulus, dzickus, gorcunov, fweisbec, Brown, Len, Matthew Garrett

On Wed, 2010-08-18 at 20:41 +0800, Matt Fleming wrote:
> On Wed, Aug 18, 2010 at 02:25:29PM +0200, Peter Zijlstra wrote:
> > On Wed, 2010-08-18 at 15:59 +0800, Zhang Rui wrote:
> > > Hi, all,
> > > 
> > > RAPL(running average power limit) is a new feature which provides
> > > mechanisms to enforce power consumption limit, on some new processors.
> > > 
> > > Generally speaking, by using RAPL, OS can set a power budget in a
> > > certain time window, and let Hardware to throttle the processor
> > > P/T-state to meet this energy limitation.
> > > 
> > > RAPL also provides a new MSR, i.e. MSR_PKG_ENERGY_STATUS, which reports
> > > the total amount of energy consumed by the package.
> > > 
> > > I'm not sure if to support RAPL or not, but anyway, it sounds like a
> > > good idea to export the energy status in perf.
> > > 
> > > So a new perf pmu and event to show the package energy consumed is
> > > introduced in this patch.
> > > 
> > > Here is what I get after applying the three patches,
> > > 
> > > #./perf stat -e energy test
> > > Performance counter stats for 'test':
> > > 
> > > 	202	Joules cost by package
> > > 7.926001238	seconds time elapsed
> > > 
> > > 
> > > Note that this patch set is made based on Peter's perf-pmu branch,
> > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git
> > >  which provides better interfaces to register/unregister a new pmu.
> > > 
> > > any comment are welcome. :)
> > 
> > 
> > Nice,.. however:
> > 
> >  - if it is a pure read-only counter without sampling support,
> >    expose it as such, don't fudge in the hrtimer stuff. Simply
> >    fail to create a sampling event.
> > 
> >    SH has the same problem for its 'normal' PMU, the solution is
> >    to use event groups, Matt was looking at adding support to
> >    perf-record for that, if creating a sampling event fails, fall
> >    back to {hrtimer, $event} groups.
> 
> I had a quick look over the patches and Peter is right - the group
> events stuff would probably fit quite well here. Unfortunately, due to
> holidays and things, I haven't been able to get them finished
> yet. I'll get on that ASAP.

Hi, Matt

What's the "group events stuff"?
Is there some discussion on LKML or elsewhere I can have a look at?

Thanks,
Lin Ming




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-19  3:28     ` Lin Ming
@ 2010-08-19  7:54       ` Matt Fleming
  2010-08-19  8:15         ` Lin Ming
  2010-08-19  8:31         ` Zhang Rui
  2010-08-19  9:02       ` Peter Zijlstra
  1 sibling, 2 replies; 20+ messages in thread
From: Matt Fleming @ 2010-08-19  7:54 UTC (permalink / raw)
  To: Lin Ming
  Cc: Peter Zijlstra, Zhang, Rui, LKML, mingo, robert.richter, acme,
	paulus, dzickus, gorcunov, fweisbec, Brown, Len, Matthew Garrett

On Thu, Aug 19, 2010 at 11:28:17AM +0800, Lin Ming wrote:
> On Wed, 2010-08-18 at 20:41 +0800, Matt Fleming wrote:
> > 
> > I had a quick look over the patches and Peter is right - the group
> > events stuff would probably fit quite well here. Unfortunately, due to
> > holidays and things, I haven't been able to get them finished
> > yet. I'll get on that ASAP.
> 
> Hi, Matt
> 
> What's the "group events stuff"?
> Is there some discussion on LKML or elsewhere I can have a look at?
> 
> Thanks,
> Lin Ming

The relevant information can be found here in this thread,
http://lkml.org/lkml/2010/8/4/174. I'm working on some patches for
this but they're not finished yet. I can probably get something to
show by next week.

The discussion started because the performance counters on SH do not
generate an interrupt on overflow, so we need to periodically sample
them. Am I correct in thinking that the energy counters also do not
generate an interrupt on overflow and that's why you wrote the event
as a software event?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-19  7:54       ` Matt Fleming
@ 2010-08-19  8:15         ` Lin Ming
  2010-08-19  8:31         ` Zhang Rui
  1 sibling, 0 replies; 20+ messages in thread
From: Lin Ming @ 2010-08-19  8:15 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Peter Zijlstra, Zhang, Rui, LKML, mingo, robert.richter, acme,
	paulus, dzickus, gorcunov, fweisbec, Brown, Len, Matthew Garrett

On Thu, 2010-08-19 at 15:54 +0800, Matt Fleming wrote:
> On Thu, Aug 19, 2010 at 11:28:17AM +0800, Lin Ming wrote:
> > On Wed, 2010-08-18 at 20:41 +0800, Matt Fleming wrote:
> > > 
> > > I had a quick look over the patches and Peter is right - the group
> > > events stuff would probably fit quite well here. Unfortunately, due to
> > > holidays and things, I haven't been able to get them finished
> > > yet. I'll get on that ASAP.
> > 
> > Hi, Matt
> > 
> > What's the "group events stuff"?
> > Is there some discussion on LKML or elsewhere I can have a look at?
> > 
> > Thanks,
> > Lin Ming
> 
> The relevant information can be found here in this thread,
> http://lkml.org/lkml/2010/8/4/174. I'm working on some patches for
> this but they're not finished yet. I can probably get something to
> show by next week.

Thanks.

> 
> The discussion started because the performance counters on SH do not
> generate an interrupt on overflow, so we need to periodically sample
> them. Am I correct in thinking that the energy counters also do not
> generate an interrupt on overflow and that's why you wrote the event
> as a software event?

I think so.

Rui, could you confirm this?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-19  7:54       ` Matt Fleming
  2010-08-19  8:15         ` Lin Ming
@ 2010-08-19  8:31         ` Zhang Rui
  2010-08-19  8:32           ` Matt Fleming
  1 sibling, 1 reply; 20+ messages in thread
From: Zhang Rui @ 2010-08-19  8:31 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Lin, Ming M, Peter Zijlstra, LKML, mingo, robert.richter, acme,
	paulus, dzickus, gorcunov, fweisbec, Brown, Len, Matthew Garrett

On Thu, 2010-08-19 at 15:54 +0800, Matt Fleming wrote:
> On Thu, Aug 19, 2010 at 11:28:17AM +0800, Lin Ming wrote:
> > On Wed, 2010-08-18 at 20:41 +0800, Matt Fleming wrote:
> > > 
> > > I had a quick look over the patches and Peter is right - the group
> > > events stuff would probably fit quite well here. Unfortunately, due to
> > > holidays and things, I haven't been able to get them finished
> > > yet. I'll get on that ASAP.
> > 
> > Hi, Matt
> > 
> > What's the "group events stuff"?
> > Is there some discussion on LKML or elsewhere I can have a look at?
> > 
> > Thanks,
> > Lin Ming
> 
> The relevant information can be found here in this thread,
> http://lkml.org/lkml/2010/8/4/174. I'm working on some patches for
> this but they're not finished yet. I can probably get something to
> show by next week.
> 
> The discussion started because the performance counters on SH do not
> generate an interrupt on overflow, so we need to periodically sample
> them. Am I correct in thinking that the energy counters also do not
> generate an interrupt on overflow and that's why you wrote the event
> as a software event?

right.

BTW, I'm not quite familiar with perf tool, and now I'm wondering if the
periodically sample is needed.
because IMO, .start is invoked every time the process is scheduled in,
and .stop is invoked when it's scheduled out. It seems that we just need
to read the energy consumed in .start and .stop, and update the counter
in .stop, right?

thanks,
rui


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-19  8:31         ` Zhang Rui
@ 2010-08-19  8:32           ` Matt Fleming
  2010-08-19  9:44             ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Matt Fleming @ 2010-08-19  8:32 UTC (permalink / raw)
  To: Zhang Rui
  Cc: Lin, Ming M, Peter Zijlstra, LKML, mingo, robert.richter, acme,
	paulus, dzickus, gorcunov, fweisbec, Brown, Len, Matthew Garrett

On Thu, Aug 19, 2010 at 04:31:54PM +0800, Zhang Rui wrote:
> On Thu, 2010-08-19 at 15:54 +0800, Matt Fleming wrote:
> > On Thu, Aug 19, 2010 at 11:28:17AM +0800, Lin Ming wrote:
> > > On Wed, 2010-08-18 at 20:41 +0800, Matt Fleming wrote:
> > > > 
> > > > I had a quick look over the patches and Peter is right - the group
> > > > events stuff would probably fit quite well here. Unfortunately, due to
> > > > holidays and things, I haven't been able to get them finished
> > > > yet. I'll get on that ASAP.
> > > 
> > > Hi, Matt
> > > 
> > > What's the "group events stuff"?
> > > Is there some discussion on LKML or elsewhere I can have a look at?
> > > 
> > > Thanks,
> > > Lin Ming
> > 
> > The relevant information can be found here in this thread,
> > http://lkml.org/lkml/2010/8/4/174. I'm working on some patches for
> > this but they're not finished yet. I can probably get something to
> > show by next week.
> > 
> > The discussion started because the performance counters on SH do not
> > generate an interrupt on overflow, so we need to periodically sample
> > them. Am I correct in thinking that the energy counters also do not
> > generate an interrupt on overflow and that's why you wrote the event
> > as a software event?
> 
> right.
> 
> BTW, I'm not quite familiar with perf tool, and now I'm wondering if the
> periodically sample is needed.
> because IMO, .start is invoked every time the process is scheduled in,
> and .stop is invoked when it's scheduled out. It seems that we just need
> to read the energy consumed in .start and .stop, and update the counter
> in .stop, right?

How big is the hardware counter? The problem comes when the process is
scheduled in and runs for a long time, e.g. so long that the energy
hardware counter wraps. This is why it's necessary to periodically
sample the counter.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-19  8:32           ` Matt Fleming
@ 2010-08-19  9:44             ` Peter Zijlstra
  2010-08-21  1:18               ` Frederic Weisbecker
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2010-08-19  9:44 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Zhang Rui, Lin, Ming M, LKML, mingo, robert.richter, acme,
	paulus, dzickus, gorcunov, fweisbec, Brown, Len, Matthew Garrett

On Thu, 2010-08-19 at 09:32 +0100, Matt Fleming wrote:
> 
> 
> How big is the hardware counter? The problem comes when the process is
> scheduled in and runs for a long time, e.g. so long that the energy
> hardware counter wraps. This is why it's necessary to periodically
> sample the counter.
> 
Long running processes aren't the only case, you could associate an
event with a CPU.

Right, short counters (like SH when not chained) need something to
accumulate deltas into the larger u64. You can indeed use timers for
that, hr or otherwise, but you don't need the swcounter hrtimer
infrastructure for that.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-19  9:44             ` Peter Zijlstra
@ 2010-08-21  1:18               ` Frederic Weisbecker
  2010-08-21  9:30                 ` Ingo Molnar
  2010-08-23  9:31                 ` Peter Zijlstra
  0 siblings, 2 replies; 20+ messages in thread
From: Frederic Weisbecker @ 2010-08-21  1:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matt Fleming, Zhang Rui, Lin, Ming M, LKML, mingo,
	robert.richter, acme, paulus, dzickus, gorcunov, Brown, Len,
	Matthew Garrett

On Thu, Aug 19, 2010 at 11:44:45AM +0200, Peter Zijlstra wrote:
> On Thu, 2010-08-19 at 09:32 +0100, Matt Fleming wrote:
> > 
> > 
> > How big is the hardware counter? The problem comes when the process is
> > scheduled in and runs for a long time, e.g. so long that the energy
> > hardware counter wraps. This is why it's necessary to periodically
> > sample the counter.
> > 
> Long running processes aren't the only case, you could associate an
> event with a CPU.



I don't understand what you mean.



> Right, short counters (like SH when not chained) need something to
> accumulate deltas into the larger u64. You can indeed use timers for
> that, hr or otherwise, but you don't need the swcounter hrtimer
> infrastructure for that.


So what is the point in simulating a PMI using an hrtimer? It won't be
based on periods on the interesting counter but on time periods. This
is not how we want the samples. If we want timer based samples, we can
just launch a seperate software timer based event.

In the case of SH where we need to flush to avoid wraps, I understand, but
oterwise?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-21  1:18               ` Frederic Weisbecker
@ 2010-08-21  9:30                 ` Ingo Molnar
  2010-08-23  9:31                 ` Peter Zijlstra
  1 sibling, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2010-08-21  9:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Matt Fleming, Zhang Rui, Lin, Ming M, LKML,
	robert.richter, acme, paulus, dzickus, gorcunov, Brown, Len,
	Matthew Garrett


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> > Right, short counters (like SH when not chained) need something to 
> > accumulate deltas into the larger u64. You can indeed use timers for 
> > that, hr or otherwise, but you don't need the swcounter hrtimer 
> > infrastructure for that.
> 
> So what is the point in simulating a PMI using an hrtimer? It won't be 
> based on periods on the interesting counter but on time periods. This 
> is not how we want the samples. If we want timer based samples, we can 
> just launch a seperate software timer based event.

If we then measure the delta of the count during that constant-time 
period, we'll get a 'weight' to consider.

So for example if we sample with a period of every 1000 cache-misses, 
regular same-counter-PMU-IRQ sampling goes like this:

   1000
   1000
   1000
   1000
   1000
   ....

While if we use a hrtimer, we get variations:

   1050
    711
   1539
   2210
    400

But using that variable period as a weight will, statistically, 
compensate for the variation.

It's similar to how the auto-freq code works - that too has variable 
periods (due to the self-adjustment) - which we compensate with weight.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-21  1:18               ` Frederic Weisbecker
  2010-08-21  9:30                 ` Ingo Molnar
@ 2010-08-23  9:31                 ` Peter Zijlstra
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2010-08-23  9:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Matt Fleming, Zhang Rui, Lin, Ming M, LKML, mingo,
	robert.richter, acme, paulus, dzickus, gorcunov, Brown, Len,
	Matthew Garrett

On Sat, 2010-08-21 at 03:18 +0200, Frederic Weisbecker wrote:
> On Thu, Aug 19, 2010 at 11:44:45AM +0200, Peter Zijlstra wrote:
> > On Thu, 2010-08-19 at 09:32 +0100, Matt Fleming wrote:
> > > 
> > > 
> > > How big is the hardware counter? The problem comes when the process is
> > > scheduled in and runs for a long time, e.g. so long that the energy
> > > hardware counter wraps. This is why it's necessary to periodically
> > > sample the counter.
> > > 
> > Long running processes aren't the only case, you could associate an
> > event with a CPU.

> I don't understand what you mean.

perf_event_open(.pid = -1, .cpu = n);

> > Right, short counters (like SH when not chained) need something to
> > accumulate deltas into the larger u64. You can indeed use timers for
> > that, hr or otherwise, but you don't need the swcounter hrtimer
> > infrastructure for that.
> 
> 
> So what is the point in simulating a PMI using an hrtimer? It won't be
> based on periods on the interesting counter but on time periods. This
> is not how we want the samples. If we want timer based samples, we can
> just launch a seperate software timer based event.

*sigh* that's exactly what we're doing, we're creating a separate
software hrtimer to create samples, the only thing that's different is
that we put this hrtimer and the hw-counter in a group and let the
hrtimer sample include the hw-counter's value.

If you then weight the samples by the hw-counter delta, you get
something that's more or less related to the thing the hw-counter is
counting.

For counter's that do no provide overflow interrupts this is the only
possible way to get anything.

> In the case of SH where we need to flush to avoid wraps, I understand, but
> oterwise?

The wrap issue it totally unrelated.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-19  3:28     ` Lin Ming
  2010-08-19  7:54       ` Matt Fleming
@ 2010-08-19  9:02       ` Peter Zijlstra
  2010-08-20  1:44         ` Zhang Rui
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2010-08-19  9:02 UTC (permalink / raw)
  To: Lin Ming
  Cc: Matt Fleming, Zhang, Rui, LKML, mingo, robert.richter, acme,
	paulus, dzickus, gorcunov, fweisbec, Brown, Len, Matthew Garrett

On Thu, 2010-08-19 at 11:28 +0800, Lin Ming wrote:
> On Wed, 2010-08-18 at 20:41 +0800, Matt Fleming wrote:
> > On Wed, Aug 18, 2010 at 02:25:29PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2010-08-18 at 15:59 +0800, Zhang Rui wrote:
> > > > Hi, all,
> > > > 
> > > > RAPL(running average power limit) is a new feature which provides
> > > > mechanisms to enforce power consumption limit, on some new processors.
> > > > 
> > > > Generally speaking, by using RAPL, OS can set a power budget in a
> > > > certain time window, and let Hardware to throttle the processor
> > > > P/T-state to meet this energy limitation.
> > > > 
> > > > RAPL also provides a new MSR, i.e. MSR_PKG_ENERGY_STATUS, which reports
> > > > the total amount of energy consumed by the package.
> > > > 
> > > > I'm not sure if to support RAPL or not, but anyway, it sounds like a
> > > > good idea to export the energy status in perf.
> > > > 
> > > > So a new perf pmu and event to show the package energy consumed is
> > > > introduced in this patch.
> > > > 
> > > > Here is what I get after applying the three patches,
> > > > 
> > > > #./perf stat -e energy test
> > > > Performance counter stats for 'test':
> > > > 
> > > > 	202	Joules cost by package
> > > > 7.926001238	seconds time elapsed
> > > > 
> > > > 
> > > > Note that this patch set is made based on Peter's perf-pmu branch,
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git
> > > >  which provides better interfaces to register/unregister a new pmu.
> > > > 
> > > > any comment are welcome. :)
> > > 
> > > 
> > > Nice,.. however:
> > > 
> > >  - if it is a pure read-only counter without sampling support,
> > >    expose it as such, don't fudge in the hrtimer stuff. Simply
> > >    fail to create a sampling event.
> > > 
> > >    SH has the same problem for its 'normal' PMU, the solution is
> > >    to use event groups, Matt was looking at adding support to
> > >    perf-record for that, if creating a sampling event fails, fall
> > >    back to {hrtimer, $event} groups.
> > 
> > I had a quick look over the patches and Peter is right - the group
> > events stuff would probably fit quite well here. Unfortunately, due to
> > holidays and things, I haven't been able to get them finished
> > yet. I'll get on that ASAP.
> 
> Hi, Matt
> 
> What's the "group events stuff"?
> Is there some discussion on LKML or elsewhere I can have a look at?

its some obscure perf feature:

 leader = sys_perf_event_open(&hrtimer_attr, pid, cpu, 0, 0);
 sibling = sys_perf_event_open(&rapl_attr, pid, cpu, leader, 0);

will create an even group (which means that both events require to be
co-scheduled). If you then provided:

hrtimer_attr.read_format |= PERF_FORMAT_GROUP;
hrtimer_attr.sample_type |= PERF_SAMPLE_READ;

the samples from the hrtimer will contain a field like:

 *      { u64           nr;
 *        { u64         time_enabled; } && PERF_FORMAT_ENABLED
 *        { u64         time_running; } && PERF_FORMAT_RUNNING
 *        { u64         value;
 *          { u64       id;           } && PERF_FORMAT_ID
 *        }             cntr[nr];
 *      } && PERF_FORMAT_GROUP

Which contains both the hrtimer count (ns) and the RAPL count (watts).

Using that you can compute the RAPL delta between consecutive samples
and use that to weight the sample.


For perf-stat non of this is needed, since it doesn't use sampling
counters anyway ;-).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-19  9:02       ` Peter Zijlstra
@ 2010-08-20  1:44         ` Zhang Rui
  2010-08-20  9:34           ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Zhang Rui @ 2010-08-20  1:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lin, Ming M, Matt Fleming, LKML, mingo, robert.richter, acme,
	paulus, dzickus, gorcunov, fweisbec, Brown, Len, Matthew Garrett

On Thu, 2010-08-19 at 17:02 +0800, Peter Zijlstra wrote:
> > > > 
> > > >  - if it is a pure read-only counter without sampling support,
> > > >    expose it as such, don't fudge in the hrtimer stuff. Simply
> > > >    fail to create a sampling event.
> > > > 
> > > >    SH has the same problem for its 'normal' PMU, the solution is
> > > >    to use event groups, Matt was looking at adding support to
> > > >    perf-record for that, if creating a sampling event fails, fall
> > > >    back to {hrtimer, $event} groups.
> > > 
> > > I had a quick look over the patches and Peter is right - the group
> > > events stuff would probably fit quite well here. Unfortunately, due to
> > > holidays and things, I haven't been able to get them finished
> > > yet. I'll get on that ASAP.
> > 
> > Hi, Matt
> > 
> > What's the "group events stuff"?
> > Is there some discussion on LKML or elsewhere I can have a look at?
> 
> its some obscure perf feature:
> 
>  leader = sys_perf_event_open(&hrtimer_attr, pid, cpu, 0, 0);
>  sibling = sys_perf_event_open(&rapl_attr, pid, cpu, leader, 0);
> 
> will create an even group (which means that both events require to be
> co-scheduled). If you then provided:
> 
> hrtimer_attr.read_format |= PERF_FORMAT_GROUP;
> hrtimer_attr.sample_type |= PERF_SAMPLE_READ;
> 
hrtimer_attr is only shared in an event group, and rapl needs its owen
event group, right?

> the samples from the hrtimer will contain a field like:
> 
>  *      { u64           nr;
>  *        { u64         time_enabled; } && PERF_FORMAT_ENABLED
>  *        { u64         time_running; } && PERF_FORMAT_RUNNING
>  *        { u64         value;
>  *          { u64       id;           } && PERF_FORMAT_ID
>  *        }             cntr[nr];
>  *      } && PERF_FORMAT_GROUP
> 
> Which contains both the hrtimer count (ns) and the RAPL count (watts).
> 
> Using that you can compute the RAPL delta between consecutive samples
> and use that to weight the sample.
> 
> 
> For perf-stat non of this is needed, since it doesn't use sampling
> counters anyway ;-).

so what do you think the rapl counter should look like in userspace?
showing it in perf-stat looks nice, right? :)

thanks,
rui


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-20  1:44         ` Zhang Rui
@ 2010-08-20  9:34           ` Peter Zijlstra
  2010-08-20 12:31             ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2010-08-20  9:34 UTC (permalink / raw)
  To: Zhang Rui
  Cc: Lin, Ming M, Matt Fleming, LKML, mingo, robert.richter, acme,
	paulus, dzickus, gorcunov, fweisbec, Brown, Len, Matthew Garrett

On Fri, 2010-08-20 at 09:44 +0800, Zhang Rui wrote:
> On Thu, 2010-08-19 at 17:02 +0800, Peter Zijlstra wrote:

> > its some obscure perf feature:
> > 
> >  leader = sys_perf_event_open(&hrtimer_attr, pid, cpu, 0, 0);
> >  sibling = sys_perf_event_open(&rapl_attr, pid, cpu, leader, 0);
> > 
> > will create an even group (which means that both events require to be
> > co-scheduled). If you then provided:
> > 
> > hrtimer_attr.read_format |= PERF_FORMAT_GROUP;
> > hrtimer_attr.sample_type |= PERF_SAMPLE_READ;
> > 
> hrtimer_attr is only shared in an event group, and rapl needs its owen
> event group, right?

Uhm, no. The idea is to group the hrtimer and rapl event in order to
obtain rapl 'samples'.

That is, you get hrtimer samples which include the rapl count. For this
we use the grouping construct where group siblings are always
co-scheduled and can report on each others count.

> so what do you think the rapl counter should look like in userspace?
> showing it in perf-stat looks nice, right? :)

Right, so the userspace interface would be using Lin's sysfs bits, which
I still need to read up on. But the general idea is that each PMU gets a
sysfs representation somewhere in the system topology reflecting its
actual site (RAPL would be CPU local), this sysfs representation would
then also allow you to discover all events it provides.

perf list will then use sysfs to discover all available events, and you
can still use perf stat -e $foo to select it, where foo is some to be
determined string that identifies the thing, maybe something like:
rapl:watts or somesuch (with rapl identifying the pmu and watts the
actual event for that pmu).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-20  9:34           ` Peter Zijlstra
@ 2010-08-20 12:31             ` Ingo Molnar
  2010-08-20 21:34               ` acme
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2010-08-20 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang Rui, Lin, Ming M, Matt Fleming, LKML, robert.richter, acme,
	paulus, dzickus, gorcunov, fweisbec, Brown, Len, Matthew Garrett,
	Steven Rostedt, Thomas Gleixner


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, 2010-08-20 at 09:44 +0800, Zhang Rui wrote:
> > On Thu, 2010-08-19 at 17:02 +0800, Peter Zijlstra wrote:
> 
> > > its some obscure perf feature:
> > > 
> > >  leader = sys_perf_event_open(&hrtimer_attr, pid, cpu, 0, 0);
> > >  sibling = sys_perf_event_open(&rapl_attr, pid, cpu, leader, 0);
> > > 
> > > will create an even group (which means that both events require to be
> > > co-scheduled). If you then provided:
> > > 
> > > hrtimer_attr.read_format |= PERF_FORMAT_GROUP;
> > > hrtimer_attr.sample_type |= PERF_SAMPLE_READ;
> > > 
> > hrtimer_attr is only shared in an event group, and rapl needs its owen
> > event group, right?
> 
> Uhm, no. The idea is to group the hrtimer and rapl event in order to
> obtain rapl 'samples'.
> 
> That is, you get hrtimer samples which include the rapl count. For this
> we use the grouping construct where group siblings are always
> co-scheduled and can report on each others count.
> 
> > so what do you think the rapl counter should look like in userspace?
> > showing it in perf-stat looks nice, right? :)
> 
> Right, so the userspace interface would be using Lin's sysfs bits, which I 
> still need to read up on. But the general idea is that each PMU gets a sysfs 
> representation somewhere in the system topology reflecting its actual site 
> (RAPL would be CPU local), this sysfs representation would then also allow 
> you to discover all events it provides.
> 
> perf list will then use sysfs to discover all available events, and you can 
> still use perf stat -e $foo to select it, where foo is some to be determined 
> string that identifies the thing, maybe something like: rapl:watts or 
> somesuch (with rapl identifying the pmu and watts the actual event for that 
> pmu).

Btw., some 'perf list' thoughts. We could do a:

   perf list --help rapl:watts

Which gives the user some idea what an event does. Also, short descriptive 
line in perf list output would be nice:

$ perf list

List of pre-defined events (to be used in -e):

  cpu-cycles OR cycles                       [Hardware event]   # CPU cycles
  instructions                               [Hardware event]   # instructions executed

  ...

  rapl:watts                                 [Tracepoint]       # watts usage

or something like that. Perhaps even a TUI for perf list, to browse between 
event types? (in that case it would probably be useful to make them collapse 
along natural grouping)

We want users/developers to discover new events, see and understand their 
purpose and combine them in not-seen-before ways.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-20 12:31             ` Ingo Molnar
@ 2010-08-20 21:34               ` acme
  0 siblings, 0 replies; 20+ messages in thread
From: acme @ 2010-08-20 21:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Zhang Rui, Lin, Ming M, Matt Fleming, LKML,
	robert.richter, paulus, dzickus, gorcunov, fweisbec, Brown, Len,
	Matthew Garrett, Steven Rostedt, Thomas Gleixner

Em Fri, Aug 20, 2010 at 02:31:59PM +0200, Ingo Molnar escreveu:
> Btw., some 'perf list' thoughts. We could do a:
> 
>    perf list --help rapl:watts
> 
> Which gives the user some idea what an event does. Also, short descriptive 
> line in perf list output would be nice:
> 
> $ perf list
> 
> List of pre-defined events (to be used in -e):
> 
>   cpu-cycles OR cycles                       [Hardware event]   # CPU cycles
>   instructions                               [Hardware event]   # instructions executed
> 
>   ...
> 
>   rapl:watts                                 [Tracepoint]       # watts usage
> 
> or something like that. Perhaps even a TUI for perf list, to browse between 
> event types? (in that case it would probably be useful to make them collapse 
> along natural grouping)
> 
> We want users/developers to discover new events, see and understand their 
> purpose and combine them in not-seen-before ways.

Right, record, list, probe, top are on the UI (not just T-UI, see latest
efforts on decoupling from newt/slang) hit-list :)

Moving from one to the other seamlessly like today is possible for
report and annotate is the goal.

Now that the UI browser code is more robust and generic that should
happen faster, I think.

- Arnaldo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-18 12:25 ` Peter Zijlstra
  2010-08-18 12:41   ` Matt Fleming
@ 2010-08-19  2:43   ` Lin Ming
  2010-08-19  8:54     ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Lin Ming @ 2010-08-19  2:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang, Rui, LKML, mingo, robert.richter, acme, paulus, dzickus,
	gorcunov, fweisbec, Brown, Len, Matthew Garrett, Matt Fleming

On Wed, 2010-08-18 at 20:25 +0800, Peter Zijlstra wrote:
> On Wed, 2010-08-18 at 15:59 +0800, Zhang Rui wrote:
> > Hi, all,
> > 
> > RAPL(running average power limit) is a new feature which provides
> > mechanisms to enforce power consumption limit, on some new processors.
> > 
> > Generally speaking, by using RAPL, OS can set a power budget in a
> > certain time window, and let Hardware to throttle the processor
> > P/T-state to meet this energy limitation.
> > 
> > RAPL also provides a new MSR, i.e. MSR_PKG_ENERGY_STATUS, which reports
> > the total amount of energy consumed by the package.
> > 
> > I'm not sure if to support RAPL or not, but anyway, it sounds like a
> > good idea to export the energy status in perf.
> > 
> > So a new perf pmu and event to show the package energy consumed is
> > introduced in this patch.
> > 
> > Here is what I get after applying the three patches,
> > 
> > #./perf stat -e energy test
> > Performance counter stats for 'test':
> > 
> > 	202	Joules cost by package
> > 7.926001238	seconds time elapsed
> > 
> > 
> > Note that this patch set is made based on Peter's perf-pmu branch,
> > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git
> >  which provides better interfaces to register/unregister a new pmu.
> > 
> > any comment are welcome. :)
> 
> 
> Nice,.. however:
> 
>  - if it is a pure read-only counter without sampling support,
>    expose it as such, don't fudge in the hrtimer stuff. Simply
>    fail to create a sampling event.
> 
>    SH has the same problem for its 'normal' PMU, the solution is
>    to use event groups, Matt was looking at adding support to
>    perf-record for that, if creating a sampling event fails, fall
>    back to {hrtimer, $event} groups.
> 
>  - since its a free-running, non-configurable counter, you can indeed
>    act like its a 'software' event in that you can schedule consumers
>    without constraints, however I don't think the PERF_COUNT_SW_* space
>    is the right way to expose this counter.
> 
>    Better would be to use the sysfs stuff Lin has been working on (for

Sorry that I have no good idea how to export the various tracepoints
events automatically, so this work will take time.

Lin Ming

>    which I still need to catch up on the latest discussions), it would
>    then be tied to the pmu instance and appear/disappear when you load/
>    unload the module.
> 
>    However for testing purposes I see why you'd want to have _a_
>    interface :-)
> 
> - it would be nice if you'd write the cpu detection a bit more readable,
>   also, it looks like you forgot to check x86_vendor == X86_VENDOR_INTEL.
> 
> > +static int __init intel_rapl_init(void)
> > +{
> > +	/*
> > +	 * RAPL features are only supported on processors have a CPUID
> > +	 * signature with DisplayFamily_DisplayModel of 06_2AH, 06_2DH
> > +	 */
> > +	if (boot_cpu_data.x86 != 0x06 ||
> > +	    (boot_cpu_data.x86_model != 0x2A &&
> > +	    boot_cpu_data.x86_model != 0x2D))
> > +		return -ENODEV;
> > +
> > +	if (rapl_check_unit())
> > +		return -ENODEV;
> > +
> > +	perf_pmu_register(&rapl_pmu);
> > +	return 0;
> > +}
> 
> Maybe something like (see intel_pmu_init() for example):
> 
>   if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
>     return -ENODEV;
> 
>   if (boot_cpu_data.x86 != 0x06)
>     return -ENODEV;
> 
>   switch (boot_cpu_data.x86_model) {
>   case 0x2A: /* sandybridge ?! 32nm */
>   case 0x2D: /* othermodel 32nm */
>     break;
> 
>   default:
>     return -ENODEV;
>   }
> 
> Which again reminds me to ask of Intel, a comprehensive x86_model list,
> please?
> 
> Alternatively, you can create a X86_FEATURE_RAPL and simply use
> boot_cpu_has(X86_FEATURE_RAPL) (much like intel_ds_init() has).



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-19  2:43   ` Lin Ming
@ 2010-08-19  8:54     ` Peter Zijlstra
  2010-08-20  0:21       ` Lin Ming
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2010-08-19  8:54 UTC (permalink / raw)
  To: Lin Ming
  Cc: Zhang, Rui, LKML, mingo, robert.richter, acme, paulus, dzickus,
	gorcunov, fweisbec, Brown, Len, Matthew Garrett, Matt Fleming

On Thu, 2010-08-19 at 10:43 +0800, Lin Ming wrote:
> Sorry that I have no good idea how to export the various tracepoints
> events automatically, so this work will take time.
> 
Well, we could start with just he hardware bits and leave the tracepoint
bits for later, right?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/3] perf: show package power consumption in perf
  2010-08-19  8:54     ` Peter Zijlstra
@ 2010-08-20  0:21       ` Lin Ming
  0 siblings, 0 replies; 20+ messages in thread
From: Lin Ming @ 2010-08-20  0:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang, Rui, LKML, mingo, robert.richter, acme, paulus, dzickus,
	gorcunov, fweisbec, Brown, Len, Matthew Garrett, Matt Fleming

On Thu, 2010-08-19 at 16:54 +0800, Peter Zijlstra wrote:
> On Thu, 2010-08-19 at 10:43 +0800, Lin Ming wrote:
> > Sorry that I have no good idea how to export the various tracepoints
> > events automatically, so this work will take time.
> > 
> Well, we could start with just he hardware bits and leave the tracepoint
> bits for later, right?

Right. I'll update the patches.



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2010-08-23  9:31 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-18  7:59 [RFC PATCH 0/3] perf: show package power consumption in perf Zhang Rui
2010-08-18 12:25 ` Peter Zijlstra
2010-08-18 12:41   ` Matt Fleming
2010-08-19  3:28     ` Lin Ming
2010-08-19  7:54       ` Matt Fleming
2010-08-19  8:15         ` Lin Ming
2010-08-19  8:31         ` Zhang Rui
2010-08-19  8:32           ` Matt Fleming
2010-08-19  9:44             ` Peter Zijlstra
2010-08-21  1:18               ` Frederic Weisbecker
2010-08-21  9:30                 ` Ingo Molnar
2010-08-23  9:31                 ` Peter Zijlstra
2010-08-19  9:02       ` Peter Zijlstra
2010-08-20  1:44         ` Zhang Rui
2010-08-20  9:34           ` Peter Zijlstra
2010-08-20 12:31             ` Ingo Molnar
2010-08-20 21:34               ` acme
2010-08-19  2:43   ` Lin Ming
2010-08-19  8:54     ` Peter Zijlstra
2010-08-20  0:21       ` Lin Ming

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.