All of lore.kernel.org
 help / color / mirror / Atom feed
* [perf] enable raw OFFCORE_EVENTS for non-perf userspace
@ 2011-08-03 16:05 Vince Weaver
  2011-08-03 17:00 ` Vince Weaver
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Vince Weaver @ 2011-08-03 16:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo

Hello

I propose we just enable raw OFFCORE_EVENT support and get it over with.

There is a lot of demand for this from PAPI users, and so we encourage 
them to apply the below patch.  PAPI supports this out of the box.

The current "block" against using this feature *DOES NOT WORK*.
It silently fails if you try to use the config1 field to set it.

Even worse, if some previous user has set the OFFCORE_RSP_0 msr
(say by running "perf stat -e LLC-load-misses") then
the msr *stays set* and if you try to set the config1 field on your own
it looks like it worked, but instead it is using whatever value the
kernel last used.

So there's a lot of userspace confusion about this, and you can't even 
reliably tell if the feature is turned off or not because it fails 
silently in unpredictable ways.

Thanks

Vince
vweaver1@eecs.utk.edu

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4ee3abf..28f9ca9 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -604,12 +604,8 @@ static int x86_setup_perfctr(struct perf_event *event)
 			return -EOPNOTSUPP;
 	}
 
-	/*
-	 * Do not allow config1 (extended registers) to propagate,
-	 * there's no sane user-space generalization yet:
-	 */
 	if (attr->type == PERF_TYPE_RAW)
-		return 0;
+		return x86_pmu_extra_regs(event->attr.config, event);
 
 	if (attr->type == PERF_TYPE_HW_CACHE)
 		return set_ext_hw_attr(hwc, event);


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [perf] enable raw OFFCORE_EVENTS for non-perf userspace
  2011-08-03 16:05 [perf] enable raw OFFCORE_EVENTS for non-perf userspace Vince Weaver
@ 2011-08-03 17:00 ` Vince Weaver
  2011-08-04 15:55 ` Peter Zijlstra
  2011-11-18 23:34 ` [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 17+ messages in thread
From: Vince Weaver @ 2011-08-03 17:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo


and in case we don't want to be nice to userspace and want to brutally 
suppress anyone not using the "perf" tool, here's a patch that 
returns EINVAL for any raw access that sets config1.

I actually tested to make sure the patch doesn't break PAPI, unlike my 
last attempt.

Currently we "block" access to setting the config1 value from userspace
by silently failing.  This can cause unexpected behavior (especially if
the kernel is using the OFFCORE_RSP_0 msr internally) and also makes it 
impossible to test if OFFCORE_RESPONSE events are being blocked or not.

This hackish patch causes an EINVAL to propogate back to userspace
by setting an invalid value to the config1 field.

I personally think this patch is a step backward, and the right thing to 
do is give users full access to their own hardware.  But the current state 
of affairs is bad for tools like PAPI too.



diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4ee3abf..72baf25 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -608,8 +608,11 @@ static int x86_setup_perfctr(struct perf_event *event)
 	 * Do not allow config1 (extended registers) to propagate,
 	 * there's no sane user-space generalization yet:
 	 */
-	if (attr->type == PERF_TYPE_RAW)
-		return 0;
+	if (attr->type == PERF_TYPE_RAW) {
+		/* I HATE THIS PATCH AND EVERYTHING IT STANDS FOR.  --  VMW */
+		event->attr.config1=0xffffffff;  /* invalid, will be caught */
+		return x86_pmu_extra_regs(event->attr.config, event);
+	}
 
 	if (attr->type == PERF_TYPE_HW_CACHE)
 		return set_ext_hw_attr(hwc, event);


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [perf] enable raw OFFCORE_EVENTS for non-perf userspace
  2011-08-03 16:05 [perf] enable raw OFFCORE_EVENTS for non-perf userspace Vince Weaver
  2011-08-03 17:00 ` Vince Weaver
@ 2011-08-04 15:55 ` Peter Zijlstra
  2011-08-05  2:24   ` Vince Weaver
                     ` (2 more replies)
  2011-11-18 23:34 ` [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events tip-bot for Peter Zijlstra
  2 siblings, 3 replies; 17+ messages in thread
From: Peter Zijlstra @ 2011-08-04 15:55 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo

On Wed, 2011-08-03 at 12:05 -0400, Vince Weaver wrote:
> Hello
> 
> I propose we just enable raw OFFCORE_EVENT support and get it over with.
> 
> There is a lot of demand for this from PAPI users, and so we encourage 
> them to apply the below patch.  PAPI supports this out of the box.
> 
> The current "block" against using this feature *DOES NOT WORK*.
> It silently fails if you try to use the config1 field to set it.
> 
> Even worse, if some previous user has set the OFFCORE_RSP_0 msr
> (say by running "perf stat -e LLC-load-misses") then
> the msr *stays set* and if you try to set the config1 field on your own
> it looks like it worked, but instead it is using whatever value the
> kernel last used.
> 
> So there's a lot of userspace confusion about this, and you can't even 
> reliably tell if the feature is turned off or not because it fails 
> silently in unpredictable ways.


> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 4ee3abf..28f9ca9 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -604,12 +604,8 @@ static int x86_setup_perfctr(struct perf_event *event)
>  			return -EOPNOTSUPP;
>  	}
>  
> -	/*
> -	 * Do not allow config1 (extended registers) to propagate,
> -	 * there's no sane user-space generalization yet:
> -	 */
>  	if (attr->type == PERF_TYPE_RAW)
> -		return 0;
> +		return x86_pmu_extra_regs(event->attr.config, event);
>  
>  	if (attr->type == PERF_TYPE_HW_CACHE)
>  		return set_ext_hw_attr(hwc, event);


I'm inclined to merge this, aside from snb, the offcore stuff is
actually quite usable now. Ingo can we somehow persuade you?

Anybody who knows how to program the snb offcore please tell. I mean we
have all the code to poke at the right registers, and the SDM lists all
the various bits that go where and a few constraints on how to combine
said bits, but I've really no idea what any of it means.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [perf] enable raw OFFCORE_EVENTS for non-perf userspace
  2011-08-04 15:55 ` Peter Zijlstra
@ 2011-08-05  2:24   ` Vince Weaver
  2011-08-05  9:49   ` Ingo Molnar
  2011-11-07 18:30   ` Vince Weaver
  2 siblings, 0 replies; 17+ messages in thread
From: Vince Weaver @ 2011-08-05  2:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Stephane Eranian

On Thu, 4 Aug 2011, Peter Zijlstra wrote:

> On Wed, 2011-08-03 at 12:05 -0400, Vince Weaver wrote:
> 
> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> > index 4ee3abf..28f9ca9 100644
> > --- a/arch/x86/kernel/cpu/perf_event.c
> > +++ b/arch/x86/kernel/cpu/perf_event.c
> > @@ -604,12 +604,8 @@ static int x86_setup_perfctr(struct perf_event *event)
> >  			return -EOPNOTSUPP;
> >  	}
> >  
> > -	/*
> > -	 * Do not allow config1 (extended registers) to propagate,
> > -	 * there's no sane user-space generalization yet:
> > -	 */
> >  	if (attr->type == PERF_TYPE_RAW)
> > -		return 0;
> > +		return x86_pmu_extra_regs(event->attr.config, event);
> >  
> >  	if (attr->type == PERF_TYPE_HW_CACHE)
> >  		return set_ext_hw_attr(hwc, event);
> 
> 
> I'm inclined to merge this, aside from snb, the offcore stuff is
> actually quite usable now. Ingo can we somehow persuade you?

This would make me happy.  The patch is more or less a revert of the patch 
that disabled support in the first place, but if you need my signed-off-by 
(I forgot it before) here it is

Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu>

> Anybody who knows how to program the snb offcore please tell. I mean we
> have all the code to poke at the right registers, and the SDM lists all
> the various bits that go where and a few constraints on how to combine
> said bits, but I've really no idea what any of it means.

I've added Stephane as I think he did some work on SandyBridge offcore.

Vince


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [perf] enable raw OFFCORE_EVENTS for non-perf userspace
  2011-08-04 15:55 ` Peter Zijlstra
  2011-08-05  2:24   ` Vince Weaver
@ 2011-08-05  9:49   ` Ingo Molnar
  2011-11-07 18:30   ` Vince Weaver
  2 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2011-08-05  9:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, linux-kernel, Paul Mackerras,
	Arnaldo Carvalho de Melo, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Wed, 2011-08-03 at 12:05 -0400, Vince Weaver wrote:
> > Hello
> > 
> > I propose we just enable raw OFFCORE_EVENT support and get it over with.
> > 
> > There is a lot of demand for this from PAPI users, and so we encourage 
> > them to apply the below patch.  PAPI supports this out of the box.
> > 
> > The current "block" against using this feature *DOES NOT WORK*.
> > It silently fails if you try to use the config1 field to set it.
> > 
> > Even worse, if some previous user has set the OFFCORE_RSP_0 msr
> > (say by running "perf stat -e LLC-load-misses") then
> > the msr *stays set* and if you try to set the config1 field on your own
> > it looks like it worked, but instead it is using whatever value the
> > kernel last used.
> > 
> > So there's a lot of userspace confusion about this, and you can't even 
> > reliably tell if the feature is turned off or not because it fails 
> > silently in unpredictable ways.
> 
> 
> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> > index 4ee3abf..28f9ca9 100644
> > --- a/arch/x86/kernel/cpu/perf_event.c
> > +++ b/arch/x86/kernel/cpu/perf_event.c
> > @@ -604,12 +604,8 @@ static int x86_setup_perfctr(struct perf_event *event)
> >  			return -EOPNOTSUPP;
> >  	}
> >  
> > -	/*
> > -	 * Do not allow config1 (extended registers) to propagate,
> > -	 * there's no sane user-space generalization yet:
> > -	 */
> >  	if (attr->type == PERF_TYPE_RAW)
> > -		return 0;
> > +		return x86_pmu_extra_regs(event->attr.config, event);
> >  
> >  	if (attr->type == PERF_TYPE_HW_CACHE)
> >  		return set_ext_hw_attr(hwc, event);
> 
> 
> I'm inclined to merge this, aside from snb, the offcore stuff is 
> actually quite usable now. Ingo can we somehow persuade you?

Sure, i think we are almost there, the only missing bit i see is to 
add the now generalized NUMA node bits to tools/perf/ so that it can 
be tested easily and then we can enable the raw bits as well.

It was supposed to be added alongside of:

  89d6c0b: perf, arch: Add generic NODE cache events

what happened to that? If someone wants to accelerate the raw events 
enabling then that should be an easy thing to fix.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [perf] enable raw OFFCORE_EVENTS for non-perf userspace
  2011-08-04 15:55 ` Peter Zijlstra
  2011-08-05  2:24   ` Vince Weaver
  2011-08-05  9:49   ` Ingo Molnar
@ 2011-11-07 18:30   ` Vince Weaver
  2 siblings, 0 replies; 17+ messages in thread
From: Vince Weaver @ 2011-11-07 18:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo

On Thu, 4 Aug 2011, Peter Zijlstra wrote:

> On Wed, 2011-08-03 at 12:05 -0400, Vince Weaver wrote:
> > Hello
> > 
> > I propose we just enable raw OFFCORE_EVENT support and get it over with.
> > 
> > There is a lot of demand for this from PAPI users, and so we encourage 
> > them to apply the below patch.  PAPI supports this out of the box.
> > 
> > The current "block" against using this feature *DOES NOT WORK*.
> > It silently fails if you try to use the config1 field to set it.
> > 
> > Even worse, if some previous user has set the OFFCORE_RSP_0 msr
> > (say by running "perf stat -e LLC-load-misses") then
> > the msr *stays set* and if you try to set the config1 field on your own
> > it looks like it worked, but instead it is using whatever value the
> > kernel last used.
> > 
> > So there's a lot of userspace confusion about this, and you can't even 
> > reliably tell if the feature is turned off or not because it fails 
> > silently in unpredictable ways.

Is there any hope for either the patch that enables raw offcore_response, 
or the other patch that properly disables it?

Currently as mentioned the OFFCORE_RSP_0 msr value is leaked if a 
generalized event is using offcore response, thus it is not possible to 
properly detect availability of raw offcore response.

Since this MSR leakage has been in at least two released kernels now, it's
tempting to assume it's the published ABI and start hacking up PAPI so 
it can use it as a backdoor to enable RAW offcore response without needing 
the kernel patches.

Vince


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events
  2011-08-03 16:05 [perf] enable raw OFFCORE_EVENTS for non-perf userspace Vince Weaver
  2011-08-03 17:00 ` Vince Weaver
  2011-08-04 15:55 ` Peter Zijlstra
@ 2011-11-18 23:34 ` tip-bot for Peter Zijlstra
  2011-11-21 18:24   ` Vince Weaver
  2 siblings, 1 reply; 17+ messages in thread
From: tip-bot for Peter Zijlstra @ 2011-11-18 23:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  ed13ec58bfe0d5dc95f748e6118432cb0fa283cb
Gitweb:     http://git.kernel.org/tip/ed13ec58bfe0d5dc95f748e6118432cb0fa283cb
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 14 Nov 2011 10:03:25 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 14 Nov 2011 13:03:44 +0100

perf/x86: Enable raw event access to Intel offcore events

Now that the core offcore support is fixed up (thanks Stephane) and we
have sane generic events utilizing them, re-enable the raw access to
the feature as well.

Note that it doesn't matter if you use event 0x1b7 or 0x1bb to specify
an offcore event, either one works and neither guarantees you'll end
up on a particular offcore MSR.

Based on original patch from: Vince Weaver <vweaver1@eecs.utk.edu>.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vince Weaver <vweaver1@eecs.utk.edu>.
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1108031200390.703@cl320.eecs.utk.edu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/perf_event.c |    6 +-----
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index ff0e8d4..2bda212 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -312,12 +312,8 @@ int x86_setup_perfctr(struct perf_event *event)
 			return -EOPNOTSUPP;
 	}
 
-	/*
-	 * Do not allow config1 (extended registers) to propagate,
-	 * there's no sane user-space generalization yet:
-	 */
 	if (attr->type == PERF_TYPE_RAW)
-		return 0;
+		return x86_pmu_extra_regs(event->attr.config, event);
 
 	if (attr->type == PERF_TYPE_HW_CACHE)
 		return set_ext_hw_attr(hwc, event);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events
  2011-11-18 23:34 ` [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events tip-bot for Peter Zijlstra
@ 2011-11-21 18:24   ` Vince Weaver
  2011-11-21 18:52     ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Vince Weaver @ 2011-11-21 18:24 UTC (permalink / raw)
  To: mingo, hpa, eranian, linux-kernel, a.p.zijlstra, tglx, mingo

On Fri, 18 Nov 2011, tip-bot for Peter Zijlstra wrote:

> Commit-ID:  ed13ec58bfe0d5dc95f748e6118432cb0fa283cb
> Gitweb:     http://git.kernel.org/tip/ed13ec58bfe0d5dc95f748e6118432cb0fa283cb
> Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
> AuthorDate: Mon, 14 Nov 2011 10:03:25 +0100
> Committer:  Ingo Molnar <mingo@elte.hu>
> CommitDate: Mon, 14 Nov 2011 13:03:44 +0100
> 
> perf/x86: Enable raw event access to Intel offcore events

awesome.  You can add a "Whined-about-it-at-great-length: Vince Weaver"
line to the commit logs if you want.  Now I'll have to find something else 
to complain about, probably the scedulability issue introduced with the
NMI watchdog.

Thanks,

Vince
vweaver1@eecs.utk.edu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events
  2011-11-21 18:24   ` Vince Weaver
@ 2011-11-21 18:52     ` Peter Zijlstra
  2011-11-21 19:01       ` Vince Weaver
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2011-11-21 18:52 UTC (permalink / raw)
  To: Vince Weaver; +Cc: mingo, hpa, eranian, linux-kernel, tglx, mingo

On Mon, 2011-11-21 at 13:24 -0500, Vince Weaver wrote:
> On Fri, 18 Nov 2011, tip-bot for Peter Zijlstra wrote:
> 
> > Commit-ID:  ed13ec58bfe0d5dc95f748e6118432cb0fa283cb
> > Gitweb:     http://git.kernel.org/tip/ed13ec58bfe0d5dc95f748e6118432cb0fa283cb
> > Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
> > AuthorDate: Mon, 14 Nov 2011 10:03:25 +0100
> > Committer:  Ingo Molnar <mingo@elte.hu>
> > CommitDate: Mon, 14 Nov 2011 13:03:44 +0100
> > 
> > perf/x86: Enable raw event access to Intel offcore events
> 
> awesome.  You can add a "Whined-about-it-at-great-length: Vince Weaver"
> line to the commit logs if you want.  Now I'll have to find something else 
> to complain about, probably the scedulability issue introduced with the
> NMI watchdog.

:-)

But you can disable that thing, right?

echo 0 > /proc/sys/kernel/nmi_watchdog;

or so should do the trick.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events
  2011-11-21 18:52     ` Peter Zijlstra
@ 2011-11-21 19:01       ` Vince Weaver
  2011-11-21 19:04         ` Stephane Eranian
  0 siblings, 1 reply; 17+ messages in thread
From: Vince Weaver @ 2011-11-21 19:01 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, hpa, eranian, linux-kernel, tglx, mingo

On Mon, 21 Nov 2011, Peter Zijlstra wrote:
> On Mon, 2011-11-21 at 13:24 -0500, Vince Weaver wrote:

> > Now I'll have to find something else 
> > to complain about, probably the scedulability issue introduced with the
> > NMI watchdog.
> 
> But you can disable that thing, right?
> 
> echo 0 > /proc/sys/kernel/nmi_watchdog;
> 
> or so should do the trick.

yes, if the user running PAPI has root.

That's unfortunately not always the case, and sometimes people want to 
have both the NMI watchdog operating and use PAPI simultaneously.

We have a workaround, but it currently disables kernel multiplexing and a 
few other nice features.

Vince
vweaver1@eecs.utk.edu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events
  2011-11-21 19:01       ` Vince Weaver
@ 2011-11-21 19:04         ` Stephane Eranian
  2011-11-21 21:04           ` Vince Weaver
  0 siblings, 1 reply; 17+ messages in thread
From: Stephane Eranian @ 2011-11-21 19:04 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Peter Zijlstra, mingo, hpa, linux-kernel, tglx, mingo

Vince,

On Mon, Nov 21, 2011 at 8:01 PM, Vince Weaver <vweaver1@eecs.utk.edu> wrote:
> On Mon, 21 Nov 2011, Peter Zijlstra wrote:
>> On Mon, 2011-11-21 at 13:24 -0500, Vince Weaver wrote:
>
>> > Now I'll have to find something else
>> > to complain about, probably the scedulability issue introduced with the
>> > NMI watchdog.
>>
>> But you can disable that thing, right?
>>
>> echo 0 > /proc/sys/kernel/nmi_watchdog;
>>
>> or so should do the trick.
>
> yes, if the user running PAPI has root.
>
> That's unfortunately not always the case, and sometimes people want to
> have both the NMI watchdog operating and use PAPI simultaneously.
>

I don't understand why you have a problem with NMI watchdog. Multiplexing
allows you to still measure more events than there are counters.

> We have a workaround, but it currently disables kernel multiplexing and a
> few other nice features.
>
> Vince
> vweaver1@eecs.utk.edu
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events
  2011-11-21 19:04         ` Stephane Eranian
@ 2011-11-21 21:04           ` Vince Weaver
  2011-11-21 21:39             ` Stephane Eranian
  2011-11-21 22:05             ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Vince Weaver @ 2011-11-21 21:04 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Peter Zijlstra, mingo, hpa, linux-kernel, tglx, mingo

On Mon, 21 Nov 2011, Stephane Eranian wrote:

> > We have a workaround, but it currently disables kernel multiplexing and a
> > few other nice features.
> 
> I don't understand why you have a problem with NMI watchdog. Multiplexing
> allows you to still measure more events than there are counters.

The PAPI code currently uses FORMAT_GROUP and puts as many events as 
possible in a group.  The way we maximize events in a group is to
add events until perf_events indicates a failure.

When NMI watchdog is enabled, a counter is stolen.  Yet the perf_events
code does not account for this.

So say on an AMD machine with 4 counters (3 after one is stolen) 
perf_events lets you add 4 events to an event group, even though only 3 
are available.  It does not report failure upon open or start, only at 
read.  By then it's too late.

We have to work around this, by doing an extra read at open time to verify 
that the event group actually is valid, adding overhead.

Our multiplex code tries to maximize the number of events in a group too.
Currently PAPI works around this by just not doing kernel multiplexing
if a NMI watchdog is detected.  There's probably more elegant solutions 
such as checking with a read there too, or not using FORMAT_GROUP at all, 
but since this is an ABI regression I was hoping it would get fixed 
quickly enough that I wouldn't have to construct better workarounds.

Vince
vweaver1@eecs.utk.edu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events
  2011-11-21 21:04           ` Vince Weaver
@ 2011-11-21 21:39             ` Stephane Eranian
  2011-11-21 21:42               ` Peter Zijlstra
  2011-11-21 22:05             ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Stephane Eranian @ 2011-11-21 21:39 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Peter Zijlstra, mingo, hpa, linux-kernel, tglx, mingo

On Mon, Nov 21, 2011 at 10:04 PM, Vince Weaver <vweaver1@eecs.utk.edu> wrote:
> On Mon, 21 Nov 2011, Stephane Eranian wrote:
>
>> > We have a workaround, but it currently disables kernel multiplexing and a
>> > few other nice features.
>>
>> I don't understand why you have a problem with NMI watchdog. Multiplexing
>> allows you to still measure more events than there are counters.
>
> The PAPI code currently uses FORMAT_GROUP and puts as many events as
> possible in a group.  The way we maximize events in a group is to
> add events until perf_events indicates a failure.
>
Ok, now I understand your problem.

When you submit events (one by one) the kernel does a scheduling simulation
but only taking into account events in the group. The goal is to
verify that the group
is schedulable. It does not look at events from system-wide sessions
(which have priority).
Doing so would be difficult for per-thread because it would have to
look on all CPUs.
Even if it were to do that, that would be no guarantee of
schedulability  because
by the time the group is actually scheduled, you may now have
system-wide session, i.e.,
your group is not scheduled because there aren't enough counters
leftover from system-wide.

The problem you're describing is not specific to X86 and the NMI
watchdog. It applies to
all architectures and has to do with system-wide pinned events vs.
per-thread group schedulability.
The NMI watchdog on X86 is an example of that. But AFAIK, the watchdog
could be supported
on other architectures as well.

One could argue, you could check is NMI is active and assume you have
one fewer counters
during the scheduling simulation. But I think it is a bit more
complicated than that.

To solve this problem (in the general case), you need to know which
counter is taken (or required)
by ALL pinned system-wide events across all CPUs. Once you've
constructed the bitmap, you can
use it as the basis (used_mask) to try and schedule the group events.

That's a best effort algorithm. It does not provide the guarantee, the
group will be schedulable
during its entire lifetime. It simply makes it more likely it will
run, if you assume no changes on
system-wide pinned events from the moment you create the per-thread
group and the time is
starts counting.


> When NMI watchdog is enabled, a counter is stolen.  Yet the perf_events
> code does not account for this.
>
> So say on an AMD machine with 4 counters (3 after one is stolen)
> perf_events lets you add 4 events to an event group, even though only 3
> are available.  It does not report failure upon open or start, only at
> read.  By then it's too late.
>
> We have to work around this, by doing an extra read at open time to verify
> that the event group actually is valid, adding overhead.
>
> Our multiplex code tries to maximize the number of events in a group too.
> Currently PAPI works around this by just not doing kernel multiplexing
> if a NMI watchdog is detected.  There's probably more elegant solutions
> such as checking with a read there too, or not using FORMAT_GROUP at all,
> but since this is an ABI regression I was hoping it would get fixed
> quickly enough that I wouldn't have to construct better workarounds.
>
> Vince
> vweaver1@eecs.utk.edu
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events
  2011-11-21 21:39             ` Stephane Eranian
@ 2011-11-21 21:42               ` Peter Zijlstra
  2011-11-21 21:45                 ` Stephane Eranian
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2011-11-21 21:42 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Vince Weaver, mingo, hpa, linux-kernel, tglx, mingo

On Mon, 2011-11-21 at 22:39 +0100, Stephane Eranian wrote:
> To solve this problem (in the general case), you need to know which
> counter is taken (or required)
> by ALL pinned system-wide events across all CPUs. Once you've
> constructed the bitmap, you can
> use it as the basis (used_mask) to try and schedule the group events.
> 
Its worse, you also need to consider all task-pinned events if you want
to be complete. That's an even worse problem.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events
  2011-11-21 21:42               ` Peter Zijlstra
@ 2011-11-21 21:45                 ` Stephane Eranian
  2011-11-21 21:48                   ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Stephane Eranian @ 2011-11-21 21:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Vince Weaver, mingo, hpa, linux-kernel, tglx, mingo

On Mon, Nov 21, 2011 at 10:42 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Mon, 2011-11-21 at 22:39 +0100, Stephane Eranian wrote:
>> To solve this problem (in the general case), you need to know which
>> counter is taken (or required)
>> by ALL pinned system-wide events across all CPUs. Once you've
>> constructed the bitmap, you can
>> use it as the basis (used_mask) to try and schedule the group events.
>>
> Its worse, you also need to consider all task-pinned events if you want
> to be complete. That's an even worse problem.
>
But that you know because by the time you do the simulation you already
know the task the events are attached to. You can get to the list of events
already attached to the task. Again this is a best effort, it could well be
that someone else attaches pinned task events just after the simulation.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events
  2011-11-21 21:45                 ` Stephane Eranian
@ 2011-11-21 21:48                   ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2011-11-21 21:48 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Vince Weaver, mingo, hpa, linux-kernel, tglx, mingo

On Mon, 2011-11-21 at 22:45 +0100, Stephane Eranian wrote:
> On Mon, Nov 21, 2011 at 10:42 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Mon, 2011-11-21 at 22:39 +0100, Stephane Eranian wrote:
> >> To solve this problem (in the general case), you need to know which
> >> counter is taken (or required)
> >> by ALL pinned system-wide events across all CPUs. Once you've
> >> constructed the bitmap, you can
> >> use it as the basis (used_mask) to try and schedule the group events.
> >>
> > Its worse, you also need to consider all task-pinned events if you want
> > to be complete. That's an even worse problem.
> >
> But that you know because by the time you do the simulation you already
> know the task the events are attached to. You can get to the list of events
> already attached to the task. Again this is a best effort, it could well be
> that someone else attaches pinned task events just after the simulation.

In the PAPI case yes, but what if you want to create a CPU-wide group,
then you need to consider all task-pinned events that could possibly
land on your CPU. Not fun.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events
  2011-11-21 21:04           ` Vince Weaver
  2011-11-21 21:39             ` Stephane Eranian
@ 2011-11-21 22:05             ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2011-11-21 22:05 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Stephane Eranian, mingo, hpa, linux-kernel, tglx, mingo

On Mon, 2011-11-21 at 16:04 -0500, Vince Weaver wrote:

> The PAPI code currently uses FORMAT_GROUP and puts as many events as 
> possible in a group.  The way we maximize events in a group is to
> add events until perf_events indicates a failure.
> 
> When NMI watchdog is enabled, a counter is stolen.  Yet the perf_events
> code does not account for this.
> 
> So say on an AMD machine with 4 counters (3 after one is stolen) 
> perf_events lets you add 4 events to an event group, even though only 3 
> are available.  It does not report failure upon open or start, only at 
> read.  By then it's too late.
> 
> We have to work around this, by doing an extra read at open time to verify 
> that the event group actually is valid, adding overhead.

So strictly speaking this is a starvation case, we create a group that
requires the entire PMU but then run a higher priority event that won't
yield resulting in the group never getting scheduled.

Now fully agreed that starvation sucks, but I'm not exactly sure what to
do about it.

What isn't helping either is that all this scheduling code is per
arch :/

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-11-21 22:07 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-03 16:05 [perf] enable raw OFFCORE_EVENTS for non-perf userspace Vince Weaver
2011-08-03 17:00 ` Vince Weaver
2011-08-04 15:55 ` Peter Zijlstra
2011-08-05  2:24   ` Vince Weaver
2011-08-05  9:49   ` Ingo Molnar
2011-11-07 18:30   ` Vince Weaver
2011-11-18 23:34 ` [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events tip-bot for Peter Zijlstra
2011-11-21 18:24   ` Vince Weaver
2011-11-21 18:52     ` Peter Zijlstra
2011-11-21 19:01       ` Vince Weaver
2011-11-21 19:04         ` Stephane Eranian
2011-11-21 21:04           ` Vince Weaver
2011-11-21 21:39             ` Stephane Eranian
2011-11-21 21:42               ` Peter Zijlstra
2011-11-21 21:45                 ` Stephane Eranian
2011-11-21 21:48                   ` Peter Zijlstra
2011-11-21 22:05             ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.