* [perf] enable raw OFFCORE_EVENTS for non-perf userspace @ 2011-08-03 16:05 Vince Weaver 2011-08-03 17:00 ` Vince Weaver ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Vince Weaver @ 2011-08-03 16:05 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo Hello I propose we just enable raw OFFCORE_EVENT support and get it over with. There is a lot of demand for this from PAPI users, and so we encourage them to apply the below patch. PAPI supports this out of the box. The current "block" against using this feature *DOES NOT WORK*. It silently fails if you try to use the config1 field to set it. Even worse, if some previous user has set the OFFCORE_RSP_0 msr (say by running "perf stat -e LLC-load-misses") then the msr *stays set* and if you try to set the config1 field on your own it looks like it worked, but instead it is using whatever value the kernel last used. So there's a lot of userspace confusion about this, and you can't even reliably tell if the feature is turned off or not because it fails silently in unpredictable ways. Thanks Vince vweaver1@eecs.utk.edu diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 4ee3abf..28f9ca9 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -604,12 +604,8 @@ static int x86_setup_perfctr(struct perf_event *event) return -EOPNOTSUPP; } - /* - * Do not allow config1 (extended registers) to propagate, - * there's no sane user-space generalization yet: - */ if (attr->type == PERF_TYPE_RAW) - return 0; + return x86_pmu_extra_regs(event->attr.config, event); if (attr->type == PERF_TYPE_HW_CACHE) return set_ext_hw_attr(hwc, event); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [perf] enable raw OFFCORE_EVENTS for non-perf userspace 2011-08-03 16:05 [perf] enable raw OFFCORE_EVENTS for non-perf userspace Vince Weaver @ 2011-08-03 17:00 ` Vince Weaver 2011-08-04 15:55 ` Peter Zijlstra 2011-11-18 23:34 ` [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events tip-bot for Peter Zijlstra 2 siblings, 0 replies; 17+ messages in thread From: Vince Weaver @ 2011-08-03 17:00 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo and in case we don't want to be nice to userspace and want to brutally suppress anyone not using the "perf" tool, here's a patch that returns EINVAL for any raw access that sets config1. I actually tested to make sure the patch doesn't break PAPI, unlike my last attempt. Currently we "block" access to setting the config1 value from userspace by silently failing. This can cause unexpected behavior (especially if the kernel is using the OFFCORE_RSP_0 msr internally) and also makes it impossible to test if OFFCORE_RESPONSE events are being blocked or not. This hackish patch causes an EINVAL to propogate back to userspace by setting an invalid value to the config1 field. I personally think this patch is a step backward, and the right thing to do is give users full access to their own hardware. But the current state of affairs is bad for tools like PAPI too. diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 4ee3abf..72baf25 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -608,8 +608,11 @@ static int x86_setup_perfctr(struct perf_event *event) * Do not allow config1 (extended registers) to propagate, * there's no sane user-space generalization yet: */ - if (attr->type == PERF_TYPE_RAW) - return 0; + if (attr->type == PERF_TYPE_RAW) { + /* I HATE THIS PATCH AND EVERYTHING IT STANDS FOR. -- VMW */ + event->attr.config1=0xffffffff; /* invalid, will be caught */ + return x86_pmu_extra_regs(event->attr.config, event); + } if (attr->type == PERF_TYPE_HW_CACHE) return set_ext_hw_attr(hwc, event); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [perf] enable raw OFFCORE_EVENTS for non-perf userspace 2011-08-03 16:05 [perf] enable raw OFFCORE_EVENTS for non-perf userspace Vince Weaver 2011-08-03 17:00 ` Vince Weaver @ 2011-08-04 15:55 ` Peter Zijlstra 2011-08-05 2:24 ` Vince Weaver ` (2 more replies) 2011-11-18 23:34 ` [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events tip-bot for Peter Zijlstra 2 siblings, 3 replies; 17+ messages in thread From: Peter Zijlstra @ 2011-08-04 15:55 UTC (permalink / raw) To: Vince Weaver Cc: linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo On Wed, 2011-08-03 at 12:05 -0400, Vince Weaver wrote: > Hello > > I propose we just enable raw OFFCORE_EVENT support and get it over with. > > There is a lot of demand for this from PAPI users, and so we encourage > them to apply the below patch. PAPI supports this out of the box. > > The current "block" against using this feature *DOES NOT WORK*. > It silently fails if you try to use the config1 field to set it. > > Even worse, if some previous user has set the OFFCORE_RSP_0 msr > (say by running "perf stat -e LLC-load-misses") then > the msr *stays set* and if you try to set the config1 field on your own > it looks like it worked, but instead it is using whatever value the > kernel last used. > > So there's a lot of userspace confusion about this, and you can't even > reliably tell if the feature is turned off or not because it fails > silently in unpredictable ways. > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c > index 4ee3abf..28f9ca9 100644 > --- a/arch/x86/kernel/cpu/perf_event.c > +++ b/arch/x86/kernel/cpu/perf_event.c > @@ -604,12 +604,8 @@ static int x86_setup_perfctr(struct perf_event *event) > return -EOPNOTSUPP; > } > > - /* > - * Do not allow config1 (extended registers) to propagate, > - * there's no sane user-space generalization yet: > - */ > if (attr->type == PERF_TYPE_RAW) > - return 0; > + return x86_pmu_extra_regs(event->attr.config, event); > > if (attr->type == PERF_TYPE_HW_CACHE) > return set_ext_hw_attr(hwc, event); I'm inclined to merge this, aside from snb, the offcore stuff is actually quite usable now. Ingo can we somehow persuade you? Anybody who knows how to program the snb offcore please tell. I mean we have all the code to poke at the right registers, and the SDM lists all the various bits that go where and a few constraints on how to combine said bits, but I've really no idea what any of it means. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [perf] enable raw OFFCORE_EVENTS for non-perf userspace 2011-08-04 15:55 ` Peter Zijlstra @ 2011-08-05 2:24 ` Vince Weaver 2011-08-05 9:49 ` Ingo Molnar 2011-11-07 18:30 ` Vince Weaver 2 siblings, 0 replies; 17+ messages in thread From: Vince Weaver @ 2011-08-05 2:24 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Stephane Eranian On Thu, 4 Aug 2011, Peter Zijlstra wrote: > On Wed, 2011-08-03 at 12:05 -0400, Vince Weaver wrote: > > > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c > > index 4ee3abf..28f9ca9 100644 > > --- a/arch/x86/kernel/cpu/perf_event.c > > +++ b/arch/x86/kernel/cpu/perf_event.c > > @@ -604,12 +604,8 @@ static int x86_setup_perfctr(struct perf_event *event) > > return -EOPNOTSUPP; > > } > > > > - /* > > - * Do not allow config1 (extended registers) to propagate, > > - * there's no sane user-space generalization yet: > > - */ > > if (attr->type == PERF_TYPE_RAW) > > - return 0; > > + return x86_pmu_extra_regs(event->attr.config, event); > > > > if (attr->type == PERF_TYPE_HW_CACHE) > > return set_ext_hw_attr(hwc, event); > > > I'm inclined to merge this, aside from snb, the offcore stuff is > actually quite usable now. Ingo can we somehow persuade you? This would make me happy. The patch is more or less a revert of the patch that disabled support in the first place, but if you need my signed-off-by (I forgot it before) here it is Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu> > Anybody who knows how to program the snb offcore please tell. I mean we > have all the code to poke at the right registers, and the SDM lists all > the various bits that go where and a few constraints on how to combine > said bits, but I've really no idea what any of it means. I've added Stephane as I think he did some work on SandyBridge offcore. Vince ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [perf] enable raw OFFCORE_EVENTS for non-perf userspace 2011-08-04 15:55 ` Peter Zijlstra 2011-08-05 2:24 ` Vince Weaver @ 2011-08-05 9:49 ` Ingo Molnar 2011-11-07 18:30 ` Vince Weaver 2 siblings, 0 replies; 17+ messages in thread From: Ingo Molnar @ 2011-08-05 9:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Vince Weaver, linux-kernel, Paul Mackerras, Arnaldo Carvalho de Melo, Arnaldo Carvalho de Melo, Frédéric Weisbecker * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Wed, 2011-08-03 at 12:05 -0400, Vince Weaver wrote: > > Hello > > > > I propose we just enable raw OFFCORE_EVENT support and get it over with. > > > > There is a lot of demand for this from PAPI users, and so we encourage > > them to apply the below patch. PAPI supports this out of the box. > > > > The current "block" against using this feature *DOES NOT WORK*. > > It silently fails if you try to use the config1 field to set it. > > > > Even worse, if some previous user has set the OFFCORE_RSP_0 msr > > (say by running "perf stat -e LLC-load-misses") then > > the msr *stays set* and if you try to set the config1 field on your own > > it looks like it worked, but instead it is using whatever value the > > kernel last used. > > > > So there's a lot of userspace confusion about this, and you can't even > > reliably tell if the feature is turned off or not because it fails > > silently in unpredictable ways. > > > > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c > > index 4ee3abf..28f9ca9 100644 > > --- a/arch/x86/kernel/cpu/perf_event.c > > +++ b/arch/x86/kernel/cpu/perf_event.c > > @@ -604,12 +604,8 @@ static int x86_setup_perfctr(struct perf_event *event) > > return -EOPNOTSUPP; > > } > > > > - /* > > - * Do not allow config1 (extended registers) to propagate, > > - * there's no sane user-space generalization yet: > > - */ > > if (attr->type == PERF_TYPE_RAW) > > - return 0; > > + return x86_pmu_extra_regs(event->attr.config, event); > > > > if (attr->type == PERF_TYPE_HW_CACHE) > > return set_ext_hw_attr(hwc, event); > > > I'm inclined to merge this, aside from snb, the offcore stuff is > actually quite usable now. Ingo can we somehow persuade you? Sure, i think we are almost there, the only missing bit i see is to add the now generalized NUMA node bits to tools/perf/ so that it can be tested easily and then we can enable the raw bits as well. It was supposed to be added alongside of: 89d6c0b: perf, arch: Add generic NODE cache events what happened to that? If someone wants to accelerate the raw events enabling then that should be an easy thing to fix. Thanks, Ingo ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [perf] enable raw OFFCORE_EVENTS for non-perf userspace 2011-08-04 15:55 ` Peter Zijlstra 2011-08-05 2:24 ` Vince Weaver 2011-08-05 9:49 ` Ingo Molnar @ 2011-11-07 18:30 ` Vince Weaver 2 siblings, 0 replies; 17+ messages in thread From: Vince Weaver @ 2011-11-07 18:30 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo On Thu, 4 Aug 2011, Peter Zijlstra wrote: > On Wed, 2011-08-03 at 12:05 -0400, Vince Weaver wrote: > > Hello > > > > I propose we just enable raw OFFCORE_EVENT support and get it over with. > > > > There is a lot of demand for this from PAPI users, and so we encourage > > them to apply the below patch. PAPI supports this out of the box. > > > > The current "block" against using this feature *DOES NOT WORK*. > > It silently fails if you try to use the config1 field to set it. > > > > Even worse, if some previous user has set the OFFCORE_RSP_0 msr > > (say by running "perf stat -e LLC-load-misses") then > > the msr *stays set* and if you try to set the config1 field on your own > > it looks like it worked, but instead it is using whatever value the > > kernel last used. > > > > So there's a lot of userspace confusion about this, and you can't even > > reliably tell if the feature is turned off or not because it fails > > silently in unpredictable ways. Is there any hope for either the patch that enables raw offcore_response, or the other patch that properly disables it? Currently as mentioned the OFFCORE_RSP_0 msr value is leaked if a generalized event is using offcore response, thus it is not possible to properly detect availability of raw offcore response. Since this MSR leakage has been in at least two released kernels now, it's tempting to assume it's the published ABI and start hacking up PAPI so it can use it as a backdoor to enable RAW offcore response without needing the kernel patches. Vince ^ permalink raw reply [flat|nested] 17+ messages in thread
* [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events 2011-08-03 16:05 [perf] enable raw OFFCORE_EVENTS for non-perf userspace Vince Weaver 2011-08-03 17:00 ` Vince Weaver 2011-08-04 15:55 ` Peter Zijlstra @ 2011-11-18 23:34 ` tip-bot for Peter Zijlstra 2011-11-21 18:24 ` Vince Weaver 2 siblings, 1 reply; 17+ messages in thread From: tip-bot for Peter Zijlstra @ 2011-11-18 23:34 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, eranian, hpa, mingo, a.p.zijlstra, tglx, mingo Commit-ID: ed13ec58bfe0d5dc95f748e6118432cb0fa283cb Gitweb: http://git.kernel.org/tip/ed13ec58bfe0d5dc95f748e6118432cb0fa283cb Author: Peter Zijlstra <a.p.zijlstra@chello.nl> AuthorDate: Mon, 14 Nov 2011 10:03:25 +0100 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Mon, 14 Nov 2011 13:03:44 +0100 perf/x86: Enable raw event access to Intel offcore events Now that the core offcore support is fixed up (thanks Stephane) and we have sane generic events utilizing them, re-enable the raw access to the feature as well. Note that it doesn't matter if you use event 0x1b7 or 0x1bb to specify an offcore event, either one works and neither guarantees you'll end up on a particular offcore MSR. Based on original patch from: Vince Weaver <vweaver1@eecs.utk.edu>. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vince Weaver <vweaver1@eecs.utk.edu>. Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1108031200390.703@cl320.eecs.utk.edu Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/cpu/perf_event.c | 6 +----- 1 files changed, 1 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index ff0e8d4..2bda212 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -312,12 +312,8 @@ int x86_setup_perfctr(struct perf_event *event) return -EOPNOTSUPP; } - /* - * Do not allow config1 (extended registers) to propagate, - * there's no sane user-space generalization yet: - */ if (attr->type == PERF_TYPE_RAW) - return 0; + return x86_pmu_extra_regs(event->attr.config, event); if (attr->type == PERF_TYPE_HW_CACHE) return set_ext_hw_attr(hwc, event); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events 2011-11-18 23:34 ` [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events tip-bot for Peter Zijlstra @ 2011-11-21 18:24 ` Vince Weaver 2011-11-21 18:52 ` Peter Zijlstra 0 siblings, 1 reply; 17+ messages in thread From: Vince Weaver @ 2011-11-21 18:24 UTC (permalink / raw) To: mingo, hpa, eranian, linux-kernel, a.p.zijlstra, tglx, mingo On Fri, 18 Nov 2011, tip-bot for Peter Zijlstra wrote: > Commit-ID: ed13ec58bfe0d5dc95f748e6118432cb0fa283cb > Gitweb: http://git.kernel.org/tip/ed13ec58bfe0d5dc95f748e6118432cb0fa283cb > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > AuthorDate: Mon, 14 Nov 2011 10:03:25 +0100 > Committer: Ingo Molnar <mingo@elte.hu> > CommitDate: Mon, 14 Nov 2011 13:03:44 +0100 > > perf/x86: Enable raw event access to Intel offcore events awesome. You can add a "Whined-about-it-at-great-length: Vince Weaver" line to the commit logs if you want. Now I'll have to find something else to complain about, probably the scedulability issue introduced with the NMI watchdog. Thanks, Vince vweaver1@eecs.utk.edu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events 2011-11-21 18:24 ` Vince Weaver @ 2011-11-21 18:52 ` Peter Zijlstra 2011-11-21 19:01 ` Vince Weaver 0 siblings, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2011-11-21 18:52 UTC (permalink / raw) To: Vince Weaver; +Cc: mingo, hpa, eranian, linux-kernel, tglx, mingo On Mon, 2011-11-21 at 13:24 -0500, Vince Weaver wrote: > On Fri, 18 Nov 2011, tip-bot for Peter Zijlstra wrote: > > > Commit-ID: ed13ec58bfe0d5dc95f748e6118432cb0fa283cb > > Gitweb: http://git.kernel.org/tip/ed13ec58bfe0d5dc95f748e6118432cb0fa283cb > > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > > AuthorDate: Mon, 14 Nov 2011 10:03:25 +0100 > > Committer: Ingo Molnar <mingo@elte.hu> > > CommitDate: Mon, 14 Nov 2011 13:03:44 +0100 > > > > perf/x86: Enable raw event access to Intel offcore events > > awesome. You can add a "Whined-about-it-at-great-length: Vince Weaver" > line to the commit logs if you want. Now I'll have to find something else > to complain about, probably the scedulability issue introduced with the > NMI watchdog. :-) But you can disable that thing, right? echo 0 > /proc/sys/kernel/nmi_watchdog; or so should do the trick. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events 2011-11-21 18:52 ` Peter Zijlstra @ 2011-11-21 19:01 ` Vince Weaver 2011-11-21 19:04 ` Stephane Eranian 0 siblings, 1 reply; 17+ messages in thread From: Vince Weaver @ 2011-11-21 19:01 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mingo, hpa, eranian, linux-kernel, tglx, mingo On Mon, 21 Nov 2011, Peter Zijlstra wrote: > On Mon, 2011-11-21 at 13:24 -0500, Vince Weaver wrote: > > Now I'll have to find something else > > to complain about, probably the scedulability issue introduced with the > > NMI watchdog. > > But you can disable that thing, right? > > echo 0 > /proc/sys/kernel/nmi_watchdog; > > or so should do the trick. yes, if the user running PAPI has root. That's unfortunately not always the case, and sometimes people want to have both the NMI watchdog operating and use PAPI simultaneously. We have a workaround, but it currently disables kernel multiplexing and a few other nice features. Vince vweaver1@eecs.utk.edu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events 2011-11-21 19:01 ` Vince Weaver @ 2011-11-21 19:04 ` Stephane Eranian 2011-11-21 21:04 ` Vince Weaver 0 siblings, 1 reply; 17+ messages in thread From: Stephane Eranian @ 2011-11-21 19:04 UTC (permalink / raw) To: Vince Weaver; +Cc: Peter Zijlstra, mingo, hpa, linux-kernel, tglx, mingo Vince, On Mon, Nov 21, 2011 at 8:01 PM, Vince Weaver <vweaver1@eecs.utk.edu> wrote: > On Mon, 21 Nov 2011, Peter Zijlstra wrote: >> On Mon, 2011-11-21 at 13:24 -0500, Vince Weaver wrote: > >> > Now I'll have to find something else >> > to complain about, probably the scedulability issue introduced with the >> > NMI watchdog. >> >> But you can disable that thing, right? >> >> echo 0 > /proc/sys/kernel/nmi_watchdog; >> >> or so should do the trick. > > yes, if the user running PAPI has root. > > That's unfortunately not always the case, and sometimes people want to > have both the NMI watchdog operating and use PAPI simultaneously. > I don't understand why you have a problem with NMI watchdog. Multiplexing allows you to still measure more events than there are counters. > We have a workaround, but it currently disables kernel multiplexing and a > few other nice features. > > Vince > vweaver1@eecs.utk.edu > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events 2011-11-21 19:04 ` Stephane Eranian @ 2011-11-21 21:04 ` Vince Weaver 2011-11-21 21:39 ` Stephane Eranian 2011-11-21 22:05 ` Peter Zijlstra 0 siblings, 2 replies; 17+ messages in thread From: Vince Weaver @ 2011-11-21 21:04 UTC (permalink / raw) To: Stephane Eranian; +Cc: Peter Zijlstra, mingo, hpa, linux-kernel, tglx, mingo On Mon, 21 Nov 2011, Stephane Eranian wrote: > > We have a workaround, but it currently disables kernel multiplexing and a > > few other nice features. > > I don't understand why you have a problem with NMI watchdog. Multiplexing > allows you to still measure more events than there are counters. The PAPI code currently uses FORMAT_GROUP and puts as many events as possible in a group. The way we maximize events in a group is to add events until perf_events indicates a failure. When NMI watchdog is enabled, a counter is stolen. Yet the perf_events code does not account for this. So say on an AMD machine with 4 counters (3 after one is stolen) perf_events lets you add 4 events to an event group, even though only 3 are available. It does not report failure upon open or start, only at read. By then it's too late. We have to work around this, by doing an extra read at open time to verify that the event group actually is valid, adding overhead. Our multiplex code tries to maximize the number of events in a group too. Currently PAPI works around this by just not doing kernel multiplexing if a NMI watchdog is detected. There's probably more elegant solutions such as checking with a read there too, or not using FORMAT_GROUP at all, but since this is an ABI regression I was hoping it would get fixed quickly enough that I wouldn't have to construct better workarounds. Vince vweaver1@eecs.utk.edu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events 2011-11-21 21:04 ` Vince Weaver @ 2011-11-21 21:39 ` Stephane Eranian 2011-11-21 21:42 ` Peter Zijlstra 2011-11-21 22:05 ` Peter Zijlstra 1 sibling, 1 reply; 17+ messages in thread From: Stephane Eranian @ 2011-11-21 21:39 UTC (permalink / raw) To: Vince Weaver; +Cc: Peter Zijlstra, mingo, hpa, linux-kernel, tglx, mingo On Mon, Nov 21, 2011 at 10:04 PM, Vince Weaver <vweaver1@eecs.utk.edu> wrote: > On Mon, 21 Nov 2011, Stephane Eranian wrote: > >> > We have a workaround, but it currently disables kernel multiplexing and a >> > few other nice features. >> >> I don't understand why you have a problem with NMI watchdog. Multiplexing >> allows you to still measure more events than there are counters. > > The PAPI code currently uses FORMAT_GROUP and puts as many events as > possible in a group. The way we maximize events in a group is to > add events until perf_events indicates a failure. > Ok, now I understand your problem. When you submit events (one by one) the kernel does a scheduling simulation but only taking into account events in the group. The goal is to verify that the group is schedulable. It does not look at events from system-wide sessions (which have priority). Doing so would be difficult for per-thread because it would have to look on all CPUs. Even if it were to do that, that would be no guarantee of schedulability because by the time the group is actually scheduled, you may now have system-wide session, i.e., your group is not scheduled because there aren't enough counters leftover from system-wide. The problem you're describing is not specific to X86 and the NMI watchdog. It applies to all architectures and has to do with system-wide pinned events vs. per-thread group schedulability. The NMI watchdog on X86 is an example of that. But AFAIK, the watchdog could be supported on other architectures as well. One could argue, you could check is NMI is active and assume you have one fewer counters during the scheduling simulation. But I think it is a bit more complicated than that. To solve this problem (in the general case), you need to know which counter is taken (or required) by ALL pinned system-wide events across all CPUs. Once you've constructed the bitmap, you can use it as the basis (used_mask) to try and schedule the group events. That's a best effort algorithm. It does not provide the guarantee, the group will be schedulable during its entire lifetime. It simply makes it more likely it will run, if you assume no changes on system-wide pinned events from the moment you create the per-thread group and the time is starts counting. > When NMI watchdog is enabled, a counter is stolen. Yet the perf_events > code does not account for this. > > So say on an AMD machine with 4 counters (3 after one is stolen) > perf_events lets you add 4 events to an event group, even though only 3 > are available. It does not report failure upon open or start, only at > read. By then it's too late. > > We have to work around this, by doing an extra read at open time to verify > that the event group actually is valid, adding overhead. > > Our multiplex code tries to maximize the number of events in a group too. > Currently PAPI works around this by just not doing kernel multiplexing > if a NMI watchdog is detected. There's probably more elegant solutions > such as checking with a read there too, or not using FORMAT_GROUP at all, > but since this is an ABI regression I was hoping it would get fixed > quickly enough that I wouldn't have to construct better workarounds. > > Vince > vweaver1@eecs.utk.edu > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events 2011-11-21 21:39 ` Stephane Eranian @ 2011-11-21 21:42 ` Peter Zijlstra 2011-11-21 21:45 ` Stephane Eranian 0 siblings, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2011-11-21 21:42 UTC (permalink / raw) To: Stephane Eranian; +Cc: Vince Weaver, mingo, hpa, linux-kernel, tglx, mingo On Mon, 2011-11-21 at 22:39 +0100, Stephane Eranian wrote: > To solve this problem (in the general case), you need to know which > counter is taken (or required) > by ALL pinned system-wide events across all CPUs. Once you've > constructed the bitmap, you can > use it as the basis (used_mask) to try and schedule the group events. > Its worse, you also need to consider all task-pinned events if you want to be complete. That's an even worse problem. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events 2011-11-21 21:42 ` Peter Zijlstra @ 2011-11-21 21:45 ` Stephane Eranian 2011-11-21 21:48 ` Peter Zijlstra 0 siblings, 1 reply; 17+ messages in thread From: Stephane Eranian @ 2011-11-21 21:45 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Vince Weaver, mingo, hpa, linux-kernel, tglx, mingo On Mon, Nov 21, 2011 at 10:42 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Mon, 2011-11-21 at 22:39 +0100, Stephane Eranian wrote: >> To solve this problem (in the general case), you need to know which >> counter is taken (or required) >> by ALL pinned system-wide events across all CPUs. Once you've >> constructed the bitmap, you can >> use it as the basis (used_mask) to try and schedule the group events. >> > Its worse, you also need to consider all task-pinned events if you want > to be complete. That's an even worse problem. > But that you know because by the time you do the simulation you already know the task the events are attached to. You can get to the list of events already attached to the task. Again this is a best effort, it could well be that someone else attaches pinned task events just after the simulation. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events 2011-11-21 21:45 ` Stephane Eranian @ 2011-11-21 21:48 ` Peter Zijlstra 0 siblings, 0 replies; 17+ messages in thread From: Peter Zijlstra @ 2011-11-21 21:48 UTC (permalink / raw) To: Stephane Eranian; +Cc: Vince Weaver, mingo, hpa, linux-kernel, tglx, mingo On Mon, 2011-11-21 at 22:45 +0100, Stephane Eranian wrote: > On Mon, Nov 21, 2011 at 10:42 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > On Mon, 2011-11-21 at 22:39 +0100, Stephane Eranian wrote: > >> To solve this problem (in the general case), you need to know which > >> counter is taken (or required) > >> by ALL pinned system-wide events across all CPUs. Once you've > >> constructed the bitmap, you can > >> use it as the basis (used_mask) to try and schedule the group events. > >> > > Its worse, you also need to consider all task-pinned events if you want > > to be complete. That's an even worse problem. > > > But that you know because by the time you do the simulation you already > know the task the events are attached to. You can get to the list of events > already attached to the task. Again this is a best effort, it could well be > that someone else attaches pinned task events just after the simulation. In the PAPI case yes, but what if you want to create a CPU-wide group, then you need to consider all task-pinned events that could possibly land on your CPU. Not fun. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events 2011-11-21 21:04 ` Vince Weaver 2011-11-21 21:39 ` Stephane Eranian @ 2011-11-21 22:05 ` Peter Zijlstra 1 sibling, 0 replies; 17+ messages in thread From: Peter Zijlstra @ 2011-11-21 22:05 UTC (permalink / raw) To: Vince Weaver; +Cc: Stephane Eranian, mingo, hpa, linux-kernel, tglx, mingo On Mon, 2011-11-21 at 16:04 -0500, Vince Weaver wrote: > The PAPI code currently uses FORMAT_GROUP and puts as many events as > possible in a group. The way we maximize events in a group is to > add events until perf_events indicates a failure. > > When NMI watchdog is enabled, a counter is stolen. Yet the perf_events > code does not account for this. > > So say on an AMD machine with 4 counters (3 after one is stolen) > perf_events lets you add 4 events to an event group, even though only 3 > are available. It does not report failure upon open or start, only at > read. By then it's too late. > > We have to work around this, by doing an extra read at open time to verify > that the event group actually is valid, adding overhead. So strictly speaking this is a starvation case, we create a group that requires the entire PMU but then run a higher priority event that won't yield resulting in the group never getting scheduled. Now fully agreed that starvation sucks, but I'm not exactly sure what to do about it. What isn't helping either is that all this scheduling code is per arch :/ ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-11-21 22:07 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-08-03 16:05 [perf] enable raw OFFCORE_EVENTS for non-perf userspace Vince Weaver 2011-08-03 17:00 ` Vince Weaver 2011-08-04 15:55 ` Peter Zijlstra 2011-08-05 2:24 ` Vince Weaver 2011-08-05 9:49 ` Ingo Molnar 2011-11-07 18:30 ` Vince Weaver 2011-11-18 23:34 ` [tip:perf/urgent] perf/x86: Enable raw event access to Intel offcore events tip-bot for Peter Zijlstra 2011-11-21 18:24 ` Vince Weaver 2011-11-21 18:52 ` Peter Zijlstra 2011-11-21 19:01 ` Vince Weaver 2011-11-21 19:04 ` Stephane Eranian 2011-11-21 21:04 ` Vince Weaver 2011-11-21 21:39 ` Stephane Eranian 2011-11-21 21:42 ` Peter Zijlstra 2011-11-21 21:45 ` Stephane Eranian 2011-11-21 21:48 ` Peter Zijlstra 2011-11-21 22:05 ` Peter Zijlstra
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.