linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* OProfile issues
@ 2007-06-12 15:02 Stephane Eranian
  2007-06-12 18:37 ` Chris Wright
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Stephane Eranian @ 2007-06-12 15:02 UTC (permalink / raw)
  To: oprofile-list; +Cc: wcohen, ak, perfmon, linux-kernel, levon

Hello,

I am working on perfmon2 to allow Oprofile and perfmon2 to co-exist
as suggested by Andi Kleen. I looked at the Oprofile init/shutdown
code and I am puzzled by several things which you might be able to
explain for me. I am looking at 2.6.22-rc3.

Here are the issues:

 * model->fill_in_addresses is called once for all CPUs
   on X86, it does more than just filling in the addresses,
   it also coordinates with the NMI watchdog by reserving
   registers via the reserve_*nmi() interface.

   The problem is that the release of the registers is done
   in model->shutdown() which happens to be executed on every
   CPU. So you end up releasing the registers too many times.
   This is *not* harmless once you start sharing the PMU with
   other subsystems given the way the allocator is designed.

 * allocate_msrs() allocates two tables per CPU. One for the
   counters, the other for the eventsel registers. But then
   nmi_setup() copies the cpu_msrs[0] into cpu_msrs[] of all
   other cpus. This operation overrides the cpu_msrs[].counters
   and cpu_msrs[].controls pointers for all CPUs but CPU0.
   But free_msrs() will free the same tables multiple times. This
   causes a kernel dump when you enable certain kernel debugging
   features. The fix is to copy the content of the counters and
   controls array, not the pointers.

 * the fill_in_addresses() callback for X86 invokes the NMI watchdog
   reserve_*_nmi() register allocation routines. This is done regardless
   of whether the NMI watchdog is active. When the NMI watchdog is not
   active, the allocator will satisfy the allocation for the first MSR
   of each type (counter or control), but then it will reject any
   request for the others. You end up working with a single
   counter/control register.

Are those known bugs?

-- 
-Stephane

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: OProfile issues
  2007-06-12 15:02 OProfile issues Stephane Eranian
@ 2007-06-12 18:37 ` Chris Wright
  2007-06-12 18:38 ` Chuck Ebbert
  2007-06-12 19:07 ` Björn Steinbrink
  2 siblings, 0 replies; 28+ messages in thread
From: Chris Wright @ 2007-06-12 18:37 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: oprofile-list, wcohen, ak, perfmon, linux-kernel, levon

* Stephane Eranian (eranian@hpl.hp.com) wrote:
>  * allocate_msrs() allocates two tables per CPU. One for the
>    counters, the other for the eventsel registers. But then
>    nmi_setup() copies the cpu_msrs[0] into cpu_msrs[] of all
>    other cpus. This operation overrides the cpu_msrs[].counters
>    and cpu_msrs[].controls pointers for all CPUs but CPU0.
>    But free_msrs() will free the same tables multiple times. This
>    causes a kernel dump when you enable certain kernel debugging
>    features. The fix is to copy the content of the counters and
>    controls array, not the pointers.

This is fixed in 2.6.22-rc4 in commit 0939c17c7bcf1c838bea4445b80a6966809a438f

thanks,
-chris

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: OProfile issues
  2007-06-12 15:02 OProfile issues Stephane Eranian
  2007-06-12 18:37 ` Chris Wright
@ 2007-06-12 18:38 ` Chuck Ebbert
  2007-06-12 19:07 ` Björn Steinbrink
  2 siblings, 0 replies; 28+ messages in thread
From: Chuck Ebbert @ 2007-06-12 18:38 UTC (permalink / raw)
  To: eranian; +Cc: oprofile-list, wcohen, ak, perfmon, linux-kernel, levon

On 06/12/2007 11:02 AM, Stephane Eranian wrote:
>  * allocate_msrs() allocates two tables per CPU. One for the
>    counters, the other for the eventsel registers. But then
>    nmi_setup() copies the cpu_msrs[0] into cpu_msrs[] of all
>    other cpus. This operation overrides the cpu_msrs[].counters
>    and cpu_msrs[].controls pointers for all CPUs but CPU0.
>    But free_msrs() will free the same tables multiple times. This
>    causes a kernel dump when you enable certain kernel debugging
>    features. The fix is to copy the content of the counters and
>    controls array, not the pointers.
> 

How old is the kernel you are looking at?

http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0939c17c7bcf1c838bea4445b80a6966809a438f

...fixed this over 10 days ago.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: OProfile issues
  2007-06-12 15:02 OProfile issues Stephane Eranian
  2007-06-12 18:37 ` Chris Wright
  2007-06-12 18:38 ` Chuck Ebbert
@ 2007-06-12 19:07 ` Björn Steinbrink
  2007-06-13  1:41   ` [PATCH] Separate performance counter reservation from nmi watchdog Björn Steinbrink
  2 siblings, 1 reply; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-12 19:07 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: oprofile-list, wcohen, ak, perfmon, linux-kernel, levon

On 2007.06.12 08:02:46 -0700, Stephane Eranian wrote:
> Hello,
> 
> I am working on perfmon2 to allow Oprofile and perfmon2 to co-exist
> as suggested by Andi Kleen. I looked at the Oprofile init/shutdown
> code and I am puzzled by several things which you might be able to
> explain for me. I am looking at 2.6.22-rc3.
> 
> Here are the issues:
> 
>  * model->fill_in_addresses is called once for all CPUs
>    on X86, it does more than just filling in the addresses,
>    it also coordinates with the NMI watchdog by reserving
>    registers via the reserve_*nmi() interface.
> 
>    The problem is that the release of the registers is done
>    in model->shutdown() which happens to be executed on every
>    CPU. So you end up releasing the registers too many times.
>    This is *not* harmless once you start sharing the PMU with
>    other subsystems given the way the allocator is designed.

Hm, currently it should be ok to move the call to model->shutdown() into
nmi_shutdown(), but you might want to instead set addr to 0 when the
register is released to still allow for per cpu actions in shutdown().

>  * allocate_msrs() allocates two tables per CPU. One for the
>    counters, the other for the eventsel registers. But then
>    nmi_setup() copies the cpu_msrs[0] into cpu_msrs[] of all
>    other cpus. This operation overrides the cpu_msrs[].counters
>    and cpu_msrs[].controls pointers for all CPUs but CPU0.
>    But free_msrs() will free the same tables multiple times. This
>    causes a kernel dump when you enable certain kernel debugging
>    features. The fix is to copy the content of the counters and
>    controls array, not the pointers.

This was fixed in commit 0939c17c7bcf1.

>  * the fill_in_addresses() callback for X86 invokes the NMI watchdog
>    reserve_*_nmi() register allocation routines. This is done regardless
>    of whether the NMI watchdog is active. When the NMI watchdog is not
>    active, the allocator will satisfy the allocation for the first MSR
>    of each type (counter or control), but then it will reject any
>    request for the others. You end up working with a single
>    counter/control register.

Hm, ouch. I'll try to move the reservation parts into a separate system.

Björn

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] Separate performance counter reservation from nmi watchdog
  2007-06-12 19:07 ` Björn Steinbrink
@ 2007-06-13  1:41   ` Björn Steinbrink
  2007-06-13 16:46     ` Björn Steinbrink
  0 siblings, 1 reply; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-13  1:41 UTC (permalink / raw)
  To: Stephane Eranian, oprofile-list, wcohen, ak, perfmon,
	linux-kernel, levon, akpm

On 2007.06.12 21:07:30 +0200, Björn Steinbrink wrote:
> On 2007.06.12 08:02:46 -0700, Stephane Eranian wrote:
> >  * the fill_in_addresses() callback for X86 invokes the NMI watchdog
> >    reserve_*_nmi() register allocation routines. This is done regardless
> >    of whether the NMI watchdog is active. When the NMI watchdog is not
> >    active, the allocator will satisfy the allocation for the first MSR
> >    of each type (counter or control), but then it will reject any
> >    request for the others. You end up working with a single
> >    counter/control register.
> 
> Hm, ouch. I'll try to move the reservation parts into a separate system.

Ok, here's the first try. The performance counter reservation system has
been moved into its own file and separated from the nmi watchdog. Also,
the probing rules were relaxed a bit, as they were more restrictive in
the watchdog code than in the oprofile code.

The new code never allows to reserve a performance counter or event
selection register when the probing failed, instead of allowing one
random register to be reserved.

While moving the code around, I noticed that the PerfMon nmi watchdog
actually reserved the wrong MSRs, as the generic reservation function
always took the base register. As the respective attributes are no
longer used as base regs, we can now store the correct value there and
keep using the generic function.

Being unfamiliar with the kernel init process, I simply put the probing
call right before the nmi watchdog initialization, but that's probably
wrong (and dependent on local APIC on i386), so I'd be glad if someone
could point out a better location.

Thanks,
Björn


From: Björn Steinbrink <B.Steinbrink@gmx.de>

Separate the performance counter reservation system from the nmi
watchdog to allow usage of all performance counters even if the nmi
watchdog is not used.

Fixed the reservation of the wrong performance counter for the PerfMon
based nmi watchdog along the way.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
---
diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
index 67824f3..88b74e3 100644
--- a/arch/i386/kernel/apic.c
+++ b/arch/i386/kernel/apic.c
@@ -1009,6 +1009,7 @@ void __devinit setup_local_APIC(void)
 	value |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR);
 	apic_write_around(APIC_LVTT, value);
 
+	probe_performance_counters();
 	setup_apic_nmi_watchdog(NULL);
 	apic_pm_activate();
 }
diff --git a/arch/i386/kernel/cpu/Makefile b/arch/i386/kernel/cpu/Makefile
index 74f27a4..882364d 100644
--- a/arch/i386/kernel/cpu/Makefile
+++ b/arch/i386/kernel/cpu/Makefile
@@ -18,4 +18,4 @@ obj-$(CONFIG_X86_MCE)	+=	mcheck/
 obj-$(CONFIG_MTRR)	+= 	mtrr/
 obj-$(CONFIG_CPU_FREQ)	+=	cpufreq/
 
-obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o
+obj-$(CONFIG_X86_LOCAL_APIC) += perfctr.o perfctr-watchdog.o
diff --git a/arch/i386/kernel/cpu/perfctr-watchdog.c b/arch/i386/kernel/cpu/perfctr-watchdog.c
index 2b04c8f..6187097 100644
--- a/arch/i386/kernel/cpu/perfctr-watchdog.c
+++ b/arch/i386/kernel/cpu/perfctr-watchdog.c
@@ -8,9 +8,7 @@
    Original code for K7/P6 written by Keith Owens */
 
 #include <linux/percpu.h>
-#include <linux/module.h>
 #include <linux/kernel.h>
-#include <linux/bitops.h>
 #include <linux/smp.h>
 #include <linux/nmi.h>
 #include <asm/apic.h>
@@ -36,105 +34,8 @@ struct wd_ops {
 
 static struct wd_ops *wd_ops;
 
-/* this number is calculated from Intel's MSR_P4_CRU_ESCR5 register and it's
- * offset from MSR_P4_BSU_ESCR0.  It will be the max for all platforms (for now)
- */
-#define NMI_MAX_COUNTER_BITS 66
-
-/* perfctr_nmi_owner tracks the ownership of the perfctr registers:
- * evtsel_nmi_owner tracks the ownership of the event selection
- * - different performance counters/ event selection may be reserved for
- *   different subsystems this reservation system just tries to coordinate
- *   things a little
- */
-static DECLARE_BITMAP(perfctr_nmi_owner, NMI_MAX_COUNTER_BITS);
-static DECLARE_BITMAP(evntsel_nmi_owner, NMI_MAX_COUNTER_BITS);
-
 static DEFINE_PER_CPU(struct nmi_watchdog_ctlblk, nmi_watchdog_ctlblk);
 
-/* converts an msr to an appropriate reservation bit */
-static inline unsigned int nmi_perfctr_msr_to_bit(unsigned int msr)
-{
-	return wd_ops ? msr - wd_ops->perfctr : 0;
-}
-
-/* converts an msr to an appropriate reservation bit */
-/* returns the bit offset of the event selection register */
-static inline unsigned int nmi_evntsel_msr_to_bit(unsigned int msr)
-{
-	return wd_ops ? msr - wd_ops->evntsel : 0;
-}
-
-/* checks for a bit availability (hack for oprofile) */
-int avail_to_resrv_perfctr_nmi_bit(unsigned int counter)
-{
-	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
-
-	return (!test_bit(counter, perfctr_nmi_owner));
-}
-
-/* checks the an msr for availability */
-int avail_to_resrv_perfctr_nmi(unsigned int msr)
-{
-	unsigned int counter;
-
-	counter = nmi_perfctr_msr_to_bit(msr);
-	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
-
-	return (!test_bit(counter, perfctr_nmi_owner));
-}
-
-int reserve_perfctr_nmi(unsigned int msr)
-{
-	unsigned int counter;
-
-	counter = nmi_perfctr_msr_to_bit(msr);
-	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
-
-	if (!test_and_set_bit(counter, perfctr_nmi_owner))
-		return 1;
-	return 0;
-}
-
-void release_perfctr_nmi(unsigned int msr)
-{
-	unsigned int counter;
-
-	counter = nmi_perfctr_msr_to_bit(msr);
-	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
-
-	clear_bit(counter, perfctr_nmi_owner);
-}
-
-int reserve_evntsel_nmi(unsigned int msr)
-{
-	unsigned int counter;
-
-	counter = nmi_evntsel_msr_to_bit(msr);
-	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
-
-	if (!test_and_set_bit(counter, evntsel_nmi_owner))
-		return 1;
-	return 0;
-}
-
-void release_evntsel_nmi(unsigned int msr)
-{
-	unsigned int counter;
-
-	counter = nmi_evntsel_msr_to_bit(msr);
-	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
-
-	clear_bit(counter, evntsel_nmi_owner);
-}
-
-EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi);
-EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi_bit);
-EXPORT_SYMBOL(reserve_perfctr_nmi);
-EXPORT_SYMBOL(release_perfctr_nmi);
-EXPORT_SYMBOL(reserve_evntsel_nmi);
-EXPORT_SYMBOL(release_evntsel_nmi);
-
 void disable_lapic_nmi_watchdog(void)
 {
 	BUG_ON(nmi_watchdog != NMI_LOCAL_APIC);
@@ -568,8 +469,8 @@ static struct wd_ops intel_arch_wd_ops = {
 	.setup = setup_intel_arch_watchdog,
 	.rearm = p6_rearm,
 	.stop = single_msr_stop_watchdog,
-	.perfctr = MSR_ARCH_PERFMON_PERFCTR0,
-	.evntsel = MSR_ARCH_PERFMON_EVENTSEL0,
+	.perfctr = MSR_ARCH_PERFMON_PERFCTR1,
+	.evntsel = MSR_ARCH_PERFMON_EVENTSEL1,
 };
 
 static void probe_nmi_watchdog(void)
diff --git a/arch/i386/kernel/cpu/perfctr.c b/arch/i386/kernel/cpu/perfctr.c
new file mode 100644
index 0000000..63e68a3
--- /dev/null
+++ b/arch/i386/kernel/cpu/perfctr.c
@@ -0,0 +1,175 @@
+#include <linux/bitops.h>
+#include <linux/module.h>
+#include <asm/msr-index.h>
+#include <asm/intel_arch_perfmon.h>
+
+struct perfctr_base_regs {
+	unsigned int perfctr;
+	unsigned int evntsel;
+};
+
+static struct perfctr_base_regs *perfctr_base_regs;
+
+static struct perfctr_base_regs k7_base_regs = {
+	.perfctr = MSR_K7_PERFCTR0,
+	.evntsel = MSR_K7_EVNTSEL0
+};
+
+static struct perfctr_base_regs p4_base_regs = {
+	.perfctr = MSR_P4_BPU_PERFCTR0,
+	.evntsel = MSR_P4_BSU_ESCR0
+};
+
+static struct perfctr_base_regs p6_base_regs = {
+	.perfctr = MSR_P6_PERFCTR0,
+	.evntsel = MSR_P6_EVNTSEL0
+};
+
+static struct perfctr_base_regs arch_perfmon_base_regs = {
+	.perfctr = MSR_ARCH_PERFMON_PERFCTR0,
+	.evntsel = MSR_ARCH_PERFMON_EVENTSEL0
+};
+
+/* this number is calculated from Intel's MSR_P4_CRU_ESCR5 register and it's
+ * offset from MSR_P4_BSU_ESCR0.  It will be the max for all platforms (for now)
+ */
+#define NMI_MAX_COUNTER_BITS 66
+
+/* perfctr_nmi_owner tracks the ownership of the perfctr registers:
+ * evtsel_nmi_owner tracks the ownership of the event selection
+ * - different performance counters/ event selection may be reserved for
+ *   different subsystems this reservation system just tries to coordinate
+ *   things a little
+ */
+static DECLARE_BITMAP(perfctr_nmi_owner, NMI_MAX_COUNTER_BITS);
+static DECLARE_BITMAP(evntsel_nmi_owner, NMI_MAX_COUNTER_BITS);
+
+void __devinit probe_performance_counters(void)
+{
+	switch (boot_cpu_data.x86_vendor) {
+	case X86_VENDOR_AMD:
+		if (boot_cpu_data.x86 != 6 && boot_cpu_data.x86 != 15 &&
+		    boot_cpu_data.x86 != 16)
+			return;
+
+		perfctr_base_regs = &k7_base_regs;
+		break;
+
+	case X86_VENDOR_INTEL:
+		if (cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) {
+			perfctr_base_regs = &arch_perfmon_base_regs;
+			break;
+		}
+		switch (boot_cpu_data.x86) {
+		case 6:
+			perfctr_base_regs = &p6_base_regs;
+			break;
+
+		case 15:
+			perfctr_base_regs = &p4_base_regs;
+			break;
+		}
+		break;
+	}
+}
+
+/* converts an msr to an appropriate reservation bit */
+static inline unsigned int nmi_perfctr_msr_to_bit(unsigned int msr)
+{
+	return msr - perfctr_base_regs->perfctr;
+}
+
+/* converts an msr to an appropriate reservation bit */
+/* returns the bit offset of the event selection register */
+static inline unsigned int nmi_evntsel_msr_to_bit(unsigned int msr)
+{
+	return msr - perfctr_base_regs->evntsel;
+}
+
+/* checks for a bit availability (hack for oprofile) */
+int avail_to_resrv_perfctr_nmi_bit(unsigned int counter)
+{
+	if (!perfctr_base_regs)
+		return 0;
+
+	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
+
+	return (!test_bit(counter, perfctr_nmi_owner));
+}
+
+/* checks the an msr for availability */
+int avail_to_resrv_perfctr_nmi(unsigned int msr)
+{
+	unsigned int counter;
+
+	if (!perfctr_base_regs)
+		return 0;
+
+	counter = nmi_perfctr_msr_to_bit(msr);
+	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
+
+	return (!test_bit(counter, perfctr_nmi_owner));
+}
+
+int reserve_perfctr_nmi(unsigned int msr)
+{
+	unsigned int counter;
+
+	if (!perfctr_base_regs)
+		return 0;
+
+	counter = nmi_perfctr_msr_to_bit(msr);
+	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
+
+	if (!test_and_set_bit(counter, perfctr_nmi_owner))
+		return 1;
+	return 0;
+}
+
+void release_perfctr_nmi(unsigned int msr)
+{
+	unsigned int counter;
+
+	if (!perfctr_base_regs)
+		return 0;
+
+	counter = nmi_perfctr_msr_to_bit(msr);
+	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
+
+	clear_bit(counter, perfctr_nmi_owner);
+}
+
+int reserve_evntsel_nmi(unsigned int msr)
+{
+	unsigned int counter;
+
+	if (!perfctr_base_regs)
+		return 0;
+
+	counter = nmi_evntsel_msr_to_bit(msr);
+	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
+
+	if (!test_and_set_bit(counter, evntsel_nmi_owner))
+		return 1;
+	return 0;
+}
+
+void release_evntsel_nmi(unsigned int msr)
+{
+	unsigned int counter;
+
+	if (!perfctr_base_regs)
+		return 0;
+
+	counter = nmi_evntsel_msr_to_bit(msr);
+	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
+
+	clear_bit(counter, evntsel_nmi_owner);
+}
+
+EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi);
+EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi_bit);
+EXPORT_SYMBOL(reserve_perfctr_nmi);
+EXPORT_SYMBOL(release_perfctr_nmi);
+EXPORT_SYMBOL(reserve_evntsel_nmi);
+EXPORT_SYMBOL(release_evntsel_nmi);
diff --git a/arch/x86_64/kernel/Makefile b/arch/x86_64/kernel/Makefile
index de1de8a..cc7587d 100644
--- a/arch/x86_64/kernel/Makefile
+++ b/arch/x86_64/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y	:= process.o signal.o entry.o traps.o irq.o \
 		x8664_ksyms.o i387.o syscall.o vsyscall.o \
 		setup64.o bootflag.o e820.o reboot.o quirks.o i8237.o \
 		pci-dma.o pci-nommu.o alternative.o hpet.o tsc.o bugs.o \
-		perfctr-watchdog.o
+		perfctr.o perfctr-watchdog.o
 
 obj-$(CONFIG_STACKTRACE)	+= stacktrace.o
 obj-$(CONFIG_X86_MCE)		+= mce.o therm_throt.o
@@ -60,4 +60,5 @@ i8237-y				+= ../../i386/kernel/i8237.o
 msr-$(subst m,y,$(CONFIG_X86_MSR))  += ../../i386/kernel/msr.o
 alternative-y			+= ../../i386/kernel/alternative.o
 pcspeaker-y			+= ../../i386/kernel/pcspeaker.o
+perfctr-y			+= ../../i386/kernel/cpu/perfctr.o
 perfctr-watchdog-y		+= ../../i386/kernel/cpu/perfctr-watchdog.o
diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c
index 1b0e07b..892ebf8 100644
--- a/arch/x86_64/kernel/apic.c
+++ b/arch/x86_64/kernel/apic.c
@@ -453,6 +453,7 @@ void __cpuinit setup_local_APIC (void)
 			oldvalue, value);
 	}
 
+	probe_performance_counters();
 	nmi_watchdog_default();
 	setup_apic_nmi_watchdog(NULL);
 	apic_pm_activate();
diff --git a/include/asm-i386/nmi.h b/include/asm-i386/nmi.h
index fb1e133..736af6f 100644
--- a/include/asm-i386/nmi.h
+++ b/include/asm-i386/nmi.h
@@ -18,6 +18,7 @@
 int do_nmi_callback(struct pt_regs *regs, int cpu);
 
 extern int nmi_watchdog_enabled;
+extern void probe_performance_counters(void);
 extern int avail_to_resrv_perfctr_nmi_bit(unsigned int);
 extern int avail_to_resrv_perfctr_nmi(unsigned int);
 extern int reserve_perfctr_nmi(unsigned int);
diff --git a/include/asm-x86_64/nmi.h b/include/asm-x86_64/nmi.h
index d0a7f53..d45fc62 100644
--- a/include/asm-x86_64/nmi.h
+++ b/include/asm-x86_64/nmi.h
@@ -46,6 +46,7 @@ extern int unknown_nmi_panic;
 extern int nmi_watchdog_enabled;
 
 extern int check_nmi_watchdog(void);
+extern void probe_performance_counters(void);
 extern int avail_to_resrv_perfctr_nmi_bit(unsigned int);
 extern int avail_to_resrv_perfctr_nmi(unsigned int);
 extern int reserve_perfctr_nmi(unsigned int);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] Separate performance counter reservation from nmi watchdog
  2007-06-13  1:41   ` [PATCH] Separate performance counter reservation from nmi watchdog Björn Steinbrink
@ 2007-06-13 16:46     ` Björn Steinbrink
  2007-06-18  9:52       ` Stephane Eranian
  0 siblings, 1 reply; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-13 16:46 UTC (permalink / raw)
  To: Stephane Eranian, oprofile-list, wcohen, ak, perfmon,
	linux-kernel, levon, akpm

On 2007.06.13 03:41:36 +0200, Björn Steinbrink wrote:
> On 2007.06.12 21:07:30 +0200, Björn Steinbrink wrote:
> > On 2007.06.12 08:02:46 -0700, Stephane Eranian wrote:
> > >  * the fill_in_addresses() callback for X86 invokes the NMI watchdog
> > >    reserve_*_nmi() register allocation routines. This is done regardless
> > >    of whether the NMI watchdog is active. When the NMI watchdog is not
> > >    active, the allocator will satisfy the allocation for the first MSR
> > >    of each type (counter or control), but then it will reject any
> > >    request for the others. You end up working with a single
> > >    counter/control register.
> > 
> > Hm, ouch. I'll try to move the reservation parts into a separate system.
> 
> Ok, here's the first try. The performance counter reservation system has
> been moved into its own file and separated from the nmi watchdog. Also,
> the probing rules were relaxed a bit, as they were more restrictive in
> the watchdog code than in the oprofile code.
> 
> The new code never allows to reserve a performance counter or event
> selection register when the probing failed, instead of allowing one
> random register to be reserved.
> 
> While moving the code around, I noticed that the PerfMon nmi watchdog
> actually reserved the wrong MSRs, as the generic reservation function
> always took the base register. As the respective attributes are no
> longer used as base regs, we can now store the correct value there and
> keep using the generic function.
> 
> Being unfamiliar with the kernel init process, I simply put the probing
> call right before the nmi watchdog initialization, but that's probably
> wrong (and dependent on local APIC on i386), so I'd be glad if someone
> could point out a better location.

JFYI, this should be 2.6.22 material, as the dependency on the nmi
watchdog being probed was added by the cleanup in commit
09198e68501a7e34737cd9264d266f42429abcdc. So it's a regression over
2.6.21.

> Thanks,
> Björn
> 
> 
> From: Björn Steinbrink <B.Steinbrink@gmx.de>
> 
> Separate the performance counter reservation system from the nmi
> watchdog to allow usage of all performance counters even if the nmi
> watchdog is not used.
> 
> Fixed the reservation of the wrong performance counter for the PerfMon
> based nmi watchdog along the way.
> 
> Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
> ---
> diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
> index 67824f3..88b74e3 100644
> --- a/arch/i386/kernel/apic.c
> +++ b/arch/i386/kernel/apic.c
> @@ -1009,6 +1009,7 @@ void __devinit setup_local_APIC(void)
>  	value |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR);
>  	apic_write_around(APIC_LVTT, value);
>  
> +	probe_performance_counters();
>  	setup_apic_nmi_watchdog(NULL);
>  	apic_pm_activate();
>  }
> diff --git a/arch/i386/kernel/cpu/Makefile b/arch/i386/kernel/cpu/Makefile
> index 74f27a4..882364d 100644
> --- a/arch/i386/kernel/cpu/Makefile
> +++ b/arch/i386/kernel/cpu/Makefile
> @@ -18,4 +18,4 @@ obj-$(CONFIG_X86_MCE)	+=	mcheck/
>  obj-$(CONFIG_MTRR)	+= 	mtrr/
>  obj-$(CONFIG_CPU_FREQ)	+=	cpufreq/
>  
> -obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o
> +obj-$(CONFIG_X86_LOCAL_APIC) += perfctr.o perfctr-watchdog.o
> diff --git a/arch/i386/kernel/cpu/perfctr-watchdog.c b/arch/i386/kernel/cpu/perfctr-watchdog.c
> index 2b04c8f..6187097 100644
> --- a/arch/i386/kernel/cpu/perfctr-watchdog.c
> +++ b/arch/i386/kernel/cpu/perfctr-watchdog.c
> @@ -8,9 +8,7 @@
>     Original code for K7/P6 written by Keith Owens */
>  
>  #include <linux/percpu.h>
> -#include <linux/module.h>
>  #include <linux/kernel.h>
> -#include <linux/bitops.h>
>  #include <linux/smp.h>
>  #include <linux/nmi.h>
>  #include <asm/apic.h>
> @@ -36,105 +34,8 @@ struct wd_ops {
>  
>  static struct wd_ops *wd_ops;
>  
> -/* this number is calculated from Intel's MSR_P4_CRU_ESCR5 register and it's
> - * offset from MSR_P4_BSU_ESCR0.  It will be the max for all platforms (for now)
> - */
> -#define NMI_MAX_COUNTER_BITS 66
> -
> -/* perfctr_nmi_owner tracks the ownership of the perfctr registers:
> - * evtsel_nmi_owner tracks the ownership of the event selection
> - * - different performance counters/ event selection may be reserved for
> - *   different subsystems this reservation system just tries to coordinate
> - *   things a little
> - */
> -static DECLARE_BITMAP(perfctr_nmi_owner, NMI_MAX_COUNTER_BITS);
> -static DECLARE_BITMAP(evntsel_nmi_owner, NMI_MAX_COUNTER_BITS);
> -
>  static DEFINE_PER_CPU(struct nmi_watchdog_ctlblk, nmi_watchdog_ctlblk);
>  
> -/* converts an msr to an appropriate reservation bit */
> -static inline unsigned int nmi_perfctr_msr_to_bit(unsigned int msr)
> -{
> -	return wd_ops ? msr - wd_ops->perfctr : 0;
> -}
> -
> -/* converts an msr to an appropriate reservation bit */
> -/* returns the bit offset of the event selection register */
> -static inline unsigned int nmi_evntsel_msr_to_bit(unsigned int msr)
> -{
> -	return wd_ops ? msr - wd_ops->evntsel : 0;
> -}
> -
> -/* checks for a bit availability (hack for oprofile) */
> -int avail_to_resrv_perfctr_nmi_bit(unsigned int counter)
> -{
> -	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> -
> -	return (!test_bit(counter, perfctr_nmi_owner));
> -}
> -
> -/* checks the an msr for availability */
> -int avail_to_resrv_perfctr_nmi(unsigned int msr)
> -{
> -	unsigned int counter;
> -
> -	counter = nmi_perfctr_msr_to_bit(msr);
> -	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> -
> -	return (!test_bit(counter, perfctr_nmi_owner));
> -}
> -
> -int reserve_perfctr_nmi(unsigned int msr)
> -{
> -	unsigned int counter;
> -
> -	counter = nmi_perfctr_msr_to_bit(msr);
> -	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> -
> -	if (!test_and_set_bit(counter, perfctr_nmi_owner))
> -		return 1;
> -	return 0;
> -}
> -
> -void release_perfctr_nmi(unsigned int msr)
> -{
> -	unsigned int counter;
> -
> -	counter = nmi_perfctr_msr_to_bit(msr);
> -	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> -
> -	clear_bit(counter, perfctr_nmi_owner);
> -}
> -
> -int reserve_evntsel_nmi(unsigned int msr)
> -{
> -	unsigned int counter;
> -
> -	counter = nmi_evntsel_msr_to_bit(msr);
> -	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> -
> -	if (!test_and_set_bit(counter, evntsel_nmi_owner))
> -		return 1;
> -	return 0;
> -}
> -
> -void release_evntsel_nmi(unsigned int msr)
> -{
> -	unsigned int counter;
> -
> -	counter = nmi_evntsel_msr_to_bit(msr);
> -	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> -
> -	clear_bit(counter, evntsel_nmi_owner);
> -}
> -
> -EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi);
> -EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi_bit);
> -EXPORT_SYMBOL(reserve_perfctr_nmi);
> -EXPORT_SYMBOL(release_perfctr_nmi);
> -EXPORT_SYMBOL(reserve_evntsel_nmi);
> -EXPORT_SYMBOL(release_evntsel_nmi);
> -
>  void disable_lapic_nmi_watchdog(void)
>  {
>  	BUG_ON(nmi_watchdog != NMI_LOCAL_APIC);
> @@ -568,8 +469,8 @@ static struct wd_ops intel_arch_wd_ops = {
>  	.setup = setup_intel_arch_watchdog,
>  	.rearm = p6_rearm,
>  	.stop = single_msr_stop_watchdog,
> -	.perfctr = MSR_ARCH_PERFMON_PERFCTR0,
> -	.evntsel = MSR_ARCH_PERFMON_EVENTSEL0,
> +	.perfctr = MSR_ARCH_PERFMON_PERFCTR1,
> +	.evntsel = MSR_ARCH_PERFMON_EVENTSEL1,
>  };
>  
>  static void probe_nmi_watchdog(void)
> diff --git a/arch/i386/kernel/cpu/perfctr.c b/arch/i386/kernel/cpu/perfctr.c
> new file mode 100644
> index 0000000..63e68a3
> --- /dev/null
> +++ b/arch/i386/kernel/cpu/perfctr.c
> @@ -0,0 +1,175 @@
> +#include <linux/bitops.h>
> +#include <linux/module.h>
> +#include <asm/msr-index.h>
> +#include <asm/intel_arch_perfmon.h>
> +
> +struct perfctr_base_regs {
> +	unsigned int perfctr;
> +	unsigned int evntsel;
> +};
> +
> +static struct perfctr_base_regs *perfctr_base_regs;
> +
> +static struct perfctr_base_regs k7_base_regs = {
> +	.perfctr = MSR_K7_PERFCTR0,
> +	.evntsel = MSR_K7_EVNTSEL0
> +};
> +
> +static struct perfctr_base_regs p4_base_regs = {
> +	.perfctr = MSR_P4_BPU_PERFCTR0,
> +	.evntsel = MSR_P4_BSU_ESCR0
> +};
> +
> +static struct perfctr_base_regs p6_base_regs = {
> +	.perfctr = MSR_P6_PERFCTR0,
> +	.evntsel = MSR_P6_EVNTSEL0
> +};
> +
> +static struct perfctr_base_regs arch_perfmon_base_regs = {
> +	.perfctr = MSR_ARCH_PERFMON_PERFCTR0,
> +	.evntsel = MSR_ARCH_PERFMON_EVENTSEL0
> +};
> +
> +/* this number is calculated from Intel's MSR_P4_CRU_ESCR5 register and it's
> + * offset from MSR_P4_BSU_ESCR0.  It will be the max for all platforms (for now)
> + */
> +#define NMI_MAX_COUNTER_BITS 66
> +
> +/* perfctr_nmi_owner tracks the ownership of the perfctr registers:
> + * evtsel_nmi_owner tracks the ownership of the event selection
> + * - different performance counters/ event selection may be reserved for
> + *   different subsystems this reservation system just tries to coordinate
> + *   things a little
> + */
> +static DECLARE_BITMAP(perfctr_nmi_owner, NMI_MAX_COUNTER_BITS);
> +static DECLARE_BITMAP(evntsel_nmi_owner, NMI_MAX_COUNTER_BITS);
> +
> +void __devinit probe_performance_counters(void)
> +{
> +	switch (boot_cpu_data.x86_vendor) {
> +	case X86_VENDOR_AMD:
> +		if (boot_cpu_data.x86 != 6 && boot_cpu_data.x86 != 15 &&
> +		    boot_cpu_data.x86 != 16)
> +			return;
> +
> +		perfctr_base_regs = &k7_base_regs;
> +		break;
> +
> +	case X86_VENDOR_INTEL:
> +		if (cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) {
> +			perfctr_base_regs = &arch_perfmon_base_regs;
> +			break;
> +		}
> +		switch (boot_cpu_data.x86) {
> +		case 6:
> +			perfctr_base_regs = &p6_base_regs;
> +			break;
> +
> +		case 15:
> +			perfctr_base_regs = &p4_base_regs;
> +			break;
> +		}
> +		break;
> +	}
> +}
> +
> +/* converts an msr to an appropriate reservation bit */
> +static inline unsigned int nmi_perfctr_msr_to_bit(unsigned int msr)
> +{
> +	return msr - perfctr_base_regs->perfctr;
> +}
> +
> +/* converts an msr to an appropriate reservation bit */
> +/* returns the bit offset of the event selection register */
> +static inline unsigned int nmi_evntsel_msr_to_bit(unsigned int msr)
> +{
> +	return msr - perfctr_base_regs->evntsel;
> +}
> +
> +/* checks for a bit availability (hack for oprofile) */
> +int avail_to_resrv_perfctr_nmi_bit(unsigned int counter)
> +{
> +	if (!perfctr_base_regs)
> +		return 0;
> +
> +	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> +
> +	return (!test_bit(counter, perfctr_nmi_owner));
> +}
> +
> +/* checks the an msr for availability */
> +int avail_to_resrv_perfctr_nmi(unsigned int msr)
> +{
> +	unsigned int counter;
> +
> +	if (!perfctr_base_regs)
> +		return 0;
> +
> +	counter = nmi_perfctr_msr_to_bit(msr);
> +	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> +
> +	return (!test_bit(counter, perfctr_nmi_owner));
> +}
> +
> +int reserve_perfctr_nmi(unsigned int msr)
> +{
> +	unsigned int counter;
> +
> +	if (!perfctr_base_regs)
> +		return 0;
> +
> +	counter = nmi_perfctr_msr_to_bit(msr);
> +	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> +
> +	if (!test_and_set_bit(counter, perfctr_nmi_owner))
> +		return 1;
> +	return 0;
> +}
> +
> +void release_perfctr_nmi(unsigned int msr)
> +{
> +	unsigned int counter;
> +
> +	if (!perfctr_base_regs)
> +		return 0;
> +
> +	counter = nmi_perfctr_msr_to_bit(msr);
> +	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> +
> +	clear_bit(counter, perfctr_nmi_owner);
> +}
> +
> +int reserve_evntsel_nmi(unsigned int msr)
> +{
> +	unsigned int counter;
> +
> +	if (!perfctr_base_regs)
> +		return 0;
> +
> +	counter = nmi_evntsel_msr_to_bit(msr);
> +	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> +
> +	if (!test_and_set_bit(counter, evntsel_nmi_owner))
> +		return 1;
> +	return 0;
> +}
> +
> +void release_evntsel_nmi(unsigned int msr)
> +{
> +	unsigned int counter;
> +
> +	if (!perfctr_base_regs)
> +		return 0;
> +
> +	counter = nmi_evntsel_msr_to_bit(msr);
> +	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> +
> +	clear_bit(counter, evntsel_nmi_owner);
> +}
> +
> +EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi);
> +EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi_bit);
> +EXPORT_SYMBOL(reserve_perfctr_nmi);
> +EXPORT_SYMBOL(release_perfctr_nmi);
> +EXPORT_SYMBOL(reserve_evntsel_nmi);
> +EXPORT_SYMBOL(release_evntsel_nmi);
> diff --git a/arch/x86_64/kernel/Makefile b/arch/x86_64/kernel/Makefile
> index de1de8a..cc7587d 100644
> --- a/arch/x86_64/kernel/Makefile
> +++ b/arch/x86_64/kernel/Makefile
> @@ -9,7 +9,7 @@ obj-y	:= process.o signal.o entry.o traps.o irq.o \
>  		x8664_ksyms.o i387.o syscall.o vsyscall.o \
>  		setup64.o bootflag.o e820.o reboot.o quirks.o i8237.o \
>  		pci-dma.o pci-nommu.o alternative.o hpet.o tsc.o bugs.o \
> -		perfctr-watchdog.o
> +		perfctr.o perfctr-watchdog.o
>  
>  obj-$(CONFIG_STACKTRACE)	+= stacktrace.o
>  obj-$(CONFIG_X86_MCE)		+= mce.o therm_throt.o
> @@ -60,4 +60,5 @@ i8237-y				+= ../../i386/kernel/i8237.o
>  msr-$(subst m,y,$(CONFIG_X86_MSR))  += ../../i386/kernel/msr.o
>  alternative-y			+= ../../i386/kernel/alternative.o
>  pcspeaker-y			+= ../../i386/kernel/pcspeaker.o
> +perfctr-y			+= ../../i386/kernel/cpu/perfctr.o
>  perfctr-watchdog-y		+= ../../i386/kernel/cpu/perfctr-watchdog.o
> diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c
> index 1b0e07b..892ebf8 100644
> --- a/arch/x86_64/kernel/apic.c
> +++ b/arch/x86_64/kernel/apic.c
> @@ -453,6 +453,7 @@ void __cpuinit setup_local_APIC (void)
>  			oldvalue, value);
>  	}
>  
> +	probe_performance_counters();
>  	nmi_watchdog_default();
>  	setup_apic_nmi_watchdog(NULL);
>  	apic_pm_activate();
> diff --git a/include/asm-i386/nmi.h b/include/asm-i386/nmi.h
> index fb1e133..736af6f 100644
> --- a/include/asm-i386/nmi.h
> +++ b/include/asm-i386/nmi.h
> @@ -18,6 +18,7 @@
>  int do_nmi_callback(struct pt_regs *regs, int cpu);
>  
>  extern int nmi_watchdog_enabled;
> +extern void probe_performance_counters(void);
>  extern int avail_to_resrv_perfctr_nmi_bit(unsigned int);
>  extern int avail_to_resrv_perfctr_nmi(unsigned int);
>  extern int reserve_perfctr_nmi(unsigned int);
> diff --git a/include/asm-x86_64/nmi.h b/include/asm-x86_64/nmi.h
> index d0a7f53..d45fc62 100644
> --- a/include/asm-x86_64/nmi.h
> +++ b/include/asm-x86_64/nmi.h
> @@ -46,6 +46,7 @@ extern int unknown_nmi_panic;
>  extern int nmi_watchdog_enabled;
>  
>  extern int check_nmi_watchdog(void);
> +extern void probe_performance_counters(void);
>  extern int avail_to_resrv_perfctr_nmi_bit(unsigned int);
>  extern int avail_to_resrv_perfctr_nmi(unsigned int);
>  extern int reserve_perfctr_nmi(unsigned int);
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] Separate performance counter reservation from nmi watchdog
  2007-06-13 16:46     ` Björn Steinbrink
@ 2007-06-18  9:52       ` Stephane Eranian
  2007-06-18 10:32         ` Björn Steinbrink
  0 siblings, 1 reply; 28+ messages in thread
From: Stephane Eranian @ 2007-06-18  9:52 UTC (permalink / raw)
  To: Björn Steinbrink, oprofile-list, wcohen, ak, perfmon,
	linux-kernel, levon, akpm

Hello Bjorn,

Sorry for the delay in my reponse.

> > From: Björn Steinbrink <B.Steinbrink@gmx.de>
> > 
> > Separate the performance counter reservation system from the nmi
> > watchdog to allow usage of all performance counters even if the nmi
> > watchdog is not used.
> > 

I think it is a good idea to separate the counter allocator from NMI
becuase as you said they are/will be use used by more than the NMI
watchdog. Thus, I think you should drop the stirng nmi from the
routines. I would also povide a separate header file for this allocator.


> > Fixed the reservation of the wrong performance counter for the PerfMon
> > based nmi watchdog along the way.
> > 
> > Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
> > ---
> > diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
> > index 67824f3..88b74e3 100644
> > --- a/arch/i386/kernel/apic.c
> > +++ b/arch/i386/kernel/apic.c
> > @@ -1009,6 +1009,7 @@ void __devinit setup_local_APIC(void)
> >  	value |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR);
> >  	apic_write_around(APIC_LVTT, value);
> >  
> > +	probe_performance_counters();
> >  	setup_apic_nmi_watchdog(NULL);
> >  	apic_pm_activate();
> >  }
> > diff --git a/arch/i386/kernel/cpu/Makefile b/arch/i386/kernel/cpu/Makefile
> > index 74f27a4..882364d 100644
> > --- a/arch/i386/kernel/cpu/Makefile
> > +++ b/arch/i386/kernel/cpu/Makefile
> > @@ -18,4 +18,4 @@ obj-$(CONFIG_X86_MCE)	+=	mcheck/
> >  obj-$(CONFIG_MTRR)	+= 	mtrr/
> >  obj-$(CONFIG_CPU_FREQ)	+=	cpufreq/
> >  
> > -obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o
> > +obj-$(CONFIG_X86_LOCAL_APIC) += perfctr.o perfctr-watchdog.o
> > diff --git a/arch/i386/kernel/cpu/perfctr-watchdog.c b/arch/i386/kernel/cpu/perfctr-watchdog.c
> > index 2b04c8f..6187097 100644
> > --- a/arch/i386/kernel/cpu/perfctr-watchdog.c
> > +++ b/arch/i386/kernel/cpu/perfctr-watchdog.c
> > @@ -8,9 +8,7 @@
> >     Original code for K7/P6 written by Keith Owens */
> >  
> >  #include <linux/percpu.h>
> > -#include <linux/module.h>
> >  #include <linux/kernel.h>
> > -#include <linux/bitops.h>
> >  #include <linux/smp.h>
> >  #include <linux/nmi.h>
> >  #include <asm/apic.h>
> > @@ -36,105 +34,8 @@ struct wd_ops {
> >  
> >  static struct wd_ops *wd_ops;
> >  
> > -/* this number is calculated from Intel's MSR_P4_CRU_ESCR5 register and it's
> > - * offset from MSR_P4_BSU_ESCR0.  It will be the max for all platforms (for now)
> > - */
> > -#define NMI_MAX_COUNTER_BITS 66
> > -
> > -/* perfctr_nmi_owner tracks the ownership of the perfctr registers:
> > - * evtsel_nmi_owner tracks the ownership of the event selection
> > - * - different performance counters/ event selection may be reserved for
> > - *   different subsystems this reservation system just tries to coordinate
> > - *   things a little
> > - */
> > -static DECLARE_BITMAP(perfctr_nmi_owner, NMI_MAX_COUNTER_BITS);
> > -static DECLARE_BITMAP(evntsel_nmi_owner, NMI_MAX_COUNTER_BITS);
> > -
> >  static DEFINE_PER_CPU(struct nmi_watchdog_ctlblk, nmi_watchdog_ctlblk);
> >  
> > -/* converts an msr to an appropriate reservation bit */
> > -static inline unsigned int nmi_perfctr_msr_to_bit(unsigned int msr)
> > -{
> > -	return wd_ops ? msr - wd_ops->perfctr : 0;
> > -}
> > -
> > -/* converts an msr to an appropriate reservation bit */
> > -/* returns the bit offset of the event selection register */
> > -static inline unsigned int nmi_evntsel_msr_to_bit(unsigned int msr)
> > -{
> > -	return wd_ops ? msr - wd_ops->evntsel : 0;
> > -}
> > -
> > -/* checks for a bit availability (hack for oprofile) */
> > -int avail_to_resrv_perfctr_nmi_bit(unsigned int counter)
> > -{
> > -	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> > -
> > -	return (!test_bit(counter, perfctr_nmi_owner));
> > -}
> > -
> > -/* checks the an msr for availability */
> > -int avail_to_resrv_perfctr_nmi(unsigned int msr)
> > -{
> > -	unsigned int counter;
> > -
> > -	counter = nmi_perfctr_msr_to_bit(msr);
> > -	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> > -
> > -	return (!test_bit(counter, perfctr_nmi_owner));
> > -}
> > -
> > -int reserve_perfctr_nmi(unsigned int msr)
> > -{
> > -	unsigned int counter;
> > -
> > -	counter = nmi_perfctr_msr_to_bit(msr);
> > -	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> > -
> > -	if (!test_and_set_bit(counter, perfctr_nmi_owner))
> > -		return 1;
> > -	return 0;
> > -}
> > -
> > -void release_perfctr_nmi(unsigned int msr)
> > -{
> > -	unsigned int counter;
> > -
> > -	counter = nmi_perfctr_msr_to_bit(msr);
> > -	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> > -
> > -	clear_bit(counter, perfctr_nmi_owner);
> > -}
> > -
> > -int reserve_evntsel_nmi(unsigned int msr)
> > -{
> > -	unsigned int counter;
> > -
> > -	counter = nmi_evntsel_msr_to_bit(msr);
> > -	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> > -
> > -	if (!test_and_set_bit(counter, evntsel_nmi_owner))
> > -		return 1;
> > -	return 0;
> > -}
> > -
> > -void release_evntsel_nmi(unsigned int msr)
> > -{
> > -	unsigned int counter;
> > -
> > -	counter = nmi_evntsel_msr_to_bit(msr);
> > -	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> > -
> > -	clear_bit(counter, evntsel_nmi_owner);
> > -}
> > -
> > -EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi);
> > -EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi_bit);
> > -EXPORT_SYMBOL(reserve_perfctr_nmi);
> > -EXPORT_SYMBOL(release_perfctr_nmi);
> > -EXPORT_SYMBOL(reserve_evntsel_nmi);
> > -EXPORT_SYMBOL(release_evntsel_nmi);
> > -
> >  void disable_lapic_nmi_watchdog(void)
> >  {
> >  	BUG_ON(nmi_watchdog != NMI_LOCAL_APIC);
> > @@ -568,8 +469,8 @@ static struct wd_ops intel_arch_wd_ops = {
> >  	.setup = setup_intel_arch_watchdog,
> >  	.rearm = p6_rearm,
> >  	.stop = single_msr_stop_watchdog,
> > -	.perfctr = MSR_ARCH_PERFMON_PERFCTR0,
> > -	.evntsel = MSR_ARCH_PERFMON_EVENTSEL0,
> > +	.perfctr = MSR_ARCH_PERFMON_PERFCTR1,
> > +	.evntsel = MSR_ARCH_PERFMON_EVENTSEL1,
> >  };
> >  
> >  static void probe_nmi_watchdog(void)
> > diff --git a/arch/i386/kernel/cpu/perfctr.c b/arch/i386/kernel/cpu/perfctr.c
> > new file mode 100644
> > index 0000000..63e68a3
> > --- /dev/null
> > +++ b/arch/i386/kernel/cpu/perfctr.c
> > @@ -0,0 +1,175 @@
> > +#include <linux/bitops.h>
> > +#include <linux/module.h>
> > +#include <asm/msr-index.h>
> > +#include <asm/intel_arch_perfmon.h>
> > +
> > +struct perfctr_base_regs {
> > +	unsigned int perfctr;
> > +	unsigned int evntsel;
> > +};
> > +
> > +static struct perfctr_base_regs *perfctr_base_regs;
> > +
> > +static struct perfctr_base_regs k7_base_regs = {
> > +	.perfctr = MSR_K7_PERFCTR0,
> > +	.evntsel = MSR_K7_EVNTSEL0
> > +};
> > +
> > +static struct perfctr_base_regs p4_base_regs = {
> > +	.perfctr = MSR_P4_BPU_PERFCTR0,
> > +	.evntsel = MSR_P4_BSU_ESCR0
> > +};
> > +
> > +static struct perfctr_base_regs p6_base_regs = {
> > +	.perfctr = MSR_P6_PERFCTR0,
> > +	.evntsel = MSR_P6_EVNTSEL0
> > +};
> > +
> > +static struct perfctr_base_regs arch_perfmon_base_regs = {
> > +	.perfctr = MSR_ARCH_PERFMON_PERFCTR0,
> > +	.evntsel = MSR_ARCH_PERFMON_EVENTSEL0
> > +};
> > +
> > +/* this number is calculated from Intel's MSR_P4_CRU_ESCR5 register and it's
> > + * offset from MSR_P4_BSU_ESCR0.  It will be the max for all platforms (for now)
> > + */
> > +#define NMI_MAX_COUNTER_BITS 66
> > +
> > +/* perfctr_nmi_owner tracks the ownership of the perfctr registers:
> > + * evtsel_nmi_owner tracks the ownership of the event selection
> > + * - different performance counters/ event selection may be reserved for
> > + *   different subsystems this reservation system just tries to coordinate
> > + *   things a little
> > + */
> > +static DECLARE_BITMAP(perfctr_nmi_owner, NMI_MAX_COUNTER_BITS);
> > +static DECLARE_BITMAP(evntsel_nmi_owner, NMI_MAX_COUNTER_BITS);
> > +
> > +void __devinit probe_performance_counters(void)
> > +{
> > +	switch (boot_cpu_data.x86_vendor) {
> > +	case X86_VENDOR_AMD:
> > +		if (boot_cpu_data.x86 != 6 && boot_cpu_data.x86 != 15 &&
> > +		    boot_cpu_data.x86 != 16)
> > +			return;
> > +
> > +		perfctr_base_regs = &k7_base_regs;
> > +		break;
> > +
> > +	case X86_VENDOR_INTEL:
> > +		if (cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) {
> > +			perfctr_base_regs = &arch_perfmon_base_regs;
> > +			break;
> > +		}
> > +		switch (boot_cpu_data.x86) {
> > +		case 6:
> > +			perfctr_base_regs = &p6_base_regs;
> > +			break;
> > +
> > +		case 15:
> > +			perfctr_base_regs = &p4_base_regs;
> > +			break;
> > +		}
> > +		break;
> > +	}
> > +}
> > +
> > +/* converts an msr to an appropriate reservation bit */
> > +static inline unsigned int nmi_perfctr_msr_to_bit(unsigned int msr)
> > +{
> > +	return msr - perfctr_base_regs->perfctr;
> > +}
> > +
> > +/* converts an msr to an appropriate reservation bit */
> > +/* returns the bit offset of the event selection register */
> > +static inline unsigned int nmi_evntsel_msr_to_bit(unsigned int msr)
> > +{
> > +	return msr - perfctr_base_regs->evntsel;
> > +}
> > +
> > +/* checks for a bit availability (hack for oprofile) */
> > +int avail_to_resrv_perfctr_nmi_bit(unsigned int counter)
> > +{
> > +	if (!perfctr_base_regs)
> > +		return 0;
> > +
> > +	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> > +
> > +	return (!test_bit(counter, perfctr_nmi_owner));
> > +}
> > +
> > +/* checks the an msr for availability */
> > +int avail_to_resrv_perfctr_nmi(unsigned int msr)
> > +{
> > +	unsigned int counter;
> > +
> > +	if (!perfctr_base_regs)
> > +		return 0;
> > +
> > +	counter = nmi_perfctr_msr_to_bit(msr);
> > +	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> > +
> > +	return (!test_bit(counter, perfctr_nmi_owner));
> > +}
> > +
> > +int reserve_perfctr_nmi(unsigned int msr)
> > +{
> > +	unsigned int counter;
> > +
> > +	if (!perfctr_base_regs)
> > +		return 0;
> > +
> > +	counter = nmi_perfctr_msr_to_bit(msr);
> > +	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> > +
> > +	if (!test_and_set_bit(counter, perfctr_nmi_owner))
> > +		return 1;
> > +	return 0;
> > +}
> > +
> > +void release_perfctr_nmi(unsigned int msr)
> > +{
> > +	unsigned int counter;
> > +
> > +	if (!perfctr_base_regs)
> > +		return 0;
> > +
> > +	counter = nmi_perfctr_msr_to_bit(msr);
> > +	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> > +
> > +	clear_bit(counter, perfctr_nmi_owner);
> > +}
> > +
> > +int reserve_evntsel_nmi(unsigned int msr)
> > +{
> > +	unsigned int counter;
> > +
> > +	if (!perfctr_base_regs)
> > +		return 0;
> > +
> > +	counter = nmi_evntsel_msr_to_bit(msr);
> > +	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> > +
> > +	if (!test_and_set_bit(counter, evntsel_nmi_owner))
> > +		return 1;
> > +	return 0;
> > +}
> > +
> > +void release_evntsel_nmi(unsigned int msr)
> > +{
> > +	unsigned int counter;
> > +
> > +	if (!perfctr_base_regs)
> > +		return 0;
> > +
> > +	counter = nmi_evntsel_msr_to_bit(msr);
> > +	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
> > +
> > +	clear_bit(counter, evntsel_nmi_owner);
> > +}
> > +
> > +EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi);
> > +EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi_bit);
> > +EXPORT_SYMBOL(reserve_perfctr_nmi);
> > +EXPORT_SYMBOL(release_perfctr_nmi);
> > +EXPORT_SYMBOL(reserve_evntsel_nmi);
> > +EXPORT_SYMBOL(release_evntsel_nmi);
> > diff --git a/arch/x86_64/kernel/Makefile b/arch/x86_64/kernel/Makefile
> > index de1de8a..cc7587d 100644
> > --- a/arch/x86_64/kernel/Makefile
> > +++ b/arch/x86_64/kernel/Makefile
> > @@ -9,7 +9,7 @@ obj-y	:= process.o signal.o entry.o traps.o irq.o \
> >  		x8664_ksyms.o i387.o syscall.o vsyscall.o \
> >  		setup64.o bootflag.o e820.o reboot.o quirks.o i8237.o \
> >  		pci-dma.o pci-nommu.o alternative.o hpet.o tsc.o bugs.o \
> > -		perfctr-watchdog.o
> > +		perfctr.o perfctr-watchdog.o
> >  
> >  obj-$(CONFIG_STACKTRACE)	+= stacktrace.o
> >  obj-$(CONFIG_X86_MCE)		+= mce.o therm_throt.o
> > @@ -60,4 +60,5 @@ i8237-y				+= ../../i386/kernel/i8237.o
> >  msr-$(subst m,y,$(CONFIG_X86_MSR))  += ../../i386/kernel/msr.o
> >  alternative-y			+= ../../i386/kernel/alternative.o
> >  pcspeaker-y			+= ../../i386/kernel/pcspeaker.o
> > +perfctr-y			+= ../../i386/kernel/cpu/perfctr.o
> >  perfctr-watchdog-y		+= ../../i386/kernel/cpu/perfctr-watchdog.o
> > diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c
> > index 1b0e07b..892ebf8 100644
> > --- a/arch/x86_64/kernel/apic.c
> > +++ b/arch/x86_64/kernel/apic.c
> > @@ -453,6 +453,7 @@ void __cpuinit setup_local_APIC (void)
> >  			oldvalue, value);
> >  	}
> >  
> > +	probe_performance_counters();
> >  	nmi_watchdog_default();
> >  	setup_apic_nmi_watchdog(NULL);
> >  	apic_pm_activate();
> > diff --git a/include/asm-i386/nmi.h b/include/asm-i386/nmi.h
> > index fb1e133..736af6f 100644
> > --- a/include/asm-i386/nmi.h
> > +++ b/include/asm-i386/nmi.h
> > @@ -18,6 +18,7 @@
> >  int do_nmi_callback(struct pt_regs *regs, int cpu);
> >  
> >  extern int nmi_watchdog_enabled;
> > +extern void probe_performance_counters(void);
> >  extern int avail_to_resrv_perfctr_nmi_bit(unsigned int);
> >  extern int avail_to_resrv_perfctr_nmi(unsigned int);
> >  extern int reserve_perfctr_nmi(unsigned int);
> > diff --git a/include/asm-x86_64/nmi.h b/include/asm-x86_64/nmi.h
> > index d0a7f53..d45fc62 100644
> > --- a/include/asm-x86_64/nmi.h
> > +++ b/include/asm-x86_64/nmi.h
> > @@ -46,6 +46,7 @@ extern int unknown_nmi_panic;
> >  extern int nmi_watchdog_enabled;
> >  
> >  extern int check_nmi_watchdog(void);
> > +extern void probe_performance_counters(void);
> >  extern int avail_to_resrv_perfctr_nmi_bit(unsigned int);
> >  extern int avail_to_resrv_perfctr_nmi(unsigned int);
> >  extern int reserve_perfctr_nmi(unsigned int);
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

-- 

-Stephane

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] Separate performance counter reservation from nmi watchdog
  2007-06-18  9:52       ` Stephane Eranian
@ 2007-06-18 10:32         ` Björn Steinbrink
       [not found]           ` <11823357571842-git-send-email->
  2007-06-20 10:49           ` [PATCH 0/2] Performance counter allocator separation Björn Steinbrink
  0 siblings, 2 replies; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-18 10:32 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: oprofile-list, wcohen, ak, perfmon, linux-kernel, levon, akpm

On 2007.06.18 02:52:38 -0700, Stephane Eranian wrote:
> Hello Bjorn,
> 
> Sorry for the delay in my reponse.
> 
> > > From: Björn Steinbrink <B.Steinbrink@gmx.de>
> > > 
> > > Separate the performance counter reservation system from the nmi
> > > watchdog to allow usage of all performance counters even if the nmi
> > > watchdog is not used.
> > > 
> 
> I think it is a good idea to separate the counter allocator from NMI
> becuase as you said they are/will be use used by more than the NMI
> watchdog. Thus, I think you should drop the stirng nmi from the
> routines. I would also povide a separate header file for this allocator.

Rationale for both: This was already a (IMHO) rather huge patch for such
a bugfix so I didn't want to change even more code for the sake of
cleanup. I had planned to send a second patch for 2.6.23 that would do
the clean up, should have mentioned that (or included the second patch).

Will resend the patch later today, together with the cleanup patch.
Whoever applies it can then decide if the cleanup should go into 2.6.22
or 2.6.23.

Thanks,
Björn

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog
       [not found]           ` <11823357571842-git-send-email->
@ 2007-06-20 10:35             ` Björn Steinbrink
  2007-06-20 10:35               ` [PATCH 2/2] Finish separation of the performance counter allocator from the " Björn Steinbrink
  2007-06-20 12:31               ` [PATCH 1/2] Separate the performance counter allocation from the LAPIC " Andi Kleen
  0 siblings, 2 replies; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-20 10:35 UTC (permalink / raw)
  To: eranian
  Cc: oprofile-list, wcohen, ak, perfmon, linux-kernel, levon, akpm,
	mingo, Björn Steinbrink

The performance counter allocator is tied to the LAPIC NMI watchdog,
rendering it unusable if the watchdog is not in use, breaking other
subsystems that try to allocate performance counters.

Fix the performance counter allocator by making it independent of the
LAPIC NMI watchdog. This also fixes a bug in the LAPIC NMI watchdog
which allocated the wrong performance counter on CPUs with PerfMon
support.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
---
 arch/i386/kernel/apic.c                 |    1 +
 arch/i386/kernel/cpu/Makefile           |    2 +-
 arch/i386/kernel/cpu/perfctr-watchdog.c |  103 +-----------------
 arch/i386/kernel/cpu/perfctr.c          |  178 +++++++++++++++++++++++++++++++
 arch/x86_64/kernel/Makefile             |    3 +-
 arch/x86_64/kernel/apic.c               |    1 +
 include/asm-i386/nmi.h                  |    1 +
 include/asm-x86_64/nmi.h                |    1 +
 8 files changed, 187 insertions(+), 103 deletions(-)
 create mode 100644 arch/i386/kernel/cpu/perfctr.c

diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
index 67824f3..88b74e3 100644
--- a/arch/i386/kernel/apic.c
+++ b/arch/i386/kernel/apic.c
@@ -1009,6 +1009,7 @@ void __devinit setup_local_APIC(void)
 	value |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR);
 	apic_write_around(APIC_LVTT, value);
 
+	probe_performance_counters();
 	setup_apic_nmi_watchdog(NULL);
 	apic_pm_activate();
 }
diff --git a/arch/i386/kernel/cpu/Makefile b/arch/i386/kernel/cpu/Makefile
index 74f27a4..882364d 100644
--- a/arch/i386/kernel/cpu/Makefile
+++ b/arch/i386/kernel/cpu/Makefile
@@ -18,4 +18,4 @@ obj-$(CONFIG_X86_MCE)	+=	mcheck/
 obj-$(CONFIG_MTRR)	+= 	mtrr/
 obj-$(CONFIG_CPU_FREQ)	+=	cpufreq/
 
-obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o
+obj-$(CONFIG_X86_LOCAL_APIC) += perfctr.o perfctr-watchdog.o
diff --git a/arch/i386/kernel/cpu/perfctr-watchdog.c b/arch/i386/kernel/cpu/perfctr-watchdog.c
index f0b6763..2ae39eb 100644
--- a/arch/i386/kernel/cpu/perfctr-watchdog.c
+++ b/arch/i386/kernel/cpu/perfctr-watchdog.c
@@ -8,9 +8,7 @@
    Original code for K7/P6 written by Keith Owens */
 
 #include <linux/percpu.h>
-#include <linux/module.h>
 #include <linux/kernel.h>
-#include <linux/bitops.h>
 #include <linux/smp.h>
 #include <linux/nmi.h>
 #include <asm/apic.h>
@@ -36,105 +34,8 @@ struct wd_ops {
 
 static struct wd_ops *wd_ops;
 
-/* this number is calculated from Intel's MSR_P4_CRU_ESCR5 register and it's
- * offset from MSR_P4_BSU_ESCR0.  It will be the max for all platforms (for now)
- */
-#define NMI_MAX_COUNTER_BITS 66
-
-/* perfctr_nmi_owner tracks the ownership of the perfctr registers:
- * evtsel_nmi_owner tracks the ownership of the event selection
- * - different performance counters/ event selection may be reserved for
- *   different subsystems this reservation system just tries to coordinate
- *   things a little
- */
-static DECLARE_BITMAP(perfctr_nmi_owner, NMI_MAX_COUNTER_BITS);
-static DECLARE_BITMAP(evntsel_nmi_owner, NMI_MAX_COUNTER_BITS);
-
 static DEFINE_PER_CPU(struct nmi_watchdog_ctlblk, nmi_watchdog_ctlblk);
 
-/* converts an msr to an appropriate reservation bit */
-static inline unsigned int nmi_perfctr_msr_to_bit(unsigned int msr)
-{
-	return wd_ops ? msr - wd_ops->perfctr : 0;
-}
-
-/* converts an msr to an appropriate reservation bit */
-/* returns the bit offset of the event selection register */
-static inline unsigned int nmi_evntsel_msr_to_bit(unsigned int msr)
-{
-	return wd_ops ? msr - wd_ops->evntsel : 0;
-}
-
-/* checks for a bit availability (hack for oprofile) */
-int avail_to_resrv_perfctr_nmi_bit(unsigned int counter)
-{
-	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
-
-	return (!test_bit(counter, perfctr_nmi_owner));
-}
-
-/* checks the an msr for availability */
-int avail_to_resrv_perfctr_nmi(unsigned int msr)
-{
-	unsigned int counter;
-
-	counter = nmi_perfctr_msr_to_bit(msr);
-	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
-
-	return (!test_bit(counter, perfctr_nmi_owner));
-}
-
-int reserve_perfctr_nmi(unsigned int msr)
-{
-	unsigned int counter;
-
-	counter = nmi_perfctr_msr_to_bit(msr);
-	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
-
-	if (!test_and_set_bit(counter, perfctr_nmi_owner))
-		return 1;
-	return 0;
-}
-
-void release_perfctr_nmi(unsigned int msr)
-{
-	unsigned int counter;
-
-	counter = nmi_perfctr_msr_to_bit(msr);
-	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
-
-	clear_bit(counter, perfctr_nmi_owner);
-}
-
-int reserve_evntsel_nmi(unsigned int msr)
-{
-	unsigned int counter;
-
-	counter = nmi_evntsel_msr_to_bit(msr);
-	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
-
-	if (!test_and_set_bit(counter, evntsel_nmi_owner))
-		return 1;
-	return 0;
-}
-
-void release_evntsel_nmi(unsigned int msr)
-{
-	unsigned int counter;
-
-	counter = nmi_evntsel_msr_to_bit(msr);
-	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
-
-	clear_bit(counter, evntsel_nmi_owner);
-}
-
-EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi);
-EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi_bit);
-EXPORT_SYMBOL(reserve_perfctr_nmi);
-EXPORT_SYMBOL(release_perfctr_nmi);
-EXPORT_SYMBOL(reserve_evntsel_nmi);
-EXPORT_SYMBOL(release_evntsel_nmi);
-
 void disable_lapic_nmi_watchdog(void)
 {
 	BUG_ON(nmi_watchdog != NMI_LOCAL_APIC);
@@ -568,8 +469,8 @@ static struct wd_ops intel_arch_wd_ops = {
 	.setup = setup_intel_arch_watchdog,
 	.rearm = p6_rearm,
 	.stop = single_msr_stop_watchdog,
-	.perfctr = MSR_ARCH_PERFMON_PERFCTR0,
-	.evntsel = MSR_ARCH_PERFMON_EVENTSEL0,
+	.perfctr = MSR_ARCH_PERFMON_PERFCTR1,
+	.evntsel = MSR_ARCH_PERFMON_EVENTSEL1,
 };
 
 static void probe_nmi_watchdog(void)
diff --git a/arch/i386/kernel/cpu/perfctr.c b/arch/i386/kernel/cpu/perfctr.c
new file mode 100644
index 0000000..fece4fc
--- /dev/null
+++ b/arch/i386/kernel/cpu/perfctr.c
@@ -0,0 +1,178 @@
+#include <linux/bitops.h>
+#include <linux/module.h>
+#include <asm/msr-index.h>
+#include <asm/intel_arch_perfmon.h>
+
+struct perfctr_base_regs {
+	unsigned int perfctr;
+	unsigned int evntsel;
+};
+
+static struct perfctr_base_regs *perfctr_base_regs;
+
+static struct perfctr_base_regs k7_base_regs = {
+	.perfctr = MSR_K7_PERFCTR0,
+	.evntsel = MSR_K7_EVNTSEL0
+};
+
+static struct perfctr_base_regs p4_base_regs = {
+	.perfctr = MSR_P4_BPU_PERFCTR0,
+	.evntsel = MSR_P4_BSU_ESCR0
+};
+
+static struct perfctr_base_regs p6_base_regs = {
+	.perfctr = MSR_P6_PERFCTR0,
+	.evntsel = MSR_P6_EVNTSEL0
+};
+
+static struct perfctr_base_regs arch_perfmon_base_regs = {
+	.perfctr = MSR_ARCH_PERFMON_PERFCTR0,
+	.evntsel = MSR_ARCH_PERFMON_EVENTSEL0
+};
+
+/* this number is calculated from Intel's MSR_P4_CRU_ESCR5 register and it's
+ * offset from MSR_P4_BSU_ESCR0.  It will be the max for all platforms (for now)
+ */
+#define NMI_MAX_COUNTER_BITS 66
+
+/* perfctr_nmi_owner tracks the ownership of the perfctr registers:
+ * evtsel_nmi_owner tracks the ownership of the event selection
+ * - different performance counters/ event selection may be reserved for
+ *   different subsystems this reservation system just tries to coordinate
+ *   things a little
+ */
+static DECLARE_BITMAP(perfctr_nmi_owner, NMI_MAX_COUNTER_BITS);
+static DECLARE_BITMAP(evntsel_nmi_owner, NMI_MAX_COUNTER_BITS);
+
+void __devinit probe_performance_counters(void)
+{
+	if (perfctr_base_regs)
+		return;
+
+	switch (boot_cpu_data.x86_vendor) {
+	case X86_VENDOR_AMD:
+		if (boot_cpu_data.x86 != 6 && boot_cpu_data.x86 != 15 &&
+		    boot_cpu_data.x86 != 16)
+			return;
+
+		perfctr_base_regs = &k7_base_regs;
+		break;
+
+	case X86_VENDOR_INTEL:
+		if (cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) {
+			perfctr_base_regs = &arch_perfmon_base_regs;
+			break;
+		}
+		switch (boot_cpu_data.x86) {
+		case 6:
+			perfctr_base_regs = &p6_base_regs;
+			break;
+
+		case 15:
+			perfctr_base_regs = &p4_base_regs;
+			break;
+		}
+		break;
+	}
+}
+
+/* converts an msr to an appropriate reservation bit */
+static inline unsigned int nmi_perfctr_msr_to_bit(unsigned int msr)
+{
+	return msr - perfctr_base_regs->perfctr;
+}
+
+/* converts an msr to an appropriate reservation bit */
+/* returns the bit offset of the event selection register */
+static inline unsigned int nmi_evntsel_msr_to_bit(unsigned int msr)
+{
+	return msr - perfctr_base_regs->evntsel;
+}
+
+/* checks for a bit availability (hack for oprofile) */
+int avail_to_resrv_perfctr_nmi_bit(unsigned int counter)
+{
+	if (!perfctr_base_regs)
+		return 0;
+
+	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
+
+	return (!test_bit(counter, perfctr_nmi_owner));
+}
+
+/* checks the an msr for availability */
+int avail_to_resrv_perfctr_nmi(unsigned int msr)
+{
+	unsigned int counter;
+
+	if (!perfctr_base_regs)
+		return 0;
+
+	counter = nmi_perfctr_msr_to_bit(msr);
+	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
+
+	return (!test_bit(counter, perfctr_nmi_owner));
+}
+
+int reserve_perfctr_nmi(unsigned int msr)
+{
+	unsigned int counter;
+
+	if (!perfctr_base_regs)
+		return 0;
+
+	counter = nmi_perfctr_msr_to_bit(msr);
+	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
+
+	if (!test_and_set_bit(counter, perfctr_nmi_owner))
+		return 1;
+	return 0;
+}
+
+void release_perfctr_nmi(unsigned int msr)
+{
+	unsigned int counter;
+
+	if (!perfctr_base_regs)
+		return;
+
+	counter = nmi_perfctr_msr_to_bit(msr);
+	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
+
+	clear_bit(counter, perfctr_nmi_owner);
+}
+
+int reserve_evntsel_nmi(unsigned int msr)
+{
+	unsigned int counter;
+
+	if (!perfctr_base_regs)
+		return 0;
+
+	counter = nmi_evntsel_msr_to_bit(msr);
+	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
+
+	if (!test_and_set_bit(counter, evntsel_nmi_owner))
+		return 1;
+	return 0;
+}
+
+void release_evntsel_nmi(unsigned int msr)
+{
+	unsigned int counter;
+
+	if (!perfctr_base_regs)
+		return;
+
+	counter = nmi_evntsel_msr_to_bit(msr);
+	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
+
+	clear_bit(counter, evntsel_nmi_owner);
+}
+
+EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi);
+EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi_bit);
+EXPORT_SYMBOL(reserve_perfctr_nmi);
+EXPORT_SYMBOL(release_perfctr_nmi);
+EXPORT_SYMBOL(reserve_evntsel_nmi);
+EXPORT_SYMBOL(release_evntsel_nmi);
diff --git a/arch/x86_64/kernel/Makefile b/arch/x86_64/kernel/Makefile
index de1de8a..cc7587d 100644
--- a/arch/x86_64/kernel/Makefile
+++ b/arch/x86_64/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y	:= process.o signal.o entry.o traps.o irq.o \
 		x8664_ksyms.o i387.o syscall.o vsyscall.o \
 		setup64.o bootflag.o e820.o reboot.o quirks.o i8237.o \
 		pci-dma.o pci-nommu.o alternative.o hpet.o tsc.o bugs.o \
-		perfctr-watchdog.o
+		perfctr.o perfctr-watchdog.o
 
 obj-$(CONFIG_STACKTRACE)	+= stacktrace.o
 obj-$(CONFIG_X86_MCE)		+= mce.o therm_throt.o
@@ -60,4 +60,5 @@ i8237-y				+= ../../i386/kernel/i8237.o
 msr-$(subst m,y,$(CONFIG_X86_MSR))  += ../../i386/kernel/msr.o
 alternative-y			+= ../../i386/kernel/alternative.o
 pcspeaker-y			+= ../../i386/kernel/pcspeaker.o
+perfctr-y			+= ../../i386/kernel/cpu/perfctr.o
 perfctr-watchdog-y		+= ../../i386/kernel/cpu/perfctr-watchdog.o
diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c
index 1b0e07b..892ebf8 100644
--- a/arch/x86_64/kernel/apic.c
+++ b/arch/x86_64/kernel/apic.c
@@ -453,6 +453,7 @@ void __cpuinit setup_local_APIC (void)
 			oldvalue, value);
 	}
 
+	probe_performance_counters();
 	nmi_watchdog_default();
 	setup_apic_nmi_watchdog(NULL);
 	apic_pm_activate();
diff --git a/include/asm-i386/nmi.h b/include/asm-i386/nmi.h
index fb1e133..736af6f 100644
--- a/include/asm-i386/nmi.h
+++ b/include/asm-i386/nmi.h
@@ -18,6 +18,7 @@
 int do_nmi_callback(struct pt_regs *regs, int cpu);
 
 extern int nmi_watchdog_enabled;
+extern void probe_performance_counters(void);
 extern int avail_to_resrv_perfctr_nmi_bit(unsigned int);
 extern int avail_to_resrv_perfctr_nmi(unsigned int);
 extern int reserve_perfctr_nmi(unsigned int);
diff --git a/include/asm-x86_64/nmi.h b/include/asm-x86_64/nmi.h
index d0a7f53..d45fc62 100644
--- a/include/asm-x86_64/nmi.h
+++ b/include/asm-x86_64/nmi.h
@@ -46,6 +46,7 @@ extern int unknown_nmi_panic;
 extern int nmi_watchdog_enabled;
 
 extern int check_nmi_watchdog(void);
+extern void probe_performance_counters(void);
 extern int avail_to_resrv_perfctr_nmi_bit(unsigned int);
 extern int avail_to_resrv_perfctr_nmi(unsigned int);
 extern int reserve_perfctr_nmi(unsigned int);
-- 
1.5.2.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/2] Finish separation of the performance counter allocator from the NMI watchdog
  2007-06-20 10:35             ` [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog Björn Steinbrink
@ 2007-06-20 10:35               ` Björn Steinbrink
  2007-06-20 12:31               ` [PATCH 1/2] Separate the performance counter allocation from the LAPIC " Andi Kleen
  1 sibling, 0 replies; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-20 10:35 UTC (permalink / raw)
  To: eranian
  Cc: oprofile-list, wcohen, ak, perfmon, linux-kernel, levon, akpm,
	mingo, Björn Steinbrink

From: Björn Steinbrink <B.Steinbrink@gmx.de>

Remove the nmi part from all names in the performance counter allocator
and move all declarations into their own header file.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
---
 arch/i386/kernel/apic.c                 |    1 +
 arch/i386/kernel/cpu/perfctr-watchdog.c |   27 +++++++-------
 arch/i386/kernel/cpu/perfctr.c          |   58 +++++++++++++++---------------
 arch/i386/oprofile/nmi_int.c            |    4 +-
 arch/i386/oprofile/op_model_athlon.c    |   10 +++---
 arch/i386/oprofile/op_model_p4.c        |   28 +++++++-------
 arch/i386/oprofile/op_model_ppro.c      |   10 +++---
 arch/x86_64/kernel/apic.c               |    1 +
 arch/x86_64/kernel/time.c               |   12 +++---
 include/asm-i386/nmi.h                  |    7 ----
 include/asm-i386/perfctr.h              |   15 ++++++++
 include/asm-x86_64/nmi.h                |    7 ----
 include/asm-x86_64/perfctr.h            |   15 ++++++++
 13 files changed, 107 insertions(+), 88 deletions(-)
 create mode 100644 include/asm-i386/perfctr.h
 create mode 100644 include/asm-x86_64/perfctr.h

diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
index 88b74e3..16a5140 100644
--- a/arch/i386/kernel/apic.c
+++ b/arch/i386/kernel/apic.c
@@ -38,6 +38,7 @@
 #include <asm/hpet.h>
 #include <asm/i8253.h>
 #include <asm/nmi.h>
+#include <asm/perfctr.h>
 
 #include <mach_apic.h>
 #include <mach_apicdef.h>
diff --git a/arch/i386/kernel/cpu/perfctr-watchdog.c b/arch/i386/kernel/cpu/perfctr-watchdog.c
index 2ae39eb..d8fabd2 100644
--- a/arch/i386/kernel/cpu/perfctr-watchdog.c
+++ b/arch/i386/kernel/cpu/perfctr-watchdog.c
@@ -13,6 +13,7 @@
 #include <linux/nmi.h>
 #include <asm/apic.h>
 #include <asm/intel_arch_perfmon.h>
+#include <asm/perfctr.h>
 
 struct nmi_watchdog_ctlblk {
 	unsigned int cccr_msr;
@@ -165,11 +166,11 @@ static void single_msr_stop_watchdog(void)
 
 static int single_msr_reserve(void)
 {
-	if (!reserve_perfctr_nmi(wd_ops->perfctr))
+	if (!reserve_perfctr(wd_ops->perfctr))
 		return 0;
 
-	if (!reserve_evntsel_nmi(wd_ops->evntsel)) {
-		release_perfctr_nmi(wd_ops->perfctr);
+	if (!reserve_evntsel(wd_ops->evntsel)) {
+		release_perfctr(wd_ops->perfctr);
 		return 0;
 	}
 	return 1;
@@ -177,8 +178,8 @@ static int single_msr_reserve(void)
 
 static void single_msr_unreserve(void)
 {
-	release_evntsel_nmi(wd_ops->evntsel);
-	release_perfctr_nmi(wd_ops->perfctr);
+	release_evntsel(wd_ops->evntsel);
+	release_perfctr(wd_ops->perfctr);
 }
 
 static void single_msr_rearm(struct nmi_watchdog_ctlblk *wd, unsigned nmi_hz)
@@ -352,23 +353,23 @@ static void stop_p4_watchdog(void)
 
 static int p4_reserve(void)
 {
-	if (!reserve_perfctr_nmi(MSR_P4_IQ_PERFCTR0))
+	if (!reserve_perfctr(MSR_P4_IQ_PERFCTR0))
 		return 0;
 #ifdef CONFIG_SMP
-	if (smp_num_siblings > 1 && !reserve_perfctr_nmi(MSR_P4_IQ_PERFCTR1))
+	if (smp_num_siblings > 1 && !reserve_perfctr(MSR_P4_IQ_PERFCTR1))
 		goto fail1;
 #endif
-	if (!reserve_evntsel_nmi(MSR_P4_CRU_ESCR0))
+	if (!reserve_evntsel(MSR_P4_CRU_ESCR0))
 		goto fail2;
 	/* RED-PEN why is ESCR1 not reserved here? */
 	return 1;
  fail2:
 #ifdef CONFIG_SMP
 	if (smp_num_siblings > 1)
-		release_perfctr_nmi(MSR_P4_IQ_PERFCTR1);
+		release_perfctr(MSR_P4_IQ_PERFCTR1);
  fail1:
 #endif
-	release_perfctr_nmi(MSR_P4_IQ_PERFCTR0);
+	release_perfctr(MSR_P4_IQ_PERFCTR0);
 	return 0;
 }
 
@@ -376,10 +377,10 @@ static void p4_unreserve(void)
 {
 #ifdef CONFIG_SMP
 	if (smp_num_siblings > 1)
-		release_perfctr_nmi(MSR_P4_IQ_PERFCTR1);
+		release_perfctr(MSR_P4_IQ_PERFCTR1);
 #endif
-	release_evntsel_nmi(MSR_P4_CRU_ESCR0);
-	release_perfctr_nmi(MSR_P4_IQ_PERFCTR0);
+	release_evntsel(MSR_P4_CRU_ESCR0);
+	release_perfctr(MSR_P4_IQ_PERFCTR0);
 }
 
 static void p4_rearm(struct nmi_watchdog_ctlblk *wd, unsigned nmi_hz)
diff --git a/arch/i386/kernel/cpu/perfctr.c b/arch/i386/kernel/cpu/perfctr.c
index fece4fc..cbf57ce 100644
--- a/arch/i386/kernel/cpu/perfctr.c
+++ b/arch/i386/kernel/cpu/perfctr.c
@@ -35,14 +35,14 @@ static struct perfctr_base_regs arch_perfmon_base_regs = {
  */
 #define NMI_MAX_COUNTER_BITS 66
 
-/* perfctr_nmi_owner tracks the ownership of the perfctr registers:
- * evtsel_nmi_owner tracks the ownership of the event selection
+/* perfctr_owner tracks the ownership of the perfctr registers:
+ * evtsel_owner tracks the ownership of the event selection
  * - different performance counters/ event selection may be reserved for
  *   different subsystems this reservation system just tries to coordinate
  *   things a little
  */
-static DECLARE_BITMAP(perfctr_nmi_owner, NMI_MAX_COUNTER_BITS);
-static DECLARE_BITMAP(evntsel_nmi_owner, NMI_MAX_COUNTER_BITS);
+static DECLARE_BITMAP(perfctr_owner, NMI_MAX_COUNTER_BITS);
+static DECLARE_BITMAP(evntsel_owner, NMI_MAX_COUNTER_BITS);
 
 void __devinit probe_performance_counters(void)
 {
@@ -77,102 +77,102 @@ void __devinit probe_performance_counters(void)
 }
 
 /* converts an msr to an appropriate reservation bit */
-static inline unsigned int nmi_perfctr_msr_to_bit(unsigned int msr)
+static inline unsigned int perfctr_msr_to_bit(unsigned int msr)
 {
 	return msr - perfctr_base_regs->perfctr;
 }
 
 /* converts an msr to an appropriate reservation bit */
 /* returns the bit offset of the event selection register */
-static inline unsigned int nmi_evntsel_msr_to_bit(unsigned int msr)
+static inline unsigned int evntsel_msr_to_bit(unsigned int msr)
 {
 	return msr - perfctr_base_regs->evntsel;
 }
 
 /* checks for a bit availability (hack for oprofile) */
-int avail_to_resrv_perfctr_nmi_bit(unsigned int counter)
+int avail_to_resrv_perfctr_bit(unsigned int counter)
 {
 	if (!perfctr_base_regs)
 		return 0;
 
 	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
 
-	return (!test_bit(counter, perfctr_nmi_owner));
+	return (!test_bit(counter, perfctr_owner));
 }
 
 /* checks the an msr for availability */
-int avail_to_resrv_perfctr_nmi(unsigned int msr)
+int avail_to_resrv_perfctr(unsigned int msr)
 {
 	unsigned int counter;
 
 	if (!perfctr_base_regs)
 		return 0;
 
-	counter = nmi_perfctr_msr_to_bit(msr);
+	counter = perfctr_msr_to_bit(msr);
 	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
 
-	return (!test_bit(counter, perfctr_nmi_owner));
+	return (!test_bit(counter, perfctr_owner));
 }
 
-int reserve_perfctr_nmi(unsigned int msr)
+int reserve_perfctr(unsigned int msr)
 {
 	unsigned int counter;
 
 	if (!perfctr_base_regs)
 		return 0;
 
-	counter = nmi_perfctr_msr_to_bit(msr);
+	counter = perfctr_msr_to_bit(msr);
 	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
 
-	if (!test_and_set_bit(counter, perfctr_nmi_owner))
+	if (!test_and_set_bit(counter, perfctr_owner))
 		return 1;
 	return 0;
 }
 
-void release_perfctr_nmi(unsigned int msr)
+void release_perfctr(unsigned int msr)
 {
 	unsigned int counter;
 
 	if (!perfctr_base_regs)
 		return;
 
-	counter = nmi_perfctr_msr_to_bit(msr);
+	counter = perfctr_msr_to_bit(msr);
 	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
 
-	clear_bit(counter, perfctr_nmi_owner);
+	clear_bit(counter, perfctr_owner);
 }
 
-int reserve_evntsel_nmi(unsigned int msr)
+int reserve_evntsel(unsigned int msr)
 {
 	unsigned int counter;
 
 	if (!perfctr_base_regs)
 		return 0;
 
-	counter = nmi_evntsel_msr_to_bit(msr);
+	counter = evntsel_msr_to_bit(msr);
 	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
 
-	if (!test_and_set_bit(counter, evntsel_nmi_owner))
+	if (!test_and_set_bit(counter, evntsel_owner))
 		return 1;
 	return 0;
 }
 
-void release_evntsel_nmi(unsigned int msr)
+void release_evntsel(unsigned int msr)
 {
 	unsigned int counter;
 
 	if (!perfctr_base_regs)
 		return;
 
-	counter = nmi_evntsel_msr_to_bit(msr);
+	counter = evntsel_msr_to_bit(msr);
 	BUG_ON(counter > NMI_MAX_COUNTER_BITS);
 
-	clear_bit(counter, evntsel_nmi_owner);
+	clear_bit(counter, evntsel_owner);
 }
 
-EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi);
-EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi_bit);
-EXPORT_SYMBOL(reserve_perfctr_nmi);
-EXPORT_SYMBOL(release_perfctr_nmi);
-EXPORT_SYMBOL(reserve_evntsel_nmi);
-EXPORT_SYMBOL(release_evntsel_nmi);
+EXPORT_SYMBOL(avail_to_resrv_perfctr);
+EXPORT_SYMBOL(avail_to_resrv_perfctr_bit);
+EXPORT_SYMBOL(reserve_perfctr);
+EXPORT_SYMBOL(release_perfctr);
+EXPORT_SYMBOL(reserve_evntsel);
+EXPORT_SYMBOL(release_evntsel);
diff --git a/arch/i386/oprofile/nmi_int.c b/arch/i386/oprofile/nmi_int.c
index 11b7a51..b00ae99 100644
--- a/arch/i386/oprofile/nmi_int.c
+++ b/arch/i386/oprofile/nmi_int.c
@@ -15,7 +15,7 @@
 #include <linux/slab.h>
 #include <linux/moduleparam.h>
 #include <linux/kdebug.h>
-#include <asm/nmi.h>
+#include <asm/perfctr.h>
 #include <asm/msr.h>
 #include <asm/apic.h>
  
@@ -324,7 +324,7 @@ static int nmi_create_files(struct super_block * sb, struct dentry * root)
 		 * NOTE:  assumes 1:1 mapping here (that counters are organized
 		 *        sequentially in their struct assignment).
 		 */
-		if (unlikely(!avail_to_resrv_perfctr_nmi_bit(i)))
+		if (unlikely(!avail_to_resrv_perfctr_bit(i)))
 			continue;
 
 		snprintf(buf,  sizeof(buf), "%d", i);
diff --git a/arch/i386/oprofile/op_model_athlon.c b/arch/i386/oprofile/op_model_athlon.c
index 3057a19..b1e9902 100644
--- a/arch/i386/oprofile/op_model_athlon.c
+++ b/arch/i386/oprofile/op_model_athlon.c
@@ -13,7 +13,7 @@
 #include <linux/oprofile.h>
 #include <asm/ptrace.h>
 #include <asm/msr.h>
-#include <asm/nmi.h>
+#include <asm/perfctr.h>
  
 #include "op_x86_model.h"
 #include "op_counter.h"
@@ -45,14 +45,14 @@ static void athlon_fill_in_addresses(struct op_msrs * const msrs)
 	int i;
 
 	for (i=0; i < NUM_COUNTERS; i++) {
-		if (reserve_perfctr_nmi(MSR_K7_PERFCTR0 + i))
+		if (reserve_perfctr(MSR_K7_PERFCTR0 + i))
 			msrs->counters[i].addr = MSR_K7_PERFCTR0 + i;
 		else
 			msrs->counters[i].addr = 0;
 	}
 
 	for (i=0; i < NUM_CONTROLS; i++) {
-		if (reserve_evntsel_nmi(MSR_K7_EVNTSEL0 + i))
+		if (reserve_evntsel(MSR_K7_EVNTSEL0 + i))
 			msrs->controls[i].addr = MSR_K7_EVNTSEL0 + i;
 		else
 			msrs->controls[i].addr = 0;
@@ -160,11 +160,11 @@ static void athlon_shutdown(struct op_msrs const * const msrs)
 
 	for (i = 0 ; i < NUM_COUNTERS ; ++i) {
 		if (CTR_IS_RESERVED(msrs,i))
-			release_perfctr_nmi(MSR_K7_PERFCTR0 + i);
+			release_perfctr(MSR_K7_PERFCTR0 + i);
 	}
 	for (i = 0 ; i < NUM_CONTROLS ; ++i) {
 		if (CTRL_IS_RESERVED(msrs,i))
-			release_evntsel_nmi(MSR_K7_EVNTSEL0 + i);
+			release_evntsel(MSR_K7_EVNTSEL0 + i);
 	}
 }
 
diff --git a/arch/i386/oprofile/op_model_p4.c b/arch/i386/oprofile/op_model_p4.c
index 4792592..6ba5856 100644
--- a/arch/i386/oprofile/op_model_p4.c
+++ b/arch/i386/oprofile/op_model_p4.c
@@ -14,7 +14,7 @@
 #include <asm/ptrace.h>
 #include <asm/fixmap.h>
 #include <asm/apic.h>
-#include <asm/nmi.h>
+#include <asm/perfctr.h>
 
 #include "op_x86_model.h"
 #include "op_counter.h"
@@ -413,7 +413,7 @@ static void p4_fill_in_addresses(struct op_msrs * const msrs)
 	for (i = 0; i < num_counters; ++i) {
 		addr = p4_counters[VIRT_CTR(stag, i)].counter_address;
 		cccraddr = p4_counters[VIRT_CTR(stag, i)].cccr_address;
-		if (reserve_perfctr_nmi(addr)){
+		if (reserve_perfctr(addr)){
 			msrs->counters[i].addr = addr;
 			msrs->controls[i].addr = cccraddr;
 		}
@@ -422,7 +422,7 @@ static void p4_fill_in_addresses(struct op_msrs * const msrs)
 	/* 43 ESCR registers in three or four discontiguous group */
 	for (addr = MSR_P4_BSU_ESCR0 + stag;
 	     addr < MSR_P4_IQ_ESCR0; ++i, addr += addr_increment()) {
-		if (reserve_evntsel_nmi(addr))
+		if (reserve_evntsel(addr))
 			msrs->controls[i].addr = addr;
 	}
 
@@ -431,32 +431,32 @@ static void p4_fill_in_addresses(struct op_msrs * const msrs)
 	if (boot_cpu_data.x86_model >= 0x3) {
 		for (addr = MSR_P4_BSU_ESCR0 + stag;
 		     addr <= MSR_P4_BSU_ESCR1; ++i, addr += addr_increment()) {
-			if (reserve_evntsel_nmi(addr))
+			if (reserve_evntsel(addr))
 				msrs->controls[i].addr = addr;
 		}
 	} else {
 		for (addr = MSR_P4_IQ_ESCR0 + stag;
 		     addr <= MSR_P4_IQ_ESCR1; ++i, addr += addr_increment()) {
-			if (reserve_evntsel_nmi(addr))
+			if (reserve_evntsel(addr))
 				msrs->controls[i].addr = addr;
 		}
 	}
 
 	for (addr = MSR_P4_RAT_ESCR0 + stag;
 	     addr <= MSR_P4_SSU_ESCR0; ++i, addr += addr_increment()) {
-		if (reserve_evntsel_nmi(addr))
+		if (reserve_evntsel(addr))
 			msrs->controls[i].addr = addr;
 	}
 	
 	for (addr = MSR_P4_MS_ESCR0 + stag;
 	     addr <= MSR_P4_TC_ESCR1; ++i, addr += addr_increment()) { 
-		if (reserve_evntsel_nmi(addr))
+		if (reserve_evntsel(addr))
 			msrs->controls[i].addr = addr;
 	}
 	
 	for (addr = MSR_P4_IX_ESCR0 + stag;
 	     addr <= MSR_P4_CRU_ESCR3; ++i, addr += addr_increment()) { 
-		if (reserve_evntsel_nmi(addr))
+		if (reserve_evntsel(addr))
 			msrs->controls[i].addr = addr;
 	}
 
@@ -464,21 +464,21 @@ static void p4_fill_in_addresses(struct op_msrs * const msrs)
 
 	if (num_counters == NUM_COUNTERS_NON_HT) {		
 		/* standard non-HT CPUs handle both remaining ESCRs*/
-		if (reserve_evntsel_nmi(MSR_P4_CRU_ESCR5))
+		if (reserve_evntsel(MSR_P4_CRU_ESCR5))
 			msrs->controls[i++].addr = MSR_P4_CRU_ESCR5;
-		if (reserve_evntsel_nmi(MSR_P4_CRU_ESCR4))
+		if (reserve_evntsel(MSR_P4_CRU_ESCR4))
 			msrs->controls[i++].addr = MSR_P4_CRU_ESCR4;
 
 	} else if (stag == 0) {
 		/* HT CPUs give the first remainder to the even thread, as
 		   the 32nd control register */
-		if (reserve_evntsel_nmi(MSR_P4_CRU_ESCR4))
+		if (reserve_evntsel(MSR_P4_CRU_ESCR4))
 			msrs->controls[i++].addr = MSR_P4_CRU_ESCR4;
 
 	} else {
 		/* and two copies of the second to the odd thread,
 		   for the 22st and 23nd control registers */
-		if (reserve_evntsel_nmi(MSR_P4_CRU_ESCR5)) {
+		if (reserve_evntsel(MSR_P4_CRU_ESCR5)) {
 			msrs->controls[i++].addr = MSR_P4_CRU_ESCR5;
 			msrs->controls[i++].addr = MSR_P4_CRU_ESCR5;
 		}
@@ -684,7 +684,7 @@ static void p4_shutdown(struct op_msrs const * const msrs)
 
 	for (i = 0 ; i < num_counters ; ++i) {
 		if (CTR_IS_RESERVED(msrs,i))
-			release_perfctr_nmi(msrs->counters[i].addr);
+			release_perfctr(msrs->counters[i].addr);
 	}
 	/* some of the control registers are specially reserved in
 	 * conjunction with the counter registers (hence the starting offset).
@@ -692,7 +692,7 @@ static void p4_shutdown(struct op_msrs const * const msrs)
 	 */
 	for (i = num_counters ; i < num_controls ; ++i) {
 		if (CTRL_IS_RESERVED(msrs,i))
-			release_evntsel_nmi(msrs->controls[i].addr);
+			release_evntsel(msrs->controls[i].addr);
 	}
 }
 
diff --git a/arch/i386/oprofile/op_model_ppro.c b/arch/i386/oprofile/op_model_ppro.c
index c554f52..5dfd93a 100644
--- a/arch/i386/oprofile/op_model_ppro.c
+++ b/arch/i386/oprofile/op_model_ppro.c
@@ -14,7 +14,7 @@
 #include <asm/ptrace.h>
 #include <asm/msr.h>
 #include <asm/apic.h>
-#include <asm/nmi.h>
+#include <asm/perfctr.h>
  
 #include "op_x86_model.h"
 #include "op_counter.h"
@@ -47,14 +47,14 @@ static void ppro_fill_in_addresses(struct op_msrs * const msrs)
 	int i;
 
 	for (i=0; i < NUM_COUNTERS; i++) {
-		if (reserve_perfctr_nmi(MSR_P6_PERFCTR0 + i))
+		if (reserve_perfctr(MSR_P6_PERFCTR0 + i))
 			msrs->counters[i].addr = MSR_P6_PERFCTR0 + i;
 		else
 			msrs->counters[i].addr = 0;
 	}
 	
 	for (i=0; i < NUM_CONTROLS; i++) {
-		if (reserve_evntsel_nmi(MSR_P6_EVNTSEL0 + i))
+		if (reserve_evntsel(MSR_P6_EVNTSEL0 + i))
 			msrs->controls[i].addr = MSR_P6_EVNTSEL0 + i;
 		else
 			msrs->controls[i].addr = 0;
@@ -171,11 +171,11 @@ static void ppro_shutdown(struct op_msrs const * const msrs)
 
 	for (i = 0 ; i < NUM_COUNTERS ; ++i) {
 		if (CTR_IS_RESERVED(msrs,i))
-			release_perfctr_nmi(MSR_P6_PERFCTR0 + i);
+			release_perfctr(MSR_P6_PERFCTR0 + i);
 	}
 	for (i = 0 ; i < NUM_CONTROLS ; ++i) {
 		if (CTRL_IS_RESERVED(msrs,i))
-			release_evntsel_nmi(MSR_P6_EVNTSEL0 + i);
+			release_evntsel(MSR_P6_EVNTSEL0 + i);
 	}
 }
 
diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c
index 892ebf8..c08cf0f 100644
--- a/arch/x86_64/kernel/apic.c
+++ b/arch/x86_64/kernel/apic.c
@@ -33,6 +33,7 @@
 #include <asm/pgalloc.h>
 #include <asm/mach_apic.h>
 #include <asm/nmi.h>
+#include <asm/perfctr.h>
 #include <asm/idle.h>
 #include <asm/proto.h>
 #include <asm/timex.h>
diff --git a/arch/x86_64/kernel/time.c b/arch/x86_64/kernel/time.c
index 4a0895b..d145547 100644
--- a/arch/x86_64/kernel/time.c
+++ b/arch/x86_64/kernel/time.c
@@ -43,7 +43,7 @@
 #include <asm/apic.h>
 #include <asm/hpet.h>
 #include <asm/mpspec.h>
-#include <asm/nmi.h>
+#include <asm/perfctr.h>
 
 static char *timename = NULL;
 
@@ -261,7 +261,7 @@ static unsigned int __init tsc_calibrate_cpu_khz(void)
        unsigned long flags;
 
        for (i = 0; i < 4; i++)
-               if (avail_to_resrv_perfctr_nmi_bit(i))
+               if (avail_to_resrv_perfctr_bit(i))
                        break;
        no_ctr_free = (i == 4);
        if (no_ctr_free) {
@@ -270,8 +270,8 @@ static unsigned int __init tsc_calibrate_cpu_khz(void)
                wrmsrl(MSR_K7_EVNTSEL3, 0);
                rdmsrl(MSR_K7_PERFCTR3, pmc3);
        } else {
-               reserve_perfctr_nmi(MSR_K7_PERFCTR0 + i);
-               reserve_evntsel_nmi(MSR_K7_EVNTSEL0 + i);
+               reserve_perfctr(MSR_K7_PERFCTR0 + i);
+               reserve_evntsel(MSR_K7_EVNTSEL0 + i);
        }
        local_irq_save(flags);
        /* start meauring cycles, incrementing from 0 */
@@ -289,8 +289,8 @@ static unsigned int __init tsc_calibrate_cpu_khz(void)
                wrmsrl(MSR_K7_PERFCTR3, pmc3);
                wrmsrl(MSR_K7_EVNTSEL3, evntsel3);
        } else {
-               release_perfctr_nmi(MSR_K7_PERFCTR0 + i);
-               release_evntsel_nmi(MSR_K7_EVNTSEL0 + i);
+               release_perfctr(MSR_K7_PERFCTR0 + i);
+               release_evntsel(MSR_K7_EVNTSEL0 + i);
        }
 
        return pmc_now * tsc_khz / (tsc_now - tsc_start);
diff --git a/include/asm-i386/nmi.h b/include/asm-i386/nmi.h
index 736af6f..e189531 100644
--- a/include/asm-i386/nmi.h
+++ b/include/asm-i386/nmi.h
@@ -18,13 +18,6 @@
 int do_nmi_callback(struct pt_regs *regs, int cpu);
 
 extern int nmi_watchdog_enabled;
-extern void probe_performance_counters(void);
-extern int avail_to_resrv_perfctr_nmi_bit(unsigned int);
-extern int avail_to_resrv_perfctr_nmi(unsigned int);
-extern int reserve_perfctr_nmi(unsigned int);
-extern void release_perfctr_nmi(unsigned int);
-extern int reserve_evntsel_nmi(unsigned int);
-extern void release_evntsel_nmi(unsigned int);
 
 extern void setup_apic_nmi_watchdog (void *);
 extern void stop_apic_nmi_watchdog (void *);
diff --git a/include/asm-i386/perfctr.h b/include/asm-i386/perfctr.h
new file mode 100644
index 0000000..fbd4993
--- /dev/null
+++ b/include/asm-i386/perfctr.h
@@ -0,0 +1,15 @@
+/*
+ *  linux/include/asm-i386/perfctr.h
+ */
+#ifndef ASM_PERFCTR_H
+#define ASM_PERFCTR_H
+
+extern void probe_performance_counters(void);
+extern int avail_to_resrv_perfctr_bit(unsigned int);
+extern int avail_to_resrv_perfctr(unsigned int);
+extern int reserve_perfctr(unsigned int);
+extern void release_perfctr(unsigned int);
+extern int reserve_evntsel(unsigned int);
+extern void release_evntsel(unsigned int);
+
+#endif /* ASM_PERFCTR_H */
diff --git a/include/asm-x86_64/nmi.h b/include/asm-x86_64/nmi.h
index d45fc62..666462c 100644
--- a/include/asm-x86_64/nmi.h
+++ b/include/asm-x86_64/nmi.h
@@ -46,13 +46,6 @@ extern int unknown_nmi_panic;
 extern int nmi_watchdog_enabled;
 
 extern int check_nmi_watchdog(void);
-extern void probe_performance_counters(void);
-extern int avail_to_resrv_perfctr_nmi_bit(unsigned int);
-extern int avail_to_resrv_perfctr_nmi(unsigned int);
-extern int reserve_perfctr_nmi(unsigned int);
-extern void release_perfctr_nmi(unsigned int);
-extern int reserve_evntsel_nmi(unsigned int);
-extern void release_evntsel_nmi(unsigned int);
 
 extern void setup_apic_nmi_watchdog (void *);
 extern void stop_apic_nmi_watchdog (void *);
diff --git a/include/asm-x86_64/perfctr.h b/include/asm-x86_64/perfctr.h
new file mode 100644
index 0000000..9ff5c2d
--- /dev/null
+++ b/include/asm-x86_64/perfctr.h
@@ -0,0 +1,15 @@
+/*
+ *  linux/include/asm-x86_64/perfctr.h
+ */
+#ifndef ASM_PERFCTR_H
+#define ASM_PERFCTR_H
+
+extern void probe_performance_counters(void);
+extern int avail_to_resrv_perfctr_bit(unsigned int);
+extern int avail_to_resrv_perfctr(unsigned int);
+extern int reserve_perfctr(unsigned int);
+extern void release_perfctr(unsigned int);
+extern int reserve_evntsel(unsigned int);
+extern void release_evntsel(unsigned int);
+
+#endif /* ASM_PERFCTR_H */
-- 
1.5.2.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 0/2] Performance counter allocator separation
  2007-06-18 10:32         ` Björn Steinbrink
       [not found]           ` <11823357571842-git-send-email->
@ 2007-06-20 10:49           ` Björn Steinbrink
  1 sibling, 0 replies; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-20 10:49 UTC (permalink / raw)
  To: Stephane Eranian, oprofile-list, wcohen, ak, perfmon,
	linux-kernel, levon, akpm, mingo

[Sorry if anyone gets this mail twice, seems that git-send-email
"forgot" to mask the umlaut in my name for this message, causing a few
servers to reject it.]

These two patches fix the performance counter allocator to work even
when the LAPIC NMI watchdog is disabled. It's split up into two patches
to keep the size of the pure regression fix down and allow the cleanup
to be merged once 2.6.22 is out.

If you prefer an all-in-one patch nevertheless, please let me know.

Thanks,
Björn

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog
  2007-06-20 10:35             ` [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog Björn Steinbrink
  2007-06-20 10:35               ` [PATCH 2/2] Finish separation of the performance counter allocator from the " Björn Steinbrink
@ 2007-06-20 12:31               ` Andi Kleen
  2007-06-20 12:49                 ` [perfmon] " Stephane Eranian
  2007-06-20 13:18                 ` Björn Steinbrink
  1 sibling, 2 replies; 28+ messages in thread
From: Andi Kleen @ 2007-06-20 12:31 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: eranian, oprofile-list, wcohen, perfmon, linux-kernel, levon,
	akpm, mingo

On Wednesday 20 June 2007 12:35:56 Björn Steinbrink wrote:
> The performance counter allocator is tied to the LAPIC NMI watchdog,

That's not true. It's completely independent. It just happens to be in
the same file, but it has no direct functional ties to the watchdog.


> Fix the performance counter allocator by making it independent of the
> LAPIC NMI watchdog. This also fixes a bug in the LAPIC NMI watchdog
> which allocated the wrong performance counter on CPUs with PerfMon
> support.

Combining code movement with functional fixes makes it impossible
to review properly. Don't do that please.

-An

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [perfmon] Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog
  2007-06-20 12:31               ` [PATCH 1/2] Separate the performance counter allocation from the LAPIC " Andi Kleen
@ 2007-06-20 12:49                 ` Stephane Eranian
  2007-06-20 13:01                   ` Andi Kleen
  2007-06-20 13:18                 ` Björn Steinbrink
  1 sibling, 1 reply; 28+ messages in thread
From: Stephane Eranian @ 2007-06-20 12:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Björn Steinbrink, mingo, linux-kernel, levon, perfmon,
	oprofile-list, wcohen, akpm

Andi,

On Wed, Jun 20, 2007 at 02:31:43PM +0200, Andi Kleen wrote:
> On Wednesday 20 June 2007 12:35:56 Björn Steinbrink wrote:
> > The performance counter allocator is tied to the LAPIC NMI watchdog,
> 
> That's not true. It's completely independent. It just happens to be in
> the same file, but it has no direct functional ties to the watchdog.
> 
I agree with you that the allocator is functionally independent of the
watchdog. That is how I'd like to see it and I think we all agree on
that.

Yet in the current implementation, there is a link between the two which
causes the issues I ran into. If you look at:

static inline unsigned int nmi_evntsel_msr_to_bit(unsigned int msr)
{
        return wd_ops ? msr - wd_ops->evntsel : 0;
}

int reserve_evntsel_nmi(unsigned int msr)
{
        unsigned int counter;

        counter = nmi_evntsel_msr_to_bit(msr);
        BUG_ON(counter > NMI_MAX_COUNTER_BITS);
	....
}

You see that if the wd_ops (a watchdog structure) is not initialized
then the allocator collapses all MSRs onto one bit.

Once this is fixed (which is what Bjorn did), then I will agree with you.
For this, the allocator needs to be able to probe the CPU and initialize
its own data structures.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [perfmon] Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog
  2007-06-20 12:49                 ` [perfmon] " Stephane Eranian
@ 2007-06-20 13:01                   ` Andi Kleen
  2007-06-20 18:33                     ` Björn Steinbrink
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2007-06-20 13:01 UTC (permalink / raw)
  To: eranian
  Cc: Björn Steinbrink, mingo, linux-kernel, levon, perfmon,
	oprofile-list, wcohen, akpm


> Once this is fixed (which is what Bjorn did), then I will agree with you.
> For this, the allocator needs to be able to probe the CPU and initialize
> its own data structures.

Ok that sounds reasonable. Please someone send a patch that does only 
that.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog
  2007-06-20 12:31               ` [PATCH 1/2] Separate the performance counter allocation from the LAPIC " Andi Kleen
  2007-06-20 12:49                 ` [perfmon] " Stephane Eranian
@ 2007-06-20 13:18                 ` Björn Steinbrink
  1 sibling, 0 replies; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-20 13:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: eranian, oprofile-list, wcohen, perfmon, linux-kernel, levon,
	akpm, mingo

On 2007.06.20 14:31:43 +0200, Andi Kleen wrote:
> On Wednesday 20 June 2007 12:35:56 Björn Steinbrink wrote:
> > The performance counter allocator is tied to the LAPIC NMI watchdog,
> 
> That's not true. It's completely independent. It just happens to be in
> the same file, but it has no direct functional ties to the watchdog.

The allocator relies on the wd_ops structure, which is only initialized
if the LAPIC NMI watchdog is enabled. If it is not enabled, the
functions that convert from absolute perfctr/evntsel addresses to
offsets from the respective base address break, and always return 0.

> > Fix the performance counter allocator by making it independent of the
> > LAPIC NMI watchdog. This also fixes a bug in the LAPIC NMI watchdog
> > which allocated the wrong performance counter on CPUs with PerfMon
> > support.
> 
> Combining code movement with functional fixes makes it impossible
> to review properly. Don't do that please.

Ok.

Björn

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [perfmon] Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog
  2007-06-20 13:01                   ` Andi Kleen
@ 2007-06-20 18:33                     ` Björn Steinbrink
  2007-06-20 18:34                       ` [PATCH 1/2] Always probe the " Björn Steinbrink
                                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-20 18:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: eranian, mingo, linux-kernel, levon, perfmon, oprofile-list,
	wcohen, akpm

On 2007.06.20 15:01:02 +0200, Andi Kleen wrote:
> 
> > Once this is fixed (which is what Bjorn did), then I will agree with you.
> > For this, the allocator needs to be able to probe the CPU and initialize
> > its own data structures.
> 
> Ok that sounds reasonable. Please someone send a patch that does only 
> that.

OK, here come the bugfixes without any restructuring. The first patch
enables unconditional probing of the watchdog. The second makes the
perfmon nmi watchdog reserve the correct perfctr/evntsel.

Björn

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/2] Always probe the NMI watchdog
  2007-06-20 18:33                     ` Björn Steinbrink
@ 2007-06-20 18:34                       ` Björn Steinbrink
  2007-06-25 19:09                         ` Andrew Morton
  2007-06-20 18:35                       ` [PATCH 2/2] Reserve the right performance counter for the Intel PerfMon " Björn Steinbrink
  2007-06-20 21:59                       ` [perfmon] Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC " Stephane Eranian
  2 siblings, 1 reply; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-20 18:34 UTC (permalink / raw)
  To: Andi Kleen, eranian, mingo, linux-kernel, levon, perfmon,
	oprofile-list, wcohen, akpm

The performance counter allocator relies on the nmi watchdog being
probed, so we have to do that even if the watchdog is not enabled.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
---
 arch/i386/kernel/cpu/perfctr-watchdog.c |   11 +++++------
 arch/i386/kernel/nmi.c                  |    3 +++
 arch/x86_64/kernel/nmi.c                |    3 +++
 include/asm-i386/nmi.h                  |    1 +
 include/asm-x86_64/nmi.h                |    1 +
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/arch/i386/kernel/cpu/perfctr-watchdog.c b/arch/i386/kernel/cpu/perfctr-watchdog.c
index f0b6763..a12dbcf 100644
--- a/arch/i386/kernel/cpu/perfctr-watchdog.c
+++ b/arch/i386/kernel/cpu/perfctr-watchdog.c
@@ -572,7 +572,7 @@ static struct wd_ops intel_arch_wd_ops = {
 	.evntsel = MSR_ARCH_PERFMON_EVENTSEL0,
 };
 
-static void probe_nmi_watchdog(void)
+void probe_nmi_watchdog(void)
 {
 	switch (boot_cpu_data.x86_vendor) {
 	case X86_VENDOR_AMD:
@@ -610,17 +610,16 @@ static void probe_nmi_watchdog(void)
 
 int lapic_watchdog_init(unsigned nmi_hz)
 {
-	if (!wd_ops) {
-		probe_nmi_watchdog();
-		if (!wd_ops)
-			return -1;
+	if (!wd_ops)
+		return -1;
 
+	/* hack to make sure that we only try to reserver the perfctrs once */
+	if (smp_processor_id() == 0)
 		if (!wd_ops->reserve()) {
 			printk(KERN_ERR
 				"NMI watchdog: cannot reserve perfctrs\n");
 			return -1;
 		}
-	}
 
 	if (!(wd_ops->setup(nmi_hz))) {
 		printk(KERN_ERR "Cannot setup NMI watchdog on CPU %d\n",
diff --git a/arch/i386/kernel/nmi.c b/arch/i386/kernel/nmi.c
index fba121f..8788f91 100644
--- a/arch/i386/kernel/nmi.c
+++ b/arch/i386/kernel/nmi.c
@@ -248,6 +248,9 @@ void setup_apic_nmi_watchdog (void *unused)
 	if ((smp_processor_id() != 0) && (atomic_read(&nmi_active) <= 0))
 		return;
 
+	/* always probe the watchdog, the perfctr allocator requires that */
+	probe_nmi_watchdog();
+
 	switch (nmi_watchdog) {
 	case NMI_LOCAL_APIC:
 		__get_cpu_var(wd_enabled) = 1; /* enable it before to avoid race with handler */
diff --git a/arch/x86_64/kernel/nmi.c b/arch/x86_64/kernel/nmi.c
index 931c64b..3229a26 100644
--- a/arch/x86_64/kernel/nmi.c
+++ b/arch/x86_64/kernel/nmi.c
@@ -255,6 +255,9 @@ void setup_apic_nmi_watchdog(void *unused)
 	if ((smp_processor_id() != 0) && (atomic_read(&nmi_active) <= 0))
 		return;
 
+	/* always probe the watchdog, the perfctr allocator requires that */
+	probe_nmi_watchdog();
+
 	switch (nmi_watchdog) {
 	case NMI_LOCAL_APIC:
 		__get_cpu_var(wd_enabled) = 1;
diff --git a/include/asm-i386/nmi.h b/include/asm-i386/nmi.h
index fb1e133..cf87cd0 100644
--- a/include/asm-i386/nmi.h
+++ b/include/asm-i386/nmi.h
@@ -18,6 +18,7 @@
 int do_nmi_callback(struct pt_regs *regs, int cpu);
 
 extern int nmi_watchdog_enabled;
+extern void probe_nmi_watchdog(void);
 extern int avail_to_resrv_perfctr_nmi_bit(unsigned int);
 extern int avail_to_resrv_perfctr_nmi(unsigned int);
 extern int reserve_perfctr_nmi(unsigned int);
diff --git a/include/asm-x86_64/nmi.h b/include/asm-x86_64/nmi.h
index d0a7f53..f61653b 100644
--- a/include/asm-x86_64/nmi.h
+++ b/include/asm-x86_64/nmi.h
@@ -45,6 +45,7 @@ extern int panic_on_timeout;
 extern int unknown_nmi_panic;
 extern int nmi_watchdog_enabled;
 
+extern void probe_nmi_watchdog(void);
 extern int check_nmi_watchdog(void);
 extern int avail_to_resrv_perfctr_nmi_bit(unsigned int);
 extern int avail_to_resrv_perfctr_nmi(unsigned int);
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/2] Reserve the right performance counter for the Intel PerfMon NMI watchdog
  2007-06-20 18:33                     ` Björn Steinbrink
  2007-06-20 18:34                       ` [PATCH 1/2] Always probe the " Björn Steinbrink
@ 2007-06-20 18:35                       ` Björn Steinbrink
  2007-06-20 21:59                       ` [perfmon] Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC " Stephane Eranian
  2 siblings, 0 replies; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-20 18:35 UTC (permalink / raw)
  To: Andi Kleen, eranian, mingo, linux-kernel, levon, perfmon,
	oprofile-list, wcohen, akpm

The Intel PerfMon NMI watchdog was using the generic reservation
function which always reserves the first performance counter. But the
watchdog actually uses the second performance counter, thus we need a
specialised function.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
---
 arch/i386/kernel/cpu/perfctr-watchdog.c |   22 ++++++++++++++++++++--
 1 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/i386/kernel/cpu/perfctr-watchdog.c b/arch/i386/kernel/cpu/perfctr-watchdog.c
index a12dbcf..efc3232 100644
--- a/arch/i386/kernel/cpu/perfctr-watchdog.c
+++ b/arch/i386/kernel/cpu/perfctr-watchdog.c
@@ -562,9 +562,27 @@ static int setup_intel_arch_watchdog(unsigned nmi_hz)
 	return 1;
 }
 
+static int intel_arch_reserve(void)
+{
+	if (!reserve_perfctr_nmi(MSR_ARCH_PERFMON_PERFCTR1))
+		return 0;
+
+	if (!reserve_evntsel_nmi(MSR_ARCH_PERFMON_EVENTSEL1)) {
+		release_perfctr_nmi(MSR_ARCH_PERFMON_PERFCTR1);
+		return 0;
+	}
+	return 1;
+}
+
+static void intel_arch_unreserve(void)
+{
+	release_evntsel_nmi(MSR_ARCH_PERFMON_EVENTSEL1);
+	release_perfctr_nmi(MSR_ARCH_PERFMON_PERFCTR1);
+}
+
 static struct wd_ops intel_arch_wd_ops = {
-	.reserve = single_msr_reserve,
-	.unreserve = single_msr_unreserve,
+	.reserve = intel_arch_reserve,
+	.unreserve = intel_arch_unreserve,
 	.setup = setup_intel_arch_watchdog,
 	.rearm = p6_rearm,
 	.stop = single_msr_stop_watchdog,
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [perfmon] Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog
  2007-06-20 18:33                     ` Björn Steinbrink
  2007-06-20 18:34                       ` [PATCH 1/2] Always probe the " Björn Steinbrink
  2007-06-20 18:35                       ` [PATCH 2/2] Reserve the right performance counter for the Intel PerfMon " Björn Steinbrink
@ 2007-06-20 21:59                       ` Stephane Eranian
  2007-06-21  8:36                         ` Stephane Eranian
  2 siblings, 1 reply; 28+ messages in thread
From: Stephane Eranian @ 2007-06-20 21:59 UTC (permalink / raw)
  To: Björn Steinbrink, Andi Kleen, mingo, linux-kernel, levon,
	perfmon, oprofile-list, wcohen, akpm

Bjorn,

I ran into one issue related with the new allocator.

In the case of a Core 2 Duo processor, the PMU implements more
than just basic counters. In particular it supports fixed counters
and PEBS where both use another set of MSRs. Those are not within
a 66 bit distance from MSR_ARCH_PERFMON_EVNTSEL0. Thus the allocator
fails with an assertion.

I do know that perfmon is the only consumer of those extended 
features TODAY. Yet I think we need to define the allocator such
that it can work with other "distant" MSRs as well.

On Wed, Jun 20, 2007 at 08:33:15PM +0200, Bj?rn Steinbrink wrote:
> On 2007.06.20 15:01:02 +0200, Andi Kleen wrote:
> > 
> > > Once this is fixed (which is what Bjorn did), then I will agree with you.
> > > For this, the allocator needs to be able to probe the CPU and initialize
> > > its own data structures.
> > 
> > Ok that sounds reasonable. Please someone send a patch that does only 
> > that.
> 
> OK, here come the bugfixes without any restructuring. The first patch
> enables unconditional probing of the watchdog. The second makes the
> perfmon nmi watchdog reserve the correct perfctr/evntsel.
> 
> Björn
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> oprofile-list mailing list
> oprofile-list@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oprofile-list

-- 

-Stephane

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [perfmon] Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog
  2007-06-20 21:59                       ` [perfmon] Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC " Stephane Eranian
@ 2007-06-21  8:36                         ` Stephane Eranian
  2007-06-22  7:13                           ` Björn Steinbrink
  0 siblings, 1 reply; 28+ messages in thread
From: Stephane Eranian @ 2007-06-21  8:36 UTC (permalink / raw)
  To: Björn Steinbrink, Andi Kleen, mingo, linux-kernel, levon,
	perfmon, oprofile-list, wcohen, akpm

Bjorn,


On Wed, Jun 20, 2007 at 02:59:33PM -0700, Stephane Eranian wrote:
> Bjorn,
> 
> I ran into one issue related with the new allocator.
> 
> In the case of a Core 2 Duo processor, the PMU implements more
> than just basic counters. In particular it supports fixed counters
> and PEBS where both use another set of MSRs. Those are not within
> a 66 bit distance from MSR_ARCH_PERFMON_EVNTSEL0. Thus the allocator
> fails with an assertion.
> 
> I do know that perfmon is the only consumer of those extended 
> features TODAY. Yet I think we need to define the allocator such
> that it can work with other "distant" MSRs as well.
> 

I think that a workaround for this issue could be for the allocator
to grant the requests for registers outside of the range, i.e., register
that it does not see/manage.

> On Wed, Jun 20, 2007 at 08:33:15PM +0200, Bj?rn Steinbrink wrote:
> > On 2007.06.20 15:01:02 +0200, Andi Kleen wrote:
> > > 
> > > > Once this is fixed (which is what Bjorn did), then I will agree with you.
> > > > For this, the allocator needs to be able to probe the CPU and initialize
> > > > its own data structures.
> > > 
> > > Ok that sounds reasonable. Please someone send a patch that does only 
> > > that.
> > 
> > OK, here come the bugfixes without any restructuring. The first patch
> > enables unconditional probing of the watchdog. The second makes the
> > perfmon nmi watchdog reserve the correct perfctr/evntsel.
> > 
> > Björn
> > 
> > -------------------------------------------------------------------------
> > This SF.net email is sponsored by DB2 Express
> > Download DB2 Express C - the FREE version of DB2 express and take
> > control of your XML. No limits. Just data. Click to get it now.
> > http://sourceforge.net/powerbar/db2/
> > _______________________________________________
> > oprofile-list mailing list
> > oprofile-list@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/oprofile-list
> 
> -- 
> 
> -Stephane
> _______________________________________________
> perfmon mailing list
> perfmon@linux.hpl.hp.com
> http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/

-- 

-Stephane

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [perfmon] Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog
  2007-06-21  8:36                         ` Stephane Eranian
@ 2007-06-22  7:13                           ` Björn Steinbrink
  2007-06-22 10:02                             ` Stephane Eranian
  0 siblings, 1 reply; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-22  7:13 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, mingo, linux-kernel, levon, perfmon, oprofile-list,
	wcohen, akpm

Hi Stephane,

On 2007.06.21 01:36:45 -0700, Stephane Eranian wrote:
> Bjorn,
> 
> 
> On Wed, Jun 20, 2007 at 02:59:33PM -0700, Stephane Eranian wrote:
> > Bjorn,
> > 
> > I ran into one issue related with the new allocator.

Should be the same with 2.6.21 and earlier, the "new" allocator should
do exactly the samething, it just fixes the breakage introduced in the
post-2.6.21 cleanup.

> > In the case of a Core 2 Duo processor, the PMU implements more
> > than just basic counters. In particular it supports fixed counters
> > and PEBS where both use another set of MSRs. Those are not within
> > a 66 bit distance from MSR_ARCH_PERFMON_EVNTSEL0. Thus the allocator
> > fails with an assertion.

How far away are they?

> > 
> > I do know that perfmon is the only consumer of those extended 
> > features TODAY. Yet I think we need to define the allocator such
> > that it can work with other "distant" MSRs as well.
> > 
> 
> I think that a workaround for this issue could be for the allocator
> to grant the requests for registers outside of the range, i.e., register
> that it does not see/manage.

That would also allow multiple subsystems to use them at the same time.
And whoever adds the second user of those MSRs might not be aware of the
just-grant-it policy of the allocator. And bugs that arise due to such
problems will probably become a real PITA to track down.

Unfortunately, I don't see any elegant solution to this atm, and of
course making your code simply circumvent the allocator isn't an option
either.

Thanks,
Björn

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [perfmon] Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog
  2007-06-22  7:13                           ` Björn Steinbrink
@ 2007-06-22 10:02                             ` Stephane Eranian
  0 siblings, 0 replies; 28+ messages in thread
From: Stephane Eranian @ 2007-06-22 10:02 UTC (permalink / raw)
  To: Björn Steinbrink, Andi Kleen, mingo, linux-kernel, levon,
	perfmon, oprofile-list, wcohen, akpm

Bjorn,

You have the following registers to consider (for P4/Core):

#define MSR_IA32_PEBS_ENABLE            0x000003f1
#define MSR_CORE_PERF_FIXED_CTR0        0x00000309
#define MSR_CORE_PERF_FIXED_CTR1        0x0000030a
#define MSR_CORE_PERF_FIXED_CTR2        0x0000030b
#define MSR_CORE_PERF_FIXED_CTR_CTRL    0x0000038d
#define MSR_CORE_PERF_GLOBAL_STATUS     0x0000038e
#define MSR_CORE_PERF_GLOBAL_CTRL       0x0000038f
#define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x00000390


With Barcelona, you'll also have to consider:

#define MSR_AMD64_IBSFETCHCTL           0xc0011030
#define MSR_AMD64_IBSFETCHLINAD         0xc0011031
#define MSR_AMD64_IBSFETCHPHYSAD        0xc0011032
#define MSR_AMD64_IBSOPCTL              0xc0011033
#define MSR_AMD64_IBSOPRIP              0xc0011034
#define MSR_AMD64_IBSOPDATA             0xc0011035
#define MSR_AMD64_IBSOPDATA2            0xc0011036
#define MSR_AMD64_IBSOPDATA3            0xc0011037
#define MSR_AMD64_IBSDCLINAD            0xc0011038
#define MSR_AMD64_IBSDCPHYSAD           0xc0011039
#define MSR_AMD64_IBSCTL                0xc001103A

I think you could organize by groups with a bitmap 
per group and some sorted linked list to keep track
of all of them.


On Fri, Jun 22, 2007 at 09:13:54AM +0200, Bj?rn Steinbrink wrote:
> Hi Stephane,
> 
> On 2007.06.21 01:36:45 -0700, Stephane Eranian wrote:
> > Bjorn,
> > 
> > 
> > On Wed, Jun 20, 2007 at 02:59:33PM -0700, Stephane Eranian wrote:
> > > Bjorn,
> > > 
> > > I ran into one issue related with the new allocator.
> 
> Should be the same with 2.6.21 and earlier, the "new" allocator should
> do exactly the samething, it just fixes the breakage introduced in the
> post-2.6.21 cleanup.
> 
> > > In the case of a Core 2 Duo processor, the PMU implements more
> > > than just basic counters. In particular it supports fixed counters
> > > and PEBS where both use another set of MSRs. Those are not within
> > > a 66 bit distance from MSR_ARCH_PERFMON_EVNTSEL0. Thus the allocator
> > > fails with an assertion.
> 
> How far away are they?
> 
> > > 
> > > I do know that perfmon is the only consumer of those extended 
> > > features TODAY. Yet I think we need to define the allocator such
> > > that it can work with other "distant" MSRs as well.
> > > 
> > 
> > I think that a workaround for this issue could be for the allocator
> > to grant the requests for registers outside of the range, i.e., register
> > that it does not see/manage.
> 
> That would also allow multiple subsystems to use them at the same time.
> And whoever adds the second user of those MSRs might not be aware of the
> just-grant-it policy of the allocator. And bugs that arise due to such
> problems will probably become a real PITA to track down.
> 
> Unfortunately, I don't see any elegant solution to this atm, and of
> course making your code simply circumvent the allocator isn't an option
> either.
> 
> Thanks,
> Björn
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> oprofile-list mailing list
> oprofile-list@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oprofile-list

-- 

-Stephane

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] Always probe the NMI watchdog
  2007-06-20 18:34                       ` [PATCH 1/2] Always probe the " Björn Steinbrink
@ 2007-06-25 19:09                         ` Andrew Morton
  2007-06-25 19:36                           ` Andi Kleen
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2007-06-25 19:09 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Andi Kleen, eranian, mingo, linux-kernel, levon, perfmon,
	oprofile-list, wcohen

On Wed, 20 Jun 2007 20:34:48 +0200
Bj__rn Steinbrink <B.Steinbrink@gmx.de> wrote:

> The performance counter allocator relies on the nmi watchdog being
> probed, so we have to do that even if the watchdog is not enabled.
> 

So...  what's the status of this lot?

I've just merged this patch and the second one:

Subject: [PATCH 2/2] Reserve the right performance counter for the Intel PerfMon NMI watchdog
Message-ID: <20070620183551.GC3251@atjola.homenet>

but there was no followup discussion afaict.

Andi, Stephane: acks?

If acked, do we agree that this is 2.6.22 material?

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] Always probe the NMI watchdog
  2007-06-25 19:09                         ` Andrew Morton
@ 2007-06-25 19:36                           ` Andi Kleen
  2007-06-25 20:01                             ` Stephane Eranian
  2007-06-25 21:06                             ` Björn Steinbrink
  0 siblings, 2 replies; 28+ messages in thread
From: Andi Kleen @ 2007-06-25 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Björn Steinbrink, eranian, mingo, linux-kernel, levon,
	perfmon, oprofile-list, wcohen

On Monday 25 June 2007 21:09, Andrew Morton wrote:
> On Wed, 20 Jun 2007 20:34:48 +0200
>
> Bj__rn Steinbrink <B.Steinbrink@gmx.de> wrote:
> > The performance counter allocator relies on the nmi watchdog being
> > probed, so we have to do that even if the watchdog is not enabled.
>
> So...  what's the status of this lot?
>
> I've just merged this patch and the second one:
>
> Subject: [PATCH 2/2] Reserve the right performance counter for the Intel
> PerfMon NMI watchdog Message-ID: <20070620183551.GC3251@atjola.homenet>
>
> but there was no followup discussion afaict.
>
> Andi, Stephane: acks?

Yes, although I'm still a little uneasy about the always probe one.

> If acked, do we agree that this is 2.6.22 material?

The first (always probe) is probably .22 material, but needs more testing 
first.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] Always probe the NMI watchdog
  2007-06-25 19:36                           ` Andi Kleen
@ 2007-06-25 20:01                             ` Stephane Eranian
  2007-06-25 20:36                               ` Andi Kleen
  2007-06-25 21:04                               ` Björn Steinbrink
  2007-06-25 21:06                             ` Björn Steinbrink
  1 sibling, 2 replies; 28+ messages in thread
From: Stephane Eranian @ 2007-06-25 20:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: b.steinbrink, ingo, linux-kernel, levon, perfmon, oprofile-list,
	wcohen, Stephane Eranian

Hi,
On Mon, Jun 25, 2007 at 09:36:17PM +0200, Andi Kleen wrote:
> On Monday 25 June 2007 21:09, Andrew Morton wrote:
> > On Wed, 20 Jun 2007 20:34:48 +0200
> >
> > Bj__rn Steinbrink <B.Steinbrink@gmx.de> wrote:
> > > The performance counter allocator relies on the nmi watchdog being
> > > probed, so we have to do that even if the watchdog is not enabled.
> >
> > So...  what's the status of this lot?
> >
> > I've just merged this patch and the second one:
> >
> > Subject: [PATCH 2/2] Reserve the right performance counter for the Intel
> > PerfMon NMI watchdog Message-ID: <20070620183551.GC3251@atjola.homenet>
> >
> > but there was no followup discussion afaict.
> >
> > Andi, Stephane: acks?
> 
> Yes, although I'm still a little uneasy about the always probe one.
> 

I looked at the code I have in my tree coming from Bjon's patches and
I am a bit confused by the flow for probing as well.

The register allocator works globally, i.e., you reserve a register
for all CPUs at once.

The probe_nmi_watchdog() routine simply probes the CPU type to initialize
the watchdog data structure (wd_ops). This needs to be done once and for all.
Why put it in a route that is called with on_each_cpu()?

I think the tricky part is that we do want to reserve perfctr1 even though
the NMI watchdog is not active. This comes from the fact that the NMI watchdog
knows about only one counter and if it can't get that one, it probably fails.
By reserving it from the start, we ensure NMI watchdog will work when eventually
activated.

Unlike sharing between Oprofile and perfmon which works by enforcing
mutual exclusion between the two subsystems, the NMI watchdog must work
concurrently with either Oprofile or Perfmon.

Bjorn, did I understand the constraints correctly?

-- 
-Stephane

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] Always probe the NMI watchdog
  2007-06-25 20:01                             ` Stephane Eranian
@ 2007-06-25 20:36                               ` Andi Kleen
  2007-06-25 21:04                               ` Björn Steinbrink
  1 sibling, 0 replies; 28+ messages in thread
From: Andi Kleen @ 2007-06-25 20:36 UTC (permalink / raw)
  To: eranian
  Cc: b.steinbrink, ingo, linux-kernel, levon, perfmon, oprofile-list, wcohen


> I looked at the code I have in my tree coming from Bjon's patches and
> I am a bit confused by the flow for probing as well.

Yes, it's a little risky. Perhaps it's better to readd the separate CPU switch
from .21 there again for 2.6.22. Ugly, but should be safe

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] Always probe the NMI watchdog
  2007-06-25 20:01                             ` Stephane Eranian
  2007-06-25 20:36                               ` Andi Kleen
@ 2007-06-25 21:04                               ` Björn Steinbrink
  1 sibling, 0 replies; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-25 21:04 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, ingo, linux-kernel, levon, perfmon, oprofile-list, wcohen

On 2007.06.25 13:01:58 -0700, Stephane Eranian wrote:
> Hi,
> On Mon, Jun 25, 2007 at 09:36:17PM +0200, Andi Kleen wrote:
> > On Monday 25 June 2007 21:09, Andrew Morton wrote:
> > > On Wed, 20 Jun 2007 20:34:48 +0200
> > >
> > > Bj__rn Steinbrink <B.Steinbrink@gmx.de> wrote:
> > > > The performance counter allocator relies on the nmi watchdog being
> > > > probed, so we have to do that even if the watchdog is not enabled.
> > >
> > > So...  what's the status of this lot?
> > >
> > > I've just merged this patch and the second one:
> > >
> > > Subject: [PATCH 2/2] Reserve the right performance counter for the Intel
> > > PerfMon NMI watchdog Message-ID: <20070620183551.GC3251@atjola.homenet>
> > >
> > > but there was no followup discussion afaict.
> > >
> > > Andi, Stephane: acks?
> > 
> > Yes, although I'm still a little uneasy about the always probe one.
> > 
> 
> I looked at the code I have in my tree coming from Bjon's patches and
> I am a bit confused by the flow for probing as well.
> 
> The register allocator works globally, i.e., you reserve a register
> for all CPUs at once.
> 
> The probe_nmi_watchdog() routine simply probes the CPU type to initialize
> the watchdog data structure (wd_ops). This needs to be done once and for all.
> Why put it in a route that is called with on_each_cpu()?

Ehrm, that's a good question actually... I moved the probing call up
into setup_local_APIC in that other big patch and put a check at the
start of the probing function, so that it is executed only once. No idea
why I did it so weird in this one.

> I think the tricky part is that we do want to reserve perfctr1 even
> though the NMI watchdog is not active. This comes from the fact that
> the NMI watchdog knows about only one counter and if it can't get that
> one, it probably fails.  By reserving it from the start, we ensure NMI
> watchdog will work when eventually activated.

Can you enable it later on at all? It failed for me when I tried,
because it didn't know which hardware to use. Had to pass the kernel
parameter to make the proc files do anything. Seems like it has to be
enable at boot to work at all.

And AFAICT we never unconditionally reserved a perfctr for the watchdog.

> Unlike sharing between Oprofile and perfmon which works by enforcing
> mutual exclusion between the two subsystems, the NMI watchdog must
> work concurrently with either Oprofile or Perfmon.

In 2.6.21 the nmi watchdog, if enabled, just reserved its perfctrs and
everything else had to deal with it. Since the cleanup, the watchdog
will release its perfctr when disabled, so another subsystem can grab
it. But that also means that that other subsystem must release it again
before you can reenable the watchdog.

> Bjorn, did I understand the constraints correctly?

I'll tell you, once I'm sure that I understood them correctly ;-)

Björn

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] Always probe the NMI watchdog
  2007-06-25 19:36                           ` Andi Kleen
  2007-06-25 20:01                             ` Stephane Eranian
@ 2007-06-25 21:06                             ` Björn Steinbrink
  1 sibling, 0 replies; 28+ messages in thread
From: Björn Steinbrink @ 2007-06-25 21:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, eranian, mingo, linux-kernel, levon, perfmon,
	oprofile-list, wcohen

On 2007.06.25 21:36:17 +0200, Andi Kleen wrote:
> On Monday 25 June 2007 21:09, Andrew Morton wrote:
> > On Wed, 20 Jun 2007 20:34:48 +0200
> >
> > Bj__rn Steinbrink <B.Steinbrink@gmx.de> wrote:
> > > The performance counter allocator relies on the nmi watchdog being
> > > probed, so we have to do that even if the watchdog is not enabled.
> >
> > So...  what's the status of this lot?
> >
> > I've just merged this patch and the second one:
> >
> > Subject: [PATCH 2/2] Reserve the right performance counter for the Intel
> > PerfMon NMI watchdog Message-ID: <20070620183551.GC3251@atjola.homenet>
> >
> > but there was no followup discussion afaict.
> >
> > Andi, Stephane: acks?
> 
> Yes, although I'm still a little uneasy about the always probe one.
> 
> > If acked, do we agree that this is 2.6.22 material?
> 
> The first (always probe) is probably .22 material, but needs more testing 
> first.

Hm, without the second, I expect OProfile to break when the watchdog is
enabled. Alternatively to the patch I sent, we could revert the change
that makes it use perfctr1 instead of perfctr0. Would you prefer that?

Björn

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2007-06-25 21:06 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-12 15:02 OProfile issues Stephane Eranian
2007-06-12 18:37 ` Chris Wright
2007-06-12 18:38 ` Chuck Ebbert
2007-06-12 19:07 ` Björn Steinbrink
2007-06-13  1:41   ` [PATCH] Separate performance counter reservation from nmi watchdog Björn Steinbrink
2007-06-13 16:46     ` Björn Steinbrink
2007-06-18  9:52       ` Stephane Eranian
2007-06-18 10:32         ` Björn Steinbrink
     [not found]           ` <11823357571842-git-send-email->
2007-06-20 10:35             ` [PATCH 1/2] Separate the performance counter allocation from the LAPIC NMI watchdog Björn Steinbrink
2007-06-20 10:35               ` [PATCH 2/2] Finish separation of the performance counter allocator from the " Björn Steinbrink
2007-06-20 12:31               ` [PATCH 1/2] Separate the performance counter allocation from the LAPIC " Andi Kleen
2007-06-20 12:49                 ` [perfmon] " Stephane Eranian
2007-06-20 13:01                   ` Andi Kleen
2007-06-20 18:33                     ` Björn Steinbrink
2007-06-20 18:34                       ` [PATCH 1/2] Always probe the " Björn Steinbrink
2007-06-25 19:09                         ` Andrew Morton
2007-06-25 19:36                           ` Andi Kleen
2007-06-25 20:01                             ` Stephane Eranian
2007-06-25 20:36                               ` Andi Kleen
2007-06-25 21:04                               ` Björn Steinbrink
2007-06-25 21:06                             ` Björn Steinbrink
2007-06-20 18:35                       ` [PATCH 2/2] Reserve the right performance counter for the Intel PerfMon " Björn Steinbrink
2007-06-20 21:59                       ` [perfmon] Re: [PATCH 1/2] Separate the performance counter allocation from the LAPIC " Stephane Eranian
2007-06-21  8:36                         ` Stephane Eranian
2007-06-22  7:13                           ` Björn Steinbrink
2007-06-22 10:02                             ` Stephane Eranian
2007-06-20 13:18                 ` Björn Steinbrink
2007-06-20 10:49           ` [PATCH 0/2] Performance counter allocator separation Björn Steinbrink

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).