All of lore.kernel.org
 help / color / mirror / Atom feed
* 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
@ 2009-08-07 17:09 Johannes Stezenbach
  2009-08-09 10:03 ` Johannes Stezenbach
  2009-08-10 10:31 ` Andi Kleen
  0 siblings, 2 replies; 25+ messages in thread
From: Johannes Stezenbach @ 2009-08-07 17:09 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Rafael J. Wysocki

Hi,

I'm currently running linux-2.6.31-rc5-246-g90bc1a6 on
an old Thinkpad T42p.  During boot I get the following:

   Local APIC disabled by BIOS -- you can enable it with "lapic"
   APIC: disable apic facility
   ...
   mce: CPU supports 5 MCE banks
   Disabling lock debugging due to kernel taint
   ------------[ cut here ]------------
   WARNING: at arch/x86/kernel/apic/apic.c:247 native_apic_write_dummy+0x2d/0x39()
   Hardware name: 2373Y4M
   Modules linked in:
   Pid: 0, comm: swapper Tainted: G   M       2.6.31-rc5 #1
   Call Trace:
    [<c10248c1>] warn_slowpath_common+0x60/0x90
    [<c10248fe>] warn_slowpath_null+0xd/0x10
    [<c1013139>] native_apic_write_dummy+0x2d/0x39
    [<c100dcd2>] intel_init_thermal+0xb6/0x144
    [<c100d517>] ? mce_init+0x33/0xb0
    [<c100db4b>] mce_intel_feature_init+0xb/0x4c
    [<c14fc31e>] mcheck_init+0x1e2/0x253
    [<c14faef4>] identify_cpu+0x30b/0x31b
    [<c14d9af0>] identify_boot_cpu+0xd/0x23
    [<c14d9b3c>] check_bugs+0xb/0xd4
    [<c104f929>] ? delayacct_init+0x42/0x49
    [<c14d493c>] start_kernel+0x25e/0x26d
    [<c14d430b>] i386_start_kernel+0x65/0x6a
   ---[ end trace 4eaa2a86a8e2da22 ]---
   ...
   CPU: Intel(R) Pentium(R) M processor 1.80GHz stepping 06


mcelog reports:

   HARDWARE ERROR. This is *NOT* a software problem!
   Please contact your hardware vendor
   MCE 0
   CPU 0 BANK 1 
   TIME 1249662514 Fri Aug  7 18:28:34 2009
   MCG status:
   MCi status:
   Error overflow
   Uncorrected error
   Error enabled
   Processor context corrupt
   MCA: Unknown Error 30
   STATUS f200000000000030 MCGSTATUS 0
   MCGCAP 5 APICID 0 SOCKETID 0 
   CPUID Vendor Intel Family 6 Model 13


In .config I have:

   CONFIG_X86_UP_APIC=y
   # CONFIG_X86_UP_IOAPIC is not set
   CONFIG_X86_LOCAL_APIC=y
   CONFIG_X86_IO_APIC=y
   # CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
   CONFIG_X86_MCE=y
   # CONFIG_X86_OLD_MCE is not set
   CONFIG_X86_NEW_MCE=y
   CONFIG_X86_MCE_INTEL=y
   # CONFIG_X86_MCE_AMD is not set
   # CONFIG_X86_ANCIENT_MCE is not set
   CONFIG_X86_MCE_THRESHOLD=y
   # CONFIG_X86_MCE_INJECT is not set
   CONFIG_X86_THERMAL_VECTOR=y


I guess I should try to boot with "lapic"?  But I think
MCE worked without "lapic" in earlier kernels. On a 2.6.29.1
kernel dmesg said:

   Local APIC disabled by BIOS -- you can enable it with "lapic"
   ...
   Intel machine check architecture supported.
   Intel machine check reporting enabled on CPU#0.

2.6.29.1 doesn't log any MCE events, so I doubt this is a HW problem.


Johannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-07 17:09 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p Johannes Stezenbach
@ 2009-08-09 10:03 ` Johannes Stezenbach
  2009-08-09 10:34   ` Bartlomiej Zolnierkiewicz
  2009-08-10 10:31 ` Andi Kleen
  1 sibling, 1 reply; 25+ messages in thread
From: Johannes Stezenbach @ 2009-08-09 10:03 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Rafael J. Wysocki

On Fri, Aug 07, 2009 at 07:09:42PM +0200, Johannes Stezenbach wrote:
> 
> I'm currently running linux-2.6.31-rc5-246-g90bc1a6 on
> an old Thinkpad T42p.  During boot I get the following:
...
> I guess I should try to boot with "lapic"?  But I think
> MCE worked without "lapic" in earlier kernels. On a 2.6.29.1
> kernel dmesg said:
> 
>    Local APIC disabled by BIOS -- you can enable it with "lapic"
>    ...
>    Intel machine check architecture supported.
>    Intel machine check reporting enabled on CPU#0.
> 
> 2.6.29.1 doesn't log any MCE events, so I doubt this is a HW problem.

I booted with "lapic", then the backtrace is gone
but it still logs machine check events which I think
are bogus since 2.6.29.1 does not log any.  I also
tried 2.6.30, no machine check messages in dmesg.  But
I noticed that mcelog complains about missing /dev/mcelog,
it seems that /dev/mcelog support for 32bit kernels is new.
However, I guess the old kernels should still have printed
a messge to dmesg, right?  "Uncorrected error" + "Processor
context corrupt" sounds pretty serious, but the machine
runs without problems.


These are from 2.6.31-rc5 by running mcelog (dmesg
just says "Machine check events logged"):


HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 0 BANK 1 
TIME 1249725930 Sat Aug  8 12:05:30 2009
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: MEMORY CONTROLLER AC_CHANNEL0_ERR
Transaction: Address/Command error
STATUS f2000000000000b0 MCGSTATUS 0
MCGCAP 5 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 13
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 1
CPU 0 BANK 1 
TIME 1249728066 Sat Aug  8 12:41:06 2009
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Unknown Error 20
STATUS f200000000000020 MCGSTATUS 0
MCGCAP 5 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 13
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 2
CPU 0 BANK 1 
TIME 1249747923 Sat Aug  8 18:12:03 2009
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Unknown Error 30
STATUS f200000000000030 MCGSTATUS 0
MCGCAP 5 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 13
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 3
CPU 0 BANK 1 
TIME 1249765938 Sat Aug  8 23:12:18 2009
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Unknown Error 20
STATUS f200000000000020 MCGSTATUS 0
MCGCAP 5 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 13


Johannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-09 10:03 ` Johannes Stezenbach
@ 2009-08-09 10:34   ` Bartlomiej Zolnierkiewicz
  2009-08-09 16:47     ` Johannes Stezenbach
  0 siblings, 1 reply; 25+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2009-08-09 10:34 UTC (permalink / raw)
  To: Johannes Stezenbach; +Cc: x86, linux-kernel, Rafael J. Wysocki


On Sunday 09 August 2009 12:03:48 Johannes Stezenbach wrote:
> On Fri, Aug 07, 2009 at 07:09:42PM +0200, Johannes Stezenbach wrote:
> > 
> > I'm currently running linux-2.6.31-rc5-246-g90bc1a6 on
> > an old Thinkpad T42p.  During boot I get the following:
> ...
> > I guess I should try to boot with "lapic"?  But I think
> > MCE worked without "lapic" in earlier kernels. On a 2.6.29.1
> > kernel dmesg said:
> > 
> >    Local APIC disabled by BIOS -- you can enable it with "lapic"
> >    ...
> >    Intel machine check architecture supported.
> >    Intel machine check reporting enabled on CPU#0.
> > 
> > 2.6.29.1 doesn't log any MCE events, so I doubt this is a HW problem.

http://patchwork.kernel.org/patch/37908/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-09 10:34   ` Bartlomiej Zolnierkiewicz
@ 2009-08-09 16:47     ` Johannes Stezenbach
  0 siblings, 0 replies; 25+ messages in thread
From: Johannes Stezenbach @ 2009-08-09 16:47 UTC (permalink / raw)
  To: Bartlomiej Zolnierkiewicz; +Cc: x86, linux-kernel, Rafael J. Wysocki

On Sun, Aug 09, 2009 at 12:34:03PM +0200, Bartlomiej Zolnierkiewicz wrote:
> On Sunday 09 August 2009 12:03:48 Johannes Stezenbach wrote:
> > On Fri, Aug 07, 2009 at 07:09:42PM +0200, Johannes Stezenbach wrote:
> > > 
> > > I'm currently running linux-2.6.31-rc5-246-g90bc1a6 on
> > > an old Thinkpad T42p.  During boot I get the following:
> > ...
> > > 
> > > 2.6.29.1 doesn't log any MCE events, so I doubt this is a HW problem.
> 
> http://patchwork.kernel.org/patch/37908/

This patch fixes the issue for me.


Thanks,
Johannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-07 17:09 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p Johannes Stezenbach
  2009-08-09 10:03 ` Johannes Stezenbach
@ 2009-08-10 10:31 ` Andi Kleen
  2009-08-10 12:27   ` Johannes Stezenbach
  2009-08-12 11:59   ` *PING* [PATCH]: x86: mce: fix mce warning with disabled lapic Ingo Molnar
  1 sibling, 2 replies; 25+ messages in thread
From: Andi Kleen @ 2009-08-10 10:31 UTC (permalink / raw)
  To: Johannes Stezenbach; +Cc: x86, linux-kernel, Rafael J. Wysocki

Johannes Stezenbach <js@sig21.net> writes:

> Hi,
>
> I'm currently running linux-2.6.31-rc5-246-g90bc1a6 on
> an old Thinkpad T42p.  During boot I get the following:

Thanks for the report.

>
>    Local APIC disabled by BIOS -- you can enable it with "lapic"
>    APIC: disable apic facility
>    ...
>    mce: CPU supports 5 MCE banks
>    Disabling lock debugging due to kernel taint
>    ------------[ cut here ]------------
>    WARNING: at arch/x86/kernel/apic/apic.c:247 native_apic_write_dummy+0x2d/0x39()
>    Hardware name: 2373Y4M
>    Modules linked in:
>    Pid: 0, comm: swapper Tainted: G   M       2.6.31-rc5 #1

The mcelog below is already worked around with Bart's patch he posted
the link to (it's really a BIOS bug in your case that the BIOS leaves
junks in the machine check registers on boot)

[for the x86 maintainers:]
One thing that would be good to make sure that Bart's patch is queued
for .31 too, not only for .32, since this BIOS problem seems
to be common (already two reports)

But still need to fix that warning too, which is independent
[another .31 candidate]

>    Call Trace:
>     [<c10248c1>] warn_slowpath_common+0x60/0x90
>     [<c10248fe>] warn_slowpath_null+0xd/0x10
>     [<c1013139>] native_apic_write_dummy+0x2d/0x39
>     [<c100dcd2>] intel_init_thermal+0xb6/0x144
>     [<c100d517>] ? mce_init+0x33/0xb0
>     [<c100db4b>] mce_intel_feature_init+0xb/0x4c
>     [<c14fc31e>] mcheck_init+0x1e2/0x253
>     [<c14faef4>] identify_cpu+0x30b/0x31b
>     [<c14d9af0>] identify_boot_cpu+0xd/0x23
>     [<c14d9b3c>] check_bugs+0xb/0xd4
>     [<c104f929>] ? delayacct_init+0x42/0x49
>     [<c14d493c>] start_kernel+0x25e/0x26d
>     [<c14d430b>] i386_start_kernel+0x65/0x6a
>    ---[ end trace 4eaa2a86a8e2da22 ]---


The appended patch should remove the warning. Can you please test it?

> 2.6.29.1 doesn't log any MCE events, so I doubt this is a HW problem.

It actually is a BIOS bug, but not really broken hardware.

-Andi

---

Don't try to enable thermal throttling on 32bit systems without apic

When the local APIC isn't enabled don't try to enable thermal throttling.
The APIC writes would WARN_ON.

Fixes 

>    Disabling lock debugging due to kernel taint
>    ------------[ cut here ]------------
>    WARNING: at arch/x86/kernel/apic/apic.c:247 native_apic_write_dummy+0x2d/0x39()
>    Hardware name: 2373Y4M
>    Modules linked in:
>    Pid: 0, comm: swapper Tainted: G   M       2.6.31-rc5 #1

Originally reported by Johannes Stezenbach 

This is a 2.6.31 candidate because it fixes a regression.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/kernel/cpu/mcheck/therm_throt.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux/arch/x86/kernel/cpu/mcheck/therm_throt.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ linux/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -236,6 +236,9 @@ void intel_init_thermal(struct cpuinfo_x
 	int tm2 = 0;
 	u32 l, h;
 
+	if (!cpu_has_apic || disable_apic)
+		return;
+
 	/* Thermal monitoring depends on ACPI and clock modulation*/
 	if (!cpu_has(c, X86_FEATURE_ACPI) || !cpu_has(c, X86_FEATURE_ACC))
 		return;



-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-10 10:31 ` Andi Kleen
@ 2009-08-10 12:27   ` Johannes Stezenbach
  2009-08-10 12:32     ` Andi Kleen
  2009-08-12 11:59   ` *PING* [PATCH]: x86: mce: fix mce warning with disabled lapic Ingo Molnar
  1 sibling, 1 reply; 25+ messages in thread
From: Johannes Stezenbach @ 2009-08-10 12:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, linux-kernel, Rafael J. Wysocki

On Mon, Aug 10, 2009 at 12:31:47PM +0200, Andi Kleen wrote:
> Johannes Stezenbach <js@sig21.net> writes:
> 
> >    Call Trace:
> >     [<c10248c1>] warn_slowpath_common+0x60/0x90
> >     [<c10248fe>] warn_slowpath_null+0xd/0x10
> >     [<c1013139>] native_apic_write_dummy+0x2d/0x39
> >     [<c100dcd2>] intel_init_thermal+0xb6/0x144
> >     [<c100d517>] ? mce_init+0x33/0xb0
> >     [<c100db4b>] mce_intel_feature_init+0xb/0x4c
> >     [<c14fc31e>] mcheck_init+0x1e2/0x253
> >     [<c14faef4>] identify_cpu+0x30b/0x31b
> >     [<c14d9af0>] identify_boot_cpu+0xd/0x23
> >     [<c14d9b3c>] check_bugs+0xb/0xd4
> >     [<c104f929>] ? delayacct_init+0x42/0x49
> >     [<c14d493c>] start_kernel+0x25e/0x26d
> >     [<c14d430b>] i386_start_kernel+0x65/0x6a
> >    ---[ end trace 4eaa2a86a8e2da22 ]---
> 
> 
> The appended patch should remove the warning. Can you please test it?

It works (no more warning), but I feel I'm better off using "lapic"
because it enables thermal monitoring and performance counters.


Thanks,
Johannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-10 12:27   ` Johannes Stezenbach
@ 2009-08-10 12:32     ` Andi Kleen
  2009-08-10 12:56       ` Johannes Stezenbach
  0 siblings, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2009-08-10 12:32 UTC (permalink / raw)
  To: Johannes Stezenbach; +Cc: Andi Kleen, x86, linux-kernel, Rafael J. Wysocki

> > The appended patch should remove the warning. Can you please test it?
> 
> It works (no more warning), but I feel I'm better off using "lapic"

Thanks for testing.

> because it enables thermal monitoring and performance counters.

When the BIOS doesn't enable it then force enabling lapic might not work.

This could cause either boot failures (obvious) or more subtle
problems like SMM doing something unexpected. Just saying that if you have 
strange problems later first try disabling this option again.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-10 12:32     ` Andi Kleen
@ 2009-08-10 12:56       ` Johannes Stezenbach
  2009-08-10 13:29         ` Ingo Molnar
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Stezenbach @ 2009-08-10 12:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, linux-kernel, Rafael J. Wysocki

On Mon, Aug 10, 2009 at 02:32:28PM +0200, Andi Kleen wrote:
> > > The appended patch should remove the warning. Can you please test it?
> > 
> > It works (no more warning), but I feel I'm better off using "lapic"
> 
> Thanks for testing.
> 
> > because it enables thermal monitoring and performance counters.
> 
> When the BIOS doesn't enable it then force enabling lapic might not work.
> 
> This could cause either boot failures (obvious) or more subtle
> problems like SMM doing something unexpected. Just saying that if you have 
> strange problems later first try disabling this option again.

Thanks for the heads-up.  I remember I tried to use oprofile
in the past on this machine and was disappointed that it only
got the timer event.  I'll keep lapic for now unless I see
signs of instability.


Johannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-10 12:56       ` Johannes Stezenbach
@ 2009-08-10 13:29         ` Ingo Molnar
  2009-08-10 19:26           ` Johannes Stezenbach
  0 siblings, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2009-08-10 13:29 UTC (permalink / raw)
  To: Johannes Stezenbach; +Cc: Andi Kleen, x86, linux-kernel, Rafael J. Wysocki


* Johannes Stezenbach <js@sig21.net> wrote:

> On Mon, Aug 10, 2009 at 02:32:28PM +0200, Andi Kleen wrote:
> > > > The appended patch should remove the warning. Can you please test it?
> > > 
> > > It works (no more warning), but I feel I'm better off using "lapic"
> > 
> > Thanks for testing.
> > 
> > > because it enables thermal monitoring and performance counters.
> > 
> > When the BIOS doesn't enable it then force enabling lapic might not work.
> > 
> > This could cause either boot failures (obvious) or more subtle 
> > problems like SMM doing something unexpected. Just saying that 
> > if you have strange problems later first try disabling this 
> > option again.
> 
> Thanks for the heads-up.  I remember I tried to use oprofile in 
> the past on this machine and was disappointed that it only got the 
> timer event.  I'll keep lapic for now unless I see signs of 
> instability.

What's the output of something like 'perf stat true', and does 'perf 
top' output something - i.e. do perfcounters work in general? Once 
you get to that stage and it works then it should be fine.

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-10 13:29         ` Ingo Molnar
@ 2009-08-10 19:26           ` Johannes Stezenbach
  2009-08-10 19:44             ` Andi Kleen
  2009-08-10 20:14             ` Ingo Molnar
  0 siblings, 2 replies; 25+ messages in thread
From: Johannes Stezenbach @ 2009-08-10 19:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andi Kleen, x86, linux-kernel, Rafael J. Wysocki

On Mon, Aug 10, 2009 at 03:29:23PM +0200, Ingo Molnar wrote:
> * Johannes Stezenbach <js@sig21.net> wrote:
> > On Mon, Aug 10, 2009 at 02:32:28PM +0200, Andi Kleen wrote:
> > > 
> > > When the BIOS doesn't enable it then force enabling lapic might not work.
> > > 
> > > This could cause either boot failures (obvious) or more subtle 
> > > problems like SMM doing something unexpected. Just saying that 
> > > if you have strange problems later first try disabling this 
> > > option again.
> > 
> > Thanks for the heads-up.  I remember I tried to use oprofile in 
> > the past on this machine and was disappointed that it only got the 
> > timer event.  I'll keep lapic for now unless I see signs of 
> > instability.
> 
> What's the output of something like 'perf stat true', and does 'perf 
> top' output something - i.e. do perfcounters work in general? Once 
> you get to that stage and it works then it should be fine.

# ./perf stat true

 Performance counter stats for 'true':

       0.985808  task-clock-msecs         #      0.779 CPUs 
              0  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            110  page-faults              #      0.112 M/sec
         583873  cycles                   #    592.279 M/sec
         500937  instructions             #      0.858 IPC  
  <not counted>  cache-references        
  <not counted>  cache-misses            

    0.001265524  seconds time elapsed


# ./perf top
------------------------------------------------------------------------------
   PerfTop:     172 irqs/sec  kernel:43.6% [100000 cycles],  (all, 1 CPUs)
------------------------------------------------------------------------------

             samples    pcnt         RIP          kernel function
  ______     _______   _____   ________________   _______________

               96.00 - 16.4% - 00000000c129be10 : acpi_pm_read
               66.00 - 11.3% - 00000000c116fbb1 : delay_tsc
               59.00 - 10.1% - 00000000c1172a83 : ioread32
               26.00 -  4.4% - 00000000c116f567 : vsnprintf
               21.00 -  3.6% - 00000000c11a4721 : acpi_os_read_port
               20.00 -  3.4% - 00000000c136612c : schedule
               19.00 -  3.2% - 00000000c10054b9 : mask_and_ack_8259A
               18.00 -  3.1% - 00000000c11ce8bc : acpi_idle_enter_bm
               17.00 -  2.9% - 00000000c1090dd1 : do_select
               16.00 -  2.7% - 00000000c133f380 : unix_poll
               14.00 -  2.4% - 00000000c116e2b4 : number
               13.00 -  2.2% - 00000000c1002847 : sysenter_past_esp
               10.00 -  1.7% - 00000000c1085b84 : fget_light
                9.00 -  1.5% - 00000000c13666a7 : preempt_schedule
                9.00 -  1.5% - 00000000c102cf1b : get_next_timer_interrupt
^C


First I tried oprofile while running an endless while loop in bash:

# opreport 
CPU: Pentium M (P6 core), speed 1800 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted, and not in a thermal trip) with a unit mask of 0x00 (No unit mask) count 100000
CPU_CLK_UNHALT...|
  samples|      %|
------------------
   282940 76.5545 bash
    78266 21.1763 libc-2.9.so
     1730  0.4681 Xorg
     1069  0.2892 oprofiled

Looks plausible.


But in demsg I got this:

Delta way too big! 18446744022868427516 ts=18446744022868427516 write stamp = 0
------------[ cut here ]------------
WARNING: at kernel/trace/ring_buffer.c:1392 rb_reserve_next_event+0x150/0x309()
Hardware name: 2373Y4M
Modules linked in: ath5k mac80211 ath cfg80211 oprofile bnep sco rfcomm l2cap bluetooth ehci_hcd uhci_hc
Pid: 13478, comm: opcontrol Not tainted 2.6.31-rc5 #5
Call Trace:
 [<c10248dd>] warn_slowpath_common+0x60/0x90
 [<c102491a>] warn_slowpath_null+0xd/0x10
 [<c1054129>] rb_reserve_next_event+0x150/0x309
 [<c1068c06>] ? get_page_from_freelist+0x86/0x35a
 [<c10544f7>] ring_buffer_lock_reserve+0xe7/0x135
 [<f88622c0>] op_cpu_buffer_write_reserve+0x1a/0x4b [oprofile]
 [<f886239d>] op_add_code+0x57/0x98 [oprofile]
 [<c1068c06>] ? get_page_from_freelist+0x86/0x35a
 [<c1367c08>] ? page_fault+0x0/0x8
 [<f8862409>] log_sample+0x2b/0x6c [oprofile]
 [<c1064c76>] ? filemap_fault+0x74/0x32b
 [<f886249c>] oprofile_add_sample+0x3b/0x6b [oprofile]
 [<f88641d8>] ppro_check_ctrs+0x66/0xdb [oprofile]
 [<c1074bde>] ? __do_fault+0x303/0x32f
 [<f8863907>] profile_exceptions_notify+0x1f/0x26 [oprofile]
 [<c103953b>] notifier_call_chain+0x2b/0x55
 [<c10398e3>] __atomic_notifier_call_chain+0x1a/0x3a
 [<c103990f>] atomic_notifier_call_chain+0xc/0xe
 [<c103993e>] notify_die+0x2d/0x2f
 [<c1003c54>] do_nmi+0x63/0x222
 [<c1367d1d>] nmi_stack_correct+0x28/0x2d
 [<c1367c08>] ? page_fault+0x0/0x8
---[ end trace d174f39c63495e01 ]---


Thanks
Johannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-10 19:26           ` Johannes Stezenbach
@ 2009-08-10 19:44             ` Andi Kleen
  2009-08-10 20:05               ` Robert Richter
  2009-08-10 20:14             ` Ingo Molnar
  1 sibling, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2009-08-10 19:44 UTC (permalink / raw)
  To: Johannes Stezenbach
  Cc: Ingo Molnar, Andi Kleen, x86, linux-kernel, Rafael J. Wysocki,
	robert.richter

> But in demsg I got this:

AFAIK that's already fixed. cc Robert to confirm. Not sure if the fix  made it into 2.6.31 though?

-Andi

> 
> Delta way too big! 18446744022868427516 ts=18446744022868427516 write stamp = 0
> ------------[ cut here ]------------
> WARNING: at kernel/trace/ring_buffer.c:1392 rb_reserve_next_event+0x150/0x309()
> Hardware name: 2373Y4M
> Modules linked in: ath5k mac80211 ath cfg80211 oprofile bnep sco rfcomm l2cap bluetooth ehci_hcd uhci_hc
> Pid: 13478, comm: opcontrol Not tainted 2.6.31-rc5 #5
> Call Trace:
>  [<c10248dd>] warn_slowpath_common+0x60/0x90
>  [<c102491a>] warn_slowpath_null+0xd/0x10
>  [<c1054129>] rb_reserve_next_event+0x150/0x309
>  [<c1068c06>] ? get_page_from_freelist+0x86/0x35a
>  [<c10544f7>] ring_buffer_lock_reserve+0xe7/0x135
>  [<f88622c0>] op_cpu_buffer_write_reserve+0x1a/0x4b [oprofile]
>  [<f886239d>] op_add_code+0x57/0x98 [oprofile]
>  [<c1068c06>] ? get_page_from_freelist+0x86/0x35a
>  [<c1367c08>] ? page_fault+0x0/0x8
>  [<f8862409>] log_sample+0x2b/0x6c [oprofile]
>  [<c1064c76>] ? filemap_fault+0x74/0x32b
>  [<f886249c>] oprofile_add_sample+0x3b/0x6b [oprofile]
>  [<f88641d8>] ppro_check_ctrs+0x66/0xdb [oprofile]
>  [<c1074bde>] ? __do_fault+0x303/0x32f
>  [<f8863907>] profile_exceptions_notify+0x1f/0x26 [oprofile]
>  [<c103953b>] notifier_call_chain+0x2b/0x55
>  [<c10398e3>] __atomic_notifier_call_chain+0x1a/0x3a
>  [<c103990f>] atomic_notifier_call_chain+0xc/0xe
>  [<c103993e>] notify_die+0x2d/0x2f
>  [<c1003c54>] do_nmi+0x63/0x222
>  [<c1367d1d>] nmi_stack_correct+0x28/0x2d
>  [<c1367c08>] ? page_fault+0x0/0x8
> ---[ end trace d174f39c63495e01 ]---
> 
> 
> Thanks
> Johannes
> 

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-10 19:44             ` Andi Kleen
@ 2009-08-10 20:05               ` Robert Richter
  0 siblings, 0 replies; 25+ messages in thread
From: Robert Richter @ 2009-08-10 20:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Johannes Stezenbach, Ingo Molnar, x86, linux-kernel,
	Rafael J. Wysocki, Steven Rostedt

On 10.08.09 21:44:13, Andi Kleen wrote:
> > But in demsg I got this:
> 
> AFAIK that's already fixed. cc Robert to confirm. Not sure if the fix  made it into 2.6.31 though?

The warning I fixed was in rb_advance_reader(). Fix is already
upstream:

 193df82 ring-buffer: Fix advance of reader in rb_buffer_peek()

But this seems to be different, a time stamp issue?

CC'ing Steve.

-Robert

> 
> -Andi
> 
> > 
> > Delta way too big! 18446744022868427516 ts=18446744022868427516 write stamp = 0
> > ------------[ cut here ]------------
> > WARNING: at kernel/trace/ring_buffer.c:1392 rb_reserve_next_event+0x150/0x309()
> > Hardware name: 2373Y4M
> > Modules linked in: ath5k mac80211 ath cfg80211 oprofile bnep sco rfcomm l2cap bluetooth ehci_hcd uhci_hc
> > Pid: 13478, comm: opcontrol Not tainted 2.6.31-rc5 #5
> > Call Trace:
> >  [<c10248dd>] warn_slowpath_common+0x60/0x90
> >  [<c102491a>] warn_slowpath_null+0xd/0x10
> >  [<c1054129>] rb_reserve_next_event+0x150/0x309
> >  [<c1068c06>] ? get_page_from_freelist+0x86/0x35a
> >  [<c10544f7>] ring_buffer_lock_reserve+0xe7/0x135
> >  [<f88622c0>] op_cpu_buffer_write_reserve+0x1a/0x4b [oprofile]
> >  [<f886239d>] op_add_code+0x57/0x98 [oprofile]
> >  [<c1068c06>] ? get_page_from_freelist+0x86/0x35a
> >  [<c1367c08>] ? page_fault+0x0/0x8
> >  [<f8862409>] log_sample+0x2b/0x6c [oprofile]
> >  [<c1064c76>] ? filemap_fault+0x74/0x32b
> >  [<f886249c>] oprofile_add_sample+0x3b/0x6b [oprofile]
> >  [<f88641d8>] ppro_check_ctrs+0x66/0xdb [oprofile]
> >  [<c1074bde>] ? __do_fault+0x303/0x32f
> >  [<f8863907>] profile_exceptions_notify+0x1f/0x26 [oprofile]
> >  [<c103953b>] notifier_call_chain+0x2b/0x55
> >  [<c10398e3>] __atomic_notifier_call_chain+0x1a/0x3a
> >  [<c103990f>] atomic_notifier_call_chain+0xc/0xe
> >  [<c103993e>] notify_die+0x2d/0x2f
> >  [<c1003c54>] do_nmi+0x63/0x222
> >  [<c1367d1d>] nmi_stack_correct+0x28/0x2d
> >  [<c1367c08>] ? page_fault+0x0/0x8
> > ---[ end trace d174f39c63495e01 ]---
> > 
> > 
> > Thanks
> > Johannes
> > 
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.
> 

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-10 19:26           ` Johannes Stezenbach
  2009-08-10 19:44             ` Andi Kleen
@ 2009-08-10 20:14             ` Ingo Molnar
  2009-08-10 20:37               ` Johannes Stezenbach
  1 sibling, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2009-08-10 20:14 UTC (permalink / raw)
  To: Johannes Stezenbach, Robert Richter, Steven Rostedt
  Cc: Andi Kleen, x86, linux-kernel, Rafael J. Wysocki


* Johannes Stezenbach <js@sig21.net> wrote:

> On Mon, Aug 10, 2009 at 03:29:23PM +0200, Ingo Molnar wrote:
> > * Johannes Stezenbach <js@sig21.net> wrote:
> > > On Mon, Aug 10, 2009 at 02:32:28PM +0200, Andi Kleen wrote:
> > > > 
> > > > When the BIOS doesn't enable it then force enabling lapic might not work.
> > > > 
> > > > This could cause either boot failures (obvious) or more subtle 
> > > > problems like SMM doing something unexpected. Just saying that 
> > > > if you have strange problems later first try disabling this 
> > > > option again.
> > > 
> > > Thanks for the heads-up.  I remember I tried to use oprofile in 
> > > the past on this machine and was disappointed that it only got the 
> > > timer event.  I'll keep lapic for now unless I see signs of 
> > > instability.
> > 
> > What's the output of something like 'perf stat true', and does 'perf 
> > top' output something - i.e. do perfcounters work in general? Once 
> > you get to that stage and it works then it should be fine.
> 
> # ./perf stat true
> 
>  Performance counter stats for 'true':
> 
>        0.985808  task-clock-msecs         #      0.779 CPUs 
>               0  context-switches         #      0.000 M/sec
>               0  CPU-migrations           #      0.000 M/sec
>             110  page-faults              #      0.112 M/sec
>          583873  cycles                   #    592.279 M/sec
>          500937  instructions             #      0.858 IPC  
>   <not counted>  cache-references        
>   <not counted>  cache-misses            
> 
>     0.001265524  seconds time elapsed

That looks almost normal - except for cache-references and 
cache-misses that is not counted. Could you send the /proc/cpuinfo 
info please?

> 
> 
> # ./perf top
> ------------------------------------------------------------------------------
>    PerfTop:     172 irqs/sec  kernel:43.6% [100000 cycles],  (all, 1 CPUs)
> ------------------------------------------------------------------------------
> 
>              samples    pcnt         RIP          kernel function
>   ______     _______   _____   ________________   _______________
> 
>                96.00 - 16.4% - 00000000c129be10 : acpi_pm_read
>                66.00 - 11.3% - 00000000c116fbb1 : delay_tsc
>                59.00 - 10.1% - 00000000c1172a83 : ioread32
>                26.00 -  4.4% - 00000000c116f567 : vsnprintf
>                21.00 -  3.6% - 00000000c11a4721 : acpi_os_read_port
>                20.00 -  3.4% - 00000000c136612c : schedule
>                19.00 -  3.2% - 00000000c10054b9 : mask_and_ack_8259A
>                18.00 -  3.1% - 00000000c11ce8bc : acpi_idle_enter_bm
>                17.00 -  2.9% - 00000000c1090dd1 : do_select
>                16.00 -  2.7% - 00000000c133f380 : unix_poll
>                14.00 -  2.4% - 00000000c116e2b4 : number
>                13.00 -  2.2% - 00000000c1002847 : sysenter_past_esp
>                10.00 -  1.7% - 00000000c1085b84 : fget_light
>                 9.00 -  1.5% - 00000000c13666a7 : preempt_schedule
>                 9.00 -  1.5% - 00000000c102cf1b : get_next_timer_interrupt
> ^C

Ok, this looks normal.

> First I tried oprofile while running an endless while loop in bash:
> 
> # opreport 
> CPU: Pentium M (P6 core), speed 1800 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (clocks processor is not halted, and not in a thermal trip) with a unit mask of 0x00 (No unit mask) count 100000
> CPU_CLK_UNHALT...|
>   samples|      %|
> ------------------
>    282940 76.5545 bash
>     78266 21.1763 libc-2.9.so
>      1730  0.4681 Xorg
>      1069  0.2892 oprofiled
> 
> Looks plausible.

Yeah.

> But in demsg I got this:
> 
> Delta way too big! 18446744022868427516 ts=18446744022868427516 write stamp = 0
> ------------[ cut here ]------------
> WARNING: at kernel/trace/ring_buffer.c:1392 rb_reserve_next_event+0x150/0x309()
> Hardware name: 2373Y4M
> Modules linked in: ath5k mac80211 ath cfg80211 oprofile bnep sco rfcomm l2cap bluetooth ehci_hcd uhci_hc
> Pid: 13478, comm: opcontrol Not tainted 2.6.31-rc5 #5
> Call Trace:
>  [<c10248dd>] warn_slowpath_common+0x60/0x90
>  [<c102491a>] warn_slowpath_null+0xd/0x10
>  [<c1054129>] rb_reserve_next_event+0x150/0x309
>  [<c1068c06>] ? get_page_from_freelist+0x86/0x35a
>  [<c10544f7>] ring_buffer_lock_reserve+0xe7/0x135
>  [<f88622c0>] op_cpu_buffer_write_reserve+0x1a/0x4b [oprofile]
>  [<f886239d>] op_add_code+0x57/0x98 [oprofile]
>  [<c1068c06>] ? get_page_from_freelist+0x86/0x35a
>  [<c1367c08>] ? page_fault+0x0/0x8
>  [<f8862409>] log_sample+0x2b/0x6c [oprofile]
>  [<c1064c76>] ? filemap_fault+0x74/0x32b
>  [<f886249c>] oprofile_add_sample+0x3b/0x6b [oprofile]
>  [<f88641d8>] ppro_check_ctrs+0x66/0xdb [oprofile]
>  [<c1074bde>] ? __do_fault+0x303/0x32f
>  [<f8863907>] profile_exceptions_notify+0x1f/0x26 [oprofile]
>  [<c103953b>] notifier_call_chain+0x2b/0x55
>  [<c10398e3>] __atomic_notifier_call_chain+0x1a/0x3a
>  [<c103990f>] atomic_notifier_call_chain+0xc/0xe
>  [<c103993e>] notify_die+0x2d/0x2f
>  [<c1003c54>] do_nmi+0x63/0x222
>  [<c1367d1d>] nmi_stack_correct+0x28/0x2d
>  [<c1367c08>] ? page_fault+0x0/0x8
> ---[ end trace d174f39c63495e01 ]---

That's a new warning i havent seen before - i've Cc:-ed Robert 
(oprofile maintainer) and Steve (ftrace/ring-buffer maintainer) for 
that.

The warning is probably harmless - oprofile sampling still works 
fine, right?

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-10 20:14             ` Ingo Molnar
@ 2009-08-10 20:37               ` Johannes Stezenbach
  2009-08-10 21:31                 ` Ingo Molnar
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Stezenbach @ 2009-08-10 20:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Robert Richter, Steven Rostedt, Andi Kleen, x86, linux-kernel,
	Rafael J. Wysocki

On Mon, Aug 10, 2009 at 10:14:06PM +0200, Ingo Molnar wrote:
> * Johannes Stezenbach <js@sig21.net> wrote:
> > 
> > # ./perf stat true
> > 
> >  Performance counter stats for 'true':
> > 
> >        0.985808  task-clock-msecs         #      0.779 CPUs 
> >               0  context-switches         #      0.000 M/sec
> >               0  CPU-migrations           #      0.000 M/sec
> >             110  page-faults              #      0.112 M/sec
> >          583873  cycles                   #    592.279 M/sec
> >          500937  instructions             #      0.858 IPC  
> >   <not counted>  cache-references        
> >   <not counted>  cache-misses            
> > 
> >     0.001265524  seconds time elapsed
> 
> That looks almost normal - except for cache-references and 
> cache-misses that is not counted. Could you send the /proc/cpuinfo 
> info please?

# cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 13
model name	: Intel(R) Pentium(R) M processor 1.80GHz
stepping	: 6
cpu MHz		: 600.000
cache size	: 2048 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr mce cx8 apic sep mtrr pge mca cmov clflush dts acpi mmx fxsr sse sse2 ss tm pbe bts est tm2
bogomips	: 1196.15
clflush size	: 64
power management:


> The warning is probably harmless - oprofile sampling still works 
> fine, right?

I haven't done much testing so far, but so far it looks promising.

Could the warning be caused by the cpufreq ondemand governor?
ISTR that one should switch to the performance governor before doing
any profiling, but I forgot for this test.


Johannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-10 20:37               ` Johannes Stezenbach
@ 2009-08-10 21:31                 ` Ingo Molnar
  2009-08-10 22:13                   ` Johannes Stezenbach
  0 siblings, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2009-08-10 21:31 UTC (permalink / raw)
  To: Johannes Stezenbach
  Cc: Robert Richter, Steven Rostedt, Andi Kleen, x86, linux-kernel,
	Rafael J. Wysocki


* Johannes Stezenbach <js@sig21.net> wrote:

> On Mon, Aug 10, 2009 at 10:14:06PM +0200, Ingo Molnar wrote:
> > * Johannes Stezenbach <js@sig21.net> wrote:
> > > 
> > > # ./perf stat true
> > > 
> > >  Performance counter stats for 'true':
> > > 
> > >        0.985808  task-clock-msecs         #      0.779 CPUs 
> > >               0  context-switches         #      0.000 M/sec
> > >               0  CPU-migrations           #      0.000 M/sec
> > >             110  page-faults              #      0.112 M/sec
> > >          583873  cycles                   #    592.279 M/sec
> > >          500937  instructions             #      0.858 IPC  
> > >   <not counted>  cache-references        
> > >   <not counted>  cache-misses            
> > > 
> > >     0.001265524  seconds time elapsed
> > 
> > That looks almost normal - except for cache-references and 
> > cache-misses that is not counted. Could you send the /proc/cpuinfo 
> > info please?
> 
> # cat /proc/cpuinfo 
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 13
> model name	: Intel(R) Pentium(R) M processor 1.80GHz

ah, yes. There's no cache-references/misses, because in 
arch/x86/kernel/cpu/perf_counter.c we have two zero entries:

static const u64 p6_perfmon_event_map[] =
{
  [PERF_COUNT_HW_CPU_CYCLES]            = 0x0079,
  [PERF_COUNT_HW_INSTRUCTIONS]          = 0x00c0,
  [PERF_COUNT_HW_CACHE_REFERENCES]      = 0x0000, <----------
  [PERF_COUNT_HW_CACHE_MISSES]          = 0x0000, <----------
  [PERF_COUNT_HW_BRANCH_INSTRUCTIONS]   = 0x00c4,
  [PERF_COUNT_HW_BRANCH_MISSES]         = 0x00c5,
  [PERF_COUNT_HW_BUS_CYCLES]            = 0x0062,
};

i.e. PERF_COUNT_HW_CACHE_REFERENCES and PERF_COUNT_HW_CACHE_MISSES 
is not filled in yet.

Could you try something like:

    perf stat -e r0f2e true

(0x2e: L2 requests, 0x0f: all units)

if i checked the docs right that counter would give us L2 cache 
stats - does it display non-zero values?

> stepping	: 6
> cpu MHz		: 600.000
> cache size	: 2048 KB
> fdiv_bug	: no
> hlt_bug		: no
> f00f_bug	: no
> coma_bug	: no
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 2
> wp		: yes
> flags		: fpu vme de pse tsc msr mce cx8 apic sep mtrr pge mca cmov clflush dts acpi mmx fxsr sse sse2 ss tm pbe bts est tm2
> bogomips	: 1196.15
> clflush size	: 64
> power management:
> 
> 
> > The warning is probably harmless - oprofile sampling still works 
> > fine, right?
> 
> I haven't done much testing so far, but so far it looks promising.
> 
> Could the warning be caused by the cpufreq ondemand governor? ISTR 
> that one should switch to the performance governor before doing 
> any profiling, but I forgot for this test.

there might be a connection - it could in theory cause sched_clock() 
transients and confuse the ring-buffer time-stamping.

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-10 21:31                 ` Ingo Molnar
@ 2009-08-10 22:13                   ` Johannes Stezenbach
  2009-08-11  9:34                     ` [patch] cache-miss and cache-refs events on P6-mobile CPUs Ingo Molnar
  2009-08-11 15:40                     ` 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p Johannes Stezenbach
  0 siblings, 2 replies; 25+ messages in thread
From: Johannes Stezenbach @ 2009-08-10 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Robert Richter, Steven Rostedt, Andi Kleen, x86, linux-kernel,
	Rafael J. Wysocki

On Mon, Aug 10, 2009 at 11:31:33PM +0200, Ingo Molnar wrote:
> * Johannes Stezenbach <js@sig21.net> wrote:
> > 
> > # cat /proc/cpuinfo 
> > processor	: 0
> > vendor_id	: GenuineIntel
> > cpu family	: 6
> > model		: 13
> > model name	: Intel(R) Pentium(R) M processor 1.80GHz
> 
> ah, yes. There's no cache-references/misses, because in 
> arch/x86/kernel/cpu/perf_counter.c we have two zero entries:
> 
> static const u64 p6_perfmon_event_map[] =
> {
>   [PERF_COUNT_HW_CPU_CYCLES]            = 0x0079,
>   [PERF_COUNT_HW_INSTRUCTIONS]          = 0x00c0,
>   [PERF_COUNT_HW_CACHE_REFERENCES]      = 0x0000, <----------
>   [PERF_COUNT_HW_CACHE_MISSES]          = 0x0000, <----------
>   [PERF_COUNT_HW_BRANCH_INSTRUCTIONS]   = 0x00c4,
>   [PERF_COUNT_HW_BRANCH_MISSES]         = 0x00c5,
>   [PERF_COUNT_HW_BUS_CYCLES]            = 0x0062,
> };
> 
> i.e. PERF_COUNT_HW_CACHE_REFERENCES and PERF_COUNT_HW_CACHE_MISSES 
> is not filled in yet.
> 
> Could you try something like:
> 
>     perf stat -e r0f2e true
> 
> (0x2e: L2 requests, 0x0f: all units)
> 
> if i checked the docs right that counter would give us L2 cache 
> stats - does it display non-zero values?

# ./perf stat -e r0f2e true

 Performance counter stats for 'true':

          10584  raw 0xf2e               

    0.001159924  seconds time elapsed

The number also increases for larger programs than "true".

According to /usr/share/oprofile/i386/p6_mobile/events and
http://oprofile.sourceforge.net/docs/intel-p6-mobile-events.php
0x2e + 0x0f is "L2 requests, all units", but I couldn't say how
to count cache references vs. misses.  Or does it work
with unit mask 0x0e vs. 0x01?

# ./perf stat -e r0e2e true

 Performance counter stats for 'true':

          10147  raw 0xe2e               

    0.001121651  seconds time elapsed

# ./perf stat -e r012e true

 Performance counter stats for 'true':

            468  raw 0x12e               

    0.001130870  seconds time elapsed


> > Could the warning be caused by the cpufreq ondemand governor? ISTR 
> > that one should switch to the performance governor before doing 
> > any profiling, but I forgot for this test.
> 
> there might be a connection - it could in theory cause sched_clock() 
> transients and confuse the ring-buffer time-stamping.

I'll try tomorrow after a fresh boot if the warning also appears
with the performance governor.


Thanks
Johannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch] cache-miss and cache-refs events on P6-mobile CPUs
  2009-08-10 22:13                   ` Johannes Stezenbach
@ 2009-08-11  9:34                     ` Ingo Molnar
  2009-08-11  9:39                       ` Peter Zijlstra
  2009-08-11 15:50                       ` Johannes Stezenbach
  2009-08-11 15:40                     ` 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p Johannes Stezenbach
  1 sibling, 2 replies; 25+ messages in thread
From: Ingo Molnar @ 2009-08-11  9:34 UTC (permalink / raw)
  To: Johannes Stezenbach
  Cc: linux-kernel, Peter Zijlstra, Steven Rostedt,
	Frédéric Weisbecker, Thomas Gleixner


* Johannes Stezenbach <js@sig21.net> wrote:

> On Mon, Aug 10, 2009 at 11:31:33PM +0200, Ingo Molnar wrote:
> > * Johannes Stezenbach <js@sig21.net> wrote:
> > > 
> > > # cat /proc/cpuinfo 
> > > processor	: 0
> > > vendor_id	: GenuineIntel
> > > cpu family	: 6
> > > model		: 13
> > > model name	: Intel(R) Pentium(R) M processor 1.80GHz
> > 
> > ah, yes. There's no cache-references/misses, because in 
> > arch/x86/kernel/cpu/perf_counter.c we have two zero entries:
> > 
> > static const u64 p6_perfmon_event_map[] =
> > {
> >   [PERF_COUNT_HW_CPU_CYCLES]            = 0x0079,
> >   [PERF_COUNT_HW_INSTRUCTIONS]          = 0x00c0,
> >   [PERF_COUNT_HW_CACHE_REFERENCES]      = 0x0000, <----------
> >   [PERF_COUNT_HW_CACHE_MISSES]          = 0x0000, <----------
> >   [PERF_COUNT_HW_BRANCH_INSTRUCTIONS]   = 0x00c4,
> >   [PERF_COUNT_HW_BRANCH_MISSES]         = 0x00c5,
> >   [PERF_COUNT_HW_BUS_CYCLES]            = 0x0062,
> > };
> > 
> > i.e. PERF_COUNT_HW_CACHE_REFERENCES and PERF_COUNT_HW_CACHE_MISSES 
> > is not filled in yet.
> > 
> > Could you try something like:
> > 
> >     perf stat -e r0f2e true
> > 
> > (0x2e: L2 requests, 0x0f: all units)
> > 
> > if i checked the docs right that counter would give us L2 cache 
> > stats - does it display non-zero values?
> 
> # ./perf stat -e r0f2e true
> 
>  Performance counter stats for 'true':
> 
>           10584  raw 0xf2e               
> 
>     0.001159924  seconds time elapsed
> 
> The number also increases for larger programs than "true".
> 
> According to /usr/share/oprofile/i386/p6_mobile/events and
> http://oprofile.sourceforge.net/docs/intel-p6-mobile-events.php
> 0x2e + 0x0f is "L2 requests, all units", but I couldn't say how
> to count cache references vs. misses.  Or does it work
> with unit mask 0x0e vs. 0x01?
> 
> # ./perf stat -e r0e2e true
> 
>  Performance counter stats for 'true':
> 
>           10147  raw 0xe2e               
> 
>     0.001121651  seconds time elapsed
> 
> # ./perf stat -e r012e true
> 
>  Performance counter stats for 'true':
> 
>             468  raw 0x12e               
> 
>     0.001130870  seconds time elapsed

Ok. That definitely looks like the right event to use.

Could you try the patch below, does it do the trick? Note, since 
there's just two generic counters and perf stat uses four counters, 
you'll need to run longer commands than 'true' or something like:

  perf stat -a sleep 1

or:

  perf stat --repeat 10 /bin/ls -R /usr/bin >/dev/null

to get all counters excercised and time-shared on your CPU.

	Ingo

-------------->
Subject: perf_counter, x86: Add generic cache events to P6-mobile CPUs
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Aug 11 10:26:33 CEST 2009

Johannes Stezenbach reported that 'perf stat' does not count
cache-miss and cache-references events on his Pentium-M based
laptop.

Add the events.

Reported-by: Johannes Stezenbach <js@sig21.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/perf_counter.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/x86/kernel/cpu/perf_counter.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/perf_counter.c
+++ linux/arch/x86/kernel/cpu/perf_counter.c
@@ -116,8 +116,8 @@ static const u64 p6_perfmon_event_map[] 
 {
   [PERF_COUNT_HW_CPU_CYCLES]		= 0x0079,
   [PERF_COUNT_HW_INSTRUCTIONS]		= 0x00c0,
-  [PERF_COUNT_HW_CACHE_REFERENCES]	= 0x0000,
-  [PERF_COUNT_HW_CACHE_MISSES]		= 0x0000,
+  [PERF_COUNT_HW_CACHE_REFERENCES]	= 0x0f2e,
+  [PERF_COUNT_HW_CACHE_MISSES]		= 0x012e,
   [PERF_COUNT_HW_BRANCH_INSTRUCTIONS]	= 0x00c4,
   [PERF_COUNT_HW_BRANCH_MISSES]		= 0x00c5,
   [PERF_COUNT_HW_BUS_CYCLES]		= 0x0062,

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] cache-miss and cache-refs events on P6-mobile CPUs
  2009-08-11  9:34                     ` [patch] cache-miss and cache-refs events on P6-mobile CPUs Ingo Molnar
@ 2009-08-11  9:39                       ` Peter Zijlstra
  2009-08-11 11:06                         ` Ingo Molnar
  2009-08-11 15:50                       ` Johannes Stezenbach
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2009-08-11  9:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Johannes Stezenbach, linux-kernel, Steven Rostedt,
	Frédéric Weisbecker, Thomas Gleixner

On Tue, 2009-08-11 at 11:34 +0200, Ingo Molnar wrote:

> @@ -116,8 +116,8 @@ static const u64 p6_perfmon_event_map[] 
>  {
>    [PERF_COUNT_HW_CPU_CYCLES]		= 0x0079,
>    [PERF_COUNT_HW_INSTRUCTIONS]		= 0x00c0,
> -  [PERF_COUNT_HW_CACHE_REFERENCES]	= 0x0000,
> -  [PERF_COUNT_HW_CACHE_MISSES]		= 0x0000,
> +  [PERF_COUNT_HW_CACHE_REFERENCES]	= 0x0f2e,
> +  [PERF_COUNT_HW_CACHE_MISSES]		= 0x012e,

2e is total numer of L2 events,

0f is all mesi states
01 is invalid states

Yes they are numbers, but no they are not full cache refs and misses.


I don't think this is a good patch.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] cache-miss and cache-refs events on P6-mobile CPUs
  2009-08-11  9:39                       ` Peter Zijlstra
@ 2009-08-11 11:06                         ` Ingo Molnar
  2009-08-11 11:21                           ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2009-08-11 11:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Stezenbach, linux-kernel, Steven Rostedt,
	Frédéric Weisbecker, Thomas Gleixner


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Tue, 2009-08-11 at 11:34 +0200, Ingo Molnar wrote:
> 
> > @@ -116,8 +116,8 @@ static const u64 p6_perfmon_event_map[] 
> >  {
> >    [PERF_COUNT_HW_CPU_CYCLES]		= 0x0079,
> >    [PERF_COUNT_HW_INSTRUCTIONS]		= 0x00c0,
> > -  [PERF_COUNT_HW_CACHE_REFERENCES]	= 0x0000,
> > -  [PERF_COUNT_HW_CACHE_MISSES]		= 0x0000,
> > +  [PERF_COUNT_HW_CACHE_REFERENCES]	= 0x0f2e,
> > +  [PERF_COUNT_HW_CACHE_MISSES]		= 0x012e,
> 
> 2e is total numer of L2 events,
> 
> 0f is all mesi states
> 01 is invalid states

here's Intel's own description:

 I_STATE 0x01 Counts how many times requests miss the cache.
 MESI    0x0F Counts how many times cache lines in any state are accessed.

so it's pretty close in practice. The only counts that are a bit 
inapplicable are fetches/prefetches it initiates on its own (they 
are included here) - but those too are related to the workload in 
general, so it's good as an approximation.

It's definitely better than 0x00 IMO. What do you think?

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] cache-miss and cache-refs events on P6-mobile CPUs
  2009-08-11 11:06                         ` Ingo Molnar
@ 2009-08-11 11:21                           ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2009-08-11 11:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Johannes Stezenbach, linux-kernel, Steven Rostedt,
	Frédéric Weisbecker, Thomas Gleixner

On Tue, 2009-08-11 at 13:06 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Tue, 2009-08-11 at 11:34 +0200, Ingo Molnar wrote:
> > 
> > > @@ -116,8 +116,8 @@ static const u64 p6_perfmon_event_map[] 
> > >  {
> > >    [PERF_COUNT_HW_CPU_CYCLES]		= 0x0079,
> > >    [PERF_COUNT_HW_INSTRUCTIONS]		= 0x00c0,
> > > -  [PERF_COUNT_HW_CACHE_REFERENCES]	= 0x0000,
> > > -  [PERF_COUNT_HW_CACHE_MISSES]		= 0x0000,
> > > +  [PERF_COUNT_HW_CACHE_REFERENCES]	= 0x0f2e,
> > > +  [PERF_COUNT_HW_CACHE_MISSES]		= 0x012e,
> > 
> > 2e is total numer of L2 events,
> > 
> > 0f is all mesi states
> > 01 is invalid states
> 
> here's Intel's own description:
> 
>  I_STATE 0x01 Counts how many times requests miss the cache.
>  MESI    0x0F Counts how many times cache lines in any state are accessed.
> 
> so it's pretty close in practice. The only counts that are a bit 
> inapplicable are fetches/prefetches it initiates on its own (they 
> are included here) - but those too are related to the workload in 
> general, so it's good as an approximation.
> 
> It's definitely better than 0x00 IMO. What do you think?

Well, if they say so. I was thinking that counting I states would count
invalidates due to remote S->{E,M} transitions and invpg ins' and such. 

And hitting an invalidated line is a whole different thing than plain
missing it due to it not being present.

Anyway, if this is the Intel recommended thing for cache misses, who am
I to argue.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-10 22:13                   ` Johannes Stezenbach
  2009-08-11  9:34                     ` [patch] cache-miss and cache-refs events on P6-mobile CPUs Ingo Molnar
@ 2009-08-11 15:40                     ` Johannes Stezenbach
  2009-08-17 14:49                       ` Steven Rostedt
  1 sibling, 1 reply; 25+ messages in thread
From: Johannes Stezenbach @ 2009-08-11 15:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Robert Richter, Steven Rostedt, Andi Kleen, x86, linux-kernel,
	Rafael J. Wysocki

On Tue, Aug 11, 2009 at 12:13:07AM +0200, Johannes Stezenbach wrote:
> On Mon, Aug 10, 2009 at 11:31:33PM +0200, Ingo Molnar wrote:
> > * Johannes Stezenbach <js@sig21.net> wrote:
> > > 
> > > Could the warning be caused by the cpufreq ondemand governor? ISTR 
> > > that one should switch to the performance governor before doing 
> > > any profiling, but I forgot for this test.
> > 
> > there might be a connection - it could in theory cause sched_clock() 
> > transients and confuse the ring-buffer time-stamping.
> 
> I'll try tomorrow after a fresh boot if the warning also appears
> with the performance governor.

Nope, cpufreq was not the culprit.  The "Delta way too big!"
warning in rb_reserve_next_event() happens after suspend-to-RAM.


Johannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] cache-miss and cache-refs events on P6-mobile CPUs
  2009-08-11  9:34                     ` [patch] cache-miss and cache-refs events on P6-mobile CPUs Ingo Molnar
  2009-08-11  9:39                       ` Peter Zijlstra
@ 2009-08-11 15:50                       ` Johannes Stezenbach
  2009-08-11 16:56                         ` Ingo Molnar
  1 sibling, 1 reply; 25+ messages in thread
From: Johannes Stezenbach @ 2009-08-11 15:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Steven Rostedt,
	Frédéric Weisbecker, Thomas Gleixner

On Tue, Aug 11, 2009 at 11:34:05AM +0200, Ingo Molnar wrote:
> 
> Could you try the patch below, does it do the trick? Note, since 
> there's just two generic counters and perf stat uses four counters, 
> you'll need to run longer commands than 'true' or something like:
> 
>   perf stat -a sleep 1
> 
> or:
> 
>   perf stat --repeat 10 /bin/ls -R /usr/bin >/dev/null
> 
> to get all counters excercised and time-shared on your CPU.

OK, here are the results:

# ./perf stat -a true

 Performance counter stats for 'true':

       1.332145  task-clock-msecs         #      1.057 CPUs 
              4  context-switches         #      0.003 M/sec
              0  CPU-migrations           #      0.000 M/sec
            121  page-faults              #      0.091 M/sec
         788333  cycles                   #    591.777 M/sec
         610408  instructions             #      0.774 IPC  
  <not counted>  cache-references        
  <not counted>  cache-misses            

    0.001260775  seconds time elapsed

# ./perf stat -a sleep 1

 Performance counter stats for 'sleep 1':

    1001.358010  task-clock-msecs         #      1.000 CPUs 
            253  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            145  page-faults              #      0.000 M/sec
        8841883  cycles                   #      8.830 M/sec  (scaled from 85.04%)
        6361047  instructions             #      0.719 IPC    (scaled from 78.03%)
         801847  cache-references         #      0.801 M/sec  (scaled from 14.96%)
          16925  cache-misses             #      0.017 M/sec  (scaled from 21.97%)

    1.001540692  seconds time elapsed

# ./perf stat --repeat 10 /bin/ls -R /usr/bin >/dev/null

 Performance counter stats for '/bin/ls -R /usr/bin' (10 runs):

       8.642859  task-clock-msecs         #      0.939 CPUs    ( +-   2.382% )
              1  context-switches         #      0.000 M/sec   ( +-   0.000% )
              0  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
            375  page-faults              #      0.043 M/sec   ( +-   0.000% )
       14796621  cycles                   #   1712.005 M/sec   ( +-   0.306% )
       15526008  instructions             #      1.049 IPC     ( +-   0.020% )
  <not counted>  cache-references        
  <not counted>  cache-misses            

    0.009207932  seconds time elapsed   ( +-   4.020% )

# ./perf stat --repeat 10 /bin/ls -lR /usr/bin >/dev/null

 Performance counter stats for '/bin/ls -lR /usr/bin' (10 runs):

      49.894190  task-clock-msecs         #      0.762 CPUs    ( +-   8.231% )
             45  context-switches         #      0.001 M/sec   ( +-  52.174% )
              0  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
            476  page-faults              #      0.010 M/sec   ( +-   0.000% )
       81036819  cycles                   #   1624.173 M/sec   ( +-   3.166% )  (scaled from 96.03%)
       88822455  instructions             #      1.096 IPC     ( +-   2.108% )  (scaled from 96.42%)
         573228  cache-references         #     11.489 M/sec   ( +-  37.947% )
  <not counted>  cache-misses            

    0.065464625  seconds time elapsed   ( +-  19.349% )

# ./perf stat --repeat 10 /bin/ls -R /usr/lib >/dev/null

 Performance counter stats for '/bin/ls -R /usr/lib' (10 runs):

     136.771643  task-clock-msecs         #      0.075 CPUs    ( +-  19.292% )
            757  context-switches         #      0.006 M/sec   ( +-  56.332% )
              0  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
            343  page-faults              #      0.003 M/sec   ( +-   0.000% )
      173125500  cycles                   #   1265.800 M/sec   ( +-   5.034% )  (scaled from 78.21%)
      174597635  instructions             #      1.009 IPC     ( +-   1.829% )  (scaled from 78.34%)
        2441413  cache-references         #     17.850 M/sec   ( +-  18.289% )  (scaled from 21.79%)
          21112  cache-misses             #      0.154 M/sec   ( +-  11.183% )  (scaled from 21.66%)

    1.824204481  seconds time elapsed   ( +-  29.781% )


Johannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] cache-miss and cache-refs events on P6-mobile CPUs
  2009-08-11 15:50                       ` Johannes Stezenbach
@ 2009-08-11 16:56                         ` Ingo Molnar
  0 siblings, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2009-08-11 16:56 UTC (permalink / raw)
  To: Johannes Stezenbach
  Cc: linux-kernel, Peter Zijlstra, Steven Rostedt,
	Frédéric Weisbecker, Thomas Gleixner


* Johannes Stezenbach <js@sig21.net> wrote:

> On Tue, Aug 11, 2009 at 11:34:05AM +0200, Ingo Molnar wrote:
> > 
> > Could you try the patch below, does it do the trick? Note, since 
> > there's just two generic counters and perf stat uses four counters, 
> > you'll need to run longer commands than 'true' or something like:
> > 
> >   perf stat -a sleep 1
> > 
> > or:
> > 
> >   perf stat --repeat 10 /bin/ls -R /usr/bin >/dev/null
> > 
> > to get all counters excercised and time-shared on your CPU.
> 
> OK, here are the results:
> 
> # ./perf stat -a true
> 
>  Performance counter stats for 'true':
> 
>        1.332145  task-clock-msecs         #      1.057 CPUs 
>               4  context-switches         #      0.003 M/sec
>               0  CPU-migrations           #      0.000 M/sec
>             121  page-faults              #      0.091 M/sec
>          788333  cycles                   #    591.777 M/sec
>          610408  instructions             #      0.774 IPC  
>   <not counted>  cache-references        
>   <not counted>  cache-misses            
> 
>     0.001260775  seconds time elapsed
> 
> # ./perf stat -a sleep 1
> 
>  Performance counter stats for 'sleep 1':
> 
>     1001.358010  task-clock-msecs         #      1.000 CPUs 
>             253  context-switches         #      0.000 M/sec
>               0  CPU-migrations           #      0.000 M/sec
>             145  page-faults              #      0.000 M/sec
>         8841883  cycles                   #      8.830 M/sec  (scaled from 85.04%)
>         6361047  instructions             #      0.719 IPC    (scaled from 78.03%)
>          801847  cache-references         #      0.801 M/sec  (scaled from 14.96%)
>           16925  cache-misses             #      0.017 M/sec  (scaled from 21.97%)
> 
>     1.001540692  seconds time elapsed
> 
> # ./perf stat --repeat 10 /bin/ls -R /usr/bin >/dev/null
> 
>  Performance counter stats for '/bin/ls -R /usr/bin' (10 runs):
> 
>        8.642859  task-clock-msecs         #      0.939 CPUs    ( +-   2.382% )
>               1  context-switches         #      0.000 M/sec   ( +-   0.000% )
>               0  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
>             375  page-faults              #      0.043 M/sec   ( +-   0.000% )
>        14796621  cycles                   #   1712.005 M/sec   ( +-   0.306% )
>        15526008  instructions             #      1.049 IPC     ( +-   0.020% )
>   <not counted>  cache-references        
>   <not counted>  cache-misses            
> 
>     0.009207932  seconds time elapsed   ( +-   4.020% )
> 
> # ./perf stat --repeat 10 /bin/ls -lR /usr/bin >/dev/null
> 
>  Performance counter stats for '/bin/ls -lR /usr/bin' (10 runs):
> 
>       49.894190  task-clock-msecs         #      0.762 CPUs    ( +-   8.231% )
>              45  context-switches         #      0.001 M/sec   ( +-  52.174% )
>               0  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
>             476  page-faults              #      0.010 M/sec   ( +-   0.000% )
>        81036819  cycles                   #   1624.173 M/sec   ( +-   3.166% )  (scaled from 96.03%)
>        88822455  instructions             #      1.096 IPC     ( +-   2.108% )  (scaled from 96.42%)
>          573228  cache-references         #     11.489 M/sec   ( +-  37.947% )
>   <not counted>  cache-misses            
> 
>     0.065464625  seconds time elapsed   ( +-  19.349% )
> 
> # ./perf stat --repeat 10 /bin/ls -R /usr/lib >/dev/null
> 
>  Performance counter stats for '/bin/ls -R /usr/lib' (10 runs):
> 
>      136.771643  task-clock-msecs         #      0.075 CPUs    ( +-  19.292% )
>             757  context-switches         #      0.006 M/sec   ( +-  56.332% )
>               0  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
>             343  page-faults              #      0.003 M/sec   ( +-   0.000% )
>       173125500  cycles                   #   1265.800 M/sec   ( +-   5.034% )  (scaled from 78.21%)
>       174597635  instructions             #      1.009 IPC     ( +-   1.829% )  (scaled from 78.34%)
>         2441413  cache-references         #     17.850 M/sec   ( +-  18.289% )  (scaled from 21.79%)
>           21112  cache-misses             #      0.154 M/sec   ( +-  11.183% )  (scaled from 21.66%)
> 
>     1.824204481  seconds time elapsed   ( +-  29.781% )

Ok, this all looks pretty good!

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: *PING* [PATCH]: x86: mce: fix mce warning with disabled lapic
  2009-08-10 10:31 ` Andi Kleen
  2009-08-10 12:27   ` Johannes Stezenbach
@ 2009-08-12 11:59   ` Ingo Molnar
  1 sibling, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2009-08-12 11:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Johannes Stezenbach, x86, linux-kernel, Rafael J. Wysocki


* Andi Kleen <andi@firstfloor.org> wrote:

> *PING* can someone please take the patch in there and process the 
> other suggestions?

Firstly, a basic patch submission technical matter: could you please 
stop spamming maintainers of the x86/MCE code with such '*PING*' 
private mails, for patches you never properly submitted to begin 
with?

The proper way to submit an upstream kernel fix, as you should well 
be aware of, is to send a patch with a proper title and to Cc: it to 
lkml and the maintainers affected. You never did that, you only 
posted a for-testing patch into a discussion. Please stop this 
self-important posturing, it's somewhat annoying.

Also, another, patch log quality issue, please credit Johannes 
properly. You put this into the changelog:

> Originally reported by Johannes Stezenbach 
> 
> This is a 2.6.31 candidate because it fixes a regression.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

The proper way is to put this into the changelog:

  Reported-by: Johannes Stezenbach <js@sig21.net>

And given that Johannes also tested the patch, another line of:

  Tested-by: Johannes Stezenbach <js@sig21.net>

Would be appropriate as well.

Thirdly, we can do better with the fix itself too:

> +++ linux/arch/x86/kernel/cpu/mcheck/therm_throt.c
> @@ -236,6 +236,9 @@ void intel_init_thermal(struct cpuinfo_x
>  	int tm2 = 0;
>  	u32 l, h;
>  
> +	if (!cpu_has_apic || disable_apic)
> +		return;
> +
>  	/* Thermal monitoring depends on ACPI and clock modulation*/
>  	if (!cpu_has(c, X86_FEATURE_ACPI) || !cpu_has(c, X86_FEATURE_ACC))
>  		return;

we already have that X86_FEATURE_ACPI and X86_FEATURE_ACC check and 
a return statement. Would be better to expand that with the APIC 
checks. Plus update the comment to also mention APIC as a 
requirement plus fix the small error in the comment too while at it.

If these problems are fixed i'll apply the fix to tip:x86/urgent.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p
  2009-08-11 15:40                     ` 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p Johannes Stezenbach
@ 2009-08-17 14:49                       ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2009-08-17 14:49 UTC (permalink / raw)
  To: Johannes Stezenbach
  Cc: Ingo Molnar, Robert Richter, Andi Kleen, x86, linux-kernel,
	Rafael J. Wysocki


On Tue, 11 Aug 2009, Johannes Stezenbach wrote:

> On Tue, Aug 11, 2009 at 12:13:07AM +0200, Johannes Stezenbach wrote:
> > On Mon, Aug 10, 2009 at 11:31:33PM +0200, Ingo Molnar wrote:
> > > * Johannes Stezenbach <js@sig21.net> wrote:
> > > > 
> > > > Could the warning be caused by the cpufreq ondemand governor? ISTR 
> > > > that one should switch to the performance governor before doing 
> > > > any profiling, but I forgot for this test.
> > > 
> > > there might be a connection - it could in theory cause sched_clock() 
> > > transients and confuse the ring-buffer time-stamping.
> > 
> > I'll try tomorrow after a fresh boot if the warning also appears
> > with the performance governor.
> 
> Nope, cpufreq was not the culprit.  The "Delta way too big!"
> warning in rb_reserve_next_event() happens after suspend-to-RAM.

Yeah, the ring buffer expects the time delta of two different events on 
the same page to be less than 2^59 nanosecs apart. But something must have 
screwed things up, since 2^59 nanosecs is 6671 days (18 years).

Perhaps coming out of suspend to ram changes the timestamp? Does it reset 
the result of sched_clock?

-- Steve


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2009-08-17 14:49 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-07 17:09 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p Johannes Stezenbach
2009-08-09 10:03 ` Johannes Stezenbach
2009-08-09 10:34   ` Bartlomiej Zolnierkiewicz
2009-08-09 16:47     ` Johannes Stezenbach
2009-08-10 10:31 ` Andi Kleen
2009-08-10 12:27   ` Johannes Stezenbach
2009-08-10 12:32     ` Andi Kleen
2009-08-10 12:56       ` Johannes Stezenbach
2009-08-10 13:29         ` Ingo Molnar
2009-08-10 19:26           ` Johannes Stezenbach
2009-08-10 19:44             ` Andi Kleen
2009-08-10 20:05               ` Robert Richter
2009-08-10 20:14             ` Ingo Molnar
2009-08-10 20:37               ` Johannes Stezenbach
2009-08-10 21:31                 ` Ingo Molnar
2009-08-10 22:13                   ` Johannes Stezenbach
2009-08-11  9:34                     ` [patch] cache-miss and cache-refs events on P6-mobile CPUs Ingo Molnar
2009-08-11  9:39                       ` Peter Zijlstra
2009-08-11 11:06                         ` Ingo Molnar
2009-08-11 11:21                           ` Peter Zijlstra
2009-08-11 15:50                       ` Johannes Stezenbach
2009-08-11 16:56                         ` Ingo Molnar
2009-08-11 15:40                     ` 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p Johannes Stezenbach
2009-08-17 14:49                       ` Steven Rostedt
2009-08-12 11:59   ` *PING* [PATCH]: x86: mce: fix mce warning with disabled lapic Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.