All of lore.kernel.org
 help / color / mirror / Atom feed
* kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
@ 2011-08-17 21:51 Rick Warner
  2011-08-18  7:47 ` Rafael J. Wysocki
  0 siblings, 1 reply; 18+ messages in thread
From: Rick Warner @ 2011-08-17 21:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Richard Houghton

Hi all,

I am getting a kernel oops/panic on a dual xeon system that is the master of a 
60 node HPC cluster.  This is happening while stress testing the system 
including significant network traffic.  The OS is openSuse 11.4. 

We are running a custom compiled 2.6.39.3 kernel on the systems due to a bug 
in the stock kernel 11.4 provided (igb driver related).  After 1-3 days of 
heavy testing, the master node locks up with the caps lock and scroll lock 
keys on the keyboard blinking with the following output captured via a serial 
console:

[381920.681113] BUG: unable to handle kernel NULL pointer dereference at           
(null)
[381920.689067] IP: [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
[381920.695187] PGD 30c27a067 PUD 16efe6067 PMD 0 
[381920.699782] Oops: 0000 [#1] PREEMPT SMP 
[381920.703866] last sysfs file: 
/sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
[381920.711868] CPU 6 
[381920.713800] Modules linked in: md5 ipmi_devintf ipmi_si ipmi_msghandler 
nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE iptable_mangle 
iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter 
ip_tables x_tables af_packet edd cpufreq_conservative cpufreq_userspace 
cpufreq_powersave acpi_cpufreq mperf xfs dm_mod ioatdma i7core_edac edac_core 
sr_mod cdrom joydev igb i2c_i801 sg button ghes hed iTCO_wdt 
iTCO_vendor_support dca pcspkr ext4 jbd2 crc16 raid456 async_raid6_recov 
async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid1 raid0 fan 
processor thermal thermal_sys ata_generic pata_atiixp arcmsr
[381920.771623] 
[381920.773210] Pid: 12701, comm: cluster Not tainted 2.6.39.3-microwaycustom 
#1 Supermicro X8DTH-i/6/iF/6F/X8DTH
[381920.783292] RIP: 0010:[<ffffffff812a7510>]  [<ffffffff812a7510>] 
acpi_atomic_read+0xe3/0x120
[381920.791853] RSP: 0000:ffff88063fc47d98  EFLAGS: 00010046
[381920.797260] RAX: 0000000000000000 RBX: 00000000bf7b5ff0 RCX: ffffffff81a3cdd0
[381920.804486] RDX: 00000000bf7b6010 RSI: 00000000bf7b6000 RDI: 
ffff88062d4b95c0
[381920.811712] RBP: ffff88063fc47dc8 R08: ffff88063fc47d98 R09: 0000000000000002
[381920.818940] R10: 0000000000000083 R11: 0000000000000010 R12: 
ffffc90003044c20
[381920.826168] R13: ffff88063fc47de0 R14: 0000000000000000 R15: 
0000000000000000
[381920.833392] FS:  0000000000000000(0000) GS:ffff88063fc40000(0000) 
knlGS:0000000000000000
[381920.841571] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[381920.847409] CR2: 0000000000000000 CR3: 00000001d2d0f000 CR4: 
00000000000006e0
[381920.854635] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[381920.861861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[381920.869088] Process cluster (pid: 12701, threadinfo ffff880028284000, task 
ffff880550184380)
[381920.877608] Stack:
[381920.879717]  ffffffff81a3cdd0 00000000bf7b5ff0 ffff88062b2cc140 ffffc90003044ca8
[381920.887288]  ffff88062b2cc140 0000000000000001 ffff88063fc47e08 ffffffffa002b21f
[381920.894862]  0000000000000000 0000000000000000 ffff88062b2cc140 
0000000000000000
[381920.902435] Call Trace:
[381920.904969]  <NMI> 
[381920.907201]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170 [ghes]
[381920.913726]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0 [ghes]
[381920.920076]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
[381920.926080]  [<ffffffff8151a1d8>] __atomic_notifier_call_chain+0x48/0x70
[381920.932864]  [<ffffffff8151a211>] atomic_notifier_call_chain+0x11/0x20
[381920.939473]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
[381920.944691]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
[381920.949649]  [<ffffffff81517550>] nmi+0x20/0x30
[381920.954269]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
[381920.960536]  <<EOE>> 
[381920.962725]  <IRQ> 
[381920.965132]  [<ffffffff8101131e>] intel_pmu_disable_all+0x3e/0x120
[381920.971399]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
[381920.977066]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
[381920.982817]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
[381920.989082]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
[381920.994740]  [<ffffffff81067097>] update_process_times+0x67/0x80
[381921.000832]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
[381921.006578]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
[381921.012758]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
[381921.018414]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
[381921.024343]  [<ffffffff810206d7>] smp_apic_timer_interrupt+0x67/0xa0
[381921.030793]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
[381921.036891]  <EOI> 
[381921.038913] Code: fc 10 74 1f 77 08 41 80 fc 08 75 48 eb 0e 41 80 fc 20 74 
17 41 80 fc 40 75 3a eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb 09 <8b> 
00 89 c0 eb 03 48 8b 00 49 89 45 00 e8 8e 2b e2 ff eb 1b 0f 
[381921.059462] RIP  [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
[381921.065669]  RSP <ffff88063fc47d98>
[381921.069242] CR2: 0000000000000000
[381921.072645] ---[ end trace 52697bfc73a34a90 ]---
[381921.077343] Kernel panic - not syncing: Fatal exception in interrupt
[381921.083784] Pid: 12701, comm: cluster Tainted: G      D     2.6.39.3-
microwaycustom #1
[381921.091788] Call Trace:
[381921.094333]  <NMI>  [<ffffffff81513300>] panic+0x9f/0x1da
[381921.099881]  [<ffffffff81517f8c>] oops_end+0xdc/0xf0
[381921.104952]  [<ffffffff81032b91>] no_context+0xf1/0x260
[381921.110277]  [<ffffffff81032e55>] __bad_area_nosemaphore+0x155/0x200
[381921.116739]  [<ffffffff81032f0e>] bad_area_nosemaphore+0xe/0x10
[381921.122766]  [<ffffffff81519f46>] do_page_fault+0x366/0x530
[381921.128440]  [<ffffffff810ee929>] ? __perf_event_overflow+0xa9/0x220
[381921.134896]  [<ffffffff810ef99b>] ? perf_event_update_userpage+0x9b/0xe0
[381921.141698]  [<ffffffff81012249>] ? intel_pmu_enable_all+0xc9/0x1a0
[381921.148061]  [<ffffffff810123ff>] ? x86_perf_event_set_period+0xdf/0x170
[381921.154852]  [<ffffffff81517295>] page_fault+0x25/0x30
[381921.160081]  [<ffffffff812a7510>] ? acpi_atomic_read+0xe3/0x120
[381921.166094]  [<ffffffff812a7486>] ? acpi_atomic_read+0x59/0x120
[381921.172107]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170 [ghes]
[381921.178639]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0 [ghes]
[381921.185004]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
[381921.191003]  [<ffffffff8151a1d8>] __atomic_notifier_call_chain+0x48/0x70
[381921.197789]  [<ffffffff8151a211>] atomic_notifier_call_chain+0x11/0x20
[381921.204399]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
[381921.209624]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
[381921.214584]  [<ffffffff81517550>] nmi+0x20/0x30
[381921.219208]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
[381921.225470]  <<EOE>>  <IRQ>  [<ffffffff8101131e>] 
intel_pmu_disable_all+0x3e/0x120
[381921.233174]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
[381921.238836]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
[381921.244582]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
[381921.250855]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
[381921.256520]  [<ffffffff81067097>] update_process_times+0x67/0x80
[381921.262618]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
[381921.268363]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
[381921.274540]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
[381921.280198]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
[381921.286125]  [<ffffffff810206d7>] smp_apic_timer_interrupt+0x67/0xa0
[381921.292578]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
[381921.298673]  <EOI>


I recompiled the kernel again, disabling the tickless feature as I saw "nohz" 
early in the call trace.  After that, it reproduced again but the call trace 
only changed slightly having tick_init_highres in there instead of 
tick_nohz_handler.  I have that call trace if it is desired as well.  It is 
nearly identical though.

I am currently trying a custom 2.6.36.4 on the system, but will need up to 3 
days before I know if the problem exists there as well.

Any ideas on this?

Thanks,
Rick
-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-17 21:51 kernel oops and panic in acpi_atomic_read under 2.6.39.3. call trace included Rick Warner
@ 2011-08-18  7:47 ` Rafael J. Wysocki
  2011-08-18 21:43   ` Rafael J. Wysocki
  0 siblings, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2011-08-18  7:47 UTC (permalink / raw)
  To: Rick Warner
  Cc: linux-kernel, Richard Houghton, ACPI Devel Mailing List,
	Len Brown, Matthew Garrett

Hi,

It's better to report ACPI failues to linux-acpi.

On Wednesday, August 17, 2011, Rick Warner wrote:
> Hi all,
> 
> I am getting a kernel oops/panic on a dual xeon system that is the master of a 
> 60 node HPC cluster.  This is happening while stress testing the system 
> including significant network traffic.  The OS is openSuse 11.4. 
> 
> We are running a custom compiled 2.6.39.3 kernel on the systems due to a bug 
> in the stock kernel 11.4 provided (igb driver related).  After 1-3 days of 
> heavy testing, the master node locks up with the caps lock and scroll lock 
> keys on the keyboard blinking with the following output captured via a serial 
> console:
> 
> [381920.681113] BUG: unable to handle kernel NULL pointer dereference at           
> (null)
> [381920.689067] IP: [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
> [381920.695187] PGD 30c27a067 PUD 16efe6067 PMD 0 
> [381920.699782] Oops: 0000 [#1] PREEMPT SMP 
> [381920.703866] last sysfs file: 
> /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
> [381920.711868] CPU 6 
> [381920.713800] Modules linked in: md5 ipmi_devintf ipmi_si ipmi_msghandler 
> nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE iptable_mangle 
> iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter 
> ip_tables x_tables af_packet edd cpufreq_conservative cpufreq_userspace 
> cpufreq_powersave acpi_cpufreq mperf xfs dm_mod ioatdma i7core_edac edac_core 
> sr_mod cdrom joydev igb i2c_i801 sg button ghes hed iTCO_wdt 
> iTCO_vendor_support dca pcspkr ext4 jbd2 crc16 raid456 async_raid6_recov 
> async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid1 raid0 fan 
> processor thermal thermal_sys ata_generic pata_atiixp arcmsr
> [381920.771623] 
> [381920.773210] Pid: 12701, comm: cluster Not tainted 2.6.39.3-microwaycustom 
> #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
> [381920.783292] RIP: 0010:[<ffffffff812a7510>]  [<ffffffff812a7510>] 
> acpi_atomic_read+0xe3/0x120
> [381920.791853] RSP: 0000:ffff88063fc47d98  EFLAGS: 00010046
> [381920.797260] RAX: 0000000000000000 RBX: 00000000bf7b5ff0 RCX: ffffffff81a3cdd0
> [381920.804486] RDX: 00000000bf7b6010 RSI: 00000000bf7b6000 RDI: 
> ffff88062d4b95c0
> [381920.811712] RBP: ffff88063fc47dc8 R08: ffff88063fc47d98 R09: 0000000000000002
> [381920.818940] R10: 0000000000000083 R11: 0000000000000010 R12: 
> ffffc90003044c20
> [381920.826168] R13: ffff88063fc47de0 R14: 0000000000000000 R15: 
> 0000000000000000
> [381920.833392] FS:  0000000000000000(0000) GS:ffff88063fc40000(0000) 
> knlGS:0000000000000000
> [381920.841571] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> [381920.847409] CR2: 0000000000000000 CR3: 00000001d2d0f000 CR4: 
> 00000000000006e0
> [381920.854635] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [381920.861861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [381920.869088] Process cluster (pid: 12701, threadinfo ffff880028284000, task 
> ffff880550184380)
> [381920.877608] Stack:
> [381920.879717]  ffffffff81a3cdd0 00000000bf7b5ff0 ffff88062b2cc140 ffffc90003044ca8
> [381920.887288]  ffff88062b2cc140 0000000000000001 ffff88063fc47e08 ffffffffa002b21f
> [381920.894862]  0000000000000000 0000000000000000 ffff88062b2cc140 
> 0000000000000000
> [381920.902435] Call Trace:
> [381920.904969]  <NMI> 
> [381920.907201]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170 [ghes]
> [381920.913726]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0 [ghes]
> [381920.920076]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
> [381920.926080]  [<ffffffff8151a1d8>] __atomic_notifier_call_chain+0x48/0x70
> [381920.932864]  [<ffffffff8151a211>] atomic_notifier_call_chain+0x11/0x20
> [381920.939473]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
> [381920.944691]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
> [381920.949649]  [<ffffffff81517550>] nmi+0x20/0x30
> [381920.954269]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
> [381920.960536]  <<EOE>> 
> [381920.962725]  <IRQ> 
> [381920.965132]  [<ffffffff8101131e>] intel_pmu_disable_all+0x3e/0x120
> [381920.971399]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
> [381920.977066]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
> [381920.982817]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
> [381920.989082]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
> [381920.994740]  [<ffffffff81067097>] update_process_times+0x67/0x80
> [381921.000832]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
> [381921.006578]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
> [381921.012758]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
> [381921.018414]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
> [381921.024343]  [<ffffffff810206d7>] smp_apic_timer_interrupt+0x67/0xa0
> [381921.030793]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
> [381921.036891]  <EOI> 
> [381921.038913] Code: fc 10 74 1f 77 08 41 80 fc 08 75 48 eb 0e 41 80 fc 20 74 
> 17 41 80 fc 40 75 3a eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb 09 <8b> 
> 00 89 c0 eb 03 48 8b 00 49 89 45 00 e8 8e 2b e2 ff eb 1b 0f 
> [381921.059462] RIP  [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
> [381921.065669]  RSP <ffff88063fc47d98>
> [381921.069242] CR2: 0000000000000000
> [381921.072645] ---[ end trace 52697bfc73a34a90 ]---
> [381921.077343] Kernel panic - not syncing: Fatal exception in interrupt
> [381921.083784] Pid: 12701, comm: cluster Tainted: G      D     2.6.39.3-
> microwaycustom #1
> [381921.091788] Call Trace:
> [381921.094333]  <NMI>  [<ffffffff81513300>] panic+0x9f/0x1da
> [381921.099881]  [<ffffffff81517f8c>] oops_end+0xdc/0xf0
> [381921.104952]  [<ffffffff81032b91>] no_context+0xf1/0x260
> [381921.110277]  [<ffffffff81032e55>] __bad_area_nosemaphore+0x155/0x200
> [381921.116739]  [<ffffffff81032f0e>] bad_area_nosemaphore+0xe/0x10
> [381921.122766]  [<ffffffff81519f46>] do_page_fault+0x366/0x530
> [381921.128440]  [<ffffffff810ee929>] ? __perf_event_overflow+0xa9/0x220
> [381921.134896]  [<ffffffff810ef99b>] ? perf_event_update_userpage+0x9b/0xe0
> [381921.141698]  [<ffffffff81012249>] ? intel_pmu_enable_all+0xc9/0x1a0
> [381921.148061]  [<ffffffff810123ff>] ? x86_perf_event_set_period+0xdf/0x170
> [381921.154852]  [<ffffffff81517295>] page_fault+0x25/0x30
> [381921.160081]  [<ffffffff812a7510>] ? acpi_atomic_read+0xe3/0x120
> [381921.166094]  [<ffffffff812a7486>] ? acpi_atomic_read+0x59/0x120
> [381921.172107]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170 [ghes]
> [381921.178639]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0 [ghes]
> [381921.185004]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
> [381921.191003]  [<ffffffff8151a1d8>] __atomic_notifier_call_chain+0x48/0x70
> [381921.197789]  [<ffffffff8151a211>] atomic_notifier_call_chain+0x11/0x20
> [381921.204399]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
> [381921.209624]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
> [381921.214584]  [<ffffffff81517550>] nmi+0x20/0x30
> [381921.219208]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
> [381921.225470]  <<EOE>>  <IRQ>  [<ffffffff8101131e>] 
> intel_pmu_disable_all+0x3e/0x120
> [381921.233174]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
> [381921.238836]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
> [381921.244582]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
> [381921.250855]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
> [381921.256520]  [<ffffffff81067097>] update_process_times+0x67/0x80
> [381921.262618]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
> [381921.268363]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
> [381921.274540]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
> [381921.280198]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
> [381921.286125]  [<ffffffff810206d7>] smp_apic_timer_interrupt+0x67/0xa0
> [381921.292578]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
> [381921.298673]  <EOI>
> 
> 
> I recompiled the kernel again, disabling the tickless feature as I saw "nohz" 
> early in the call trace.  After that, it reproduced again but the call trace 
> only changed slightly having tick_init_highres in there instead of 
> tick_nohz_handler.  I have that call trace if it is desired as well.  It is 
> nearly identical though.
> 
> I am currently trying a custom 2.6.36.4 on the system, but will need up to 3 
> days before I know if the problem exists there as well.
> 
> Any ideas on this?

It looks like a wrong address in ghes_read_estatus().  I'll have a look at it
later today.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-18  7:47 ` Rafael J. Wysocki
@ 2011-08-18 21:43   ` Rafael J. Wysocki
  2011-08-22 14:42     ` rick
  0 siblings, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2011-08-18 21:43 UTC (permalink / raw)
  To: Rick Warner
  Cc: linux-kernel, Richard Houghton, ACPI Devel Mailing List,
	Len Brown, Matthew Garrett

On Thursday, August 18, 2011, Rafael J. Wysocki wrote:
> Hi,
> 
> It's better to report ACPI failues to linux-acpi.
> 
> On Wednesday, August 17, 2011, Rick Warner wrote:
> > Hi all,
> > 
> > I am getting a kernel oops/panic on a dual xeon system that is the master of a 
> > 60 node HPC cluster.  This is happening while stress testing the system 
> > including significant network traffic.  The OS is openSuse 11.4. 
> > 
> > We are running a custom compiled 2.6.39.3 kernel on the systems due to a bug 
> > in the stock kernel 11.4 provided (igb driver related).  After 1-3 days of 
> > heavy testing, the master node locks up with the caps lock and scroll lock 
> > keys on the keyboard blinking with the following output captured via a serial 
> > console:
> > 
> > [381920.681113] BUG: unable to handle kernel NULL pointer dereference at           
> > (null)
> > [381920.689067] IP: [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
> > [381920.695187] PGD 30c27a067 PUD 16efe6067 PMD 0 
> > [381920.699782] Oops: 0000 [#1] PREEMPT SMP 
> > [381920.703866] last sysfs file: 
> > /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
> > [381920.711868] CPU 6 
> > [381920.713800] Modules linked in: md5 ipmi_devintf ipmi_si ipmi_msghandler 
> > nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE iptable_mangle 
> > iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter 
> > ip_tables x_tables af_packet edd cpufreq_conservative cpufreq_userspace 
> > cpufreq_powersave acpi_cpufreq mperf xfs dm_mod ioatdma i7core_edac edac_core 
> > sr_mod cdrom joydev igb i2c_i801 sg button ghes hed iTCO_wdt 
> > iTCO_vendor_support dca pcspkr ext4 jbd2 crc16 raid456 async_raid6_recov 
> > async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid1 raid0 fan 
> > processor thermal thermal_sys ata_generic pata_atiixp arcmsr
> > [381920.771623] 
> > [381920.773210] Pid: 12701, comm: cluster Not tainted 2.6.39.3-microwaycustom 
> > #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
> > [381920.783292] RIP: 0010:[<ffffffff812a7510>]  [<ffffffff812a7510>] 
> > acpi_atomic_read+0xe3/0x120
> > [381920.791853] RSP: 0000:ffff88063fc47d98  EFLAGS: 00010046
> > [381920.797260] RAX: 0000000000000000 RBX: 00000000bf7b5ff0 RCX: ffffffff81a3cdd0
> > [381920.804486] RDX: 00000000bf7b6010 RSI: 00000000bf7b6000 RDI: 
> > ffff88062d4b95c0
> > [381920.811712] RBP: ffff88063fc47dc8 R08: ffff88063fc47d98 R09: 0000000000000002
> > [381920.818940] R10: 0000000000000083 R11: 0000000000000010 R12: 
> > ffffc90003044c20
> > [381920.826168] R13: ffff88063fc47de0 R14: 0000000000000000 R15: 
> > 0000000000000000
> > [381920.833392] FS:  0000000000000000(0000) GS:ffff88063fc40000(0000) 
> > knlGS:0000000000000000
> > [381920.841571] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> > [381920.847409] CR2: 0000000000000000 CR3: 00000001d2d0f000 CR4: 
> > 00000000000006e0
> > [381920.854635] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> > 0000000000000000
> > [381920.861861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [381920.869088] Process cluster (pid: 12701, threadinfo ffff880028284000, task 
> > ffff880550184380)
> > [381920.877608] Stack:
> > [381920.879717]  ffffffff81a3cdd0 00000000bf7b5ff0 ffff88062b2cc140 ffffc90003044ca8
> > [381920.887288]  ffff88062b2cc140 0000000000000001 ffff88063fc47e08 ffffffffa002b21f
> > [381920.894862]  0000000000000000 0000000000000000 ffff88062b2cc140 
> > 0000000000000000
> > [381920.902435] Call Trace:
> > [381920.904969]  <NMI> 
> > [381920.907201]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170 [ghes]
> > [381920.913726]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0 [ghes]
> > [381920.920076]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
> > [381920.926080]  [<ffffffff8151a1d8>] __atomic_notifier_call_chain+0x48/0x70
> > [381920.932864]  [<ffffffff8151a211>] atomic_notifier_call_chain+0x11/0x20
> > [381920.939473]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
> > [381920.944691]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
> > [381920.949649]  [<ffffffff81517550>] nmi+0x20/0x30
> > [381920.954269]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
> > [381920.960536]  <<EOE>> 
> > [381920.962725]  <IRQ> 
> > [381920.965132]  [<ffffffff8101131e>] intel_pmu_disable_all+0x3e/0x120
> > [381920.971399]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
> > [381920.977066]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
> > [381920.982817]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
> > [381920.989082]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
> > [381920.994740]  [<ffffffff81067097>] update_process_times+0x67/0x80
> > [381921.000832]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
> > [381921.006578]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
> > [381921.012758]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
> > [381921.018414]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
> > [381921.024343]  [<ffffffff810206d7>] smp_apic_timer_interrupt+0x67/0xa0
> > [381921.030793]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
> > [381921.036891]  <EOI> 
> > [381921.038913] Code: fc 10 74 1f 77 08 41 80 fc 08 75 48 eb 0e 41 80 fc 20 74 
> > 17 41 80 fc 40 75 3a eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb 09 <8b> 
> > 00 89 c0 eb 03 48 8b 00 49 89 45 00 e8 8e 2b e2 ff eb 1b 0f 
> > [381921.059462] RIP  [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
> > [381921.065669]  RSP <ffff88063fc47d98>
> > [381921.069242] CR2: 0000000000000000
> > [381921.072645] ---[ end trace 52697bfc73a34a90 ]---
> > [381921.077343] Kernel panic - not syncing: Fatal exception in interrupt
> > [381921.083784] Pid: 12701, comm: cluster Tainted: G      D     2.6.39.3-
> > microwaycustom #1
> > [381921.091788] Call Trace:
> > [381921.094333]  <NMI>  [<ffffffff81513300>] panic+0x9f/0x1da
> > [381921.099881]  [<ffffffff81517f8c>] oops_end+0xdc/0xf0
> > [381921.104952]  [<ffffffff81032b91>] no_context+0xf1/0x260
> > [381921.110277]  [<ffffffff81032e55>] __bad_area_nosemaphore+0x155/0x200
> > [381921.116739]  [<ffffffff81032f0e>] bad_area_nosemaphore+0xe/0x10
> > [381921.122766]  [<ffffffff81519f46>] do_page_fault+0x366/0x530
> > [381921.128440]  [<ffffffff810ee929>] ? __perf_event_overflow+0xa9/0x220
> > [381921.134896]  [<ffffffff810ef99b>] ? perf_event_update_userpage+0x9b/0xe0
> > [381921.141698]  [<ffffffff81012249>] ? intel_pmu_enable_all+0xc9/0x1a0
> > [381921.148061]  [<ffffffff810123ff>] ? x86_perf_event_set_period+0xdf/0x170
> > [381921.154852]  [<ffffffff81517295>] page_fault+0x25/0x30
> > [381921.160081]  [<ffffffff812a7510>] ? acpi_atomic_read+0xe3/0x120
> > [381921.166094]  [<ffffffff812a7486>] ? acpi_atomic_read+0x59/0x120
> > [381921.172107]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170 [ghes]
> > [381921.178639]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0 [ghes]
> > [381921.185004]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
> > [381921.191003]  [<ffffffff8151a1d8>] __atomic_notifier_call_chain+0x48/0x70
> > [381921.197789]  [<ffffffff8151a211>] atomic_notifier_call_chain+0x11/0x20
> > [381921.204399]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
> > [381921.209624]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
> > [381921.214584]  [<ffffffff81517550>] nmi+0x20/0x30
> > [381921.219208]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
> > [381921.225470]  <<EOE>>  <IRQ>  [<ffffffff8101131e>] 
> > intel_pmu_disable_all+0x3e/0x120
> > [381921.233174]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
> > [381921.238836]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
> > [381921.244582]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
> > [381921.250855]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
> > [381921.256520]  [<ffffffff81067097>] update_process_times+0x67/0x80
> > [381921.262618]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
> > [381921.268363]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
> > [381921.274540]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
> > [381921.280198]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
> > [381921.286125]  [<ffffffff810206d7>] smp_apic_timer_interrupt+0x67/0xa0
> > [381921.292578]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
> > [381921.298673]  <EOI>
> > 
> > 
> > I recompiled the kernel again, disabling the tickless feature as I saw "nohz" 
> > early in the call trace.  After that, it reproduced again but the call trace 
> > only changed slightly having tick_init_highres in there instead of 
> > tick_nohz_handler.  I have that call trace if it is desired as well.  It is 
> > nearly identical though.
> > 
> > I am currently trying a custom 2.6.36.4 on the system, but will need up to 3 
> > days before I know if the problem exists there as well.
> > 
> > Any ideas on this?
> 
> It looks like a wrong address in ghes_read_estatus().  I'll have a look at it
> later today.

Hmm.  Please apply the appended patch and see if the error is still
reproducible (it's on top of the current mainline, so you may need to
adjust it by hand).

Thanks,
Rafael

---
 drivers/acpi/apei/ghes.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux/drivers/acpi/apei/ghes.c
===================================================================
--- linux.orig/drivers/acpi/apei/ghes.c
+++ linux/drivers/acpi/apei/ghes.c
@@ -398,6 +398,9 @@ static int ghes_read_estatus(struct ghes
 	u32 len;
 	int rc;
 
+	if (!g)
+		return -EINVAL;
+
 	rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
 	if (rc) {
 		if (!silent && printk_ratelimit())

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-18 21:43   ` Rafael J. Wysocki
@ 2011-08-22 14:42     ` rick
  2011-08-22 18:47       ` Rafael J. Wysocki
  0 siblings, 1 reply; 18+ messages in thread
From: rick @ 2011-08-22 14:42 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, Richard Houghton, ACPI Devel Mailing List,
	Len Brown, Matthew Garrett

> On Thursday, August 18, 2011, Rafael J. Wysocki wrote:
>> Hi,
>>
>> It's better to report ACPI failues to linux-acpi.
>>
>> On Wednesday, August 17, 2011, Rick Warner wrote:
>> > Hi all,
>> >
>> > I am getting a kernel oops/panic on a dual xeon system that is the
>> master of a
>> > 60 node HPC cluster.  This is happening while stress testing the
>> system
>> > including significant network traffic.  The OS is openSuse 11.4.
>> >
>> > We are running a custom compiled 2.6.39.3 kernel on the systems due to
>> a bug
>> > in the stock kernel 11.4 provided (igb driver related).  After 1-3
>> days of
>> > heavy testing, the master node locks up with the caps lock and scroll
>> lock
>> > keys on the keyboard blinking with the following output captured via a
>> serial
>> > console:
>> >
>> > [381920.681113] BUG: unable to handle kernel NULL pointer dereference
>> at
>> > (null)
>> > [381920.689067] IP: [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
>> > [381920.695187] PGD 30c27a067 PUD 16efe6067 PMD 0
>> > [381920.699782] Oops: 0000 [#1] PREEMPT SMP
>> > [381920.703866] last sysfs file:
>> > /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
>> > [381920.711868] CPU 6
>> > [381920.713800] Modules linked in: md5 ipmi_devintf ipmi_si
>> ipmi_msghandler
>> > nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE iptable_mangle
>> > iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
>> iptable_filter
>> > ip_tables x_tables af_packet edd cpufreq_conservative
>> cpufreq_userspace
>> > cpufreq_powersave acpi_cpufreq mperf xfs dm_mod ioatdma i7core_edac
>> edac_core
>> > sr_mod cdrom joydev igb i2c_i801 sg button ghes hed iTCO_wdt
>> > iTCO_vendor_support dca pcspkr ext4 jbd2 crc16 raid456
>> async_raid6_recov
>> > async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid1
>> raid0 fan
>> > processor thermal thermal_sys ata_generic pata_atiixp arcmsr
>> > [381920.771623]
>> > [381920.773210] Pid: 12701, comm: cluster Not tainted
>> 2.6.39.3-microwaycustom
>> > #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
>> > [381920.783292] RIP: 0010:[<ffffffff812a7510>]  [<ffffffff812a7510>]
>> > acpi_atomic_read+0xe3/0x120
>> > [381920.791853] RSP: 0000:ffff88063fc47d98  EFLAGS: 00010046
>> > [381920.797260] RAX: 0000000000000000 RBX: 00000000bf7b5ff0 RCX:
>> ffffffff81a3cdd0
>> > [381920.804486] RDX: 00000000bf7b6010 RSI: 00000000bf7b6000 RDI:
>> > ffff88062d4b95c0
>> > [381920.811712] RBP: ffff88063fc47dc8 R08: ffff88063fc47d98 R09:
>> 0000000000000002
>> > [381920.818940] R10: 0000000000000083 R11: 0000000000000010 R12:
>> > ffffc90003044c20
>> > [381920.826168] R13: ffff88063fc47de0 R14: 0000000000000000 R15:
>> > 0000000000000000
>> > [381920.833392] FS:  0000000000000000(0000) GS:ffff88063fc40000(0000)
>> > knlGS:0000000000000000
>> > [381920.841571] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
>> > [381920.847409] CR2: 0000000000000000 CR3: 00000001d2d0f000 CR4:
>> > 00000000000006e0
>> > [381920.854635] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> > 0000000000000000
>> > [381920.861861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>> 0000000000000400
>> > [381920.869088] Process cluster (pid: 12701, threadinfo
>> ffff880028284000, task
>> > ffff880550184380)
>> > [381920.877608] Stack:
>> > [381920.879717]  ffffffff81a3cdd0 00000000bf7b5ff0 ffff88062b2cc140
>> ffffc90003044ca8
>> > [381920.887288]  ffff88062b2cc140 0000000000000001 ffff88063fc47e08
>> ffffffffa002b21f
>> > [381920.894862]  0000000000000000 0000000000000000 ffff88062b2cc140
>> > 0000000000000000
>> > [381920.902435] Call Trace:
>> > [381920.904969]  <NMI>
>> > [381920.907201]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170
>> [ghes]
>> > [381920.913726]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0
>> [ghes]
>> > [381920.920076]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
>> > [381920.926080]  [<ffffffff8151a1d8>]
>> __atomic_notifier_call_chain+0x48/0x70
>> > [381920.932864]  [<ffffffff8151a211>]
>> atomic_notifier_call_chain+0x11/0x20
>> > [381920.939473]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
>> > [381920.944691]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
>> > [381920.949649]  [<ffffffff81517550>] nmi+0x20/0x30
>> > [381920.954269]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
>> > [381920.960536]  <<EOE>>
>> > [381920.962725]  <IRQ>
>> > [381920.965132]  [<ffffffff8101131e>] intel_pmu_disable_all+0x3e/0x120
>> > [381920.971399]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
>> > [381920.977066]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
>> > [381920.982817]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
>> > [381920.989082]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
>> > [381920.994740]  [<ffffffff81067097>] update_process_times+0x67/0x80
>> > [381921.000832]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
>> > [381921.006578]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
>> > [381921.012758]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
>> > [381921.018414]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
>> > [381921.024343]  [<ffffffff810206d7>]
>> smp_apic_timer_interrupt+0x67/0xa0
>> > [381921.030793]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
>> > [381921.036891]  <EOI>
>> > [381921.038913] Code: fc 10 74 1f 77 08 41 80 fc 08 75 48 eb 0e 41 80
>> fc 20 74
>> > 17 41 80 fc 40 75 3a eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb
>> 09 <8b>
>> > 00 89 c0 eb 03 48 8b 00 49 89 45 00 e8 8e 2b e2 ff eb 1b 0f
>> > [381921.059462] RIP  [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
>> > [381921.065669]  RSP <ffff88063fc47d98>
>> > [381921.069242] CR2: 0000000000000000
>> > [381921.072645] ---[ end trace 52697bfc73a34a90 ]---
>> > [381921.077343] Kernel panic - not syncing: Fatal exception in
>> interrupt
>> > [381921.083784] Pid: 12701, comm: cluster Tainted: G      D
>> 2.6.39.3-
>> > microwaycustom #1
>> > [381921.091788] Call Trace:
>> > [381921.094333]  <NMI>  [<ffffffff81513300>] panic+0x9f/0x1da
>> > [381921.099881]  [<ffffffff81517f8c>] oops_end+0xdc/0xf0
>> > [381921.104952]  [<ffffffff81032b91>] no_context+0xf1/0x260
>> > [381921.110277]  [<ffffffff81032e55>]
>> __bad_area_nosemaphore+0x155/0x200
>> > [381921.116739]  [<ffffffff81032f0e>] bad_area_nosemaphore+0xe/0x10
>> > [381921.122766]  [<ffffffff81519f46>] do_page_fault+0x366/0x530
>> > [381921.128440]  [<ffffffff810ee929>] ?
>> __perf_event_overflow+0xa9/0x220
>> > [381921.134896]  [<ffffffff810ef99b>] ?
>> perf_event_update_userpage+0x9b/0xe0
>> > [381921.141698]  [<ffffffff81012249>] ?
>> intel_pmu_enable_all+0xc9/0x1a0
>> > [381921.148061]  [<ffffffff810123ff>] ?
>> x86_perf_event_set_period+0xdf/0x170
>> > [381921.154852]  [<ffffffff81517295>] page_fault+0x25/0x30
>> > [381921.160081]  [<ffffffff812a7510>] ? acpi_atomic_read+0xe3/0x120
>> > [381921.166094]  [<ffffffff812a7486>] ? acpi_atomic_read+0x59/0x120
>> > [381921.172107]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170
>> [ghes]
>> > [381921.178639]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0
>> [ghes]
>> > [381921.185004]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
>> > [381921.191003]  [<ffffffff8151a1d8>]
>> __atomic_notifier_call_chain+0x48/0x70
>> > [381921.197789]  [<ffffffff8151a211>]
>> atomic_notifier_call_chain+0x11/0x20
>> > [381921.204399]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
>> > [381921.209624]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
>> > [381921.214584]  [<ffffffff81517550>] nmi+0x20/0x30
>> > [381921.219208]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
>> > [381921.225470]  <<EOE>>  <IRQ>  [<ffffffff8101131e>]
>> > intel_pmu_disable_all+0x3e/0x120
>> > [381921.233174]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
>> > [381921.238836]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
>> > [381921.244582]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
>> > [381921.250855]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
>> > [381921.256520]  [<ffffffff81067097>] update_process_times+0x67/0x80
>> > [381921.262618]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
>> > [381921.268363]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
>> > [381921.274540]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
>> > [381921.280198]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
>> > [381921.286125]  [<ffffffff810206d7>]
>> smp_apic_timer_interrupt+0x67/0xa0
>> > [381921.292578]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
>> > [381921.298673]  <EOI>
>> >
>> >
>> > I recompiled the kernel again, disabling the tickless feature as I saw
>> "nohz"
>> > early in the call trace.  After that, it reproduced again but the call
>> trace
>> > only changed slightly having tick_init_highres in there instead of
>> > tick_nohz_handler.  I have that call trace if it is desired as well.
>> It is
>> > nearly identical though.
>> >
>> > I am currently trying a custom 2.6.36.4 on the system, but will need
>> up to 3
>> > days before I know if the problem exists there as well.
>> >
>> > Any ideas on this?
>>
>> It looks like a wrong address in ghes_read_estatus().  I'll have a look
>> at it
>> later today.
>
> Hmm.  Please apply the appended patch and see if the error is still
> reproducible (it's on top of the current mainline, so you may need to
> adjust it by hand).
>
> Thanks,
> Rafael
>
> ---
>  drivers/acpi/apei/ghes.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> Index: linux/drivers/acpi/apei/ghes.c
> ===================================================================
> --- linux.orig/drivers/acpi/apei/ghes.c
> +++ linux/drivers/acpi/apei/ghes.c
> @@ -398,6 +398,9 @@ static int ghes_read_estatus(struct ghes
>  	u32 len;
>  	int rc;
>
> +	if (!g)
> +		return -EINVAL;
> +
>  	rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
>  	if (rc) {
>  		if (!silent && printk_ratelimit())
>

Thanks for the quick patch!  Unfortunately, the system crashed again with
the same call trace. I'll list it below in case there are any differences
I missed.  I have currently added a test for (!&g->error_status_address)
just like your test on (!g) to verify that isn't a null pointer either,
and will start tests back up with that change.  Please let me know if you
have any other suggestions or patches for me to try.

Thanks,
Rick

Call trace:

[59719.446761] BUG: unable to handle kernel NULL pointer dereference at   
       (null)
[59719.454633] IP: [<ffffffff812a211d>] acpi_atomic_read+0x8d/0xcb
[59719.460580] PGD 62c5c0067 PUD 6188ed067 PMD 0
[59719.465080] Oops: 0000 [#1] PREEMPT SMP
[59719.469050] last sysfs file:
/sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
[59719.476954] CPU 9
[59719.478794] Modules linked in: md5 nfsd lockd nfs_acl auth_rpcgss
sunrpc ipt_MASQUERADE iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables af_packet
edd cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq
mperf xfs dm_mod igb sr_mod cdrom joydev ioatdma pcspkr ghes i2c_i801 sg
dca i7core_edac edac_core hed button iTCO_wdt iTCO_vendor_support ext4
jbd2 crc16 raid456 async_raid6_recov async_pq raid6_pq async_xor xor
async_memcpy async_tx raid10 raid1 raid0 fan processor thermal thermal_sys
ata_generic pata_atiixp arcmsr
[59719.533141]
[59719.534642] Pid: 26831, comm: cluster Not tainted
2.6.39.3-microwaycustom #2 Supermicro X8DTH-i/6/iF/6F/X8DTH
[59719.544616] RIP: 0010:[<ffffffff812a211d>]  [<ffffffff812a211d>]
acpi_atomic_read+0x8d/0xcb
[59719.552984] RSP: 0000:ffff88033fca7da8  EFLAGS: 00010046
[59719.558298] RAX: 0000000000000000 RBX: ffff88033fca7df0 RCX:
00000000bf7b6000
[59719.565430] RDX: 0000000000000000 RSI: 00000000bf7b6010 RDI:
00000000bf7b5ff0
[59719.572561] RBP: ffff88033fca7dd8 R08: 00000000bf7b7000 R09:
0000000000000002
[59719.579694] R10: 0000000000000000 R11: 0000000000000000 R12:
ffffc90003044c20
[59719.586822] R13: 0000000000000000 R14: 00000000bf7b5ff0 R15:
0000000000000000
[59719.593954] FS:  0000000000000000(0000) GS:ffff88033fca0000(0000)
knlGS:0000000000000000
[59719.602037] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[59719.607783] CR2: 0000000000000000 CR3: 000000062bc1b000 CR4:
00000000000006e0
[59719.614913] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[59719.622045] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[59719.629176] Process cluster (pid: 26831, threadinfo ffff88030c960000,
task ffff8803021460c0)
[59719.637601] Stack:
[59719.639616]  0000000000000000 00000000bf7b5ff0 00000000ffffffea
ffff88062c0aa2c0
[59719.647092]  0000000000000001 ffffc90003044ca8 ffff88033fca7e18
ffffffffa0278228
[59719.654579]  0000000000000000 0000000000000000 ffff88062c0aa2c0
0000000000000000
[59719.662063] Call Trace:
[59719.664514]  <NMI>
[59719.666655]  [<ffffffffa0278228>] ghes_read_estatus+0x38/0x180 [ghes]
[59719.673106]  [<ffffffffa027860c>] ghes_notify_nmi+0xbc/0x190 [ghes]
[59719.679383]  [<ffffffff8150ddfd>] notifier_call_chain+0x4d/0x70
[59719.685310]  [<ffffffff8150de63>] __atomic_notifier_call_chain+0x43/0x60
[59719.692015]  [<ffffffff8150de91>] atomic_notifier_call_chain+0x11/0x20
[59719.698548]  [<ffffffff8150dece>] notify_die+0x2e/0x30
[59719.703694]  [<ffffffff8150b4f2>] do_nmi+0xa2/0x260
[59719.708574]  [<ffffffff8150b150>] nmi+0x20/0x30
[59719.713112]  [<ffffffff81029f6a>] ? native_write_msr_safe+0xa/0x10
[59719.719285]  <<EOE>>
[59719.721387]  <IRQ>
[59719.723709]  [<ffffffff81011568>] intel_pmu_disable_all+0x38/0xb0
[59719.729806]  [<ffffffff81010efa>] x86_pmu_disable+0x4a/0x50
[59719.735388]  [<ffffffff810ea842>] perf_event_task_tick+0x1a2/0x2a0
[59719.741571]  [<ffffffff81050750>] scheduler_tick+0x1b0/0x290
[59719.747232]  [<ffffffff81066c29>] update_process_times+0x69/0x80
[59719.753238]  [<ffffffff81088098>] tick_sched_timer+0x58/0x150
[59719.758989]  [<ffffffff8107b7ef>] __run_hrtimer+0x6f/0x250
[59719.764481]  [<ffffffff81088040>] ? tick_init_highres+0x20/0x20
[59719.770409]  [<ffffffff8107bf7a>] hrtimer_interrupt+0xda/0x230
[59719.776248]  [<ffffffff8101f5c6>] smp_apic_timer_interrupt+0x66/0xa0
[59719.782599]  [<ffffffff815120f3>] apic_timer_interrupt+0x13/0x20
[59719.788600]  <EOI>
[59719.790529] Code: fc 10 74 1f 77 08 41 80 fc 08 75 49 eb 0e 41 80 fc 20
74 17 41 80 fc 40 75 3b eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb 09
<8b> 00 89 c0 eb 03 48 8b 00 48 89 03 e8 62 55 e2 ff eb 1d 41 0f
[59719.810804] RIP  [<ffffffff812a211d>] acpi_atomic_read+0x8d/0xcb
[59719.816836]  RSP <ffff88033fca7da8>
[59719.820324] CR2: 0000000000000000
[59719.823641] ---[ end trace d2c1ec821cf1a607 ]---
[59719.828260] Kernel panic - not syncing: Fatal exception in interrupt
[59719.834614] Pid: 26831, comm: cluster Tainted: G      D    
2.6.39.3-microwaycustom #2
[59719.842520] Call Trace:
[59719.844968]  <NMI>  [<ffffffff815071ee>] panic+0x9b/0x1b0
[59719.850399]  [<ffffffff8150bb4a>] oops_end+0xea/0xf0
[59719.855374]  [<ffffffff81031dc3>] no_context+0xf3/0x260
[59719.860613]  [<ffffffff81032055>] __bad_area_nosemaphore+0x125/0x1e0
[59719.866970]  [<ffffffff8103211e>] bad_area_nosemaphore+0xe/0x10
[59719.872895]  [<ffffffff8150dd10>] do_page_fault+0x500/0x5a0
[59719.878483]  [<ffffffff810eb839>] ? __perf_event_overflow+0x99/0x210
[59719.884838]  [<ffffffff8150ae95>] page_fault+0x25/0x30
[59719.889996]  [<ffffffff812a211d>] ? acpi_atomic_read+0x8d/0xcb
[59719.895838]  [<ffffffff812a20f0>] ? acpi_atomic_read+0x60/0xcb
[59719.901684]  [<ffffffffa0278228>] ghes_read_estatus+0x38/0x180 [ghes]
[59719.908139]  [<ffffffffa027860c>] ghes_notify_nmi+0xbc/0x190 [ghes]
[59719.914417]  [<ffffffff8150ddfd>] notifier_call_chain+0x4d/0x70
[59719.920338]  [<ffffffff8150de63>] __atomic_notifier_call_chain+0x43/0x60
[59719.927047]  [<ffffffff8150de91>] atomic_notifier_call_chain+0x11/0x20
[59719.933575]  [<ffffffff8150dece>] notify_die+0x2e/0x30
[59719.938708]  [<ffffffff8150b4f2>] do_nmi+0xa2/0x260
[59719.943585]  [<ffffffff8150b150>] nmi+0x20/0x30
[59719.948127]  [<ffffffff81029f6a>] ? native_write_msr_safe+0xa/0x10
[59719.954303]  <<EOE>>  <IRQ>  [<ffffffff81011568>]
intel_pmu_disable_all+0x38/0xb0
[59719.961840]  [<ffffffff81010efa>] x86_pmu_disable+0x4a/0x50
[59719.967423]  [<ffffffff810ea842>] perf_event_task_tick+0x1a2/0x2a0
[59719.973609]  [<ffffffff81050750>] scheduler_tick+0x1b0/0x290
[59719.979273]  [<ffffffff81066c29>] update_process_times+0x69/0x80
[59719.985282]  [<ffffffff81088098>] tick_sched_timer+0x58/0x150
[59719.991039]  [<ffffffff8107b7ef>] __run_hrtimer+0x6f/0x250
[59719.996536]  [<ffffffff81088040>] ? tick_init_highres+0x20/0x20
[59720.002456]  [<ffffffff8107bf7a>] hrtimer_interrupt+0xda/0x230
[59720.008289]  [<ffffffff8101f5c6>] smp_apic_timer_interrupt+0x66/0xa0
[59720.014646]  [<ffffffff815120f3>] apic_timer_interrupt+0x13/0x20
[59720.020648]  <EOI>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-22 14:42     ` rick
@ 2011-08-22 18:47       ` Rafael J. Wysocki
  2011-08-22 20:51         ` Rick Warner
  0 siblings, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2011-08-22 18:47 UTC (permalink / raw)
  To: rick
  Cc: linux-kernel, Richard Houghton, ACPI Devel Mailing List,
	Len Brown, Matthew Garrett

On Monday, August 22, 2011, rick@microway.com wrote:
> > On Thursday, August 18, 2011, Rafael J. Wysocki wrote:
> >> Hi,
> >>
> >> It's better to report ACPI failues to linux-acpi.
> >>
> >> On Wednesday, August 17, 2011, Rick Warner wrote:
> >> > Hi all,
> >> >
> >> > I am getting a kernel oops/panic on a dual xeon system that is the
> >> master of a
> >> > 60 node HPC cluster.  This is happening while stress testing the
> >> system
> >> > including significant network traffic.  The OS is openSuse 11.4.
> >> >
> >> > We are running a custom compiled 2.6.39.3 kernel on the systems due to
> >> a bug
> >> > in the stock kernel 11.4 provided (igb driver related).  After 1-3
> >> days of
> >> > heavy testing, the master node locks up with the caps lock and scroll
> >> lock
> >> > keys on the keyboard blinking with the following output captured via a
> >> serial
> >> > console:
> >> >
> >> > [381920.681113] BUG: unable to handle kernel NULL pointer dereference
> >> at
> >> > (null)
> >> > [381920.689067] IP: [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
> >> > [381920.695187] PGD 30c27a067 PUD 16efe6067 PMD 0
> >> > [381920.699782] Oops: 0000 [#1] PREEMPT SMP
> >> > [381920.703866] last sysfs file:
> >> > /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
> >> > [381920.711868] CPU 6
> >> > [381920.713800] Modules linked in: md5 ipmi_devintf ipmi_si
> >> ipmi_msghandler
> >> > nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE iptable_mangle
> >> > iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
> >> iptable_filter
> >> > ip_tables x_tables af_packet edd cpufreq_conservative
> >> cpufreq_userspace
> >> > cpufreq_powersave acpi_cpufreq mperf xfs dm_mod ioatdma i7core_edac
> >> edac_core
> >> > sr_mod cdrom joydev igb i2c_i801 sg button ghes hed iTCO_wdt
> >> > iTCO_vendor_support dca pcspkr ext4 jbd2 crc16 raid456
> >> async_raid6_recov
> >> > async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid1
> >> raid0 fan
> >> > processor thermal thermal_sys ata_generic pata_atiixp arcmsr
> >> > [381920.771623]
> >> > [381920.773210] Pid: 12701, comm: cluster Not tainted
> >> 2.6.39.3-microwaycustom
> >> > #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
> >> > [381920.783292] RIP: 0010:[<ffffffff812a7510>]  [<ffffffff812a7510>]
> >> > acpi_atomic_read+0xe3/0x120
> >> > [381920.791853] RSP: 0000:ffff88063fc47d98  EFLAGS: 00010046
> >> > [381920.797260] RAX: 0000000000000000 RBX: 00000000bf7b5ff0 RCX:
> >> ffffffff81a3cdd0
> >> > [381920.804486] RDX: 00000000bf7b6010 RSI: 00000000bf7b6000 RDI:
> >> > ffff88062d4b95c0
> >> > [381920.811712] RBP: ffff88063fc47dc8 R08: ffff88063fc47d98 R09:
> >> 0000000000000002
> >> > [381920.818940] R10: 0000000000000083 R11: 0000000000000010 R12:
> >> > ffffc90003044c20
> >> > [381920.826168] R13: ffff88063fc47de0 R14: 0000000000000000 R15:
> >> > 0000000000000000
> >> > [381920.833392] FS:  0000000000000000(0000) GS:ffff88063fc40000(0000)
> >> > knlGS:0000000000000000
> >> > [381920.841571] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> >> > [381920.847409] CR2: 0000000000000000 CR3: 00000001d2d0f000 CR4:
> >> > 00000000000006e0
> >> > [381920.854635] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >> > 0000000000000000
> >> > [381920.861861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> >> 0000000000000400
> >> > [381920.869088] Process cluster (pid: 12701, threadinfo
> >> ffff880028284000, task
> >> > ffff880550184380)
> >> > [381920.877608] Stack:
> >> > [381920.879717]  ffffffff81a3cdd0 00000000bf7b5ff0 ffff88062b2cc140
> >> ffffc90003044ca8
> >> > [381920.887288]  ffff88062b2cc140 0000000000000001 ffff88063fc47e08
> >> ffffffffa002b21f
> >> > [381920.894862]  0000000000000000 0000000000000000 ffff88062b2cc140
> >> > 0000000000000000
> >> > [381920.902435] Call Trace:
> >> > [381920.904969]  <NMI>
> >> > [381920.907201]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170
> >> [ghes]
> >> > [381920.913726]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0
> >> [ghes]
> >> > [381920.920076]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
> >> > [381920.926080]  [<ffffffff8151a1d8>]
> >> __atomic_notifier_call_chain+0x48/0x70
> >> > [381920.932864]  [<ffffffff8151a211>]
> >> atomic_notifier_call_chain+0x11/0x20
> >> > [381920.939473]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
> >> > [381920.944691]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
> >> > [381920.949649]  [<ffffffff81517550>] nmi+0x20/0x30
> >> > [381920.954269]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
> >> > [381920.960536]  <<EOE>>
> >> > [381920.962725]  <IRQ>
> >> > [381920.965132]  [<ffffffff8101131e>] intel_pmu_disable_all+0x3e/0x120
> >> > [381920.971399]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
> >> > [381920.977066]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
> >> > [381920.982817]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
> >> > [381920.989082]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
> >> > [381920.994740]  [<ffffffff81067097>] update_process_times+0x67/0x80
> >> > [381921.000832]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
> >> > [381921.006578]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
> >> > [381921.012758]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
> >> > [381921.018414]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
> >> > [381921.024343]  [<ffffffff810206d7>]
> >> smp_apic_timer_interrupt+0x67/0xa0
> >> > [381921.030793]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
> >> > [381921.036891]  <EOI>
> >> > [381921.038913] Code: fc 10 74 1f 77 08 41 80 fc 08 75 48 eb 0e 41 80
> >> fc 20 74
> >> > 17 41 80 fc 40 75 3a eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb
> >> 09 <8b>
> >> > 00 89 c0 eb 03 48 8b 00 49 89 45 00 e8 8e 2b e2 ff eb 1b 0f
> >> > [381921.059462] RIP  [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
> >> > [381921.065669]  RSP <ffff88063fc47d98>
> >> > [381921.069242] CR2: 0000000000000000
> >> > [381921.072645] ---[ end trace 52697bfc73a34a90 ]---
> >> > [381921.077343] Kernel panic - not syncing: Fatal exception in
> >> interrupt
> >> > [381921.083784] Pid: 12701, comm: cluster Tainted: G      D
> >> 2.6.39.3-
> >> > microwaycustom #1
> >> > [381921.091788] Call Trace:
> >> > [381921.094333]  <NMI>  [<ffffffff81513300>] panic+0x9f/0x1da
> >> > [381921.099881]  [<ffffffff81517f8c>] oops_end+0xdc/0xf0
> >> > [381921.104952]  [<ffffffff81032b91>] no_context+0xf1/0x260
> >> > [381921.110277]  [<ffffffff81032e55>]
> >> __bad_area_nosemaphore+0x155/0x200
> >> > [381921.116739]  [<ffffffff81032f0e>] bad_area_nosemaphore+0xe/0x10
> >> > [381921.122766]  [<ffffffff81519f46>] do_page_fault+0x366/0x530
> >> > [381921.128440]  [<ffffffff810ee929>] ?
> >> __perf_event_overflow+0xa9/0x220
> >> > [381921.134896]  [<ffffffff810ef99b>] ?
> >> perf_event_update_userpage+0x9b/0xe0
> >> > [381921.141698]  [<ffffffff81012249>] ?
> >> intel_pmu_enable_all+0xc9/0x1a0
> >> > [381921.148061]  [<ffffffff810123ff>] ?
> >> x86_perf_event_set_period+0xdf/0x170
> >> > [381921.154852]  [<ffffffff81517295>] page_fault+0x25/0x30
> >> > [381921.160081]  [<ffffffff812a7510>] ? acpi_atomic_read+0xe3/0x120
> >> > [381921.166094]  [<ffffffff812a7486>] ? acpi_atomic_read+0x59/0x120
> >> > [381921.172107]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170
> >> [ghes]
> >> > [381921.178639]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0
> >> [ghes]
> >> > [381921.185004]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
> >> > [381921.191003]  [<ffffffff8151a1d8>]
> >> __atomic_notifier_call_chain+0x48/0x70
> >> > [381921.197789]  [<ffffffff8151a211>]
> >> atomic_notifier_call_chain+0x11/0x20
> >> > [381921.204399]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
> >> > [381921.209624]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
> >> > [381921.214584]  [<ffffffff81517550>] nmi+0x20/0x30
> >> > [381921.219208]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
> >> > [381921.225470]  <<EOE>>  <IRQ>  [<ffffffff8101131e>]
> >> > intel_pmu_disable_all+0x3e/0x120
> >> > [381921.233174]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
> >> > [381921.238836]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
> >> > [381921.244582]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
> >> > [381921.250855]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
> >> > [381921.256520]  [<ffffffff81067097>] update_process_times+0x67/0x80
> >> > [381921.262618]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
> >> > [381921.268363]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
> >> > [381921.274540]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
> >> > [381921.280198]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
> >> > [381921.286125]  [<ffffffff810206d7>]
> >> smp_apic_timer_interrupt+0x67/0xa0
> >> > [381921.292578]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
> >> > [381921.298673]  <EOI>
> >> >
> >> >
> >> > I recompiled the kernel again, disabling the tickless feature as I saw
> >> "nohz"
> >> > early in the call trace.  After that, it reproduced again but the call
> >> trace
> >> > only changed slightly having tick_init_highres in there instead of
> >> > tick_nohz_handler.  I have that call trace if it is desired as well.
> >> It is
> >> > nearly identical though.
> >> >
> >> > I am currently trying a custom 2.6.36.4 on the system, but will need
> >> up to 3
> >> > days before I know if the problem exists there as well.
> >> >
> >> > Any ideas on this?
> >>
> >> It looks like a wrong address in ghes_read_estatus().  I'll have a look
> >> at it
> >> later today.
> >
> > Hmm.  Please apply the appended patch and see if the error is still
> > reproducible (it's on top of the current mainline, so you may need to
> > adjust it by hand).
> >
> > Thanks,
> > Rafael
> >
> > ---
> >  drivers/acpi/apei/ghes.c |    3 +++
> >  1 file changed, 3 insertions(+)
> >
> > Index: linux/drivers/acpi/apei/ghes.c
> > ===================================================================
> > --- linux.orig/drivers/acpi/apei/ghes.c
> > +++ linux/drivers/acpi/apei/ghes.c
> > @@ -398,6 +398,9 @@ static int ghes_read_estatus(struct ghes
> >  	u32 len;
> >  	int rc;
> >
> > +	if (!g)
> > +		return -EINVAL;
> > +
> >  	rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
> >  	if (rc) {
> >  		if (!silent && printk_ratelimit())
> >
> 
> Thanks for the quick patch!  Unfortunately, the system crashed again with
> the same call trace. I'll list it below in case there are any differences
> I missed.  I have currently added a test for (!&g->error_status_address)
> just like your test on (!g) to verify that isn't a null pointer either,
> and will start tests back up with that change.  Please let me know if you
> have any other suggestions or patches for me to try.

That was a long shot, but perhaps you can check what source code
corresponds to ghes_read_estatus+0x38 in your kernel?

Rafael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-22 18:47       ` Rafael J. Wysocki
@ 2011-08-22 20:51         ` Rick Warner
  2011-08-22 21:13           ` Rafael J. Wysocki
  2011-08-23 17:14           ` Don Zickus
  0 siblings, 2 replies; 18+ messages in thread
From: Rick Warner @ 2011-08-22 20:51 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, Richard Houghton, ACPI Devel Mailing List,
	Len Brown, Matthew Garrett

> > >> Hi,
> > >> 
> > >> It's better to report ACPI failues to linux-acpi.
> > >> 
> > >> On Wednesday, August 17, 2011, Rick Warner wrote:
> > >> > Hi all,
> > >> > 
> > >> > I am getting a kernel oops/panic on a dual xeon system that is the
> > >> 
> > >> master of a
> > >> 
> > >> > 60 node HPC cluster.  This is happening while stress testing the
> > >> 
> > >> system
> > >> 
> > >> > including significant network traffic.  The OS is openSuse 11.4.
> > >> > 
> > >> > We are running a custom compiled 2.6.39.3 kernel on the systems due
> > >> > to
> > >> 
> > >> a bug
> > >> 
> > >> > in the stock kernel 11.4 provided (igb driver related).  After 1-3
> > >> 
> > >> days of
> > >> 
> > >> > heavy testing, the master node locks up with the caps lock and
> > >> > scroll
> > >> 
> > >> lock
> > >> 
> > >> > keys on the keyboard blinking with the following output captured via
> > >> > a
> > >> 
> > >> serial
> > >> 
> > >> > console:
> > >> > 
> > >> > [381920.681113] BUG: unable to handle kernel NULL pointer
> > >> > dereference
> > >> 
> > >> at
> > >> 
> > >> > (null)
> > >> > [381920.689067] IP: [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
> > >> > [381920.695187] PGD 30c27a067 PUD 16efe6067 PMD 0
> > >> > [381920.699782] Oops: 0000 [#1] PREEMPT SMP
> > >> > [381920.703866] last sysfs file:
> > >> > /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
> > >> > [381920.711868] CPU 6
> > >> > [381920.713800] Modules linked in: md5 ipmi_devintf ipmi_si
> > >> 
> > >> ipmi_msghandler
> > >> 
> > >> > nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE iptable_mangle
> > >> > iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
> > >> 
> > >> iptable_filter
> > >> 
> > >> > ip_tables x_tables af_packet edd cpufreq_conservative
> > >> 
> > >> cpufreq_userspace
> > >> 
> > >> > cpufreq_powersave acpi_cpufreq mperf xfs dm_mod ioatdma i7core_edac
> > >> 
> > >> edac_core
> > >> 
> > >> > sr_mod cdrom joydev igb i2c_i801 sg button ghes hed iTCO_wdt
> > >> > iTCO_vendor_support dca pcspkr ext4 jbd2 crc16 raid456
> > >> 
> > >> async_raid6_recov
> > >> 
> > >> > async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid1
> > >> 
> > >> raid0 fan
> > >> 
> > >> > processor thermal thermal_sys ata_generic pata_atiixp arcmsr
> > >> > [381920.771623]
> > >> > [381920.773210] Pid: 12701, comm: cluster Not tainted
> > >> 
> > >> 2.6.39.3-microwaycustom
> > >> 
> > >> > #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
> > >> > [381920.783292] RIP: 0010:[<ffffffff812a7510>]  [<ffffffff812a7510>]
> > >> > acpi_atomic_read+0xe3/0x120
> > >> > [381920.791853] RSP: 0000:ffff88063fc47d98  EFLAGS: 00010046
> > >> 
> > >> > [381920.797260] RAX: 0000000000000000 RBX: 00000000bf7b5ff0 RCX:
> > >> ffffffff81a3cdd0
> > >> 
> > >> > [381920.804486] RDX: 00000000bf7b6010 RSI: 00000000bf7b6000 RDI:
> > >> > ffff88062d4b95c0
> > >> 
> > >> > [381920.811712] RBP: ffff88063fc47dc8 R08: ffff88063fc47d98 R09:
> > >> 0000000000000002
> > >> 
> > >> > [381920.818940] R10: 0000000000000083 R11: 0000000000000010 R12:
> > >> > ffffc90003044c20
> > >> > [381920.826168] R13: ffff88063fc47de0 R14: 0000000000000000 R15:
> > >> > 0000000000000000
> > >> > [381920.833392] FS:  0000000000000000(0000)
> > >> > GS:ffff88063fc40000(0000) knlGS:0000000000000000
> > >> > [381920.841571] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> > >> > [381920.847409] CR2: 0000000000000000 CR3: 00000001d2d0f000 CR4:
> > >> > 00000000000006e0
> > >> > [381920.854635] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > >> > 0000000000000000
> > >> 
> > >> > [381920.861861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> > >> 0000000000000400
> > >> 
> > >> > [381920.869088] Process cluster (pid: 12701, threadinfo
> > >> 
> > >> ffff880028284000, task
> > >> 
> > >> > ffff880550184380)
> > >> > [381920.877608] Stack:
> > >> > [381920.879717]  ffffffff81a3cdd0 00000000bf7b5ff0 ffff88062b2cc140
> > >> 
> > >> ffffc90003044ca8
> > >> 
> > >> > [381920.887288]  ffff88062b2cc140 0000000000000001 ffff88063fc47e08
> > >> 
> > >> ffffffffa002b21f
> > >> 
> > >> > [381920.894862]  0000000000000000 0000000000000000 ffff88062b2cc140
> > >> > 0000000000000000
> > >> > [381920.902435] Call Trace:
> > >> > [381920.904969]  <NMI>
> > >> > [381920.907201]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170
> > >> 
> > >> [ghes]
> > >> 
> > >> > [381920.913726]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0
> > >> 
> > >> [ghes]
> > >> 
> > >> > [381920.920076]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
> > >> > [381920.926080]  [<ffffffff8151a1d8>]
> > >> 
> > >> __atomic_notifier_call_chain+0x48/0x70
> > >> 
> > >> > [381920.932864]  [<ffffffff8151a211>]
> > >> 
> > >> atomic_notifier_call_chain+0x11/0x20
> > >> 
> > >> > [381920.939473]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
> > >> > [381920.944691]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
> > >> > [381920.949649]  [<ffffffff81517550>] nmi+0x20/0x30
> > >> > [381920.954269]  [<ffffffff8102ae8a>] ?
> > >> > native_write_msr_safe+0xa/0x10 [381920.960536]  <<EOE>>
> > >> > [381920.962725]  <IRQ>
> > >> > [381920.965132]  [<ffffffff8101131e>]
> > >> > intel_pmu_disable_all+0x3e/0x120 [381920.971399] 
> > >> > [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50 [381920.977066] 
> > >> > [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40 [381920.982817] 
> > >> > [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
> > >> > [381920.989082]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
> > >> > [381920.994740]  [<ffffffff81067097>]
> > >> > update_process_times+0x67/0x80 [381921.000832] 
> > >> > [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0 [381921.006578] 
> > >> > [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
> > >> > [381921.012758]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
> > >> > [381921.018414]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
> > >> > [381921.024343]  [<ffffffff810206d7>]
> > >> 
> > >> smp_apic_timer_interrupt+0x67/0xa0
> > >> 
> > >> > [381921.030793]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
> > >> > [381921.036891]  <EOI>
> > >> > [381921.038913] Code: fc 10 74 1f 77 08 41 80 fc 08 75 48 eb 0e 41
> > >> > 80
> > >> 
> > >> fc 20 74
> > >> 
> > >> > 17 41 80 fc 40 75 3a eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb
> > >> 
> > >> 09 <8b>
> > >> 
> > >> > 00 89 c0 eb 03 48 8b 00 49 89 45 00 e8 8e 2b e2 ff eb 1b 0f
> > >> > [381921.059462] RIP  [<ffffffff812a7510>]
> > >> > acpi_atomic_read+0xe3/0x120 [381921.065669]  RSP <ffff88063fc47d98>
> > >> > [381921.069242] CR2: 0000000000000000
> > >> > [381921.072645] ---[ end trace 52697bfc73a34a90 ]---
> > >> > [381921.077343] Kernel panic - not syncing: Fatal exception in
> > >> 
> > >> interrupt
> > >> 
> > >> > [381921.083784] Pid: 12701, comm: cluster Tainted: G      D
> > >> 
> > >> 2.6.39.3-
> > >> 
> > >> > microwaycustom #1
> > >> > [381921.091788] Call Trace:
> > >> > [381921.094333]  <NMI>  [<ffffffff81513300>] panic+0x9f/0x1da
> > >> > [381921.099881]  [<ffffffff81517f8c>] oops_end+0xdc/0xf0
> > >> > [381921.104952]  [<ffffffff81032b91>] no_context+0xf1/0x260
> > >> > [381921.110277]  [<ffffffff81032e55>]
> > >> 
> > >> __bad_area_nosemaphore+0x155/0x200
> > >> 
> > >> > [381921.116739]  [<ffffffff81032f0e>] bad_area_nosemaphore+0xe/0x10
> > >> > [381921.122766]  [<ffffffff81519f46>] do_page_fault+0x366/0x530
> > >> > [381921.128440]  [<ffffffff810ee929>] ?
> > >> 
> > >> __perf_event_overflow+0xa9/0x220
> > >> 
> > >> > [381921.134896]  [<ffffffff810ef99b>] ?
> > >> 
> > >> perf_event_update_userpage+0x9b/0xe0
> > >> 
> > >> > [381921.141698]  [<ffffffff81012249>] ?
> > >> 
> > >> intel_pmu_enable_all+0xc9/0x1a0
> > >> 
> > >> > [381921.148061]  [<ffffffff810123ff>] ?
> > >> 
> > >> x86_perf_event_set_period+0xdf/0x170
> > >> 
> > >> > [381921.154852]  [<ffffffff81517295>] page_fault+0x25/0x30
> > >> > [381921.160081]  [<ffffffff812a7510>] ? acpi_atomic_read+0xe3/0x120
> > >> > [381921.166094]  [<ffffffff812a7486>] ? acpi_atomic_read+0x59/0x120
> > >> > [381921.172107]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170
> > >> 
> > >> [ghes]
> > >> 
> > >> > [381921.178639]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0
> > >> 
> > >> [ghes]
> > >> 
> > >> > [381921.185004]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
> > >> > [381921.191003]  [<ffffffff8151a1d8>]
> > >> 
> > >> __atomic_notifier_call_chain+0x48/0x70
> > >> 
> > >> > [381921.197789]  [<ffffffff8151a211>]
> > >> 
> > >> atomic_notifier_call_chain+0x11/0x20
> > >> 
> > >> > [381921.204399]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
> > >> > [381921.209624]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
> > >> > [381921.214584]  [<ffffffff81517550>] nmi+0x20/0x30
> > >> > [381921.219208]  [<ffffffff8102ae8a>] ?
> > >> > native_write_msr_safe+0xa/0x10 [381921.225470]  <<EOE>>  <IRQ> 
> > >> > [<ffffffff8101131e>]
> > >> > intel_pmu_disable_all+0x3e/0x120
> > >> > [381921.233174]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
> > >> > [381921.238836]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
> > >> > [381921.244582]  [<ffffffff810ee828>]
> > >> > perf_event_task_tick+0x218/0x270 [381921.250855] 
> > >> > [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0 [381921.256520] 
> > >> > [<ffffffff81067097>] update_process_times+0x67/0x80 [381921.262618]
> > >> >  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0 [381921.268363] 
> > >> > [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
> > >> > [381921.274540]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
> > >> > [381921.280198]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
> > >> > [381921.286125]  [<ffffffff810206d7>]
> > >> 
> > >> smp_apic_timer_interrupt+0x67/0xa0
> > >> 
> > >> > [381921.292578]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
> > >> > [381921.298673]  <EOI>
> > >> > 
> > >> > 
> > >> > I recompiled the kernel again, disabling the tickless feature as I
> > >> > saw
> > >> 
> > >> "nohz"
> > >> 
> > >> > early in the call trace.  After that, it reproduced again but the
> > >> > call
> > >> 
> > >> trace
> > >> 
> > >> > only changed slightly having tick_init_highres in there instead of
> > >> > tick_nohz_handler.  I have that call trace if it is desired as well.
> > >> 
> > >> It is
> > >> 
> > >> > nearly identical though.
> > >> > 
> > >> > I am currently trying a custom 2.6.36.4 on the system, but will need
> > >> 
> > >> up to 3
> > >> 
> > >> > days before I know if the problem exists there as well.
> > >> > 
> > >> > Any ideas on this?
> > >> 
> > >> It looks like a wrong address in ghes_read_estatus().  I'll have a
> > >> look at it
> > >> later today.
> > > 
> > > Hmm.  Please apply the appended patch and see if the error is still
> > > reproducible (it's on top of the current mainline, so you may need to
> > > adjust it by hand).
> > > 
> > > Thanks,
> > > Rafael
> > > 
> > > ---
> > > 
> > >  drivers/acpi/apei/ghes.c |    3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > Index: linux/drivers/acpi/apei/ghes.c
> > > ===================================================================
> > > --- linux.orig/drivers/acpi/apei/ghes.c
> > > +++ linux/drivers/acpi/apei/ghes.c
> > > @@ -398,6 +398,9 @@ static int ghes_read_estatus(struct ghes
> > > 
> > >  	u32 len;
> > >  	int rc;
> > > 
> > > +	if (!g)
> > > +		return -EINVAL;
> > > +
> > > 
> > >  	rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
> > >  	if (rc) {
> > >  	
> > >  		if (!silent && printk_ratelimit())
> > 
> > Thanks for the quick patch!  Unfortunately, the system crashed again with
> > the same call trace. I'll list it below in case there are any differences
> > I missed.  I have currently added a test for (!&g->error_status_address)
> > just like your test on (!g) to verify that isn't a null pointer either,
> > and will start tests back up with that change.  Please let me know if you
> > have any other suggestions or patches for me to try.
> 
> That was a long shot, but perhaps you can check what source code
> corresponds to ghes_read_estatus+0x38 in your kernel?
> 
> Rafael
Hi Rafael,

Thanks for the off-list help in getting you this info.

I had already rebuilt the kernel using the change I mentioned earlier (test on 
!&g->error_status_address) since the call trace I got.

I luckily still had a copy of the kernel and modules I built previously using 
just your patch, so I undid my change to the ghes.c source, leaving just your 
patch but not mine so it would match the ghes.ko module I ran on.  This is the 
output of gdb on that ghes.ko now:

(gdb) l *ghes_read_estatus+0x38
0x258 is in ghes_read_estatus (drivers/acpi/apei/ghes.c:296).
warning: Source file is more recent than executable.
291             int rc;
292             if (!g)
293                     return -EINVAL;
294
295             rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
296             if (rc) {
297                     if (!silent && printk_ratelimit())
298                             pr_warning(FW_WARN GHES_PFX
299     "Failed to read error status block address for hardware error source: 
%d.\n",
300                                        g->header.source_id);

The warning about the source being newer is because of the reverted change in 
the ghes.c source mentioned above.

Thanks,
Rick

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-22 20:51         ` Rick Warner
@ 2011-08-22 21:13           ` Rafael J. Wysocki
  2011-08-23 17:16             ` rick
  2011-08-23 17:14           ` Don Zickus
  1 sibling, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2011-08-22 21:13 UTC (permalink / raw)
  To: Rick Warner
  Cc: linux-kernel, Richard Houghton, ACPI Devel Mailing List,
	Len Brown, Matthew Garrett

Hi,

On Monday, August 22, 2011, Rick Warner wrote:
...
> Hi Rafael,
> 
> Thanks for the off-list help in getting you this info.
> 
> I had already rebuilt the kernel using the change I mentioned earlier (test on 
> !&g->error_status_address) since the call trace I got.
> 
> I luckily still had a copy of the kernel and modules I built previously using 
> just your patch, so I undid my change to the ghes.c source, leaving just your 
> patch but not mine so it would match the ghes.ko module I ran on.  This is the 
> output of gdb on that ghes.ko now:
> 
> (gdb) l *ghes_read_estatus+0x38
> 0x258 is in ghes_read_estatus (drivers/acpi/apei/ghes.c:296).
> warning: Source file is more recent than executable.
> 291             int rc;
> 292             if (!g)
> 293                     return -EINVAL;
> 294
> 295             rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
> 296             if (rc) {
> 297                     if (!silent && printk_ratelimit())
> 298                             pr_warning(FW_WARN GHES_PFX
> 299     "Failed to read error status block address for hardware error source: 
> %d.\n",
> 300                                        g->header.source_id);
> 
> The warning about the source being newer is because of the reverted change in 
> the ghes.c source mentioned above.

OK, since &buf_addr cannot be NULL, perhaps ghes is.  Please check if the
appended patch makes a difference.

Thanks,
Rafael

---
 drivers/acpi/apei/ghes.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Index: linux/drivers/acpi/apei/ghes.c
===================================================================
--- linux.orig/drivers/acpi/apei/ghes.c
+++ linux/drivers/acpi/apei/ghes.c
@@ -393,11 +393,16 @@ static void ghes_copy_tofrom_phys(void *
 
 static int ghes_read_estatus(struct ghes *ghes, int silent)
 {
-	struct acpi_hest_generic *g = ghes->generic;
+	struct acpi_hest_generic *g;
 	u64 buf_paddr;
 	u32 len;
 	int rc;
 
+	if (!ghes || !ghes->generic)
+		return -EINVAL;
+
+	g = ghes->generic;
+
 	rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
 	if (rc) {
 		if (!silent && printk_ratelimit())

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-22 20:51         ` Rick Warner
  2011-08-22 21:13           ` Rafael J. Wysocki
@ 2011-08-23 17:14           ` Don Zickus
  2011-08-23 17:24             ` rick
  1 sibling, 1 reply; 18+ messages in thread
From: Don Zickus @ 2011-08-23 17:14 UTC (permalink / raw)
  To: Rick Warner, ying.huang
  Cc: Rafael J. Wysocki, linux-kernel, Richard Houghton,
	ACPI Devel Mailing List, Len Brown, Matthew Garrett

adding Ying.  He wrote the ghes code.  But if you using 2.6.39, then the
ghes code probably still has some bugs in it that were fixed in 3.0 and 3.1.
And unless you are using a Nehalem machine or later (Westmere, Sandy
Bridge), ghes won't do much for you.  Can you just disable it in the
config?

Cheers,
Don

On Mon, Aug 22, 2011 at 04:51:35PM -0400, Rick Warner wrote:
> > > >> Hi,
> > > >> 
> > > >> It's better to report ACPI failues to linux-acpi.
> > > >> 
> > > >> On Wednesday, August 17, 2011, Rick Warner wrote:
> > > >> > Hi all,
> > > >> > 
> > > >> > I am getting a kernel oops/panic on a dual xeon system that is the
> > > >> 
> > > >> master of a
> > > >> 
> > > >> > 60 node HPC cluster.  This is happening while stress testing the
> > > >> 
> > > >> system
> > > >> 
> > > >> > including significant network traffic.  The OS is openSuse 11.4.
> > > >> > 
> > > >> > We are running a custom compiled 2.6.39.3 kernel on the systems due
> > > >> > to
> > > >> 
> > > >> a bug
> > > >> 
> > > >> > in the stock kernel 11.4 provided (igb driver related).  After 1-3
> > > >> 
> > > >> days of
> > > >> 
> > > >> > heavy testing, the master node locks up with the caps lock and
> > > >> > scroll
> > > >> 
> > > >> lock
> > > >> 
> > > >> > keys on the keyboard blinking with the following output captured via
> > > >> > a
> > > >> 
> > > >> serial
> > > >> 
> > > >> > console:
> > > >> > 
> > > >> > [381920.681113] BUG: unable to handle kernel NULL pointer
> > > >> > dereference
> > > >> 
> > > >> at
> > > >> 
> > > >> > (null)
> > > >> > [381920.689067] IP: [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
> > > >> > [381920.695187] PGD 30c27a067 PUD 16efe6067 PMD 0
> > > >> > [381920.699782] Oops: 0000 [#1] PREEMPT SMP
> > > >> > [381920.703866] last sysfs file:
> > > >> > /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
> > > >> > [381920.711868] CPU 6
> > > >> > [381920.713800] Modules linked in: md5 ipmi_devintf ipmi_si
> > > >> 
> > > >> ipmi_msghandler
> > > >> 
> > > >> > nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE iptable_mangle
> > > >> > iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
> > > >> 
> > > >> iptable_filter
> > > >> 
> > > >> > ip_tables x_tables af_packet edd cpufreq_conservative
> > > >> 
> > > >> cpufreq_userspace
> > > >> 
> > > >> > cpufreq_powersave acpi_cpufreq mperf xfs dm_mod ioatdma i7core_edac
> > > >> 
> > > >> edac_core
> > > >> 
> > > >> > sr_mod cdrom joydev igb i2c_i801 sg button ghes hed iTCO_wdt
> > > >> > iTCO_vendor_support dca pcspkr ext4 jbd2 crc16 raid456
> > > >> 
> > > >> async_raid6_recov
> > > >> 
> > > >> > async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid1
> > > >> 
> > > >> raid0 fan
> > > >> 
> > > >> > processor thermal thermal_sys ata_generic pata_atiixp arcmsr
> > > >> > [381920.771623]
> > > >> > [381920.773210] Pid: 12701, comm: cluster Not tainted
> > > >> 
> > > >> 2.6.39.3-microwaycustom
> > > >> 
> > > >> > #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
> > > >> > [381920.783292] RIP: 0010:[<ffffffff812a7510>]  [<ffffffff812a7510>]
> > > >> > acpi_atomic_read+0xe3/0x120
> > > >> > [381920.791853] RSP: 0000:ffff88063fc47d98  EFLAGS: 00010046
> > > >> 
> > > >> > [381920.797260] RAX: 0000000000000000 RBX: 00000000bf7b5ff0 RCX:
> > > >> ffffffff81a3cdd0
> > > >> 
> > > >> > [381920.804486] RDX: 00000000bf7b6010 RSI: 00000000bf7b6000 RDI:
> > > >> > ffff88062d4b95c0
> > > >> 
> > > >> > [381920.811712] RBP: ffff88063fc47dc8 R08: ffff88063fc47d98 R09:
> > > >> 0000000000000002
> > > >> 
> > > >> > [381920.818940] R10: 0000000000000083 R11: 0000000000000010 R12:
> > > >> > ffffc90003044c20
> > > >> > [381920.826168] R13: ffff88063fc47de0 R14: 0000000000000000 R15:
> > > >> > 0000000000000000
> > > >> > [381920.833392] FS:  0000000000000000(0000)
> > > >> > GS:ffff88063fc40000(0000) knlGS:0000000000000000
> > > >> > [381920.841571] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> > > >> > [381920.847409] CR2: 0000000000000000 CR3: 00000001d2d0f000 CR4:
> > > >> > 00000000000006e0
> > > >> > [381920.854635] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > >> > 0000000000000000
> > > >> 
> > > >> > [381920.861861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> > > >> 0000000000000400
> > > >> 
> > > >> > [381920.869088] Process cluster (pid: 12701, threadinfo
> > > >> 
> > > >> ffff880028284000, task
> > > >> 
> > > >> > ffff880550184380)
> > > >> > [381920.877608] Stack:
> > > >> > [381920.879717]  ffffffff81a3cdd0 00000000bf7b5ff0 ffff88062b2cc140
> > > >> 
> > > >> ffffc90003044ca8
> > > >> 
> > > >> > [381920.887288]  ffff88062b2cc140 0000000000000001 ffff88063fc47e08
> > > >> 
> > > >> ffffffffa002b21f
> > > >> 
> > > >> > [381920.894862]  0000000000000000 0000000000000000 ffff88062b2cc140
> > > >> > 0000000000000000
> > > >> > [381920.902435] Call Trace:
> > > >> > [381920.904969]  <NMI>
> > > >> > [381920.907201]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170
> > > >> 
> > > >> [ghes]
> > > >> 
> > > >> > [381920.913726]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0
> > > >> 
> > > >> [ghes]
> > > >> 
> > > >> > [381920.920076]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
> > > >> > [381920.926080]  [<ffffffff8151a1d8>]
> > > >> 
> > > >> __atomic_notifier_call_chain+0x48/0x70
> > > >> 
> > > >> > [381920.932864]  [<ffffffff8151a211>]
> > > >> 
> > > >> atomic_notifier_call_chain+0x11/0x20
> > > >> 
> > > >> > [381920.939473]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
> > > >> > [381920.944691]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
> > > >> > [381920.949649]  [<ffffffff81517550>] nmi+0x20/0x30
> > > >> > [381920.954269]  [<ffffffff8102ae8a>] ?
> > > >> > native_write_msr_safe+0xa/0x10 [381920.960536]  <<EOE>>
> > > >> > [381920.962725]  <IRQ>
> > > >> > [381920.965132]  [<ffffffff8101131e>]
> > > >> > intel_pmu_disable_all+0x3e/0x120 [381920.971399] 
> > > >> > [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50 [381920.977066] 
> > > >> > [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40 [381920.982817] 
> > > >> > [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
> > > >> > [381920.989082]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
> > > >> > [381920.994740]  [<ffffffff81067097>]
> > > >> > update_process_times+0x67/0x80 [381921.000832] 
> > > >> > [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0 [381921.006578] 
> > > >> > [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
> > > >> > [381921.012758]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
> > > >> > [381921.018414]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
> > > >> > [381921.024343]  [<ffffffff810206d7>]
> > > >> 
> > > >> smp_apic_timer_interrupt+0x67/0xa0
> > > >> 
> > > >> > [381921.030793]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
> > > >> > [381921.036891]  <EOI>
> > > >> > [381921.038913] Code: fc 10 74 1f 77 08 41 80 fc 08 75 48 eb 0e 41
> > > >> > 80
> > > >> 
> > > >> fc 20 74
> > > >> 
> > > >> > 17 41 80 fc 40 75 3a eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb
> > > >> 
> > > >> 09 <8b>
> > > >> 
> > > >> > 00 89 c0 eb 03 48 8b 00 49 89 45 00 e8 8e 2b e2 ff eb 1b 0f
> > > >> > [381921.059462] RIP  [<ffffffff812a7510>]
> > > >> > acpi_atomic_read+0xe3/0x120 [381921.065669]  RSP <ffff88063fc47d98>
> > > >> > [381921.069242] CR2: 0000000000000000
> > > >> > [381921.072645] ---[ end trace 52697bfc73a34a90 ]---
> > > >> > [381921.077343] Kernel panic - not syncing: Fatal exception in
> > > >> 
> > > >> interrupt
> > > >> 
> > > >> > [381921.083784] Pid: 12701, comm: cluster Tainted: G      D
> > > >> 
> > > >> 2.6.39.3-
> > > >> 
> > > >> > microwaycustom #1
> > > >> > [381921.091788] Call Trace:
> > > >> > [381921.094333]  <NMI>  [<ffffffff81513300>] panic+0x9f/0x1da
> > > >> > [381921.099881]  [<ffffffff81517f8c>] oops_end+0xdc/0xf0
> > > >> > [381921.104952]  [<ffffffff81032b91>] no_context+0xf1/0x260
> > > >> > [381921.110277]  [<ffffffff81032e55>]
> > > >> 
> > > >> __bad_area_nosemaphore+0x155/0x200
> > > >> 
> > > >> > [381921.116739]  [<ffffffff81032f0e>] bad_area_nosemaphore+0xe/0x10
> > > >> > [381921.122766]  [<ffffffff81519f46>] do_page_fault+0x366/0x530
> > > >> > [381921.128440]  [<ffffffff810ee929>] ?
> > > >> 
> > > >> __perf_event_overflow+0xa9/0x220
> > > >> 
> > > >> > [381921.134896]  [<ffffffff810ef99b>] ?
> > > >> 
> > > >> perf_event_update_userpage+0x9b/0xe0
> > > >> 
> > > >> > [381921.141698]  [<ffffffff81012249>] ?
> > > >> 
> > > >> intel_pmu_enable_all+0xc9/0x1a0
> > > >> 
> > > >> > [381921.148061]  [<ffffffff810123ff>] ?
> > > >> 
> > > >> x86_perf_event_set_period+0xdf/0x170
> > > >> 
> > > >> > [381921.154852]  [<ffffffff81517295>] page_fault+0x25/0x30
> > > >> > [381921.160081]  [<ffffffff812a7510>] ? acpi_atomic_read+0xe3/0x120
> > > >> > [381921.166094]  [<ffffffff812a7486>] ? acpi_atomic_read+0x59/0x120
> > > >> > [381921.172107]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170
> > > >> 
> > > >> [ghes]
> > > >> 
> > > >> > [381921.178639]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0
> > > >> 
> > > >> [ghes]
> > > >> 
> > > >> > [381921.185004]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
> > > >> > [381921.191003]  [<ffffffff8151a1d8>]
> > > >> 
> > > >> __atomic_notifier_call_chain+0x48/0x70
> > > >> 
> > > >> > [381921.197789]  [<ffffffff8151a211>]
> > > >> 
> > > >> atomic_notifier_call_chain+0x11/0x20
> > > >> 
> > > >> > [381921.204399]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
> > > >> > [381921.209624]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
> > > >> > [381921.214584]  [<ffffffff81517550>] nmi+0x20/0x30
> > > >> > [381921.219208]  [<ffffffff8102ae8a>] ?
> > > >> > native_write_msr_safe+0xa/0x10 [381921.225470]  <<EOE>>  <IRQ> 
> > > >> > [<ffffffff8101131e>]
> > > >> > intel_pmu_disable_all+0x3e/0x120
> > > >> > [381921.233174]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
> > > >> > [381921.238836]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
> > > >> > [381921.244582]  [<ffffffff810ee828>]
> > > >> > perf_event_task_tick+0x218/0x270 [381921.250855] 
> > > >> > [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0 [381921.256520] 
> > > >> > [<ffffffff81067097>] update_process_times+0x67/0x80 [381921.262618]
> > > >> >  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0 [381921.268363] 
> > > >> > [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
> > > >> > [381921.274540]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
> > > >> > [381921.280198]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
> > > >> > [381921.286125]  [<ffffffff810206d7>]
> > > >> 
> > > >> smp_apic_timer_interrupt+0x67/0xa0
> > > >> 
> > > >> > [381921.292578]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
> > > >> > [381921.298673]  <EOI>
> > > >> > 
> > > >> > 
> > > >> > I recompiled the kernel again, disabling the tickless feature as I
> > > >> > saw
> > > >> 
> > > >> "nohz"
> > > >> 
> > > >> > early in the call trace.  After that, it reproduced again but the
> > > >> > call
> > > >> 
> > > >> trace
> > > >> 
> > > >> > only changed slightly having tick_init_highres in there instead of
> > > >> > tick_nohz_handler.  I have that call trace if it is desired as well.
> > > >> 
> > > >> It is
> > > >> 
> > > >> > nearly identical though.
> > > >> > 
> > > >> > I am currently trying a custom 2.6.36.4 on the system, but will need
> > > >> 
> > > >> up to 3
> > > >> 
> > > >> > days before I know if the problem exists there as well.
> > > >> > 
> > > >> > Any ideas on this?
> > > >> 
> > > >> It looks like a wrong address in ghes_read_estatus().  I'll have a
> > > >> look at it
> > > >> later today.
> > > > 
> > > > Hmm.  Please apply the appended patch and see if the error is still
> > > > reproducible (it's on top of the current mainline, so you may need to
> > > > adjust it by hand).
> > > > 
> > > > Thanks,
> > > > Rafael
> > > > 
> > > > ---
> > > > 
> > > >  drivers/acpi/apei/ghes.c |    3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > Index: linux/drivers/acpi/apei/ghes.c
> > > > ===================================================================
> > > > --- linux.orig/drivers/acpi/apei/ghes.c
> > > > +++ linux/drivers/acpi/apei/ghes.c
> > > > @@ -398,6 +398,9 @@ static int ghes_read_estatus(struct ghes
> > > > 
> > > >  	u32 len;
> > > >  	int rc;
> > > > 
> > > > +	if (!g)
> > > > +		return -EINVAL;
> > > > +
> > > > 
> > > >  	rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
> > > >  	if (rc) {
> > > >  	
> > > >  		if (!silent && printk_ratelimit())
> > > 
> > > Thanks for the quick patch!  Unfortunately, the system crashed again with
> > > the same call trace. I'll list it below in case there are any differences
> > > I missed.  I have currently added a test for (!&g->error_status_address)
> > > just like your test on (!g) to verify that isn't a null pointer either,
> > > and will start tests back up with that change.  Please let me know if you
> > > have any other suggestions or patches for me to try.
> > 
> > That was a long shot, but perhaps you can check what source code
> > corresponds to ghes_read_estatus+0x38 in your kernel?
> > 
> > Rafael
> Hi Rafael,
> 
> Thanks for the off-list help in getting you this info.
> 
> I had already rebuilt the kernel using the change I mentioned earlier (test on 
> !&g->error_status_address) since the call trace I got.
> 
> I luckily still had a copy of the kernel and modules I built previously using 
> just your patch, so I undid my change to the ghes.c source, leaving just your 
> patch but not mine so it would match the ghes.ko module I ran on.  This is the 
> output of gdb on that ghes.ko now:
> 
> (gdb) l *ghes_read_estatus+0x38
> 0x258 is in ghes_read_estatus (drivers/acpi/apei/ghes.c:296).
> warning: Source file is more recent than executable.
> 291             int rc;
> 292             if (!g)
> 293                     return -EINVAL;
> 294
> 295             rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
> 296             if (rc) {
> 297                     if (!silent && printk_ratelimit())
> 298                             pr_warning(FW_WARN GHES_PFX
> 299     "Failed to read error status block address for hardware error source: 
> %d.\n",
> 300                                        g->header.source_id);
> 
> The warning about the source being newer is because of the reverted change in 
> the ghes.c source mentioned above.
> 
> Thanks,
> Rick
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-22 21:13           ` Rafael J. Wysocki
@ 2011-08-23 17:16             ` rick
  0 siblings, 0 replies; 18+ messages in thread
From: rick @ 2011-08-23 17:16 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, Richard Houghton, ACPI Devel Mailing List,
	Len Brown, Matthew Garrett

Hi,

> Hi,
>
> On Monday, August 22, 2011, Rick Warner wrote:
> ...
>> Hi Rafael,
>>
>> Thanks for the off-list help in getting you this info.
>>
>> I had already rebuilt the kernel using the change I mentioned earlier
>> (test on
>> !&g->error_status_address) since the call trace I got.
>>
>> I luckily still had a copy of the kernel and modules I built previously
>> using
>> just your patch, so I undid my change to the ghes.c source, leaving just
>> your
>> patch but not mine so it would match the ghes.ko module I ran on.  This
>> is the
>> output of gdb on that ghes.ko now:
>>
>> (gdb) l *ghes_read_estatus+0x38
>> 0x258 is in ghes_read_estatus (drivers/acpi/apei/ghes.c:296).
>> warning: Source file is more recent than executable.
>> 291             int rc;
>> 292             if (!g)
>> 293                     return -EINVAL;
>> 294
>> 295             rc = acpi_atomic_read(&buf_paddr,
>> &g->error_status_address);
>> 296             if (rc) {
>> 297                     if (!silent && printk_ratelimit())
>> 298                             pr_warning(FW_WARN GHES_PFX
>> 299     "Failed to read error status block address for hardware error
>> source:
>> %d.\n",
>> 300                                        g->header.source_id);
>>
>> The warning about the source being newer is because of the reverted
>> change in
>> the ghes.c source mentioned above.
>
> OK, since &buf_addr cannot be NULL, perhaps ghes is.  Please check if the
> appended patch makes a difference.
>
> Thanks,
> Rafael
>
> ---
>  drivers/acpi/apei/ghes.c |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> Index: linux/drivers/acpi/apei/ghes.c
> ===================================================================
> --- linux.orig/drivers/acpi/apei/ghes.c
> +++ linux/drivers/acpi/apei/ghes.c
> @@ -393,11 +393,16 @@ static void ghes_copy_tofrom_phys(void *
>
>  static int ghes_read_estatus(struct ghes *ghes, int silent)
>  {
> -	struct acpi_hest_generic *g = ghes->generic;
> +	struct acpi_hest_generic *g;
>  	u64 buf_paddr;
>  	u32 len;
>  	int rc;
>
> +	if (!ghes || !ghes->generic)
> +		return -EINVAL;
> +
> +	g = ghes->generic;
> +
>  	rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
>  	if (rc) {
>  		if (!silent && printk_ratelimit())
>

Unfortunately it had another panic with this patch in place.  Here is the
latest call trace:

[64614.937968] BUG: unable to handle kernel NULL pointer dereference at   
       (null)
[64614.945851] IP: [<ffffffff812a211d>] acpi_atomic_read+0x8d/0xcb
[64614.951817] PGD 2f8d40067 PUD 2f8cf8067 PMD 0
[64614.956346] Oops: 0000 [#1] PREEMPT SMP
[64614.960344] last sysfs file:
/sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
[64614.968265] CPU 14
[64614.970203] Modules linked in: md5 nfsd lockd nfs_acl auth_rpcgss
sunrpc ipt_MASQUERADE iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables af_packet
edd cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq
mperf xfs dm_mod igb joydev sr_mod cdrom pcspkr sg ioatdma button iTCO_wdt
iTCO_vendor_support dca ghes hed i2c_i801 i7core_edac edac_core ext4 jbd2
crc16 raid456 async_raid6_recov async_pq raid6_pq async_xor xor
async_memcpy async_tx raid10 raid1 raid0 fan processor thermal thermal_sys
ata_generic pata_atiixp arcmsr
[64615.024806]
[64615.026305] Pid: 10723, comm: cluster Not tainted
2.6.39.3-microwaycustom #5 Supermicro X8DTH-i/6/iF/6F/X8DTH
[64615.036291] RIP: 0010:[<ffffffff812a211d>]  [<ffffffff812a211d>]
acpi_atomic_read+0x8d/0xcb
[64615.044671] RSP: 0000:ffff88063fcc7da8  EFLAGS: 00010046
[64615.049994] RAX: 0000000000000000 RBX: ffff88063fcc7df0 RCX:
00000000bf7b6000
[64615.057132] RDX: 0000000000000000 RSI: 00000000bf7b6010 RDI:
00000000bf7b5ff0
[64615.064271] RBP: ffff88063fcc7dd8 R08: 00000000bf7b7000 R09:
0000000000000002
[64615.071411] R10: 0000000000000000 R11: 0000000000000000 R12:
ffffc90003044c20
[64615.078549] R13: 0000000000000000 R14: 00000000bf7b5ff0 R15:
0000000000000000
[64615.085688] FS:  0000000000000000(0000) GS:ffff88063fcc0000(0000)
knlGS:0000000000000000
[64615.093771] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[64615.099517] CR2: 0000000000000000 CR3: 00000003015b1000 CR4:
00000000000006e0
[64615.106658] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[64615.113795] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[64615.120928] Process cluster (pid: 10723, threadinfo ffff8802fb3b6000,
task ffff880301534640)
[64615.129361] Stack:
[64615.131386]  0000000000000000 00000000bf7b5ff0 00000000ffffffea
ffff88032b1c3d40
[64615.138871]  0000000000000001 ffffc90003044ca8 ffff88063fcc7e18
ffffffffa01b7245
[64615.146354]  0000000000000000 0000000000000000 ffff88032b1c3d40
0000000000000000
[64615.153840] Call Trace:
[64615.156293]  <NMI>
[64615.158442]  [<ffffffffa01b7245>] ghes_read_estatus+0x55/0x180 [ghes]
[64615.164900]  [<ffffffffa01b760c>] ghes_notify_nmi+0xbc/0x190 [ghes]
[64615.171182]  [<ffffffff8150ddfd>] notifier_call_chain+0x4d/0x70
[64615.177116]  [<ffffffff8150de63>] __atomic_notifier_call_chain+0x43/0x60
[64615.183824]  [<ffffffff8150de91>] atomic_notifier_call_chain+0x11/0x20
[64615.190373]  [<ffffffff8150dece>] notify_die+0x2e/0x30
[64615.195535]  [<ffffffff8150b4f2>] do_nmi+0xa2/0x260
[64615.200430]  [<ffffffff8150b150>] nmi+0x20/0x30
[64615.204981]  [<ffffffff81029f6a>] ? native_write_msr_safe+0xa/0x10
[64615.211170]  <<EOE>>
[64615.213276]  <IRQ>
[64615.215609]  [<ffffffff81011568>] intel_pmu_disable_all+0x38/0xb0
[64615.221710]  [<ffffffff81010efa>] x86_pmu_disable+0x4a/0x50
[64615.227306]  [<ffffffff810ea842>] perf_event_task_tick+0x1a2/0x2a0
[64615.233495]  [<ffffffff81050750>] scheduler_tick+0x1b0/0x290
[64615.239165]  [<ffffffff81066c29>] update_process_times+0x69/0x80
[64615.245193]  [<ffffffff81088098>] tick_sched_timer+0x58/0x150
[64615.250956]  [<ffffffff8107b7ef>] __run_hrtimer+0x6f/0x250
[64615.256459]  [<ffffffff81088040>] ? tick_init_highres+0x20/0x20
[64615.262393]  [<ffffffff8107bf7a>] hrtimer_interrupt+0xda/0x230
[64615.268244]  [<ffffffff8101f5c6>] smp_apic_timer_interrupt+0x66/0xa0
[64615.274622]  [<ffffffff815120f3>] apic_timer_interrupt+0x13/0x20
[64615.280633]  <EOI>
[64615.282570] Code: fc 10 74 1f 77 08 41 80 fc 08 75 49 eb 0e 41 80 fc 20
74 17 41 80 fc 40 75 3b eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb 09
<8b> 00 89 c0 eb 03 48 8b 00 48 89 03 e8 62 55 e2 ff eb 1d 41 0f
[64615.303108] RIP  [<ffffffff812a211d>] acpi_atomic_read+0x8d/0xcb
[64615.309163]  RSP <ffff88063fcc7da8>
[64615.312668] CR2: 0000000000000000
[64615.316007] ---[ end trace 3ab5dd3ba3391edf ]---
[64615.320637] Kernel panic - not syncing: Fatal exception in interrupt
[64615.326999] Pid: 10723, comm: cluster Tainted: G      D    
2.6.39.3-microwaycustom #5
[64615.334914] Call Trace:
[64615.337371]  <NMI>  [<ffffffff815071ee>] panic+0x9b/0x1b0
[64615.342837]  [<ffffffff8150bb4a>] oops_end+0xea/0xf0
[64615.347828]  [<ffffffff81031dc3>] no_context+0xf3/0x260
[64615.353081]  [<ffffffff81032055>] __bad_area_nosemaphore+0x125/0x1e0
[64615.359456]  [<ffffffff8103211e>] bad_area_nosemaphore+0xe/0x10
[64615.365389]  [<ffffffff8150dd10>] do_page_fault+0x500/0x5a0
[64615.370985]  [<ffffffff810eb839>] ? __perf_event_overflow+0x99/0x210
[64615.377357]  [<ffffffff8150ae95>] page_fault+0x25/0x30
[64615.382516]  [<ffffffff812a211d>] ? acpi_atomic_read+0x8d/0xcb
[64615.388365]  [<ffffffff812a20f0>] ? acpi_atomic_read+0x60/0xcb
[64615.394224]  [<ffffffffa01b7245>] ghes_read_estatus+0x55/0x180 [ghes]
[64615.400685]  [<ffffffffa01b760c>] ghes_notify_nmi+0xbc/0x190 [ghes]
[64615.406959]  [<ffffffff8150ddfd>] notifier_call_chain+0x4d/0x70
[64615.412887]  [<ffffffff8150de63>] __atomic_notifier_call_chain+0x43/0x60
[64615.419594]  [<ffffffff8150de91>] atomic_notifier_call_chain+0x11/0x20
[64615.426138]  [<ffffffff8150dece>] notify_die+0x2e/0x30
[64615.431292]  [<ffffffff8150b4f2>] do_nmi+0xa2/0x260
[64615.436180]  [<ffffffff8150b150>] nmi+0x20/0x30
[64615.440730]  [<ffffffff81029f6a>] ? native_write_msr_safe+0xa/0x10
[64615.446911]  <<EOE>>  <IRQ>  [<ffffffff81011568>]
intel_pmu_disable_all+0x38/0xb0
[64615.454467]  [<ffffffff81010efa>] x86_pmu_disable+0x4a/0x50
[64615.460050]  [<ffffffff810ea842>] perf_event_task_tick+0x1a2/0x2a0
[64615.466233]  [<ffffffff81050750>] scheduler_tick+0x1b0/0x290
[64615.471908]  [<ffffffff81066c29>] update_process_times+0x69/0x80
[64615.477933]  [<ffffffff81088098>] tick_sched_timer+0x58/0x150
[64615.483691]  [<ffffffff8107b7ef>] __run_hrtimer+0x6f/0x250
[64615.489202]  [<ffffffff81088040>] ? tick_init_highres+0x20/0x20
[64615.495138]  [<ffffffff8107bf7a>] hrtimer_interrupt+0xda/0x230
[64615.500989]  [<ffffffff8101f5c6>] smp_apic_timer_interrupt+0x66/0xa0
[64615.507362]  [<ffffffff815120f3>] apic_timer_interrupt+0x13/0x20
[64615.513375]  <EOI>

What should I try next?

Thanks,
Rick


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-23 17:14           ` Don Zickus
@ 2011-08-23 17:24             ` rick
  2011-08-24  4:16               ` Huang Ying
  0 siblings, 1 reply; 18+ messages in thread
From: rick @ 2011-08-23 17:24 UTC (permalink / raw)
  To: Don Zickus
  Cc: ying.huang, Rafael J. Wysocki, linux-kernel, Richard Houghton,
	ACPI Devel Mailing List, Len Brown, Matthew Garrett

Hi Don,

This system has Westmere cpus in it.  I can disable ghes completely if you
think that is best, sure.  I also want to make sure this problem is solved
for everyone going forward, so our customer can upgrade their system in
the future without problems.

The latest patch from Rafael did not resolve the problem unfortunately.  I
posted another message just before this with the current call trace. 
Unless he has another patch or suggestion, I will try rebuilding with ghes
disabled completely.

Thanks,
Rick

> adding Ying.  He wrote the ghes code.  But if you using 2.6.39, then the
> ghes code probably still has some bugs in it that were fixed in 3.0 and
> 3.1.
> And unless you are using a Nehalem machine or later (Westmere, Sandy
> Bridge), ghes won't do much for you.  Can you just disable it in the
> config?
>
> Cheers,
> Don
>
> On Mon, Aug 22, 2011 at 04:51:35PM -0400, Rick Warner wrote:
>> > > >> Hi,
>> > > >>
>> > > >> It's better to report ACPI failues to linux-acpi.
>> > > >>
>> > > >> On Wednesday, August 17, 2011, Rick Warner wrote:
>> > > >> > Hi all,
>> > > >> >
>> > > >> > I am getting a kernel oops/panic on a dual xeon system that is
>> the
>> > > >>
>> > > >> master of a
>> > > >>
>> > > >> > 60 node HPC cluster.  This is happening while stress testing
>> the
>> > > >>
>> > > >> system
>> > > >>
>> > > >> > including significant network traffic.  The OS is openSuse
>> 11.4.
>> > > >> >
>> > > >> > We are running a custom compiled 2.6.39.3 kernel on the systems
>> due
>> > > >> > to
>> > > >>
>> > > >> a bug
>> > > >>
>> > > >> > in the stock kernel 11.4 provided (igb driver related).  After
>> 1-3
>> > > >>
>> > > >> days of
>> > > >>
>> > > >> > heavy testing, the master node locks up with the caps lock and
>> > > >> > scroll
>> > > >>
>> > > >> lock
>> > > >>
>> > > >> > keys on the keyboard blinking with the following output
>> captured via
>> > > >> > a
>> > > >>
>> > > >> serial
>> > > >>
>> > > >> > console:
>> > > >> >
>> > > >> > [381920.681113] BUG: unable to handle kernel NULL pointer
>> > > >> > dereference
>> > > >>
>> > > >> at
>> > > >>
>> > > >> > (null)
>> > > >> > [381920.689067] IP: [<ffffffff812a7510>]
>> acpi_atomic_read+0xe3/0x120
>> > > >> > [381920.695187] PGD 30c27a067 PUD 16efe6067 PMD 0
>> > > >> > [381920.699782] Oops: 0000 [#1] PREEMPT SMP
>> > > >> > [381920.703866] last sysfs file:
>> > > >> > /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
>> > > >> > [381920.711868] CPU 6
>> > > >> > [381920.713800] Modules linked in: md5 ipmi_devintf ipmi_si
>> > > >>
>> > > >> ipmi_msghandler
>> > > >>
>> > > >> > nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE
>> iptable_mangle
>> > > >> > iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack
>> nf_defrag_ipv4
>> > > >>
>> > > >> iptable_filter
>> > > >>
>> > > >> > ip_tables x_tables af_packet edd cpufreq_conservative
>> > > >>
>> > > >> cpufreq_userspace
>> > > >>
>> > > >> > cpufreq_powersave acpi_cpufreq mperf xfs dm_mod ioatdma
>> i7core_edac
>> > > >>
>> > > >> edac_core
>> > > >>
>> > > >> > sr_mod cdrom joydev igb i2c_i801 sg button ghes hed iTCO_wdt
>> > > >> > iTCO_vendor_support dca pcspkr ext4 jbd2 crc16 raid456
>> > > >>
>> > > >> async_raid6_recov
>> > > >>
>> > > >> > async_pq raid6_pq async_xor xor async_memcpy async_tx raid10
>> raid1
>> > > >>
>> > > >> raid0 fan
>> > > >>
>> > > >> > processor thermal thermal_sys ata_generic pata_atiixp arcmsr
>> > > >> > [381920.771623]
>> > > >> > [381920.773210] Pid: 12701, comm: cluster Not tainted
>> > > >>
>> > > >> 2.6.39.3-microwaycustom
>> > > >>
>> > > >> > #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
>> > > >> > [381920.783292] RIP: 0010:[<ffffffff812a7510>]
>> [<ffffffff812a7510>]
>> > > >> > acpi_atomic_read+0xe3/0x120
>> > > >> > [381920.791853] RSP: 0000:ffff88063fc47d98  EFLAGS: 00010046
>> > > >>
>> > > >> > [381920.797260] RAX: 0000000000000000 RBX: 00000000bf7b5ff0
>> RCX:
>> > > >> ffffffff81a3cdd0
>> > > >>
>> > > >> > [381920.804486] RDX: 00000000bf7b6010 RSI: 00000000bf7b6000
>> RDI:
>> > > >> > ffff88062d4b95c0
>> > > >>
>> > > >> > [381920.811712] RBP: ffff88063fc47dc8 R08: ffff88063fc47d98
>> R09:
>> > > >> 0000000000000002
>> > > >>
>> > > >> > [381920.818940] R10: 0000000000000083 R11: 0000000000000010
>> R12:
>> > > >> > ffffc90003044c20
>> > > >> > [381920.826168] R13: ffff88063fc47de0 R14: 0000000000000000
>> R15:
>> > > >> > 0000000000000000
>> > > >> > [381920.833392] FS:  0000000000000000(0000)
>> > > >> > GS:ffff88063fc40000(0000) knlGS:0000000000000000
>> > > >> > [381920.841571] CS:  0010 DS: 002b ES: 002b CR0:
>> 0000000080050033
>> > > >> > [381920.847409] CR2: 0000000000000000 CR3: 00000001d2d0f000
>> CR4:
>> > > >> > 00000000000006e0
>> > > >> > [381920.854635] DR0: 0000000000000000 DR1: 0000000000000000
>> DR2:
>> > > >> > 0000000000000000
>> > > >>
>> > > >> > [381920.861861] DR3: 0000000000000000 DR6: 00000000ffff0ff0
>> DR7:
>> > > >> 0000000000000400
>> > > >>
>> > > >> > [381920.869088] Process cluster (pid: 12701, threadinfo
>> > > >>
>> > > >> ffff880028284000, task
>> > > >>
>> > > >> > ffff880550184380)
>> > > >> > [381920.877608] Stack:
>> > > >> > [381920.879717]  ffffffff81a3cdd0 00000000bf7b5ff0
>> ffff88062b2cc140
>> > > >>
>> > > >> ffffc90003044ca8
>> > > >>
>> > > >> > [381920.887288]  ffff88062b2cc140 0000000000000001
>> ffff88063fc47e08
>> > > >>
>> > > >> ffffffffa002b21f
>> > > >>
>> > > >> > [381920.894862]  0000000000000000 0000000000000000
>> ffff88062b2cc140
>> > > >> > 0000000000000000
>> > > >> > [381920.902435] Call Trace:
>> > > >> > [381920.904969]  <NMI>
>> > > >> > [381920.907201]  [<ffffffffa002b21f>]
>> ghes_read_estatus+0x2f/0x170
>> > > >>
>> > > >> [ghes]
>> > > >>
>> > > >> > [381920.913726]  [<ffffffffa002b618>]
>> ghes_notify_nmi+0xd8/0x1b0
>> > > >>
>> > > >> [ghes]
>> > > >>
>> > > >> > [381920.920076]  [<ffffffff8151a14f>]
>> notifier_call_chain+0x3f/0x80
>> > > >> > [381920.926080]  [<ffffffff8151a1d8>]
>> > > >>
>> > > >> __atomic_notifier_call_chain+0x48/0x70
>> > > >>
>> > > >> > [381920.932864]  [<ffffffff8151a211>]
>> > > >>
>> > > >> atomic_notifier_call_chain+0x11/0x20
>> > > >>
>> > > >> > [381920.939473]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
>> > > >> > [381920.944691]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
>> > > >> > [381920.949649]  [<ffffffff81517550>] nmi+0x20/0x30
>> > > >> > [381920.954269]  [<ffffffff8102ae8a>] ?
>> > > >> > native_write_msr_safe+0xa/0x10 [381920.960536]  <<EOE>>
>> > > >> > [381920.962725]  <IRQ>
>> > > >> > [381920.965132]  [<ffffffff8101131e>]
>> > > >> > intel_pmu_disable_all+0x3e/0x120 [381920.971399]
>> > > >> > [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50 [381920.977066]
>> > > >> > [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40 [381920.982817]
>> > > >> > [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
>> > > >> > [381920.989082]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
>> > > >> > [381920.994740]  [<ffffffff81067097>]
>> > > >> > update_process_times+0x67/0x80 [381921.000832]
>> > > >> > [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0 [381921.006578]
>> > > >> > [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
>> > > >> > [381921.012758]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
>> > > >> > [381921.018414]  [<ffffffff8107dea7>]
>> hrtimer_interrupt+0xb7/0x1e0
>> > > >> > [381921.024343]  [<ffffffff810206d7>]
>> > > >>
>> > > >> smp_apic_timer_interrupt+0x67/0xa0
>> > > >>
>> > > >> > [381921.030793]  [<ffffffff8151e3f3>]
>> apic_timer_interrupt+0x13/0x20
>> > > >> > [381921.036891]  <EOI>
>> > > >> > [381921.038913] Code: fc 10 74 1f 77 08 41 80 fc 08 75 48 eb 0e
>> 41
>> > > >> > 80
>> > > >>
>> > > >> fc 20 74
>> > > >>
>> > > >> > 17 41 80 fc 40 75 3a eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7
>> c0 eb
>> > > >>
>> > > >> 09 <8b>
>> > > >>
>> > > >> > 00 89 c0 eb 03 48 8b 00 49 89 45 00 e8 8e 2b e2 ff eb 1b 0f
>> > > >> > [381921.059462] RIP  [<ffffffff812a7510>]
>> > > >> > acpi_atomic_read+0xe3/0x120 [381921.065669]  RSP
>> <ffff88063fc47d98>
>> > > >> > [381921.069242] CR2: 0000000000000000
>> > > >> > [381921.072645] ---[ end trace 52697bfc73a34a90 ]---
>> > > >> > [381921.077343] Kernel panic - not syncing: Fatal exception in
>> > > >>
>> > > >> interrupt
>> > > >>
>> > > >> > [381921.083784] Pid: 12701, comm: cluster Tainted: G      D
>> > > >>
>> > > >> 2.6.39.3-
>> > > >>
>> > > >> > microwaycustom #1
>> > > >> > [381921.091788] Call Trace:
>> > > >> > [381921.094333]  <NMI>  [<ffffffff81513300>] panic+0x9f/0x1da
>> > > >> > [381921.099881]  [<ffffffff81517f8c>] oops_end+0xdc/0xf0
>> > > >> > [381921.104952]  [<ffffffff81032b91>] no_context+0xf1/0x260
>> > > >> > [381921.110277]  [<ffffffff81032e55>]
>> > > >>
>> > > >> __bad_area_nosemaphore+0x155/0x200
>> > > >>
>> > > >> > [381921.116739]  [<ffffffff81032f0e>]
>> bad_area_nosemaphore+0xe/0x10
>> > > >> > [381921.122766]  [<ffffffff81519f46>] do_page_fault+0x366/0x530
>> > > >> > [381921.128440]  [<ffffffff810ee929>] ?
>> > > >>
>> > > >> __perf_event_overflow+0xa9/0x220
>> > > >>
>> > > >> > [381921.134896]  [<ffffffff810ef99b>] ?
>> > > >>
>> > > >> perf_event_update_userpage+0x9b/0xe0
>> > > >>
>> > > >> > [381921.141698]  [<ffffffff81012249>] ?
>> > > >>
>> > > >> intel_pmu_enable_all+0xc9/0x1a0
>> > > >>
>> > > >> > [381921.148061]  [<ffffffff810123ff>] ?
>> > > >>
>> > > >> x86_perf_event_set_period+0xdf/0x170
>> > > >>
>> > > >> > [381921.154852]  [<ffffffff81517295>] page_fault+0x25/0x30
>> > > >> > [381921.160081]  [<ffffffff812a7510>] ?
>> acpi_atomic_read+0xe3/0x120
>> > > >> > [381921.166094]  [<ffffffff812a7486>] ?
>> acpi_atomic_read+0x59/0x120
>> > > >> > [381921.172107]  [<ffffffffa002b21f>]
>> ghes_read_estatus+0x2f/0x170
>> > > >>
>> > > >> [ghes]
>> > > >>
>> > > >> > [381921.178639]  [<ffffffffa002b618>]
>> ghes_notify_nmi+0xd8/0x1b0
>> > > >>
>> > > >> [ghes]
>> > > >>
>> > > >> > [381921.185004]  [<ffffffff8151a14f>]
>> notifier_call_chain+0x3f/0x80
>> > > >> > [381921.191003]  [<ffffffff8151a1d8>]
>> > > >>
>> > > >> __atomic_notifier_call_chain+0x48/0x70
>> > > >>
>> > > >> > [381921.197789]  [<ffffffff8151a211>]
>> > > >>
>> > > >> atomic_notifier_call_chain+0x11/0x20
>> > > >>
>> > > >> > [381921.204399]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
>> > > >> > [381921.209624]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
>> > > >> > [381921.214584]  [<ffffffff81517550>] nmi+0x20/0x30
>> > > >> > [381921.219208]  [<ffffffff8102ae8a>] ?
>> > > >> > native_write_msr_safe+0xa/0x10 [381921.225470]  <<EOE>>  <IRQ>
>> > > >> > [<ffffffff8101131e>]
>> > > >> > intel_pmu_disable_all+0x3e/0x120
>> > > >> > [381921.233174]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
>> > > >> > [381921.238836]  [<ffffffff810e7f2b>]
>> perf_pmu_disable+0x2b/0x40
>> > > >> > [381921.244582]  [<ffffffff810ee828>]
>> > > >> > perf_event_task_tick+0x218/0x270 [381921.250855]
>> > > >> > [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0 [381921.256520]
>> > > >> > [<ffffffff81067097>] update_process_times+0x67/0x80
>> [381921.262618]
>> > > >> >  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
>> [381921.268363]
>> > > >> > [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
>> > > >> > [381921.274540]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
>> > > >> > [381921.280198]  [<ffffffff8107dea7>]
>> hrtimer_interrupt+0xb7/0x1e0
>> > > >> > [381921.286125]  [<ffffffff810206d7>]
>> > > >>
>> > > >> smp_apic_timer_interrupt+0x67/0xa0
>> > > >>
>> > > >> > [381921.292578]  [<ffffffff8151e3f3>]
>> apic_timer_interrupt+0x13/0x20
>> > > >> > [381921.298673]  <EOI>
>> > > >> >
>> > > >> >
>> > > >> > I recompiled the kernel again, disabling the tickless feature
>> as I
>> > > >> > saw
>> > > >>
>> > > >> "nohz"
>> > > >>
>> > > >> > early in the call trace.  After that, it reproduced again but
>> the
>> > > >> > call
>> > > >>
>> > > >> trace
>> > > >>
>> > > >> > only changed slightly having tick_init_highres in there instead
>> of
>> > > >> > tick_nohz_handler.  I have that call trace if it is desired as
>> well.
>> > > >>
>> > > >> It is
>> > > >>
>> > > >> > nearly identical though.
>> > > >> >
>> > > >> > I am currently trying a custom 2.6.36.4 on the system, but will
>> need
>> > > >>
>> > > >> up to 3
>> > > >>
>> > > >> > days before I know if the problem exists there as well.
>> > > >> >
>> > > >> > Any ideas on this?
>> > > >>
>> > > >> It looks like a wrong address in ghes_read_estatus().  I'll have
>> a
>> > > >> look at it
>> > > >> later today.
>> > > >
>> > > > Hmm.  Please apply the appended patch and see if the error is
>> still
>> > > > reproducible (it's on top of the current mainline, so you may need
>> to
>> > > > adjust it by hand).
>> > > >
>> > > > Thanks,
>> > > > Rafael
>> > > >
>> > > > ---
>> > > >
>> > > >  drivers/acpi/apei/ghes.c |    3 +++
>> > > >  1 file changed, 3 insertions(+)
>> > > >
>> > > > Index: linux/drivers/acpi/apei/ghes.c
>> > > > ===================================================================
>> > > > --- linux.orig/drivers/acpi/apei/ghes.c
>> > > > +++ linux/drivers/acpi/apei/ghes.c
>> > > > @@ -398,6 +398,9 @@ static int ghes_read_estatus(struct ghes
>> > > >
>> > > >  	u32 len;
>> > > >  	int rc;
>> > > >
>> > > > +	if (!g)
>> > > > +		return -EINVAL;
>> > > > +
>> > > >
>> > > >  	rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
>> > > >  	if (rc) {
>> > > >
>> > > >  		if (!silent && printk_ratelimit())
>> > >
>> > > Thanks for the quick patch!  Unfortunately, the system crashed again
>> with
>> > > the same call trace. I'll list it below in case there are any
>> differences
>> > > I missed.  I have currently added a test for
>> (!&g->error_status_address)
>> > > just like your test on (!g) to verify that isn't a null pointer
>> either,
>> > > and will start tests back up with that change.  Please let me know
>> if you
>> > > have any other suggestions or patches for me to try.
>> >
>> > That was a long shot, but perhaps you can check what source code
>> > corresponds to ghes_read_estatus+0x38 in your kernel?
>> >
>> > Rafael
>> Hi Rafael,
>>
>> Thanks for the off-list help in getting you this info.
>>
>> I had already rebuilt the kernel using the change I mentioned earlier
>> (test on
>> !&g->error_status_address) since the call trace I got.
>>
>> I luckily still had a copy of the kernel and modules I built previously
>> using
>> just your patch, so I undid my change to the ghes.c source, leaving just
>> your
>> patch but not mine so it would match the ghes.ko module I ran on.  This
>> is the
>> output of gdb on that ghes.ko now:
>>
>> (gdb) l *ghes_read_estatus+0x38
>> 0x258 is in ghes_read_estatus (drivers/acpi/apei/ghes.c:296).
>> warning: Source file is more recent than executable.
>> 291             int rc;
>> 292             if (!g)
>> 293                     return -EINVAL;
>> 294
>> 295             rc = acpi_atomic_read(&buf_paddr,
>> &g->error_status_address);
>> 296             if (rc) {
>> 297                     if (!silent && printk_ratelimit())
>> 298                             pr_warning(FW_WARN GHES_PFX
>> 299     "Failed to read error status block address for hardware error
>> source:
>> %d.\n",
>> 300                                        g->header.source_id);
>>
>> The warning about the source being newer is because of the reverted
>> change in
>> the ghes.c source mentioned above.
>>
>> Thanks,
>> Rick
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-23 17:24             ` rick
@ 2011-08-24  4:16               ` Huang Ying
  2011-08-24 22:18                 ` rick
  0 siblings, 1 reply; 18+ messages in thread
From: Huang Ying @ 2011-08-24  4:16 UTC (permalink / raw)
  To: rick
  Cc: Don Zickus, Rafael J. Wysocki, linux-kernel, Richard Houghton,
	ACPI Devel Mailing List, Len Brown, Matthew Garrett

[-- Attachment #1: Type: text/plain, Size: 428 bytes --]

Hi, Rick,

It appears that panic occurs in acpi_atomic_read.  I think the most 
likely cause is that the acpi_generic_address is not pre-mapped.  Can 
you try the patch attached?

It will print registers mapped and accessed.  To use it, run the 
following command line before workload.

dmesg | grep GHES

Then try to find something like

GHES: gar accessed: x, xxxx

in kernel log when panic occurs.

Best Regards,
Huang Ying


[-- Attachment #2: dbg_ghes.patch --]
[-- Type: text/x-patch, Size: 846 bytes --]

---
 drivers/acpi/apei/ghes.c |    6 ++++++
 1 file changed, 6 insertions(+)

--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -299,6 +299,9 @@ static struct ghes *ghes_new(struct acpi
 		return ERR_PTR(-ENOMEM);
 	ghes->generic = generic;
 	rc = acpi_pre_map_gar(&generic->error_status_address);
+	pr_info(GHES_PFX "gar mapped: %d, 0x%llx\n",
+		generic->error_status_address.space_id,
+		generic->error_status_address.address);
 	if (rc)
 		goto err_free;
 	error_block_length = generic->error_block_length;
@@ -398,6 +401,9 @@ static int ghes_read_estatus(struct ghes
 	u32 len;
 	int rc;
 
+	pr_info(GHES_PFX "gar accessed: %d, 0x%llx\n",
+		g->error_status_address.space_id,
+		g->error_status_address.address);
 	rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
 	if (rc) {
 		if (!silent && printk_ratelimit())

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-24  4:16               ` Huang Ying
@ 2011-08-24 22:18                 ` rick
  2011-08-25 15:47                   ` rick
  0 siblings, 1 reply; 18+ messages in thread
From: rick @ 2011-08-24 22:18 UTC (permalink / raw)
  To: Huang Ying
  Cc: Don Zickus, Rafael J. Wysocki, linux-kernel, Richard Houghton,
	ACPI Devel Mailing List, Len Brown, Matthew Garrett

Hi Huang,

The original system needs to ship to our customer ASAP.  Disabling ghes is
sufficient for the time being for that.  As such, I have set up an
identical system as a temporary master for another cluster to continue
this testing.

I have applied your patch.  Here is the output of dmesg | grep GHES so far:


[    9.272198] GHES: gar mapped: 0, 0xbf7b5ff0
[    9.280782] GHES: gar mapped: 0, 0xbf7b6200
[    9.285102] [Firmware Warn]: GHES: Poll interval is 0 for generic
hardware error source: 1, disabled.

I have the serial console activated and stress tests started back up. 
I'll reply with the output once I get another panic.

Thanks!
Rick

> Hi, Rick,
>
> It appears that panic occurs in acpi_atomic_read.  I think the most
> likely cause is that the acpi_generic_address is not pre-mapped.  Can
> you try the patch attached?
>
> It will print registers mapped and accessed.  To use it, run the
> following command line before workload.
>
> dmesg | grep GHES
>
> Then try to find something like
>
> GHES: gar accessed: x, xxxx
>
> in kernel log when panic occurs.
>
> Best Regards,
> Huang Ying
>
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-24 22:18                 ` rick
@ 2011-08-25 15:47                   ` rick
  2011-08-26  0:34                     ` Huang Ying
  0 siblings, 1 reply; 18+ messages in thread
From: rick @ 2011-08-25 15:47 UTC (permalink / raw)
  To: Huang Ying
  Cc: Don Zickus, Rafael J. Wysocki, linux-kernel, Richard Houghton,
	ACPI Devel Mailing List, Len Brown, Matthew Garrett

Hi Huang,

My new setup reproduced the panic. However I do not have any gar accessed
messages on it.  The gar mapped messages are in my previous email.  Here
is the latest call trace.  There is no GHES output prior to it:

[30348.824329] BUG: unable to handle kernel NULL pointer dereference at   
       (null)
[30348.832197] IP: [<ffffffff812a211d>] acpi_atomic_read+0x8d/0xcb
[30348.838144] PGD 605984067 PUD 6059de067 PMD 0
[30348.842654] Oops: 0000 [#1] PREEMPT SMP
[30348.846640] last sysfs file:
/sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
[30348.854555] CPU 13
[30348.856487] Modules linked in: md5 ipmi_devintf ipmi_si ipmi_msghandler
nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE iptable_mangle
iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
iptable_filter ip_tables x_tables af_packet edd cpufreq_conservative
cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf xfs dm_mod igb
joydev ioatdma dca iTCO_wdt iTCO_vendor_support i7core_edac i2c_i801
edac_core ghes button hed sg pcspkr serio_raw ext4 jbd2 crc16 fan
processor thermal thermal_sys ata_generic pata_atiixp arcmsr
[30348.904982]
[30348.906481] Pid: 27462, comm: cluster Not tainted
2.6.39.3-microwaycustom #8 Supermicro X8DTH-i/6/iF/6F/X8DTH
[30348.916458] RIP: 0010:[<ffffffff812a211d>]  [<ffffffff812a211d>]
acpi_atomic_read+0x8d/0xcb
[30348.924825] RSP: 0000:ffff88063fca7da8  EFLAGS: 00010046
[30348.930129] RAX: 0000000000000000 RBX: ffff88063fca7df0 RCX:
00000000bf7b6000
[30348.937251] RDX: 0000000000000000 RSI: 00000000bf7b6010 RDI:
00000000bf7b5ff0
[30348.944374] RBP: ffff88063fca7dd8 R08: 00000000bf7b7000 R09:
0000000000000000
[30348.951497] R10: 000000000000000a R11: 000000000000000b R12:
ffffc90003044c20
[30348.958627] R13: 0000000000000000 R14: 00000000bf7b5ff0 R15:
0000000000000000
[30348.965758] FS:  0000000000000000(0000) GS:ffff88063fca0000(0000)
knlGS:0000000000000000
[30348.973841] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[30348.979586] CR2: 0000000000000000 CR3: 00000006059db000 CR4:
00000000000006e0
[30348.986708] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[30348.993838] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[30349.000961] Process cluster (pid: 27462, threadinfo ffff880605a02000,
task ffff88061e8f8440)
[30349.009387] Stack:
[30349.011403]  0000000000000000 00000000bf7b5ff0 ffff88032ac0a940
ffff88032ac0a940
[30349.018879]  0000000000000001 ffffc90003044ca8 ffff88063fca7e18
ffffffffa0136235
[30349.026366]  0000000000000000 0000000000000000 ffff88032ac0a940
0000000000000000
[30349.033850] Call Trace:
[30349.036300]  <NMI>
[30349.038442]  [<ffffffffa0136235>] ghes_read_estatus+0x45/0x180 [ghes]
[30349.044882]  [<ffffffffa013660c>] ghes_notify_nmi+0xbc/0x190 [ghes]
[30349.051148]  [<ffffffff8150ddfd>] notifier_call_chain+0x4d/0x70
[30349.057065]  [<ffffffff8150de63>] __atomic_notifier_call_chain+0x43/0x60
[30349.063762]  [<ffffffff8150de91>] atomic_notifier_call_chain+0x11/0x20
[30349.070286]  [<ffffffff8150dece>] notify_die+0x2e/0x30
[30349.075415]  [<ffffffff8150b4f2>] do_nmi+0xa2/0x260
[30349.080287]  [<ffffffff8150b150>] nmi+0x20/0x30
[30349.084819]  [<ffffffff81029f6a>] ? native_write_msr_safe+0xa/0x10
[30349.090991]  <<EOE>>
[30349.093094]  <IRQ>
[30349.095424]  [<ffffffff81011568>] intel_pmu_disable_all+0x38/0xb0
[30349.101516]  [<ffffffff81010efa>] x86_pmu_disable+0x4a/0x50
[30349.107093]  [<ffffffff810ea842>] perf_event_task_tick+0x1a2/0x2a0
[30349.113269]  [<ffffffff81050750>] scheduler_tick+0x1b0/0x290
[30349.118932]  [<ffffffff81066c29>] update_process_times+0x69/0x80
[30349.124936]  [<ffffffff81088098>] tick_sched_timer+0x58/0x150
[30349.130680]  [<ffffffff8107b7ef>] __run_hrtimer+0x6f/0x250
[30349.136166]  [<ffffffff81088040>] ? tick_init_highres+0x20/0x20
[30349.142087]  [<ffffffff8107bf7a>] hrtimer_interrupt+0xda/0x230
[30349.147921]  [<ffffffff8101f5c6>] smp_apic_timer_interrupt+0x66/0xa0
[30349.154272]  [<ffffffff815120f3>] apic_timer_interrupt+0x13/0x20
[30349.160272]  <EOI>
[30349.162200] Code: fc 10 74 1f 77 08 41 80 fc 08 75 49 eb 0e 41 80 fc 20
74 17 41 80 fc 40 75 3b eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb 09
<8b> 00 89 c0 eb 03 48 8b 00 48 89 03 e8 62 55 e2 ff eb 1d 41 0f
[30349.182456] RIP  [<ffffffff812a211d>] acpi_atomic_read+0x8d/0xcb
[30349.188490]  RSP <ffff88063fca7da8>
[30349.191977] CR2: 0000000000000000
[30349.195293] ---[ end trace 316c5d7ea544957e ]---
[30349.199904] Kernel panic - not syncing: Fatal exception in interrupt
[30349.206249] Pid: 27462, comm: cluster Tainted: G      D    
2.6.39.3-microwaycustom #8
[30349.214156] Call Trace:
[30349.216605]  <NMI>  [<ffffffff815071ee>] panic+0x9b/0x1b0
[30349.222034]  [<ffffffff8150bb4a>] oops_end+0xea/0xf0
[30349.226997]  [<ffffffff81031dc3>] no_context+0xf3/0x260
[30349.232220]  [<ffffffff812569de>] ? number+0x31e/0x350
[30349.237360]  [<ffffffff81032055>] __bad_area_nosemaphore+0x125/0x1e0
[30349.243712]  [<ffffffff8103211e>] bad_area_nosemaphore+0xe/0x10
[30349.249633]  [<ffffffff8150dd10>] do_page_fault+0x500/0x5a0
[30349.255205]  [<ffffffff81258e0e>] ? vsnprintf+0x33e/0x5d0
[30349.260605]  [<ffffffff8107cd3a>] ? up+0x2a/0x50
[30349.265228]  [<ffffffff81056da9>] ? console_unlock+0x189/0x1e0
[30349.271057]  [<ffffffff8150ae95>] page_fault+0x25/0x30
[30349.276201]  [<ffffffff812a211d>] ? acpi_atomic_read+0x8d/0xcb
[30349.282029]  [<ffffffff812a20f0>] ? acpi_atomic_read+0x60/0xcb
[30349.287869]  [<ffffffffa0136235>] ghes_read_estatus+0x45/0x180 [ghes]
[30349.294311]  [<ffffffffa013660c>] ghes_notify_nmi+0xbc/0x190 [ghes]
[30349.300575]  [<ffffffff8150ddfd>] notifier_call_chain+0x4d/0x70
[30349.306494]  [<ffffffff8150de63>] __atomic_notifier_call_chain+0x43/0x60
[30349.313192]  [<ffffffff8150de91>] atomic_notifier_call_chain+0x11/0x20
[30349.319715]  [<ffffffff8150dece>] notify_die+0x2e/0x30
[30349.324853]  [<ffffffff8150b4f2>] do_nmi+0xa2/0x260
[30349.329727]  [<ffffffff8150b150>] nmi+0x20/0x30
[30349.334264]  [<ffffffff81029f6a>] ? native_write_msr_safe+0xa/0x10
[30349.340438]  <<EOE>>  <IRQ>  [<ffffffff81011568>]
intel_pmu_disable_all+0x38/0xb0
[30349.347959]  [<ffffffff81010efa>] x86_pmu_disable+0x4a/0x50
[30349.353527]  [<ffffffff810ea842>] perf_event_task_tick+0x1a2/0x2a0
[30349.359705]  [<ffffffff81050750>] scheduler_tick+0x1b0/0x290
[30349.365366]  [<ffffffff81066c29>] update_process_times+0x69/0x80
[30349.371370]  [<ffffffff81088098>] tick_sched_timer+0x58/0x150
[30349.377114]  [<ffffffff8107b7ef>] __run_hrtimer+0x6f/0x250
[30349.382604]  [<ffffffff81088040>] ? tick_init_highres+0x20/0x20
[30349.388518]  [<ffffffff8107bf7a>] hrtimer_interrupt+0xda/0x230
[30349.394355]  [<ffffffff8101f5c6>] smp_apic_timer_interrupt+0x66/0xa0
[30349.400708]  [<ffffffff815120f3>] apic_timer_interrupt+0x13/0x20
[30349.406705]  <EOI>

Thanks,
Rick

> Hi Huang,
>
> The original system needs to ship to our customer ASAP.  Disabling ghes is
> sufficient for the time being for that.  As such, I have set up an
> identical system as a temporary master for another cluster to continue
> this testing.
>
> I have applied your patch.  Here is the output of dmesg | grep GHES so
> far:
>
>
> [    9.272198] GHES: gar mapped: 0, 0xbf7b5ff0
> [    9.280782] GHES: gar mapped: 0, 0xbf7b6200
> [    9.285102] [Firmware Warn]: GHES: Poll interval is 0 for generic
> hardware error source: 1, disabled.
>
> I have the serial console activated and stress tests started back up.
> I'll reply with the output once I get another panic.
>
> Thanks!
> Rick
>
>> Hi, Rick,
>>
>> It appears that panic occurs in acpi_atomic_read.  I think the most
>> likely cause is that the acpi_generic_address is not pre-mapped.  Can
>> you try the patch attached?
>>
>> It will print registers mapped and accessed.  To use it, run the
>> following command line before workload.
>>
>> dmesg | grep GHES
>>
>> Then try to find something like
>>
>> GHES: gar accessed: x, xxxx
>>
>> in kernel log when panic occurs.
>>
>> Best Regards,
>> Huang Ying
>>
>>
>
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-25 15:47                   ` rick
@ 2011-08-26  0:34                     ` Huang Ying
  2011-09-02 23:32                       ` rick
  0 siblings, 1 reply; 18+ messages in thread
From: Huang Ying @ 2011-08-26  0:34 UTC (permalink / raw)
  To: rick
  Cc: Don Zickus, Rafael J. Wysocki, linux-kernel, Richard Houghton,
	ACPI Devel Mailing List, Len Brown, Matthew Garrett

[-- Attachment #1: Type: text/plain, Size: 500 bytes --]

On 08/25/2011 11:47 PM, rick@microway.com wrote:
> Hi Huang,
> 
> My new setup reproduced the panic. However I do not have any gar accessed
> messages on it.  The gar mapped messages are in my previous email.  Here
> is the latest call trace.  There is no GHES output prior to it:
> 

That is wired.  Can you try the patch attached?  If my guessing is
correct, there will be no panic, but something as follow will be in dmesg:

ACPI atomic read mem: addr 0xxxxx mapped to 0

Best Regards,
Huang Ying

[-- Attachment #2: dbg_ghes.patch --]
[-- Type: text/x-patch, Size: 1249 bytes --]

---
 drivers/acpi/apei/ghes.c |    6 ++++++
 drivers/acpi/atomicio.c  |    4 ++++
 2 files changed, 10 insertions(+)

--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -299,6 +299,9 @@ static struct ghes *ghes_new(struct acpi
 		return ERR_PTR(-ENOMEM);
 	ghes->generic = generic;
 	rc = acpi_pre_map_gar(&generic->error_status_address);
+	pr_info(GHES_PFX "gar mapped: %d, 0x%llx\n",
+		generic->error_status_address.space_id,
+		generic->error_status_address.address);
 	if (rc)
 		goto err_free;
 	error_block_length = generic->error_block_length;
@@ -398,6 +401,9 @@ static int ghes_read_estatus(struct ghes
 	u32 len;
 	int rc;
 
+	pr_err(GHES_PFX "gar accessed: %d, 0x%llx\n",
+	       g->error_status_address.space_id,
+	       g->error_status_address.address);
 	rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
 	if (rc) {
 		if (!silent && printk_ratelimit())
--- a/drivers/acpi/atomicio.c
+++ b/drivers/acpi/atomicio.c
@@ -270,6 +270,10 @@ static int acpi_atomic_read_mem(u64 padd
 
 	rcu_read_lock();
 	addr = __acpi_ioremap_fast(paddr, width);
+	pr_err("ACPI atomic read mem: addr 0x%llx mapped to %p\n",
+	       paddr, addr);
+	if (!addr)
+		return -EIO;
 	switch (width) {
 	case 8:
 		*val = readb(addr);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-08-26  0:34                     ` Huang Ying
@ 2011-09-02 23:32                       ` rick
  2011-09-05  2:50                         ` Huang Ying
  0 siblings, 1 reply; 18+ messages in thread
From: rick @ 2011-09-02 23:32 UTC (permalink / raw)
  To: Huang Ying
  Cc: Don Zickus, Rafael J. Wysocki, linux-kernel, Richard Houghton,
	ACPI Devel Mailing List, Len Brown, Matthew Garrett

Hi Huang,

Sorry for the delay in my response.  Hurricane Irene delayed our testing a
bit.

I had to switch the 5620 CPUS I had for 5670s.  After 4 days of running
(it was usually about 2 before) I finally got this output in dmesg:

[337296.365930] GHES: gar accessed: 0, 0xbf7b9370
[337296.365936] ACPI atomic read mem: addr 0xbf7b9370 mapped to
ffffc90013ee8370

It is not mapped to 0 as expected, but it didn't crash now!

Thanks,
Rick

> On 08/25/2011 11:47 PM, rick@microway.com wrote:
>> Hi Huang,
>>
>> My new setup reproduced the panic. However I do not have any gar
>> accessed
>> messages on it.  The gar mapped messages are in my previous email.  Here
>> is the latest call trace.  There is no GHES output prior to it:
>>
>
> That is wired.  Can you try the patch attached?  If my guessing is
> correct, there will be no panic, but something as follow will be in dmesg:
>
> ACPI atomic read mem: addr 0xxxxx mapped to 0
>
> Best Regards,
> Huang Ying
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-09-02 23:32                       ` rick
@ 2011-09-05  2:50                         ` Huang Ying
  2011-09-15 18:35                           ` rick
  0 siblings, 1 reply; 18+ messages in thread
From: Huang Ying @ 2011-09-05  2:50 UTC (permalink / raw)
  To: rick
  Cc: Don Zickus, Rafael J. Wysocki, linux-kernel, Richard Houghton,
	ACPI Devel Mailing List, Len Brown, Matthew Garrett

[-- Attachment #1: Type: text/plain, Size: 652 bytes --]

On 09/03/2011 07:32 AM, rick@microway.com wrote:
> Hi Huang,
> 
> Sorry for the delay in my response.  Hurricane Irene delayed our testing a
> bit.
> 
> I had to switch the 5620 CPUS I had for 5670s.  After 4 days of running
> (it was usually about 2 before) I finally got this output in dmesg:
> 
> [337296.365930] GHES: gar accessed: 0, 0xbf7b9370
> [337296.365936] ACPI atomic read mem: addr 0xbf7b9370 mapped to
> ffffc90013ee8370
> 
> It is not mapped to 0 as expected, but it didn't crash now!

But I don't think this patch fixed the issue.  Maybe just hided the
issue.  Do you have time to try the new patch attached?

Best Regards,
Huang Ying


[-- Attachment #2: dbg_ghes.patch --]
[-- Type: text/x-patch, Size: 1213 bytes --]

---
 drivers/acpi/apei/ghes.c |    6 ++++++
 drivers/acpi/atomicio.c  |    2 ++
 2 files changed, 8 insertions(+)

--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -299,6 +299,9 @@ static struct ghes *ghes_new(struct acpi
 		return ERR_PTR(-ENOMEM);
 	ghes->generic = generic;
 	rc = acpi_pre_map_gar(&generic->error_status_address);
+	pr_info(GHES_PFX "gar mapped: %d, %#llx\n",
+		generic->error_status_address.space_id,
+		generic->error_status_address.address);
 	if (rc)
 		goto err_free;
 	error_block_length = generic->error_block_length;
@@ -398,6 +401,9 @@ static int ghes_read_estatus(struct ghes
 	u32 len;
 	int rc;
 
+	pr_err(GHES_PFX "gar accessed: %d, %#llx\n",
+	       g->error_status_address.space_id,
+	       g->error_status_address.address);
 	rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
 	if (rc) {
 		if (!silent && printk_ratelimit())
--- a/drivers/acpi/atomicio.c
+++ b/drivers/acpi/atomicio.c
@@ -270,6 +270,8 @@ static int acpi_atomic_read_mem(u64 padd
 
 	rcu_read_lock();
 	addr = __acpi_ioremap_fast(paddr, width);
+	if (!addr)
+		panic("ACPI atomic read mem: addr %#llx is not mapped!\n", paddr);
 	switch (width) {
 	case 8:
 		*val = readb(addr);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-09-05  2:50                         ` Huang Ying
@ 2011-09-15 18:35                           ` rick
  2011-09-16  0:20                             ` Huang Ying
  0 siblings, 1 reply; 18+ messages in thread
From: rick @ 2011-09-15 18:35 UTC (permalink / raw)
  To: Huang Ying
  Cc: Don Zickus, Rafael J. Wysocki, linux-kernel, Richard Houghton,
	ACPI Devel Mailing List, Len Brown, Matthew Garrett

Hi Huang,

I've been running for a while now with this new patch.  I have not gotten
any panic output though.  This is the only output I have from GHES:

[22256.616069] GHES: gar accessed: 0, 0xbf7b9370
[102115.873369] GHES: gar accessed: 0, 0xbf7b9370
[175460.888809] GHES: gar accessed: 0, 0xbf7b9370
[223652.976925] GHES: gar accessed: 0, 0xbf7b9370
[284678.016631] GHES: gar accessed: 0, 0xbf7b9370
[321563.363578] GHES: gar accessed: 0, 0xbf7b9370
[500789.731559] GHES: gar accessed: 0, 0xbf7b9370
[539808.495653] GHES: gar accessed: 0, 0xbf7b9370
[545116.419274] GHES: gar accessed: 0, 0xbf7b9370
[686744.665292] GHES: gar accessed: 0, 0xbf7b9370

Do you think changing from 5620-5670 cpus changed the behavior?

Thanks,
Rick


> On 09/03/2011 07:32 AM, rick@microway.com wrote:
>> Hi Huang,
>>
>> Sorry for the delay in my response.  Hurricane Irene delayed our testing
>> a
>> bit.
>>
>> I had to switch the 5620 CPUS I had for 5670s.  After 4 days of running
>> (it was usually about 2 before) I finally got this output in dmesg:
>>
>> [337296.365930] GHES: gar accessed: 0, 0xbf7b9370
>> [337296.365936] ACPI atomic read mem: addr 0xbf7b9370 mapped to
>> ffffc90013ee8370
>>
>> It is not mapped to 0 as expected, but it didn't crash now!
>
> But I don't think this patch fixed the issue.  Maybe just hided the
> issue.  Do you have time to try the new patch attached?
>
> Best Regards,
> Huang Ying
>
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
  2011-09-15 18:35                           ` rick
@ 2011-09-16  0:20                             ` Huang Ying
  0 siblings, 0 replies; 18+ messages in thread
From: Huang Ying @ 2011-09-16  0:20 UTC (permalink / raw)
  To: rick
  Cc: Don Zickus, Rafael J. Wysocki, linux-kernel, Richard Houghton,
	ACPI Devel Mailing List, Len Brown, Matthew Garrett

Hi, Rick,

On 09/16/2011 02:35 AM, rick@microway.com wrote:
> Hi Huang,
> 
> I've been running for a while now with this new patch.  I have not gotten
> any panic output though.  This is the only output I have from GHES:
> 
> [22256.616069] GHES: gar accessed: 0, 0xbf7b9370
> [102115.873369] GHES: gar accessed: 0, 0xbf7b9370
> [175460.888809] GHES: gar accessed: 0, 0xbf7b9370
> [223652.976925] GHES: gar accessed: 0, 0xbf7b9370
> [284678.016631] GHES: gar accessed: 0, 0xbf7b9370
> [321563.363578] GHES: gar accessed: 0, 0xbf7b9370
> [500789.731559] GHES: gar accessed: 0, 0xbf7b9370
> [539808.495653] GHES: gar accessed: 0, 0xbf7b9370
> [545116.419274] GHES: gar accessed: 0, 0xbf7b9370
> [686744.665292] GHES: gar accessed: 0, 0xbf7b9370
> 
> Do you think changing from 5620-5670 cpus changed the behavior?

Don't know why too.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-09-16  0:20 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-17 21:51 kernel oops and panic in acpi_atomic_read under 2.6.39.3. call trace included Rick Warner
2011-08-18  7:47 ` Rafael J. Wysocki
2011-08-18 21:43   ` Rafael J. Wysocki
2011-08-22 14:42     ` rick
2011-08-22 18:47       ` Rafael J. Wysocki
2011-08-22 20:51         ` Rick Warner
2011-08-22 21:13           ` Rafael J. Wysocki
2011-08-23 17:16             ` rick
2011-08-23 17:14           ` Don Zickus
2011-08-23 17:24             ` rick
2011-08-24  4:16               ` Huang Ying
2011-08-24 22:18                 ` rick
2011-08-25 15:47                   ` rick
2011-08-26  0:34                     ` Huang Ying
2011-09-02 23:32                       ` rick
2011-09-05  2:50                         ` Huang Ying
2011-09-15 18:35                           ` rick
2011-09-16  0:20                             ` Huang Ying

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.