From: Rick Warner <rick@microway.com>
To: linux-kernel@vger.kernel.org
Cc: Richard Houghton <rhoughton@microway.com>
Subject: kernel oops and panic in acpi_atomic_read under 2.6.39.3. call trace included
Date: Wed, 17 Aug 2011 17:51:51 -0400 [thread overview]
Message-ID: <201108171751.51648.rick@microway.com> (raw)
Hi all,
I am getting a kernel oops/panic on a dual xeon system that is the master of a
60 node HPC cluster. This is happening while stress testing the system
including significant network traffic. The OS is openSuse 11.4.
We are running a custom compiled 2.6.39.3 kernel on the systems due to a bug
in the stock kernel 11.4 provided (igb driver related). After 1-3 days of
heavy testing, the master node locks up with the caps lock and scroll lock
keys on the keyboard blinking with the following output captured via a serial
console:
[381920.681113] BUG: unable to handle kernel NULL pointer dereference at
(null)
[381920.689067] IP: [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
[381920.695187] PGD 30c27a067 PUD 16efe6067 PMD 0
[381920.699782] Oops: 0000 [#1] PREEMPT SMP
[381920.703866] last sysfs file:
/sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
[381920.711868] CPU 6
[381920.713800] Modules linked in: md5 ipmi_devintf ipmi_si ipmi_msghandler
nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE iptable_mangle
iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter
ip_tables x_tables af_packet edd cpufreq_conservative cpufreq_userspace
cpufreq_powersave acpi_cpufreq mperf xfs dm_mod ioatdma i7core_edac edac_core
sr_mod cdrom joydev igb i2c_i801 sg button ghes hed iTCO_wdt
iTCO_vendor_support dca pcspkr ext4 jbd2 crc16 raid456 async_raid6_recov
async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid1 raid0 fan
processor thermal thermal_sys ata_generic pata_atiixp arcmsr
[381920.771623]
[381920.773210] Pid: 12701, comm: cluster Not tainted 2.6.39.3-microwaycustom
#1 Supermicro X8DTH-i/6/iF/6F/X8DTH
[381920.783292] RIP: 0010:[<ffffffff812a7510>] [<ffffffff812a7510>]
acpi_atomic_read+0xe3/0x120
[381920.791853] RSP: 0000:ffff88063fc47d98 EFLAGS: 00010046
[381920.797260] RAX: 0000000000000000 RBX: 00000000bf7b5ff0 RCX: ffffffff81a3cdd0
[381920.804486] RDX: 00000000bf7b6010 RSI: 00000000bf7b6000 RDI:
ffff88062d4b95c0
[381920.811712] RBP: ffff88063fc47dc8 R08: ffff88063fc47d98 R09: 0000000000000002
[381920.818940] R10: 0000000000000083 R11: 0000000000000010 R12:
ffffc90003044c20
[381920.826168] R13: ffff88063fc47de0 R14: 0000000000000000 R15:
0000000000000000
[381920.833392] FS: 0000000000000000(0000) GS:ffff88063fc40000(0000)
knlGS:0000000000000000
[381920.841571] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[381920.847409] CR2: 0000000000000000 CR3: 00000001d2d0f000 CR4:
00000000000006e0
[381920.854635] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[381920.861861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[381920.869088] Process cluster (pid: 12701, threadinfo ffff880028284000, task
ffff880550184380)
[381920.877608] Stack:
[381920.879717] ffffffff81a3cdd0 00000000bf7b5ff0 ffff88062b2cc140 ffffc90003044ca8
[381920.887288] ffff88062b2cc140 0000000000000001 ffff88063fc47e08 ffffffffa002b21f
[381920.894862] 0000000000000000 0000000000000000 ffff88062b2cc140
0000000000000000
[381920.902435] Call Trace:
[381920.904969] <NMI>
[381920.907201] [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170 [ghes]
[381920.913726] [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0 [ghes]
[381920.920076] [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
[381920.926080] [<ffffffff8151a1d8>] __atomic_notifier_call_chain+0x48/0x70
[381920.932864] [<ffffffff8151a211>] atomic_notifier_call_chain+0x11/0x20
[381920.939473] [<ffffffff8151a24e>] notify_die+0x2e/0x30
[381920.944691] [<ffffffff81517bd2>] do_nmi+0xa2/0x270
[381920.949649] [<ffffffff81517550>] nmi+0x20/0x30
[381920.954269] [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
[381920.960536] <<EOE>>
[381920.962725] <IRQ>
[381920.965132] [<ffffffff8101131e>] intel_pmu_disable_all+0x3e/0x120
[381920.971399] [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
[381920.977066] [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
[381920.982817] [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
[381920.989082] [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
[381920.994740] [<ffffffff81067097>] update_process_times+0x67/0x80
[381921.000832] [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
[381921.006578] [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
[381921.012758] [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
[381921.018414] [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
[381921.024343] [<ffffffff810206d7>] smp_apic_timer_interrupt+0x67/0xa0
[381921.030793] [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
[381921.036891] <EOI>
[381921.038913] Code: fc 10 74 1f 77 08 41 80 fc 08 75 48 eb 0e 41 80 fc 20 74
17 41 80 fc 40 75 3a eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb 09 <8b>
00 89 c0 eb 03 48 8b 00 49 89 45 00 e8 8e 2b e2 ff eb 1b 0f
[381921.059462] RIP [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
[381921.065669] RSP <ffff88063fc47d98>
[381921.069242] CR2: 0000000000000000
[381921.072645] ---[ end trace 52697bfc73a34a90 ]---
[381921.077343] Kernel panic - not syncing: Fatal exception in interrupt
[381921.083784] Pid: 12701, comm: cluster Tainted: G D 2.6.39.3-
microwaycustom #1
[381921.091788] Call Trace:
[381921.094333] <NMI> [<ffffffff81513300>] panic+0x9f/0x1da
[381921.099881] [<ffffffff81517f8c>] oops_end+0xdc/0xf0
[381921.104952] [<ffffffff81032b91>] no_context+0xf1/0x260
[381921.110277] [<ffffffff81032e55>] __bad_area_nosemaphore+0x155/0x200
[381921.116739] [<ffffffff81032f0e>] bad_area_nosemaphore+0xe/0x10
[381921.122766] [<ffffffff81519f46>] do_page_fault+0x366/0x530
[381921.128440] [<ffffffff810ee929>] ? __perf_event_overflow+0xa9/0x220
[381921.134896] [<ffffffff810ef99b>] ? perf_event_update_userpage+0x9b/0xe0
[381921.141698] [<ffffffff81012249>] ? intel_pmu_enable_all+0xc9/0x1a0
[381921.148061] [<ffffffff810123ff>] ? x86_perf_event_set_period+0xdf/0x170
[381921.154852] [<ffffffff81517295>] page_fault+0x25/0x30
[381921.160081] [<ffffffff812a7510>] ? acpi_atomic_read+0xe3/0x120
[381921.166094] [<ffffffff812a7486>] ? acpi_atomic_read+0x59/0x120
[381921.172107] [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170 [ghes]
[381921.178639] [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0 [ghes]
[381921.185004] [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
[381921.191003] [<ffffffff8151a1d8>] __atomic_notifier_call_chain+0x48/0x70
[381921.197789] [<ffffffff8151a211>] atomic_notifier_call_chain+0x11/0x20
[381921.204399] [<ffffffff8151a24e>] notify_die+0x2e/0x30
[381921.209624] [<ffffffff81517bd2>] do_nmi+0xa2/0x270
[381921.214584] [<ffffffff81517550>] nmi+0x20/0x30
[381921.219208] [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
[381921.225470] <<EOE>> <IRQ> [<ffffffff8101131e>]
intel_pmu_disable_all+0x3e/0x120
[381921.233174] [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
[381921.238836] [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
[381921.244582] [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
[381921.250855] [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
[381921.256520] [<ffffffff81067097>] update_process_times+0x67/0x80
[381921.262618] [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
[381921.268363] [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
[381921.274540] [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
[381921.280198] [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
[381921.286125] [<ffffffff810206d7>] smp_apic_timer_interrupt+0x67/0xa0
[381921.292578] [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
[381921.298673] <EOI>
I recompiled the kernel again, disabling the tickless feature as I saw "nohz"
early in the call trace. After that, it reproduced again but the call trace
only changed slightly having tick_init_highres in there instead of
tick_nohz_handler. I have that call trace if it is desired as well. It is
nearly identical though.
I am currently trying a custom 2.6.36.4 on the system, but will need up to 3
days before I know if the problem exists there as well.
Any ideas on this?
Thanks,
Rick
--
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517
next reply other threads:[~2011-08-17 21:58 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-08-17 21:51 Rick Warner [this message]
2011-08-18 7:47 ` kernel oops and panic in acpi_atomic_read under 2.6.39.3. call trace included Rafael J. Wysocki
2011-08-18 21:43 ` Rafael J. Wysocki
2011-08-22 14:42 ` rick
2011-08-22 18:47 ` Rafael J. Wysocki
2011-08-22 20:51 ` Rick Warner
2011-08-22 21:13 ` Rafael J. Wysocki
2011-08-23 17:16 ` rick
2011-08-23 17:14 ` Don Zickus
2011-08-23 17:24 ` rick
2011-08-24 4:16 ` Huang Ying
2011-08-24 22:18 ` rick
2011-08-25 15:47 ` rick
2011-08-26 0:34 ` Huang Ying
2011-09-02 23:32 ` rick
2011-09-05 2:50 ` Huang Ying
2011-09-15 18:35 ` rick
2011-09-16 0:20 ` Huang Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201108171751.51648.rick@microway.com \
--to=rick@microway.com \
--cc=linux-kernel@vger.kernel.org \
--cc=rhoughton@microway.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.