All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rick Warner <rick@microway.com>
To: linux-kernel@vger.kernel.org
Cc: Richard Houghton <rhoughton@microway.com>
Subject: kernel oops and panic in acpi_atomic_read under 2.6.39.3.   call trace included
Date: Wed, 17 Aug 2011 17:51:51 -0400	[thread overview]
Message-ID: <201108171751.51648.rick@microway.com> (raw)

Hi all,

I am getting a kernel oops/panic on a dual xeon system that is the master of a 
60 node HPC cluster.  This is happening while stress testing the system 
including significant network traffic.  The OS is openSuse 11.4. 

We are running a custom compiled 2.6.39.3 kernel on the systems due to a bug 
in the stock kernel 11.4 provided (igb driver related).  After 1-3 days of 
heavy testing, the master node locks up with the caps lock and scroll lock 
keys on the keyboard blinking with the following output captured via a serial 
console:

[381920.681113] BUG: unable to handle kernel NULL pointer dereference at           
(null)
[381920.689067] IP: [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
[381920.695187] PGD 30c27a067 PUD 16efe6067 PMD 0 
[381920.699782] Oops: 0000 [#1] PREEMPT SMP 
[381920.703866] last sysfs file: 
/sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
[381920.711868] CPU 6 
[381920.713800] Modules linked in: md5 ipmi_devintf ipmi_si ipmi_msghandler 
nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE iptable_mangle 
iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter 
ip_tables x_tables af_packet edd cpufreq_conservative cpufreq_userspace 
cpufreq_powersave acpi_cpufreq mperf xfs dm_mod ioatdma i7core_edac edac_core 
sr_mod cdrom joydev igb i2c_i801 sg button ghes hed iTCO_wdt 
iTCO_vendor_support dca pcspkr ext4 jbd2 crc16 raid456 async_raid6_recov 
async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid1 raid0 fan 
processor thermal thermal_sys ata_generic pata_atiixp arcmsr
[381920.771623] 
[381920.773210] Pid: 12701, comm: cluster Not tainted 2.6.39.3-microwaycustom 
#1 Supermicro X8DTH-i/6/iF/6F/X8DTH
[381920.783292] RIP: 0010:[<ffffffff812a7510>]  [<ffffffff812a7510>] 
acpi_atomic_read+0xe3/0x120
[381920.791853] RSP: 0000:ffff88063fc47d98  EFLAGS: 00010046
[381920.797260] RAX: 0000000000000000 RBX: 00000000bf7b5ff0 RCX: ffffffff81a3cdd0
[381920.804486] RDX: 00000000bf7b6010 RSI: 00000000bf7b6000 RDI: 
ffff88062d4b95c0
[381920.811712] RBP: ffff88063fc47dc8 R08: ffff88063fc47d98 R09: 0000000000000002
[381920.818940] R10: 0000000000000083 R11: 0000000000000010 R12: 
ffffc90003044c20
[381920.826168] R13: ffff88063fc47de0 R14: 0000000000000000 R15: 
0000000000000000
[381920.833392] FS:  0000000000000000(0000) GS:ffff88063fc40000(0000) 
knlGS:0000000000000000
[381920.841571] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[381920.847409] CR2: 0000000000000000 CR3: 00000001d2d0f000 CR4: 
00000000000006e0
[381920.854635] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[381920.861861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[381920.869088] Process cluster (pid: 12701, threadinfo ffff880028284000, task 
ffff880550184380)
[381920.877608] Stack:
[381920.879717]  ffffffff81a3cdd0 00000000bf7b5ff0 ffff88062b2cc140 ffffc90003044ca8
[381920.887288]  ffff88062b2cc140 0000000000000001 ffff88063fc47e08 ffffffffa002b21f
[381920.894862]  0000000000000000 0000000000000000 ffff88062b2cc140 
0000000000000000
[381920.902435] Call Trace:
[381920.904969]  <NMI> 
[381920.907201]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170 [ghes]
[381920.913726]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0 [ghes]
[381920.920076]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
[381920.926080]  [<ffffffff8151a1d8>] __atomic_notifier_call_chain+0x48/0x70
[381920.932864]  [<ffffffff8151a211>] atomic_notifier_call_chain+0x11/0x20
[381920.939473]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
[381920.944691]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
[381920.949649]  [<ffffffff81517550>] nmi+0x20/0x30
[381920.954269]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
[381920.960536]  <<EOE>> 
[381920.962725]  <IRQ> 
[381920.965132]  [<ffffffff8101131e>] intel_pmu_disable_all+0x3e/0x120
[381920.971399]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
[381920.977066]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
[381920.982817]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
[381920.989082]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
[381920.994740]  [<ffffffff81067097>] update_process_times+0x67/0x80
[381921.000832]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
[381921.006578]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
[381921.012758]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
[381921.018414]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
[381921.024343]  [<ffffffff810206d7>] smp_apic_timer_interrupt+0x67/0xa0
[381921.030793]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
[381921.036891]  <EOI> 
[381921.038913] Code: fc 10 74 1f 77 08 41 80 fc 08 75 48 eb 0e 41 80 fc 20 74 
17 41 80 fc 40 75 3a eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb 09 <8b> 
00 89 c0 eb 03 48 8b 00 49 89 45 00 e8 8e 2b e2 ff eb 1b 0f 
[381921.059462] RIP  [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
[381921.065669]  RSP <ffff88063fc47d98>
[381921.069242] CR2: 0000000000000000
[381921.072645] ---[ end trace 52697bfc73a34a90 ]---
[381921.077343] Kernel panic - not syncing: Fatal exception in interrupt
[381921.083784] Pid: 12701, comm: cluster Tainted: G      D     2.6.39.3-
microwaycustom #1
[381921.091788] Call Trace:
[381921.094333]  <NMI>  [<ffffffff81513300>] panic+0x9f/0x1da
[381921.099881]  [<ffffffff81517f8c>] oops_end+0xdc/0xf0
[381921.104952]  [<ffffffff81032b91>] no_context+0xf1/0x260
[381921.110277]  [<ffffffff81032e55>] __bad_area_nosemaphore+0x155/0x200
[381921.116739]  [<ffffffff81032f0e>] bad_area_nosemaphore+0xe/0x10
[381921.122766]  [<ffffffff81519f46>] do_page_fault+0x366/0x530
[381921.128440]  [<ffffffff810ee929>] ? __perf_event_overflow+0xa9/0x220
[381921.134896]  [<ffffffff810ef99b>] ? perf_event_update_userpage+0x9b/0xe0
[381921.141698]  [<ffffffff81012249>] ? intel_pmu_enable_all+0xc9/0x1a0
[381921.148061]  [<ffffffff810123ff>] ? x86_perf_event_set_period+0xdf/0x170
[381921.154852]  [<ffffffff81517295>] page_fault+0x25/0x30
[381921.160081]  [<ffffffff812a7510>] ? acpi_atomic_read+0xe3/0x120
[381921.166094]  [<ffffffff812a7486>] ? acpi_atomic_read+0x59/0x120
[381921.172107]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170 [ghes]
[381921.178639]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0 [ghes]
[381921.185004]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
[381921.191003]  [<ffffffff8151a1d8>] __atomic_notifier_call_chain+0x48/0x70
[381921.197789]  [<ffffffff8151a211>] atomic_notifier_call_chain+0x11/0x20
[381921.204399]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
[381921.209624]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
[381921.214584]  [<ffffffff81517550>] nmi+0x20/0x30
[381921.219208]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
[381921.225470]  <<EOE>>  <IRQ>  [<ffffffff8101131e>] 
intel_pmu_disable_all+0x3e/0x120
[381921.233174]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
[381921.238836]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
[381921.244582]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
[381921.250855]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
[381921.256520]  [<ffffffff81067097>] update_process_times+0x67/0x80
[381921.262618]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
[381921.268363]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
[381921.274540]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
[381921.280198]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
[381921.286125]  [<ffffffff810206d7>] smp_apic_timer_interrupt+0x67/0xa0
[381921.292578]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
[381921.298673]  <EOI>


I recompiled the kernel again, disabling the tickless feature as I saw "nohz" 
early in the call trace.  After that, it reproduced again but the call trace 
only changed slightly having tick_init_highres in there instead of 
tick_nohz_handler.  I have that call trace if it is desired as well.  It is 
nearly identical though.

I am currently trying a custom 2.6.36.4 on the system, but will need up to 3 
days before I know if the problem exists there as well.

Any ideas on this?

Thanks,
Rick
-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517

             reply	other threads:[~2011-08-17 21:58 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-17 21:51 Rick Warner [this message]
2011-08-18  7:47 ` kernel oops and panic in acpi_atomic_read under 2.6.39.3. call trace included Rafael J. Wysocki
2011-08-18 21:43   ` Rafael J. Wysocki
2011-08-22 14:42     ` rick
2011-08-22 18:47       ` Rafael J. Wysocki
2011-08-22 20:51         ` Rick Warner
2011-08-22 21:13           ` Rafael J. Wysocki
2011-08-23 17:16             ` rick
2011-08-23 17:14           ` Don Zickus
2011-08-23 17:24             ` rick
2011-08-24  4:16               ` Huang Ying
2011-08-24 22:18                 ` rick
2011-08-25 15:47                   ` rick
2011-08-26  0:34                     ` Huang Ying
2011-09-02 23:32                       ` rick
2011-09-05  2:50                         ` Huang Ying
2011-09-15 18:35                           ` rick
2011-09-16  0:20                             ` Huang Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201108171751.51648.rick@microway.com \
    --to=rick@microway.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rhoughton@microway.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.