All of lore.kernel.org
 help / color / mirror / Atom feed
* 3.0.0 Xen pv guest - BUG: Unable to handle kernel paging request in swap_count_continued
@ 2011-08-26 17:42 Peter Sandin
  2011-08-29 14:39 ` kernel BUG at mm/swapfile.c:2527! [was 3.0.0 Xen pv guest - BUG: Unable to handle] Christopher S. Aker
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Sandin @ 2011-08-26 17:42 UTC (permalink / raw)
  To: LKML; +Cc: xen-devel

We have a number of virtualized Linux instances running under Xen that have been hitting a bug. This issue first cropped up in the 2.6.38 release and we're still seeing cases with the 3.0.0 kernel. On average we're receiving reports of about one instance per day crashing due to this issue. The affected 2.6.39 and 3.0.0 kernels are vanilla kernel.org kernels, the .config file and binary for the affected 3.0.0 kernel can be found at:

http://thesandins.net/xen/3.0.0/

This issue has happened on multiple separate physical machine and different distributions, so it's not a hardware or distribution specific issue. The Apache httpd server seems to be the most likely process to trigger this issue. Someone else opened a bug with Apache about this issue, but that bug was closed as not being an Apache issue, that report can be found at:

https://issues.apache.org/bugzilla/show_bug.cgi?id=51325

We inquired about this issue with the Xen-devel list when we first ran in to it, that thread can be found at:

http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00230.html

If anyone has any ideas on why this is happening and what we need to do to prevent it from happening in the future please let us know. The issue has only manifested in customer instances so we don't have access to other logs from these incidents, however if anyone has suggestions on tests or methods for replicating this issue I'd be glad to give those a try on a test instance. The console output from the error is included below:

BUG: unable to handle kernel paging request at f57a63be
IP: [<c01ab854>] swap_count_continued+0x104/0x180
*pdpt = 0000000029d01027 *pde = 00000000008d4067 *pte = 0000000000000000 
Oops: 0000 [#1] SMP 
Modules linked in:

Pid: 2206, comm: apache2 Not tainted 3.0.0-linode35 #1  
EIP: 0061:[<c01ab854>] EFLAGS: 00010246 CPU: 1
EIP is at swap_count_continued+0x104/0x180
EAX: f57a63be EBX: eb9fc4e0 ECX: f57a6000 EDX: 000000be
ESI: ed3d7cc0 EDI: 000000be EBP: 000003be ESP: ea3bddb0
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
Process apache2 (pid: 2206, ti=ea3bc000 task=eaca6410 task.ti=ea3bc000)
Stack:
 ea76dcc0 000013be 000000be ffffffea c01abe22 35a34067 c01040fb 0002a5cb
 40f40067 000013be ea5cb2e0 000277c0 bfc5c000 c01abee4 00000000 c01a068b
 bfc40000 80000007 00000000 00000000 000013be 0000001c e7f402e0 00100173
Call Trace:
 [<c01abe22>] ? __swap_duplicate+0xc2/0x160
 [<c01040fb>] ? pte_mfn_to_pfn+0x8b/0xe0
 [<c01abee4>] ? swap_duplicate+0x14/0x40
 [<c01a068b>] ? copy_pte_range+0x45b/0x500
 [<c01a08c5>] ? copy_page_range+0x195/0x200
 [<c0132756>] ? dup_mmap+0x1c6/0x2c0
 [<c0132b88>] ? dup_mm+0xa8/0x130
 [<c01335fa>] ? copy_process+0x98a/0xb30
 [<c01337ef>] ? do_fork+0x4f/0x280
 [<c010f780>] ? sys_clone+0x30/0x40
 [<c06c000d>] ? ptregs_clone+0x15/0x48
 [<c06bf6f1>] ? syscall_call+0x7/0xb
 [<c06b0000>] ? sctp_backlog_rcv+0xf0/0x100
Code: de 75 dc b8 01 00 00 00 5b 5e 5f 5d c3 66 90 e8 d3 7c f7 ff 8b 5b 18 83 eb 18 39 de 0f 84 7f 00 00 00 89 d8 e8 fe 7e f7 ff 01 e8 <0f> b
6 10 80 fa ff 74 dc 80 fa 7f 74 28 83 c2 01 88 10 eb 0c 89 
EIP: [<c01ab854>] swap_count_continued+0x104/0x180 SS:ESP 0069:ea3bddb0
CR2: 00000000f57a63be
---[ end trace aa46a9340a0a4bc6 ]---
note: apache2[2206] exited with preempt_count 1
BUG: scheduling while atomic: apache2/2206/0x00000001
Modules linked in:
Pid: 2206, comm: apache2 Tainted: G      D     3.0.0-linode35 #1
Call Trace:
 [<c06bda6a>] ? schedule+0x60a/0x6f0
 [<c0106404>] ? check_events+0x8/0xc
 [<c01063fb>] ? xen_restore_fl_direct_reloc+0x4/0x4
 [<c01775fe>] ? rcu_enter_nohz+0x2e/0xb0
 [<c0139921>] ? irq_exit+0x31/0xa0
 [<c0477bed>] ? xen_evtchn_do_upcall+0x1d/0x30
 [<c0101227>] ? hypercall_page+0x227/0x1000
 [<c0105c27>] ? xen_force_evtchn_callback+0x17/0x30
 [<c0106404>] ? check_events+0x8/0xc
 [<c06bf28d>] ? rwsem_down_failed_common+0x9d/0x110
 [<c06bf353>] ? call_rwsem_down_read_failed+0x7/0xc
 [<c06bea6a>] ? down_read+0xa/0x10
 [<c01683f5>] ? acct_collect+0x35/0x160
 [<c0137fbd>] ? do_exit+0x27d/0x350
 [<c011f170>] ? mm_fault_error+0x130/0x130
 [<c010b7e1>] ? oops_end+0x71/0xa0
 [<c011ef8f>] ? bad_area_nosemaphore+0xf/0x20
 [<c011f3bf>] ? do_page_fault+0x24f/0x3a0
 [<c0105c27>] ? xen_force_evtchn_callback+0x17/0x30
 [<c0106404>] ? check_events+0x8/0xc
 [<c01063fb>] ? xen_restore_fl_direct_reloc+0x4/0x4
 [<c011f170>] ? mm_fault_error+0x130/0x130
 [<c06bfc66>] ? error_code+0x5a/0x60
 [<c012007b>] ? try_preserve_large_page+0x7b/0x340
 [<c011f170>] ? mm_fault_error+0x130/0x130
 [<c01ab854>] ? swap_count_continued+0x104/0x180
 [<c01abe22>] ? __swap_duplicate+0xc2/0x160
 [<c01040fb>] ? pte_mfn_to_pfn+0x8b/0xe0
 [<c01abee4>] ? swap_duplicate+0x14/0x40
 [<c01a068b>] ? copy_pte_range+0x45b/0x500
 [<c01a08c5>] ? copy_page_range+0x195/0x200
 [<c0132756>] ? dup_mmap+0x1c6/0x2c0
 [<c0132b88>] ? dup_mm+0xa8/0x130
 [<c01335fa>] ? copy_process+0x98a/0xb30
 [<c01337ef>] ? do_fork+0x4f/0x280
 [<c010f780>] ? sys_clone+0x30/0x40
 [<c06c000d>] ? ptregs_clone+0x15/0x48
 [<c06bf6f1>] ? syscall_call+0x7/0xb
 [<c06b0000>] ? sctp_backlog_rcv+0xf0/0x100
INFO: rcu_sched_state detected stall on CPU 2 (t=60000 jiffies)

Regards,
Peter

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [Xen-devel] Re: kernel BUG at mm/swapfile.c:2527! [was 3.0.0 Xen pv guest - BUG: Unable to handle]
@ 2011-09-17 18:12 Kent Hoxsey
  0 siblings, 0 replies; 21+ messages in thread
From: Kent Hoxsey @ 2011-09-17 18:12 UTC (permalink / raw)
  To: linux-kernel

I am arriving to this discussion via a pointer on the Amazon AWS forums. There appear to be a number of threads with people experiencing some version of this issue (high IOwait%, httpd cpu load spike, etc.)

I currently have an AWS instance that appears to experience this problem every day at the same time (1:50pm Pacific), but has enough cpu horsepower to handle the surge and recover. Since it is a part of my production infrastructure, I cannot allow other people to log in, but I can certainly run diagnostics to help identify the issue. If anyone would like to suggest what diagnostics would be helpful I will try to collect them.

Kent
+++
As an example, following is a snip from mpstat during the load spike. Watching top at the same time, all of the httpd processes jump from 0.2% cpu to 15% or more, and then recover together once the spike passes:

08:49:31 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
...
08:49:46 PM  all    2.53    0.00    0.35    0.15    0.00    0.10    0.00    0.00   96.86
08:49:51 PM  all    3.63    0.00    0.52    0.16    0.00    0.10    0.05    0.00   95.53
08:49:56 PM  all    6.77    0.00    3.44    0.22    0.00    0.16    0.00    0.00   89.40
08:50:01 PM  all   53.19    0.00   41.04    0.00    0.00    0.30    5.48    0.00    0.00
08:50:06 PM  all   57.62    0.00   35.25    0.10    0.00    1.19    4.75    0.00    1.09
08:50:11 PM  all   34.85    0.00   19.09   43.20    0.00    0.77    0.46    0.00    1.62
08:50:16 PM  all   50.85    0.00   19.54   26.88    0.00    0.51    0.09    0.00    2.13
08:50:21 PM  all   31.87    0.00   16.87   49.18    0.00    0.45    0.07    0.00    1.57
08:50:26 PM  all   31.83    0.00   15.73   49.63    0.00    0.52    0.00    0.00    2.29
08:50:31 PM  all   30.50    0.00   16.91   51.03    0.00    0.30    0.07    0.00    1.18
08:50:36 PM  all   30.83    0.00   18.24   49.66    0.00    0.22    0.00    0.00    1.04
08:50:41 PM  all   33.58    0.00   15.86   48.47    0.00    0.22    0.07    0.00    1.79
08:50:46 PM  all   51.06    0.00   18.04   24.30    0.00    0.76    2.03    0.00    3.81
08:50:51 PM  all   69.61    0.00   23.73    0.39    0.00    0.39    4.22    0.00    1.67
08:50:56 PM  all   72.11    0.00   21.41    0.00    0.00    0.50    5.98    0.00    0.00
08:51:01 PM  all   71.84    0.00   21.44    0.10    0.00    0.59    5.43    0.00    0.59
08:51:06 PM  all   66.24    0.00   23.71    2.93    0.00    1.17    5.37    0.00    0.59
08:51:11 PM  all   67.97    0.00   22.66    1.95    0.00    0.68    5.27    0.00    1.46
08:51:16 PM  all   68.07    0.00   23.34    2.54    0.00    0.59    4.69    0.00    0.78
08:51:21 PM  all   55.47    0.00    8.56    2.80    0.00    0.25    1.98    0.00   30.95
08:51:26 PM  all   38.44    0.00    2.07    1.24    0.00    0.21    0.07    0.00   57.97
08:51:31 PM  all   12.65    0.00    1.85    0.87    0.00    0.40    0.00    0.00   84.23
08:51:36 PM  all    7.60    0.00    1.14   12.64    0.00    0.22    0.05    0.00   78.35
08:51:41 PM  all   10.29    0.00    0.99    0.12    0.00    0.29    0.00    0.00   88.31
08:51:46 PM  all    6.94    0.00    0.65    0.11    0.00    0.05    0.00    0.00   92.25
08:51:51 PM  all    7.05    0.00    0.86    0.43    0.00    0.11    0.05    0.00   91.49



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-09-22 20:02 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-26 17:42 3.0.0 Xen pv guest - BUG: Unable to handle kernel paging request in swap_count_continued Peter Sandin
2011-08-29 14:39 ` kernel BUG at mm/swapfile.c:2527! [was 3.0.0 Xen pv guest - BUG: Unable to handle] Christopher S. Aker
2011-08-29 15:07   ` [Xen-devel] " Konrad Rzeszutek Wilk
2011-08-29 15:07     ` Konrad Rzeszutek Wilk
2011-08-30 11:45     ` [Xen-devel] " Ian Campbell
2011-08-30 11:45       ` Ian Campbell
2011-08-31 20:43       ` [Xen-devel] " Christopher S. Aker
2011-08-31 20:43         ` Christopher S. Aker
2011-09-06 17:13         ` [Xen-devel] " Konrad Rzeszutek Wilk
2011-09-06 17:13           ` Konrad Rzeszutek Wilk
2011-09-12 16:06           ` [Xen-devel] " Christopher S. Aker
2011-09-12 16:11             ` Konrad Rzeszutek Wilk
2011-09-15 18:58               ` Christopher S. Aker
2011-09-15 19:17                 ` Christopher S. Aker
2011-09-18 15:05                   ` Christopher S. Aker
2011-09-21 18:04                     ` Konrad Rzeszutek Wilk
2011-09-21 22:09                       ` Christopher S. Aker
2011-09-22 18:32         ` Konrad Rzeszutek Wilk
2011-09-22 18:32           ` Konrad Rzeszutek Wilk
2011-09-22 20:02           ` [Xen-devel] " Christopher S. Aker
2011-09-17 18:12 Kent Hoxsey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.