Re:Re: Xen-unstable panic: FATAL PAGE FAULT

* Re:Re: Xen-unstable panic: FATAL PAGE FAULT
       [not found] <SNT0-MC2-F12iKC1rdi000797d9@snt0-mc2-f12.Snt0.hotmail.com>
@ 2010-08-26  4:49 ` MaoXiaoyun
  2010-08-26  7:39   ` Keir Fraser
  0 siblings, 1 reply; 39+ messages in thread
From: MaoXiaoyun @ 2010-08-26  4:49 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1.1: Type: text/plain, Size: 13512 bytes --]

Hi:

  This issue can be easily reproduced by continuous and almost concurrently reboot 12 Xen HVM VMS on a single physic server. The reproduce hit the back trace about 6 to 14 hours after it started. I have several similar Xen back traces, please refer to the end of the mail. The first three back traces almost the same, they happened in domain_kill, while the last backtrace happened in do_multicall. 

       As go through the Xen code, in /xen-4.0.0/xen/arch/x86/mm.c, it shows that the author aware of the race competition between domain_relinquish_resources and presented code. It occurred me to simply move line 2765 and 2766 before 2764, that is move put_page_and_type(page) into the spin_lock to avoid competition. 

2753             /* A page is dirtied when its pin status is set. */
2754             paging_mark_dirty(pg_owner, mfn);
2755 
2756             /* We can race domain destruction (domain_relinquish_resources). */
2757             if ( unlikely(pg_owner != d) )
2758             {
2759                 int drop_ref;
2760                 spin_lock(&pg_owner->page_alloc_lock);
2761                 drop_ref = (pg_owner->is_dying &&
2762                             test_and_clear_bit(_PGT_pinned,
2763                                                &page->u.inuse.type_info));
2764                 spin_unlock(&pg_owner->page_alloc_lock);
2765                 if ( drop_ref )
2766                     put_page_and_type(page);
2767             }
2768                                                                                                                                                         
2769             break;
2770         }

       Form the result of reproduce on patched code, it appears the patch worked well since the reproduce succeed during a 48hours long run. But I am not sure of the side effects it brings. 
       Appreciate in advance if someone could give more clauses, thx. 

=============Trace 1: =============

(XEN) ----[ Xen-4.0.0  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c48011617c>] free_heap_pages+0x55a/0x575
(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
(XEN) rax: 0000001fffffffe0   rbx: ffff82f60b8bbfc0   rcx: ffff83063fe01a20
(XEN) rdx: ffff8315ffffffe0   rsi: ffff8315ffffffe0   rdi: 00000000ffffffff
(XEN) rbp: ffff82c48037fc98   rsp: ffff82c48037fc58   r8:  0000000000000000
(XEN) r9:  ffffffffffffffff   r10: ffff82c48020e770   r11: 0000000000000282
(XEN) r12: 00007d0a00000000   r13: 0000000000000000   r14: ffff82f60b8bbfe0
(XEN) r15: 0000000000000001   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000000232914000   cr2: ffff8315ffffffe4
(XEN) ds: 0000   es: 0000   fs: 0063   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c48037fc58:
(XEN)    0000000000000016 0000000000000000 00000000000001a2 ffff8304afc40000
(XEN)    0000000000000000 ffff82f60b8bbfe0 00000000000330fe ffff82f60b8bc000
(XEN)    ffff82c48037fcd8 ffff82c48011647e 0000000100000000 ffff82f60b8bbfe0
(XEN)    ffff8304afc40020 0000000000000000 ffff8304afc40000 0000000000000000
(XEN)    ffff82c48037fcf8 ffff82c480160caf ffff8304afc40000 ffff82f60b8bbfe0
(XEN)    ffff82c48037fd68 ffff82c48014deaf 0000000000000ca3 ffff8304afc40fd8
(XEN)    ffff8304afc40fd8 ffff8304afc40fd8 4000000000000000 ffff82c48037ff28
(XEN)    0000000000000000 ffff8304afc40000 ffff8304afc40000 000000000099e000
(XEN)    00000000ffffffda 0000000000000001 ffff82c48037fd98 ffff82c4801504de
(XEN)    ffff8304afc40000 0000000000000000 000000000099e000 00000000ffffffda
(XEN)    ffff82c48037fdb8 ffff82c4801062ee 000000000099e000 fffffffffffffff3
(XEN)    ffff82c48037ff08 ffff82c480104cd7 ffff82c40000f800 0000000000000286
(XEN)    0000000000000286 ffff8300bf76c000 000000ea864b1814 ffff8300bf76c030
(XEN)    ffff83023ff1ded8 ffff83023ff1ded0 ffff82c48037fe38 ffff82c48011c9f5
(XEN)    ffff82c48037ff08 ffff82c480272100 ffff8300bf76c000 ffff82c48037fe48
(XEN)    ffff82c48011f557 ffff82c480272100 0000000600000002 000000004700000a
(XEN)    000000004700bf2c 0000000000000000 000000004700c158 0000000000000000
(XEN)    00002b3b59e7d050 0000000000000000 0000007f00b14140 00002b3b5f257a80
(XEN)    0000000000996380 00002aaaaaad0830 00002b3b5f257a80 00000000009bb690
(XEN)    00002aaaaaad0830 000000398905abf3 000000000078de60 00002b3b5f257aa4
(XEN) Xen call trace:
(XEN)    [<ffff82c48011617c>] free_heap_pages+0x55a/0x575
(XEN)    [<ffff82c48011647e>] free_domheap_pages+0x2e7/0x3ab
(XEN)    [<ffff82c480160caf>] put_page+0x69/0x70
(XEN)    [<ffff82c48014deaf>] relinquish_memory+0x36e/0x499
(XEN)    [<ffff82c4801504de>] domain_relinquish_resources+0x1ac/0x24c
(XEN)    [<ffff82c4801062ee>] domain_kill+0x93/0xe4
(XEN)    [<ffff82c480104cd7>] do_domctl+0xa1c/0x1205
(XEN)    [<ffff82c4801f71bf>] syscall_enter+0xef/0x149
(XEN)    
(XEN) Pagetable walk from ffff8315ffffffe4:
(XEN)  L4[0x106] = 00000000bf589027 5555555555555555
(XEN)  L3[0x057] = 0000000000000000 ffffffffffffffff
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0002]
(XEN) Faulting linear address: ffff8315ffffffe4
(XEN) ****************************************
(XEN) 
(XEN) Manual reset required ('noreboot' specified)

=============Trace 2: =============

 (XEN) Xen call trace:
(XEN)    [<ffff82c4801153c3>] free_heap_pages+0x283/0x4a0
(XEN)    [<ffff82c480115732>] free_domheap_pages+0x152/0x380
(XEN)    [<ffff82c48014aa89>] relinquish_memory+0x169/0x500
(XEN)    [<ffff82c48014b2cd>] domain_relinquish_resources+0x1ad/0x280
(XEN)    [<ffff82c480105fe0>] domain_kill+0x80/0xf0
(XEN)    [<ffff82c4801043ce>] do_domctl+0x1be/0x1000
(XEN)    [<ffff82c48010739b>] evtchn_set_pending+0xab/0x1b0
(XEN)    [<ffff82c4801e3169>] syscall_enter+0xa9/0xae
(XEN)    
(XEN) Pagetable walk from ffff8315ffffffe4:
(XEN)  L4[0x106] = 00000000bf569027 5555555555555555
(XEN)  L3[0x057] = 0000000000000000 ffffffffffffffff
(XEN) stdvga.c:147:d60 entering stdvga and caching modes
(XEN) 
(XEN) ****************************************
(XEN) HVM60: VGABios $Id: vgabios.c,v 1.67 2008/01/27 09:44:12 vruppert Exp $
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0002]
(XEN) Faulting linear address: ffff8315ffffffe4
(XEN) ****************************************
(XEN) 
(XEN) Manual reset required ('noreboot' specified)

=============Trace 3: =============

(XEN) Xen call trace:
(XEN)    [<ffff82c4801153c3>] free_heap_pages+0x283/0x4a0
(XEN)    [<ffff82c480115732>] free_domheap_pages+0x152/0x380
(XEN)    [<ffff82c48014aa89>] relinquish_memory+0x169/0x500
(XEN)    [<ffff82c48014b2cd>] domain_relinquish_resources+0x1ad/0x280
(XEN)    [<ffff82c480105fe0>] domain_kill+0x80/0xf0
(XEN)    [<ffff82c4801043ce>] do_domctl+0x1be/0x1000
(XEN)    [<ffff82c480117804>] csched_acct+0x384/0x430
(XEN)    [<ffff82c4801e3169>] syscall_enter+0xa9/0xae
(XEN)    
(XEN) Pagetable walk from ffff8315ffffffe4:
(XEN)  L4[0x106] = 00000000bf569027 5555555555555555
(XEN)  L3[0x057] = 0000000000000000 ffffffffffffffff
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0002]
(XEN) Faulting linear address: ffff8315ffffffe4
(XEN) ****************************************
(XEN) 
(XEN) Manual reset required ('noreboot' specified)

=============Trace 4: =============

(XEN) Xen call trace:
(XEN)    [<ffff82c4801153c3>] free_heap_pages+0x283/0x4a0
(XEN)    [<ffff82c480115732>] free_domheap_pages+0x152/0x380
(XEN)    [<ffff82c48015b0e5>] free_page_type+0x4c5/0x670
(XEN)    [<ffff82c48015a218>] get_page+0x28/0xf0
(XEN)    [<ffff82c48015b439>] __put_page_type+0x1a9/0x290
(XEN)    [<ffff82c48016211f>] do_mmuext_op+0xf3f/0x1320
(XEN)    [<ffff82c480113d7e>] do_multicall+0x14e/0x340
(XEN)    [<ffff82c4801e3169>] syscall_enter+0xa9/0xae
(XEN)    
(XEN) Pagetable walk from ffff8315ffffffe4:
(XEN)  L4[0x106] = 00000000bf569027 5555555555555555
(XEN)  L3[0x057] = 0000000000000000 ffffffffffffffff
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0002]
(XEN) Faulting linear address: ffff8315ffffffe4
(XEN) ****************************************
(XEN) 
(XEN) Manual reset required ('noreboot' specified)

-----------------------------------------------------

Sun, 07 Feb 2010 11:56:26 +0000, Keir Fraser >> wrote:

>I'll have to decode the backtrace a bit, but I would guess most likely is
>some memory got corrupted along the way, which would be rather nasty. I
>already need to follow up on a report of apparent memory corruption in a
>domU userspace (testing with the 'memtester' utility), so with a bit of luck
>they could be maifestations of the same bug.

>-- Keir

On 06/02/2010 22:56, "Mark Hurenkamp" <mark.hurenkamp@xxxxxxxxx>> wrote:

>> Hi,
>> 
>> 
>> While playing with my xen server (which is running xen-unstable/linux pvops),
>> it suddenly crashed with the following messages on the serial port.
>> This is a recent version of xen-unstable, but i am a few updates behind.
>> I've seen this only once, so perhaps it is hard to reproduce. I hope this
>> info is still of use to someone.
>> 
>> 
>> Regards,
>> Mark.
>> 
>> 
>> (XEN) tmem: all pools frozen for all domains
>> (XEN) tmem: all pools frozen for all domains
>> (XEN) tmem: all pools thawed for all domains
>> (XEN) tmem: all pools thawed for all domains
>> (XEN) paging.c:170: paging_free_log_dirty_bitmap: used 19 pages for domain 3
>> dirty logging
>> (XEN) ----[ Xen-4.0.0-rc3-pre  x86_64  debug=y  Tainted:    C ]----
>> (XEN) CPU:    2
>> (XEN) RIP:    e008:[<ffff82c4801150c5>>] free_heap_pages+0x53a/0x555
>> (XEN) RFLAGS: 0000000000010282   CONTEXT: hypervisor
>> (XEN) rax: ffff82c4803004c0   rbx: ffff82f600ae4b40   rcx: ffff8315ffffffe0
>> (XEN) rdx: 00000000ffffffff   rsi: ffff8315ffffffe0   rdi: ffff82f600000000
>> (XEN) rbp: ffff83013ff27bc8   rsp: ffff83013ff27b68   r8:  0000000000000000
>> (XEN) r9:  0200000000000000   r10: 0000000000000001   r11: 0080000000000000
>> (XEN) r12: ffff82f600ae4b60   r13: 0000000000000000   r14: 00007d0a00000000
>> (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000000026f0
>> (XEN) cr3: 0000000101001000   cr2: ffff8315ffffffe4
>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
>> (XEN) Xen stack trace from rsp=ffff83013ff27b68:
>> (XEN)    c2c2c2c2c2c2c2c2 0000000000000064 0000000000000000 0000000000000012
>> (XEN)    0000000000000297 000000000000017a ffff82c48011e1e3 0000000000000000
>> (XEN)    ffff83010fc50000 ffff82f600ae4b60 0000000000069f65 ffff82f600ae4b80
>> (XEN)    ffff83013ff27c18 ffff82c4801153ee 0000000000000001 0000000000000001
>> (XEN)    ffff82f600ae49c8 ffff82f600ae4b60 0000000000800727 ffff83013fef0000
>> (XEN)    ffff82f600ae4b60 ffff83010fc50000 ffff83013ff27c38 ffff82c48015d4d0
>> (XEN)    000000000000e010 800000005725b727 ffff83013ff27c78 ffff82c48015f8d8
>> (XEN)    80000000571bf727 ffff8300aae3ac60 ffff83013fef0000 ffff8300aae3b000
>> (XEN)    ffff83013ff27f28 0000000000000000 ffff83013ff27cd8 ffff82c48015eaf4
>> (XEN)    ffff83013ff27d08 ffff82c48015fe3d ffff83013ff27cf8 ffff82c48015d4fe
>> (XEN)    ffff83013ff27cc8 1400000000000001 ffff82f60155c740 ffff82f60155c740
>> (XEN)    ffff83013ff27f28 007fffffffffffff ffff83013ff27d28 ffff82c48015f11c
>> (XEN)    000000003fef0000 ffff82f60155c750 ffff83013ff27d38 ffff83013fef0000
>> (XEN)    0000000000000000 ffffc9000000c2b0 00000000000aae3a ffff83013ff27f28
>> (XEN)    ffff83013ff27d38 ffff82c48015f2f8 ffff83013ff27e38 ffff82c480163a4f
>> (XEN)    ffff83013fef0018 00007ff03fef0000 0000000000000000 ffff82c480264db0
>> (XEN)    ffff82c480264db8 ffff83013ff27f28 ffff83013ff27f28 ffff83013fef0218
>> (XEN)    ffff8300bf524000 ffff83013fef0000 ffff8300bf524000 ffff83013fef0000
>> (XEN)    ffff83013fff3da8 0000000100000002 ffff830100000000 ffff82f60155c740
>> (XEN)    800000008eadf063 ffff880000000001 ffff83013ff27de8 000000003fff3d90
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82c4801150c5>>] free_heap_pages+0x53a/0x555
>> (XEN)    [<ffff82c4801153ee>>] free_domheap_pages+0x30e/0x3cc
>> (XEN)    [<ffff82c48015d4d0>>] put_page+0x6c/0x73
>> (XEN)    [<ffff82c48015f8d8>>] put_page_from_l1e+0x19f/0x1b5
>> (XEN)    [<ffff82c48015eaf4>>] free_page_type+0x25c/0x7b0
>> (XEN)    [<ffff82c48015f11c>>] __put_page_type+0xd4/0x292
>> (XEN)    [<ffff82c48015f2f8>>] put_page_type+0xe/0x23
>> (XEN)    [<ffff82c480163a4f>>] do_mmuext_op+0x6ff/0x14b8
>> (XEN)    [<ffff82c480114235>>] do_multicall+0x285/0x410
>> (XEN)    [<ffff82c4801f01bf>>] syscall_enter+0xef/0x149
>> (XEN)
>> (XEN) Pagetable walk from ffff8315ffffffe4:
>> (XEN)  L4[0x106] = 00000000bf4f5027 5555555555555555
>> (XEN)  L3[0x057] = 0000000000000000 ffffffffffffffff
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 2:
>> (XEN) FATAL PAGE FAULT
>> (XEN) [error_code=0002]
>> (XEN) Faulting linear address: ffff8315ffffffe4
>> (XEN) ****************************************
>> (XEN)
>> (XEN) Reboot in five seconds...
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>> http://lists.xensource.com/xen-devel

[-- Attachment #1.2: Type: text/html, Size: 97173 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread