All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: kernel BUG at arch/x86/xen/mmu.c:1860!
       [not found] <COL0-MC1-F14hmBzxHs00230882@col0-mc1-f14.Col0.hotmail.com>
@ 2011-04-08 11:24 ` MaoXiaoyun
  2011-04-08 11:46   ` MaoXiaoyun
                     ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-08 11:24 UTC (permalink / raw)
  To: xen devel; +Cc: jeremy, dave, giamteckchoon, ian.campbell, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 2223 bytes --]


Hi: 
     Unfortunately I met the exactly same bug today. With pvops kernel 2.6.32.36, and xen 4.0.1.
     Kernel Panic and serial log attached. 
 
     Our test cases is quite simple, on a single physical host, we start 12 HVMS(windows 2003),
each of the HVM reboot every 10minutes. 
 
     The bug is easy to hit on our 48G machine(in hours). But We haven't hit the bug in our 24G 
machine(we have three 24G machine, all works fine.)  -----Is is possible related to Memory capacity?
 
Taking a look at the serial output,  the Dom0 code is attempting to pin what it thins 
is a "PGT_l3_page_table", however the hypervisor returns -EINVAL because it actually  is a "PGT_writable_page". 
 
(XEN) mm.c:2364:d0 Bad type (saw 7400000000000001 != exp 4000 0000 0000 0000) for mfn 898a41 (pfn 9ca41)
(XEN) mm.c:2733:d0 Error while pinning mfn 898a41
 
And  before that quite a lot abnormal grant table log like :
 
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965888
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
 
It looks like something wrong with grant table.
 
Many thanks.
 
> From: Jeremy Fitzhardinge <jeremy@goop.org>
> Subject: Re: [Xen-devel] [SPAM] Re: kernel BUG at
> arch/x86/xen/mmu.c:1860! - ideas.
> To: Ian Campbell <Ian.Campbell@citrix.com>
> Cc: Dave Hunter <dave@ivt.com.au>, Teck Choon Giam
> <giamteckchoon@gmail.com>, "xen-devel@lists.xensource.com"
> xen-devel@lists.xensource.com
 
> On 04/06/2011 12:53 AM, Ian Campbell wrote:
> > Please don't top post.
> >
> > On Wed, 2011-04-06 at 00:20 +0100, Dave Hunter wrote:
> >> Is it likely that Debian would release an updated kernel in squeeze with
> >> this configuration? (sorry, this might not be the place to ask).
> > I doubt they will, enabling DEBUG_PAGEALLOC seems very much like a
> > workaround not a solution to me.
> 
> Yes, it will impose a pretty large performance overhead.
> 
> J
> 
> 

 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 3064 bytes --]

[-- Attachment #2: kernel.txt --]
[-- Type: text/plain, Size: 4390 bytes --]

Apr  8 12:19:47 r14a11017 kernel: ------------[ cut here ]------------
Apr  8 12:19:47 r14a11017 kernel: kernel BUG at arch/x86/xen/mmu.c:1872!
Apr  8 12:19:47 r14a11017 kernel: invalid opcode: 0000 [#1] SMP
Apr  8 12:19:47 r14a11017 kernel: last sysfs file: /sys/hypervisor/properties/capabilities
Apr  8 12:19:47 r14a11017 kernel: CPU 0
Apr  8 12:19:47 r14a11017 kernel: Modules linked in: 8021q garp blktap xen_netback xen_blkback blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss bnx2 snd_seq_midi_event serio_raw snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer i2c_i801 iTCO_wdt i2c_core snd soundcore snd_page_alloc iTCO_vendor_support pata_acpi ata_generic pcspkr ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
Apr  8 12:19:47 r14a11017 kernel: Pid: 15769, comm: sh Not tainted 2.6.32.36xen #1 Tecal RH2285
Apr  8 12:19:47 r14a11017 kernel: RIP: e030:[<ffffffff8100cebc>]  [<ffffffff8100cebc>] pin_pagetable_pfn+0x36/0x3c
Apr  8 12:19:47 r14a11017 kernel: RSP: e02b:ffff88001eb7baa8  EFLAGS: 00010282
Apr  8 12:19:47 r14a11017 kernel: RAX: 00000000ffffffea RBX: 000000000007b307 RCX: 0000000000000001
Apr  8 12:19:47 r14a11017 kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88001eb7baa8
Apr  8 12:19:47 r14a11017 kernel: RBP: ffff88001eb7bac8 R08: 0000000000000420 R09: ffff880000000000
Apr  8 12:19:47 r14a11017 kernel: R10: 0000000000007ff0 R11: ffff88008fc97248 R12: ffff88002840b000
Apr  8 12:19:47 r14a11017 kernel: R13: 000000000007b484 R14: 0000000000000003 R15: ffff88009b090000
Apr  8 12:19:47 r14a11017 kernel: FS:  00007fe8bbc656e0(0000) GS:ffff88002803b000(0000) knlGS:0000000000000000
Apr  8 12:19:47 r14a11017 kernel: CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr  8 12:19:47 r14a11017 kernel: CR2: 00000000006bb338 CR3: 000000007b307000 CR4: 0000000000002660
Apr  8 12:19:47 r14a11017 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr  8 12:19:47 r14a11017 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr  8 12:19:47 r14a11017 kernel: Process sh (pid: 15769, threadinfo ffff88001eb7a000, task ffff88009b090000)
Apr  8 12:19:47 r14a11017 kernel: Stack:
Apr  8 12:19:47 r14a11017 kernel:  0000000000000000 00000000004b7484 000000011eb7bac8 000000000007b307
Apr  8 12:19:47 r14a11017 kernel: <0> ffff88001eb7baf8 ffffffff8100e8ef ffff88012e4fb100 ffff88000fb5e018
Apr  8 12:19:47 r14a11017 kernel: <0> 000000000007b484 00000000006bb338 ffff88001eb7bb08 ffffffff8100e935
Apr  8 12:19:47 r14a11017 kernel: Call Trace:
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff8100e8ef>] xen_alloc_ptpage+0x8d/0x96
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff8100e935>] xen_alloc_pte+0x13/0x15
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff810eb702>] __pte_alloc+0x7f/0xdc
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff810e90bd>] ? pmd_offset+0x13/0x3c
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff810eb818>] handle_mm_fault+0xb9/0x771
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff810f08fd>] ? vma_link+0x7c/0xa4
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff810f13b0>] ? mmap_region+0x322/0x42b
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff8100f169>] ? xen_force_evtchn_callback+0xd/0xf
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81449701>] do_page_fault+0x21c/0x288
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81447695>] page_fault+0x25/0x30
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81222a39>] ? __clear_user+0x33/0x55
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81222a1d>] ? __clear_user+0x17/0x55
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81222a8b>] clear_user+0x30/0x38
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff8115139a>] load_elf_binary+0x5d5/0x17ef
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81113094>] search_binary_handler+0xc8/0x255
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81114362>] do_execve+0x1c3/0x29e
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff8101155d>] sys_execve+0x43/0x5d
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff810131ca>] stub_execve+0x6a/0xc0

[-- Attachment #3: serial.txt --]
[-- Type: text/plain, Size: 4360 bytes --]

(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965888
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 5 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 14 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) printk: 13 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) printk: 59 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) printk: 81 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) printk: 75 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) printk: 79 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) printk: 81 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) printk: 33 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) printk: 9 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901765
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901765
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901765
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901765
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901765
(XEN) printk: 10 messages suppressed.
(XEN) mm.c:2364:d0 Bad type (saw 7400000000000001 != exp 4000 0000 0000 0000) for mfn 898a41 (pfn 9ca41)
(XEN) mm.c:2733:d0 Error while pinning mfn 898a41
                                 8000000000000000 
(XEN) mm.c:2364:d0 Bad type (saw 7400000000000001 != exp 4000000000000000) for mfn 871443 (pfn 75443)
(XEN) mm.c:2733:d0 Error while pinning mfn 871443
(XEN) mm.c:2364:d0 Bad type (saw 7400000000000001 != exp 4000000000000000) for mfn 898a41 (pfn 9ca41)
(XEN) mm.c:2500:d0 Error while installing new baseptr 898a41
(XEN) mm.c:2364:d0 Bad type (saw 7400000000000001 != exp 4000000000000000) for mfn 871443 (pfn 75443)
(XEN) mm.c:2825:d0 Error while installing new mfn 871443
(XEN) mm.c:2364:d0 Bad type (saw 4400000000000001 != exp 7000000000000000) for mfn 899551 (pfn 9d551)
(XEN) mm.c:860:d0 Error getting mfn 899551 (pfn 9d551) from L1 entry 8000000899551063 for l1e_owner=0, pg_owner=0

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: kernel BUG at arch/x86/xen/mmu.c:1860!
  2011-04-08 11:24 ` kernel BUG at arch/x86/xen/mmu.c:1860! MaoXiaoyun
@ 2011-04-08 11:46   ` MaoXiaoyun
  2011-04-10  3:57   ` kernel BUG at arch/x86/xen/mmu.c:1872 MaoXiaoyun
  2011-04-10  4:29   ` MaoXiaoyun
  2 siblings, 0 replies; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-08 11:46 UTC (permalink / raw)
  To: xen devel; +Cc: jeremy, dave, giamteckchoon, ian.campbell, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 3144 bytes --]


HI:
 
     As I go through the code with log, I noticed that  the log: 
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
 
is from xen/common/grant_table.c:266, which is in function _set_status_v1()
so it looks like kernel 2.6.32 use grant table version 1.
 
While in 2.6.31. driver/xen/grant-table.c, I noticed function gnttab_request_version()
which looks like 2.6.31 require grant version 2. But this function cannot be found 
in 2.6.32.
 
Is this correct?
 
Thanks.
 
 
>--------------------------------------------------------------------------------
>From: tinnycloud@hotmail.com
>To: xen-devel@lists.xensource.com
>CC: dave@ivt.com.au; ian.campbell@citrix.com; giamteckchoon@gmail.com; konrad.wilk@oracle.com; jeremy@goop.org
>Subject: Re: kernel BUG at arch/x86/xen/mmu.c:1860!
>Date: Fri, 8 Apr 2011 19:24:35 +0800
>
>Hi: 
>     Unfortunately I met the exactly same bug today. With pvops kernel 2.6.32.36, and xen 4.0.1.
>     Kernel Panic and serial log attached. 
> 
>     Our test cases is quite simple, on a single physical host, we start 12 HVMS(windows 2003),
>each of the HVM reboot every 10minutes. 
> 
>     The bug is easy to hit on our 48G machine(in hours). But We haven't hit the bug in our 24G 
>machine(we have three 24G machine, all works fine.)  -----Is is possible related to Memory capacity?
> 
>Taking a look at the serial output,  the Dom0 code is attempting to pin what it thins 
>is a "PGT_l3_page_table", however the hypervisor returns -EINVAL because it actually  is a "PGT_writable_page". 
> 
>(XEN) mm.c:2364:d0 Bad type (saw 7400000000000001 != exp 4000 0000 0000 0000) for mfn 898a41 (pfn 9ca41)
>(XEN) mm.c:2733:d0 Error while pinning mfn 898a41
> 
>And  before that quite a lot abnormal grant table log like :
> 
>(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
>(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
>(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
>(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
>(XEN) grant_table.c:1717:d0 Bad grant reference 4294965888
>(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
> 
>It looks like something wrong with grant table.
> 
>Many thanks.
> 
>> From: Jeremy Fitzhardinge <jeremy@goop.org>
>> Subject: Re: [Xen-devel] [SPAM] Re: kernel BUG at
>> arch/x86/xen/mmu.c:1860! - ideas.
>> To: Ian Campbell <Ian.Campbell@citrix.com>
>> Cc: Dave Hunter <dave@ivt.com.au>, Teck Choon Giam
>> <giamteckchoon@gmail.com>, "xen-devel@lists.xensource.com"
>> xen-devel@lists.xensource.com
> 
>> On 04/06/2011 12:53 AM, Ian Campbell wrote:
>> > Please don't top post.
>> >
>> > On Wed, 2011-04-06 at 00:20 +0100, Dave Hunter wrote:
>> >> Is it likely that Debian would release an updated kernel in squeeze with
>> >> this configuration? (sorry, this might not be the place to ask).
>> > I doubt they will, enabling DEBUG_PAGEALLOC seems very much like a
>> > workaround not a solution to me.
>> 
>> Yes, it will impose a pretty large performance overhead.
>> 
>> J
>> 
>> 
> 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 4560 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-08 11:24 ` kernel BUG at arch/x86/xen/mmu.c:1860! MaoXiaoyun
  2011-04-08 11:46   ` MaoXiaoyun
@ 2011-04-10  3:57   ` MaoXiaoyun
  2011-04-10  4:29   ` MaoXiaoyun
  2 siblings, 0 replies; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-10  3:57 UTC (permalink / raw)
  To: xen devel; +Cc: jeremy, dave, giamteckchoon, ian.campbell, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 537 bytes --]


Hi Konrad & Jeremy:
 
     I'd like to open this BUG in a new thread, since the old thread is too long for read.
     
     We recently want to upgrade our kernel to 2.6.32, but unfortunately, we confront a kernel crash bug.
Our test case is simple, start 24 win2003 HVMS on our physical machine, and each HVM reboot 
every 15minutes. The kernel will crash in half an hour.(That is crash on VM second starts).
 
Our test go much further.
We test different kernel version.
2.6.32.10
2.6.32.10
2.6.32.10
 
      		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 900 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-08 11:24 ` kernel BUG at arch/x86/xen/mmu.c:1860! MaoXiaoyun
  2011-04-08 11:46   ` MaoXiaoyun
  2011-04-10  3:57   ` kernel BUG at arch/x86/xen/mmu.c:1872 MaoXiaoyun
@ 2011-04-10  4:29   ` MaoXiaoyun
  2011-04-10 13:57     ` MaoXiaoyun
  2 siblings, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-10  4:29 UTC (permalink / raw)
  To: xen devel; +Cc: jeremy, dave, giamteckchoon, ian.campbell, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 2514 bytes --]


(Please ignore my last mail, sent by type..)
 
Hi Konrad & Jeremy:
 
     I'd like to open this BUG in a new thread, since the old thread is too long for easy read.
     
     We recently want to upgrade our kernel to 2.6.32, but unfortunately, we confront a kernel crash bug.
Our test case is simple, start 24 win2003 HVMS on our physical machine, and each HVM reboot 
every 15minutes. The kernel will crash in half an hour.(That is crash on VM second starts).
 
Our test go much further.
We test different kernel version.
2.6.32.10  http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=d945b014ac5df9592c478bf9486d97e8914aab59
2.6.32.11  http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=27f948a3bf365a5bc3d56119637a177d41147815
2.6.32.12  http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ba739f9abd3f659b907a824af1161926b420a2ce
2.6.32.13  http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=f6fe6583b77a49b569eef1b66c3d761eec2e561b
2.6.32.15  http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=27ed1b0e0dae5f1d5da5c76451bc84cb529128bd
2.6.32.21  http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=69e50db231723596ed8ef9275d0068d6697f466a
 
There are basic three different result we met.
 
i1) grant table issue
The host still function, but use xm  dmesg, we have abnormal log.
please refer to the attched log of grant table
 
i2) kernel crash on a different place.
Host die during the test, after reboot, we can see nothing abnormal in /var/log/messages
 
i3) kernel BUG at arch/x86/xen/mmu.c:1872; 
Host die during the test, after reboot, we see the crash log in messages, refer to the attached log of 2.6.32.36
Summary of the test result, can be classified in two:
 
1) 2.6.32.10
30 machines involved the test, and three has issue (i1), and two has issue (i2), *no* issue (i3)
Other machines run tests successfully till now, more than 8 hours
 
2)2.6.32.11 or later version.
Each version containers 10 machine for tests, and all machine crashed in less than half an hour.
 
Conclusion:
1) grant table issue exists in all kernel version
2) kernerl crash at different place may exist in all kernel versions, but not happen so frequently, 2 out of 30
3) We observe the major difference of issue i3), from the test, it looks like it is introduced between the version
2.6.32.10 and 2.6.32.11.
 
Hope this help to locate the bug.
Many thanks.
 
      		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 4312 bytes --]

[-- Attachment #2: kernel_crash_at_different_place.txt --]
[-- Type: text/plain, Size: 20116 bytes --]


=========================crash log for machine one in 2.6.32.10 ================================================

INIT: Id "s0" respawning too fast: disabled for 5 minutes
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800b7a62600
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800abb3d200
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800ab3a3000
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800a7ef8e00
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800bd224e00
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800bf09f400
blktap_sysfs_destroy
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800a96b3c00
blktap_sysfs_create: adding attributes for dev ffff8800bf09ec00
INIT: Id "s0" respawning too fast: disabled for 5 minutes
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800b8daae00
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800b0ea5400
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800b8dab200
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800b8daa600
INIT: Id "s0" respawning too fast: disabled for 5 minutes
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800ab933200
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800b6514000
BUG: scheduling while atomic: swapper/0/0x10000100
Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure serio_raw bnx2 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd pata_acpi soundcore ata_generic snd_page_alloc pcspkr iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
CPU 0:
Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure serio_raw bnx2 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd pata_acpi soundcore ata_generic snd_page_alloc pcspkr iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
Pid: 0, comm: swapper Not tainted 2.6.32.10xen #1 Tecal RH2285          
RIP: e030:[<ffffffff810093aa>]  [<ffffffff810093aa>] hypercall_page+0x3aa/0x1000
RSP: e02b:ffffffff81663ed8  EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffffff81662000 RCX: ffffffff810093aa
RDX: ffffffff8100f23f RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffffffff81663ef0 R08: 0000000000000000 R09: ffff880028092e08
R10: ffff880159558000 R11: 0000000000000246 R12: 0000000000000000
R13: 6db6db6db6db6db7 R14: ffffffff817d73a0 R15: 0000000000000000
FS:  00007f50215db6e0(0000) GS:ffff88002802c000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f449672d000 CR3: 00000000a7b0a000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff8100ebbf>] ? xen_safe_halt+0x10/0x1a
 [<ffffffff8100c102>] xen_idle+0x3c/0x46
 [<ffffffff81010cbd>] cpu_idle+0x5d/0x8c
 [<ffffffff81426742>] rest_init+0x66/0x68
 [<ffffffff8179fd8d>] start_kernel+0x3ef/0x3fb
 [<ffffffff8179f2c3>] x86_64_start_reservations+0xae/0xb2
 [<ffffffff817a2cb3>] xen_start_kernel+0x4c0/0x4c7
divide error: 0000 [#1] SMP 
last sysfs file: /sys/class/net/d342dd/address
CPU 0 
Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure serio_raw bnx2 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd pata_acpi soundcore ata_generic snd_page_alloc pcspkr iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
Pid: 0, comm: swapper Not tainted 2.6.32.10xen #1 Tecal RH2285          
RIP: e030:[<ffffffff8104ee57>]  [<ffffffff8104ee57>] find_busiest_group+0x37d/0x721
RSP: e02b:ffff88002802fc90  EFLAGS: 00010246
RAX: 0000000000003c00 RBX: 0000000000000000 RCX: ffff880028041501
RDX: 0000000000000000 RSI: 0000000000000040 RDI: 0000000000000040
RBP: ffff88002802fdf0 R08: 0000000000000000 R09: ffff88002803be08
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000040
R13: ffff88002803bdf0 R14: ffff88002803bce0 R15: 0000000000000000
FS:  00007f50215db6e0(0000) GS:ffff88002802c000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f449672d000 CR3: 00000000a7b0a000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81662000, task ffffffff816e8980)
Stack:
 00001a0e95f9ec56 ffff88002803ba48 ffff88002802fe4c 000000008100eb79
<0> ffff88002802fe40 000000008100f252 0000000000003c00 0000000000000001
<0> ffff88002803be00 0000000100000010 ffff88002803bdf0 ffffffff8102edf9
Call Trace:
 <IRQ> 
 [<ffffffff8102edf9>] ? pvclock_clocksource_read+0x47/0x80
 [<ffffffff8100f23f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8104fc43>] rebalance_domains+0x17b/0x45b
 [<ffffffff810480f2>] ? wake_up_process+0x15/0x17
 [<ffffffff8104ff63>] run_rebalance_domains+0x40/0xc5
 [<ffffffff81059b9b>] __do_softirq+0xd2/0x194
 [<ffffffff81012eac>] call_softirq+0x1c/0x30
 [<ffffffff81014627>] do_softirq+0x46/0x87
 [<ffffffff81059c98>] irq_exit+0x3b/0x7a
 [<ffffffff812868ab>] xen_evtchn_do_upcall+0x156/0x172
 [<ffffffff81012efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI> 
 [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
 [<ffffffff8100f23f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
 [<ffffffff8100ebbf>] ? xen_safe_halt+0x10/0x1a
 [<ffffffff8100c102>] ? xen_idle+0x3c/0x46
 [<ffffffff81010cbd>] ? cpu_idle+0x5d/0x8c
 [<ffffffff81426742>] ? rest_init+0x66/0x68
 [<ffffffff8179fd8d>] ? start_kernel+0x3ef/0x3fb
 [<ffffffff8179f2c3>] ? x86_64_start_reservations+0xae/0xb2
 [<ffffffff817a2cb3>] ? xen_start_kernel+0x4c0/0x4c7
Code: 83 7d 10 00 74 0c 48 8b 5d 10 c7 03 00 00 00 00 eb 70 41 8b 55 08 48 8b 45 a8 48 89 d3 48 c1 a5 d0 fe ff ff 0a 48 c1 e0 0a 31 d2 <48> f7 f3 48 89 45 a0 48 8b 85 08 ff ff ff 48 29 85 00 ff ff ff 
RIP  [<ffffffff8104ee57>] find_busiest_group+0x37d/0x721
 RSP <ffff88002802fc90>
---[ end trace 7c3e3b64ca341f0a ]---
divide error: 0000 [#2] SMP 
last sysfs file: /sys/class/net/d342dd/address
CPU 2 
Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure serio_raw bnx2 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd pata_acpi soundcore ata_generic snd_page_alloc pcspkr iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
Pid: 0, comm: swapper Tainted: G      D    2.6.32.10xen #1 Tecal RH2285          
RIP: e030:[<ffffffff8104ee57>]  [<ffffffff8104ee57>] find_busiest_group+0x37d/0x721
RSP: e02b:ffff880028069c90  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880028041500
RDX: 0000000000000000 RSI: 0000000000000040 RDI: 0000000000000040
RBP: ffff880028069df0 R08: 0000000000000000 R09: ffff88002803be08
R10: ffff880028069d28 R11: ffff88015f8f7e48 R12: 0000000000000040
R13: ffff88002803bdf0 R14: ffff880028058ce0 R15: 0000000000000001
FS:  00007faad3ba8730(0000) GS:ffff880028066000(0000) knlGS:0000000000000000
CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00007fd1e7ed9000 CR3: 000000015510a000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff88015f8f6000, task ffff88015f8e4410)
Stack:
 00001a0e95fdb97b ffff880028075a48 ffff880028069e4c 000000008100eb79
<0> ffff880028069e40 000000008100f252 0000000000003c00 ffffffff00000000
<0> ffff88002803be00 0000000100000010 ffff880028058df0 0000000000000002
Call Trace:
 <IRQ> 
 [<ffffffff8107e36d>] ? tick_dev_program_event+0x2f/0xa1
 [<ffffffff8104fc43>] rebalance_domains+0x17b/0x45b
 [<ffffffff8104ff98>] run_rebalance_domains+0x75/0xc5
 [<ffffffff81059b9b>] __do_softirq+0xd2/0x194
 [<ffffffff81012eac>] call_softirq+0x1c/0x30
 [<ffffffff81014627>] do_softirq+0x46/0x87
 [<ffffffff81059c98>] irq_exit+0x3b/0x7a
 [<ffffffff812868ab>] xen_evtchn_do_upcall+0x156/0x172
 [<ffffffff81012efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI> 
 [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
 [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
 [<ffffffff8100ebbf>] ? xen_safe_halt+0x10/0x1a
 [<ffffffff8100c102>] ? xen_idle+0x3c/0x46
 [<ffffffff81010cbd>] ? cpu_idle+0x5d/0x8c
 [<ffffffff81432d68>] ? cpu_bringup_and_idle+0x13/0x15
Code: 83 7d 10 00 74 0c 48 8b 5d 10 c7 03 00 00 00 00 eb 70 41 8b 55 08 48 8b 45 a8 48 89 d3 48 c1 a5 d0 fe ff ff 0a 48 c1 e0 0a 31 d2 <48> f7 f3 48 89 45 a0 48 8b 85 08 ff ff ff 48 29 85 00 ff ff ff 
RIP  [<ffffffff8104ee57>] find_busiest_group+0x37d/0x721
 RSP <ffff880028069c90>
---[ end trace 7c3e3b64ca341f0b ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper Tainted: G      D    2.6.32.10xen #1
Call Trace:
 <IRQ>  [<ffffffff810402a5>] ? ftrace_profile_enable_sched_process_exit+0x10/0x17
 [<ffffffff81052eea>] panic+0xe0/0x198
 [<ffffffff8100eb79>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff8100f252>] ? check_events+0x12/0x20
 [<ffffffff8100f23f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff810402a5>] ? ftrace_profile_enable_sched_process_exit+0x10/0x17
 [<ffffffff81052b43>] ? print_oops_end_marker+0x23/0x25
 [<ffffffff810402a5>] ? ftrace_profile_enable_sched_process_exit+0x10/0x17
 [<ffffffff8143d2b5>] oops_end+0xb6/0xc6
 [<ffffffff810156c1>] die+0x5a/0x63
 [<ffffffff8143cb8c>] do_trap+0x115/0x124
 [<ffffffff81013610>] do_divide_error+0x96/0x9f
 [<ffffffff8104ee57>] ? find_busiest_group+0x37d/0x721
 [<ffffffff8100f23f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8143c3da>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff813937a0>] ? skb_release_data+0xab/0xb0
 [<ffffffff8100eb79>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff8100f252>] ? check_events+0x12/0x20
 [<ffffffff81012adb>] divide_error+0x1b/0x20
 [<ffffffff8104ee57>] ? find_busiest_group+0x37d/0x721
 [<ffffffff8107e36d>] ? tick_dev_program_event+0x2f/0xa1
 [<ffffffff8104fc43>] rebalance_domains+0x17b/0x45b
 [<ffffffff8104ff98>] run_rebalance_domains+0x75/0xc5
 [<ffffffff81059b9b>] __do_softirq+0xd2/0x194
 [<ffffffff81012eac>] call_softirq+0x1c/0x30
 [<ffffffff81014627>] do_softirq+0x46/0x87
 [<ffffffff81059c98>] irq_exit+0x3b/0x7a
 [<ffffffff812868ab>] xen_evtchn_do_upcall+0x156/0x172
 [<ffffffff81012efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI>  [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
 [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
 [<ffffffff8100ebbf>] ? xen_safe_halt+0x10/0x1a
 [<ffffffff8100c102>] ? xen_idle+0x3c/0x46
 [<ffffffff81010cbd>] ? cpu_idle+0x5d/0x8c
 [<ffffffff81432d68>] ? cpu_bringup_and_idle+0x13/0x15
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper Tainted: G      D    2.6.32.10xen #1
Call Trace:
 <IRQ>  [<ffffffff81052eea>] panic+0xe0/0x198
 [<ffffffff814300b2>] ? megaraid_probe_one+0xf25/0x116d
 [<ffffffff8100eb79>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff8100f252>] ? check_events+0x12/0x20
 [<ffffffff8100f23f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81052b43>] ? print_oops_end_marker+0x23/0x25
 [<ffffffff8143d2b5>] oops_end+0xb6/0xc6
 [<ffffffff810156c1>] die+0x5a/0x63
 [<ffffffff8143cb8c>] do_trap+0x115/0x124
 [<ffffffff81013610>] do_divide_error+0x96/0x9f
 [<ffffffff8104ee57>] ? find_busiest_group+0x37d/0x721
 [<ffffffff813934b9>] ? __kfree_skb+0x79/0x7d
 [<ffffffff8100f23f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff810fd88b>] ? kmem_cache_free+0x88/0xbb
 [<ffffffff813934b9>] ? __kfree_skb+0x79/0x7d
 [<ffffffff81012adb>] divide_error+0x1b/0x20
 [<ffffffff8104ee57>] ? find_busiest_group+0x37d/0x721
 [<ffffffff8102edf9>] ? pvclock_clocksource_read+0x47/0x80
 [<ffffffff8100f23f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8104fc43>] rebalance_domains+0x17b/0x45b
 [<ffffffff810480f2>] ? wake_up_process+0x15/0x17
 [<ffffffff8104ff63>] run_rebalance_domains+0x40/0xc5
 [<ffffffff81059b9b>] __do_softirq+0xd2/0x194
 [<ffffffff81012eac>] call_softirq+0x1c/0x30
 [<ffffffff81014627>] do_softirq+0x46/0x87
 [<ffffffff81059c98>] irq_exit+0x3b/0x7a
 [<ffffffff812868ab>] xen_evtchn_do_upcall+0x156/0x172
 [<ffffffff81012efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI>  [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
 [<ffffffff8100f23f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
 [<ffffffff8100ebbf>] ? xen_safe_halt+0x10/0x1a
 [<ffffffff8100c102>] ? xen_idle+0x3c/0x46
 [<ffffffff81010cbd>] ? cpu_idle+0x5d/0x8c
 [<ffffffff81426742>] ? rest_init+0x66/0x68
 [<ffffffff8179fd8d>] ? start_kernel+0x3ef/0x3fb
 [<ffffffff8179f2c3>] ? x86_64_start_reservations+0xae/0xb2
 [<ffffffff817a2cb3>] ? xen_start_kernel+0x4c0/0x4c7










=========================crash log for machine two in 2.6.32.10 ================================================






blktap_sysfs_create: adding attributes for dev ffff88010b668c00
__ratelimit: 4 callbacks suppressed
blktap_sysfs_create: adding attributes for dev ffff8800bd384200
__ratelimit: 6 callbacks suppressed
INIT: Id "s0" respawning too fast: disabled for 5 minutes
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800bf07d800
INIT: Id "s0" respawning too fast: disabled for 5 minutes
divide error: 0000 [#1] SMP 
last sysfs file: /sys/hypervisor/type
CPU 2 
Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss bnx2 snd_pcm serio_raw snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core pcspkr pata_acpi ata_generic iTCO_wdt iTCO_vendor_support ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
Pid: 19632, comm: xenstore-list Not tainted 2.6.32.10xen #1 Tecal RH2285          
RIP: e030:[<ffffffff8104ee57>]  [<ffffffff8104ee57>] find_busiest_group+0x37d/0x721
RSP: e02b:ffff8800b52a1c08  EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88002807b501
RDX: 0000000000000000 RSI: 0000000000000040 RDI: 0000000000000040
RBP: ffff8800b52a1d68 R08: 0000000000000000 R09: ffff880028075e08
R10: ffff8800bf0cf6c0 R11: ffff8800b52a1dd8 R12: 0000000000000040
R13: ffff880028075df0 R14: ffff880028075ce0 R15: 0000000000000002
FS:  00007fcb228146e0(0000) GS:ffff880028066000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003581d180e0 CR3: 00000000b83b5000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process xenstore-list (pid: 19632, threadinfo ffff8800b52a0000, task ffff88010c080000)
Stack:
 ffff8800b52a1c20 ffff880028075a48 ffff8800b52a1dbc 00000002b52a1c30
<0> ffff8800b52a1db0 0000000000000004 0000000000000000 0000000200000000
<0> ffff880028075e00 0000000000000001 ffff880028075df0 ffffffff2807b5c0
Call Trace:
 [<ffffffff8143a805>] schedule+0x27a/0x736
 [<ffffffff8143c3da>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff81288528>] read_reply+0x86/0x104
 [<ffffffff81071e82>] ? autoremove_wake_function+0x0/0x3d
 [<ffffffff8143ae0e>] ? _cond_resched+0xe/0x22
 [<ffffffff81288639>] xenbus_dev_request_and_reply+0x58/0x89
 [<ffffffffa011554e>] xenbus_file_write+0x16a/0x469 [xenfs]
 [<ffffffff81109671>] vfs_write+0xb0/0x10a
 [<ffffffff8110a3ab>] sys_write+0x4c/0x72
 [<ffffffff81011d72>] system_call_fastpath+0x16/0x1b
Code: 83 7d 10 00 74 0c 48 8b 5d 10 c7 03 00 00 00 00 eb 70 41 8b 55 08 48 8b 45 a8 48 89 d3 48 c1 a5 d0 fe ff ff 0a 48 c1 e0 0a 31 d2 <48> f7 f3 48 89 45 a0 48 8b 85 08 ff ff ff 48 29 85 00 ff ff ff 
RIP  [<ffffffff8104ee57>] find_busiest_group+0x37d/0x721
 RSP <ffff8800b52a1c08>
---[ end trace 13509d88f5b8918c ]---
divide error: 0000 [#2] SMP 
last sysfs file: /sys/hypervisor/type
CPU 1 
Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss bnx2 snd_pcm serio_raw snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core pcspkr pata_acpi ata_generic iTCO_wdt iTCO_vendor_support ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
Pid: 3429, comm: kipmi0 Tainted: G      D    2.6.32.10xen #1 Tecal RH2285          
RIP: e030:[<ffffffff8104ee57>]  [<ffffffff8104ee57>] find_busiest_group+0x37d/0x721
RSP: e02b:ffff880155e2dc70  EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88002807b500
RDX: 0000000000000000 RSI: 0000000000000040 RDI: 0000000000000040
RBP: ffff880155e2ddd0 R08: 0000000000000000 R09: ffff880028075e08
R10: ffff880158339400 R11: ffff880028040000 R12: 0000000000000040
R13: ffff880028075df0 R14: ffff880028058ce0 R15: 0000000000000001
FS:  00007fde594896e0(0000) GS:ffff880028049000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000469000 CR3: 0000000001001000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kipmi0 (pid: 3429, threadinfo ffff880155e2c000, task ffff880155cf2d60)
Stack:
 ffff880155e2dc88 ffff880028058a48 ffff880155e2de24 0000000255e2dc98
<0> ffff880155e2de18 0000000000000004 0000000000000000 ffffffff00000000
<0> ffff880028075e00 0000000028040000 ffff880028058df0 000000002803be08
Call Trace:
 [<ffffffff8143a805>] schedule+0x27a/0x736
 [<ffffffff8143c3da>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8143b0c0>] schedule_timeout+0x9d/0xc4
 [<ffffffff810608c4>] ? process_timeout+0x0/0x10
 [<ffffffff8143b125>] schedule_timeout_interruptible+0x1e/0x20
 [<ffffffffa01c9d09>] ipmi_thread+0x6a/0x7e [ipmi_si]
 [<ffffffffa01c9c9f>] ? ipmi_thread+0x0/0x7e [ipmi_si]
 [<ffffffff81071aa3>] kthread+0x6e/0x76
 [<ffffffff81012daa>] child_rip+0xa/0x20
 [<ffffffff81011f91>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101271d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81012da0>] ? child_rip+0x0/0x20
Code: 83 7d 10 00 74 0c 48 8b 5d 10 c7 03 00 00 00 00 eb 70 41 8b 55 08 48 8b 45 a8 48 89 d3 48 c1 a5 d0 fe ff ff 0a 48 c1 e0 0a 31 d2 <48> f7 f3 48 89 45 a0 48 8b 85 08 ff ff ff 48 29 85 00 ff ff ff 
RIP  [<ffffffff8104ee57>] find_busiest_group+0x37d/0x721
 RSP <ffff880155e2dc70>
---[ end trace 13509d88f5b8918d ]---

[-- Attachment #3: kernel_bug_at_mmu.c.txt --]
[-- Type: text/plain, Size: 4390 bytes --]

Apr  8 12:19:47 r14a11017 kernel: ------------[ cut here ]------------
Apr  8 12:19:47 r14a11017 kernel: kernel BUG at arch/x86/xen/mmu.c:1872!
Apr  8 12:19:47 r14a11017 kernel: invalid opcode: 0000 [#1] SMP
Apr  8 12:19:47 r14a11017 kernel: last sysfs file: /sys/hypervisor/properties/capabilities
Apr  8 12:19:47 r14a11017 kernel: CPU 0
Apr  8 12:19:47 r14a11017 kernel: Modules linked in: 8021q garp blktap xen_netback xen_blkback blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss bnx2 snd_seq_midi_event serio_raw snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer i2c_i801 iTCO_wdt i2c_core snd soundcore snd_page_alloc iTCO_vendor_support pata_acpi ata_generic pcspkr ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
Apr  8 12:19:47 r14a11017 kernel: Pid: 15769, comm: sh Not tainted 2.6.32.36xen #1 Tecal RH2285
Apr  8 12:19:47 r14a11017 kernel: RIP: e030:[<ffffffff8100cebc>]  [<ffffffff8100cebc>] pin_pagetable_pfn+0x36/0x3c
Apr  8 12:19:47 r14a11017 kernel: RSP: e02b:ffff88001eb7baa8  EFLAGS: 00010282
Apr  8 12:19:47 r14a11017 kernel: RAX: 00000000ffffffea RBX: 000000000007b307 RCX: 0000000000000001
Apr  8 12:19:47 r14a11017 kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88001eb7baa8
Apr  8 12:19:47 r14a11017 kernel: RBP: ffff88001eb7bac8 R08: 0000000000000420 R09: ffff880000000000
Apr  8 12:19:47 r14a11017 kernel: R10: 0000000000007ff0 R11: ffff88008fc97248 R12: ffff88002840b000
Apr  8 12:19:47 r14a11017 kernel: R13: 000000000007b484 R14: 0000000000000003 R15: ffff88009b090000
Apr  8 12:19:47 r14a11017 kernel: FS:  00007fe8bbc656e0(0000) GS:ffff88002803b000(0000) knlGS:0000000000000000
Apr  8 12:19:47 r14a11017 kernel: CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr  8 12:19:47 r14a11017 kernel: CR2: 00000000006bb338 CR3: 000000007b307000 CR4: 0000000000002660
Apr  8 12:19:47 r14a11017 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr  8 12:19:47 r14a11017 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr  8 12:19:47 r14a11017 kernel: Process sh (pid: 15769, threadinfo ffff88001eb7a000, task ffff88009b090000)
Apr  8 12:19:47 r14a11017 kernel: Stack:
Apr  8 12:19:47 r14a11017 kernel:  0000000000000000 00000000004b7484 000000011eb7bac8 000000000007b307
Apr  8 12:19:47 r14a11017 kernel: <0> ffff88001eb7baf8 ffffffff8100e8ef ffff88012e4fb100 ffff88000fb5e018
Apr  8 12:19:47 r14a11017 kernel: <0> 000000000007b484 00000000006bb338 ffff88001eb7bb08 ffffffff8100e935
Apr  8 12:19:47 r14a11017 kernel: Call Trace:
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff8100e8ef>] xen_alloc_ptpage+0x8d/0x96
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff8100e935>] xen_alloc_pte+0x13/0x15
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff810eb702>] __pte_alloc+0x7f/0xdc
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff810e90bd>] ? pmd_offset+0x13/0x3c
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff810eb818>] handle_mm_fault+0xb9/0x771
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff810f08fd>] ? vma_link+0x7c/0xa4
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff810f13b0>] ? mmap_region+0x322/0x42b
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff8100f169>] ? xen_force_evtchn_callback+0xd/0xf
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81449701>] do_page_fault+0x21c/0x288
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81447695>] page_fault+0x25/0x30
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81222a39>] ? __clear_user+0x33/0x55
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81222a1d>] ? __clear_user+0x17/0x55
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81222a8b>] clear_user+0x30/0x38
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff8115139a>] load_elf_binary+0x5d5/0x17ef
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81113094>] search_binary_handler+0xc8/0x255
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff81114362>] do_execve+0x1c3/0x29e
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff8101155d>] sys_execve+0x43/0x5d
Apr  8 12:19:47 r14a11017 kernel:  [<ffffffff810131ca>] stub_execve+0x6a/0xc0

[-- Attachment #4: granttabl.txt --]
[-- Type: text/plain, Size: 10965 bytes --]

(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 5 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 5 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 5 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 5 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 6 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 9 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 9 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 9 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 5 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 2 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 4 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 2 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 5 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 15 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) printk: 13 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901765
(XEN) printk: 9 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901765
(XEN) printk: 9 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901765
(XEN) printk: 9 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901765
(XEN) printk: 13 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) printk: 9 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901765
(XEN) printk: 5 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901765
(XEN) printk: 6 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901760
(XEN) printk: 8 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294901765

[-- Attachment #5: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-10  4:29   ` MaoXiaoyun
@ 2011-04-10 13:57     ` MaoXiaoyun
  2011-04-10 20:14       ` Teck Choon Giam
  0 siblings, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-10 13:57 UTC (permalink / raw)
  To: xen devel; +Cc: jeremy, keir, ian.campbell, konrad.wilk, giamteckchoon, dave


[-- Attachment #1.1: Type: text/plain, Size: 3342 bytes --]


Hi Konrad & Jeremy:

            I think we finally located the missing patch for this commit.
            We test commit http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=c97f681f138039425c87f35ea46a92385d81e70e
            which is works.
 
            We test commit http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=221c64dbf860d37f841f40893bddf8d804aa55bd
            which server crashed.
 
             Later I found the comments for this commit: 
             http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec
         
            So It looks like this fix is not applied on 2.6.32.36, Could you take a look at this? 
 
            Many thanks.
              
=====================================================
>Hi Konrad & Jeremy:
> 
>     I'd like to open this BUG in a new thread, since the old thread is too long for easy read.
>     
>     We recently want to upgrade our kernel to 2.6.32, but unfortunately, we confront a kernel crash bug.
>Our test case is simple, start 24 win2003 HVMS on our physical machine, and each HVM reboot 
>every 15minutes. The kernel will crash in half an hour.(That is crash on VM second starts).
> 
>Our test go much further.
>We test different kernel version.
>2.6.32.10  http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=d945b014ac5df9592c478bf9486d97e8914aab59
>2.6.32.11  http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=27f948a3bf365a5bc3d56119637a177d41147815
>2.6.32.12  http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ba739f9abd3f659b907a824af1161926b420a2ce
>2.6.32.13  http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=f6fe6583b77a49b569eef1b66c3d761eec2e561b
>2.6.32.15  http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=27ed1b0e0dae5f1d5da5c76451bc84cb529128bd
>2.6.32.21  http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=69e50db231723596ed8ef9275d0068d6697f466a
> 
>There are basic three different result we met.
> 
>i1) grant table issue
>The host still function, but use xm  dmesg, we have abnormal log.
>please refer to the attched log of grant table
> 
>i2) kernel crash on a different place.
>Host die during the test, after reboot, we can see nothing abnormal in /var/log/messages
> 
>i3) kernel BUG at arch/x86/xen/mmu.c:1872; 
>Host die during the test, after reboot, we see the crash log in messages, refer to the attached log of 2.6.32.36
>Summary of the test result, can be classified in two:
> 
>1) 2.6.32.10
>30 machines involved the test, and three has issue (i1), and two has issue (i2), *no* issue (i3)
>Other machines run tests successfully till now, more than 8 hours
> 
>2)2.6.32.11 or later version.
>Each version containers 10 machine for tests, and all machine crashed in less than half an hour.
> 
>Conclusion:
>1) grant table issue exists in all kernel version
>2) kernerl crash at different place may exist in all kernel versions, but not happen so frequently, 2 out of 30
>3) We observe the major difference of issue i3), from the test, it looks like it is introduced between the version
>2.6.32.10 and 2.6.32.11.
> 
>Hope this help to locate the bug.
>Many thanks.
> 
> 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 5901 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-10 13:57     ` MaoXiaoyun
@ 2011-04-10 20:14       ` Teck Choon Giam
  2011-04-11 12:16         ` Teck Choon Giam
  0 siblings, 1 reply; 41+ messages in thread
From: Teck Choon Giam @ 2011-04-10 20:14 UTC (permalink / raw)
  To: MaoXiaoyun; +Cc: jeremy, xen devel, keir, ian.campbell, konrad.wilk, dave

[-- Attachment #1: Type: text/plain, Size: 4583 bytes --]

2011/4/10 MaoXiaoyun <tinnycloud@hotmail.com>:
> Hi Konrad & Jeremy:
>
>             I think we finally located the missing patch for this commit.
>             We test commit
> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=c97f681f138039425c87f35ea46a92385d81e70e
>             which is works.
>
>             We test commit
> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=221c64dbf860d37f841f40893bddf8d804aa55bd
>             which server crashed.
>
>              Later I found the comments for this commit:
>
> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec
>
>             So It looks like this fix is not applied on 2.6.32.36, Could you
> take a look at this?
>
>             Many thanks.
>
> =====================================================
>>Hi Konrad & Jeremy:
>>
>>     I'd like to open this BUG in a new thread, since the old thread is too
>> long for easy read.
>>
>>     We recently want to upgrade our kernel to 2.6.32, but unfortunately,
>> we confront a kernel crash bug.
>>Our test case is simple, start 24 win2003 HVMS on our physical machine, and
>> each HVM reboot
>>every 15minutes. The kernel will crash in half an hour.(That is crash on VM
>> second starts).
>>
>>Our test go much further.
>>We test different kernel version.
>>2.6.32.10
>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=d945b014ac5df9592c478bf9486d97e8914aab59
>>2.6.32.11
>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=27f948a3bf365a5bc3d56119637a177d41147815
>>2.6.32.12
>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ba739f9abd3f659b907a824af1161926b420a2ce
>>2.6.32.13
>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=f6fe6583b77a49b569eef1b66c3d761eec2e561b
>>2.6.32.15
>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=27ed1b0e0dae5f1d5da5c76451bc84cb529128bd
>>2.6.32.21
>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=69e50db231723596ed8ef9275d0068d6697f466a
>>
>>There are basic three different result we met.
>>
>>i1) grant table issue
>>The host still function, but use xm  dmesg, we have abnormal log.
>>please refer to the attched log of grant table
>>
>>i2) kernel crash on a different place.
>>Host die during the test, after reboot, we can see nothing abnormal in
>> /var/log/messages
>>
>>i3) kernel BUG at arch/x86/xen/mmu.c:1872;
>>Host die during the test, after reboot, we see the crash log in messages,
>> refer to the attached log of 2.6.32.36
>>Summary of the test result, can be classified in two:
>>
>>1) 2.6.32.10
>>30 machines involved the test, and three has issue (i1), and two has issue
>> (i2), *no* issue (i3)
>>Other machines run tests successfully till now, more than 8 hours
>>
>>2)2.6.32.11 or later version.
>>Each version containers 10 machine for tests, and all machine crashed in
>> less than half an hour.
>>
>>Conclusion:
>>1) grant table issue exists in all kernel version
>>2) kernerl crash at different place may exist in all kernel versions, but
>> not happen so frequently, 2 out of 30
>>3) We observe the major difference of issue i3), from the test, it looks
>> like it is introduced between the version
>>2.6.32.10 and 2.6.32.11.
>>
>>Hope this help to locate the bug.
>>Many thanks.
>>
>>
>

Hi,

Sorry, since this mmu related BUG has been troubled me for very
long... I really want to "kill" this BUG but my knowledge in kernel
hacking and/or xen is very limited.

While waiting for Jeremy or Konrad or others ...

Many thanks for spending time to track down this mmu related BUG.  I
have backported the commit from
http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec
to 2.6.32.36 PVOPS kernel and patch attached.  I won't know whether
did I backport it correctly nor does it affects anything.  I am
currently testing the 2.6.32.36 PVOPS kernel with this patch applied
and also unset CONFIG_DEBUG_PAGEALLOC.  Currently running testcrash.sh
loop 1000 as I am unable to reproduce this mmu BUG 1872 in
testcrash.sh loop 100.  Please note that when CONFIG_DEBUG_PAGEALLOC
is unset, I can reproduce this mmu BUG 1872 easily within <50
testcrash.sh loop cycle with PVOPS version 2.6.32.24 to 2.6.32.36
kernel.  Now test with this backport patch to see whether I can
reproduce this mmu BUG... ...

Kindest regards,
Giam Teck Choon

[-- Attachment #2: vmalloc__eagerly_clear_ptes_on_vunmap.patch --]
[-- Type: text/x-patch, Size: 3393 bytes --]

Back port from commit http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec

diff -urN a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
--- a/arch/x86/xen/mmu.c	2011-03-30 06:17:46.000000000 +0800
+++ b/arch/x86/xen/mmu.c	2011-04-11 02:17:54.000000000 +0800
@@ -2430,8 +2430,6 @@
 	x86_init.paging.pagetable_setup_start = xen_pagetable_setup_start;
 	x86_init.paging.pagetable_setup_done = xen_pagetable_setup_done;
 	pv_mmu_ops = xen_mmu_ops;
-
-	vmap_lazy_unmap = false;
 }
 
 /* Protected by xen_reservation_lock. */
diff -urN a/include/linux/vmalloc.h b/include/linux/vmalloc.h
--- a/include/linux/vmalloc.h	2011-03-30 06:17:46.000000000 +0800
+++ b/include/linux/vmalloc.h	2011-04-11 02:18:43.000000000 +0800
@@ -7,8 +7,6 @@
 
 struct vm_area_struct;		/* vma defining user mapping in mm_types.h */
 
-extern bool vmap_lazy_unmap;
-
 /* bits in flags of vmalloc's vm_struct below */
 #define VM_IOREMAP	0x00000001	/* ioremap() and friends */
 #define VM_ALLOC	0x00000002	/* vmalloc() */
diff -urN a/mm/vmalloc.c b/mm/vmalloc.c
--- a/mm/vmalloc.c	2011-03-30 06:17:46.000000000 +0800
+++ b/mm/vmalloc.c	2011-04-11 02:25:38.000000000 +0800
@@ -31,8 +31,6 @@
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>
 
-bool vmap_lazy_unmap __read_mostly = true;
-
 /*** Page table manipulation functions ***/
 
 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
@@ -503,9 +501,6 @@
 {
 	unsigned int log;
 
-	if (!vmap_lazy_unmap)
-		return 0;
-
 	log = fls(num_online_cpus());
 
 	return log * (32UL * 1024 * 1024 / PAGE_SIZE);
@@ -566,7 +561,6 @@
 			if (va->va_end > *end)
 				*end = va->va_end;
 			nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
-			unmap_vmap_area(va);
 			list_add_tail(&va->purge_list, &valist);
 			va->flags |= VM_LAZY_FREEING;
 			va->flags &= ~VM_LAZY_FREE;
@@ -612,10 +606,11 @@
 }
 
 /*
- * Free and unmap a vmap area, caller ensuring flush_cache_vunmap had been
- * called for the correct range previously.
+ * Free a vmap area, caller ensuring that the area has been unmapped
+ * and flush_cache_vunmap had been called for the correct range
+ * previously.
  */
-static void free_unmap_vmap_area_noflush(struct vmap_area *va)
+static void free_vmap_area_noflush(struct vmap_area *va)
 {
 	va->flags |= VM_LAZY_FREE;
 	atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr);
@@ -624,6 +619,16 @@
 }
 
 /*
+ * Free and unmap a vmap area, caller ensuring flush_cache_vunmap had been
+ * called for the correct range previously.
+ */
+static void free_unmap_vmap_area_noflush(struct vmap_area *va)
+{
+	unmap_vmap_area(va);
+	free_vmap_area_noflush(va);
+}
+
+/*
  * Free and unmap a vmap area
  */
 static void free_unmap_vmap_area(struct vmap_area *va)
@@ -799,7 +804,7 @@
 	spin_unlock(&vmap_block_tree_lock);
 	BUG_ON(tmp != vb);
 
-	free_unmap_vmap_area_noflush(vb->va);
+	free_vmap_area_noflush(vb->va);
 	call_rcu(&vb->rcu_head, rcu_free_vb);
 }
 
@@ -936,6 +941,8 @@
 	rcu_read_unlock();
 	BUG_ON(!vb);
 
+	vunmap_page_range((unsigned long)addr, (unsigned long)addr + size);
+
 	spin_lock(&vb->lock);
 	BUG_ON(bitmap_allocate_region(vb->dirty_map, offset >> PAGE_SHIFT, order));
 
@@ -988,7 +995,6 @@
 
 				s = vb->va->va_start + (i << PAGE_SHIFT);
 				e = vb->va->va_start + (j << PAGE_SHIFT);
-				vunmap_page_range(s, e);
 				flush = 1;
 
 				if (s < start)

[-- Attachment #3: testcrash.sh --]
[-- Type: application/x-sh, Size: 5573 bytes --]

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-10 20:14       ` Teck Choon Giam
@ 2011-04-11 12:16         ` Teck Choon Giam
  2011-04-11 12:22           ` Teck Choon Giam
  2011-04-11 12:31           ` MaoXiaoyun
  0 siblings, 2 replies; 41+ messages in thread
From: Teck Choon Giam @ 2011-04-11 12:16 UTC (permalink / raw)
  To: MaoXiaoyun; +Cc: jeremy, xen devel, keir, ian.campbell, konrad.wilk, dave

>
> Hi,
>
> Sorry, since this mmu related BUG has been troubled me for very
> long... I really want to "kill" this BUG but my knowledge in kernel
> hacking and/or xen is very limited.
>
> While waiting for Jeremy or Konrad or others ...
>
> Many thanks for spending time to track down this mmu related BUG.  I
> have backported the commit from
> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec
> to 2.6.32.36 PVOPS kernel and patch attached.  I won't know whether
> did I backport it correctly nor does it affects anything.  I am
> currently testing the 2.6.32.36 PVOPS kernel with this patch applied
> and also unset CONFIG_DEBUG_PAGEALLOC.  Currently running testcrash.sh
> loop 1000 as I am unable to reproduce this mmu BUG 1872 in
> testcrash.sh loop 100.  Please note that when CONFIG_DEBUG_PAGEALLOC
> is unset, I can reproduce this mmu BUG 1872 easily within <50
> testcrash.sh loop cycle with PVOPS version 2.6.32.24 to 2.6.32.36
> kernel.  Now test with this backport patch to see whether I can
> reproduce this mmu BUG... ...
>
> Kindest regards,
> Giam Teck Choon
>

I have tested with my backport patch and it is working fine as I am
unable to reproduce the mmu.c 1872 or 1860 bug with
CONFIG_DEBUG_PAGEALLOC not set.  I tested with testcrash.sh loop 100
and 1000.  Now doing testcrash.sh loop 10000.

Xiaoyun, is it possible for you to test my patch and see whether can
you reproduce the mmu.c 1872/1860 bug?

Can anyone of you review my patch?

I will post a format patch according to
Documentation/SubmittingPatches in my next reply and hopefully can be
reviewed.

Thanks.

Kindest regards,
Giam Teck Choon

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-11 12:16         ` Teck Choon Giam
@ 2011-04-11 12:22           ` Teck Choon Giam
  2011-04-11 12:31           ` MaoXiaoyun
  1 sibling, 0 replies; 41+ messages in thread
From: Teck Choon Giam @ 2011-04-11 12:22 UTC (permalink / raw)
  To: MaoXiaoyun; +Cc: jeremy, xen devel, keir, ian.campbell, konrad.wilk, dave

From: Giam Teck Choon <giamteckchoon@gmail.com>

vmalloc: eagerly clear ptes on vunmap

Backport from commit 64141da587241301ce8638cc945f8b67853156ec to 2.6.32.36

URL: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec

Without this patch, kernel BUG at arch/x86/xen/mmu.c:1860 or kernel BUG at
arch/x86/xen/mmu.c:1872 is easily triggered when CONFIG_DEBUG_PAGEALLOC is
unset especially doing LVM snapshots.

Signed-off-by: Giam Teck Choon <giamteckchoon@gmail.com>
---
 arch/x86/xen/mmu.c      |    2 --
 include/linux/vmalloc.h |    2 --
 mm/vmalloc.c            |   28 +++++++++++++++++-----------
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index fa36ab8..204e3ba 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -2430,8 +2430,6 @@ void __init xen_init_mmu_ops(void)
 	x86_init.paging.pagetable_setup_start = xen_pagetable_setup_start;
 	x86_init.paging.pagetable_setup_done = xen_pagetable_setup_done;
 	pv_mmu_ops = xen_mmu_ops;
-
-	vmap_lazy_unmap = false;
 }

 /* Protected by xen_reservation_lock. */
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 1a2ba21..3c123c3 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -7,8 +7,6 @@

 struct vm_area_struct;		/* vma defining user mapping in mm_types.h */

-extern bool vmap_lazy_unmap;
-
 /* bits in flags of vmalloc's vm_struct below */
 #define VM_IOREMAP	0x00000001	/* ioremap() and friends */
 #define VM_ALLOC	0x00000002	/* vmalloc() */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4f701c2..80cbd7b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -31,8 +31,6 @@
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>

-bool vmap_lazy_unmap __read_mostly = true;
-
 /*** Page table manipulation functions ***/

 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
@@ -503,9 +501,6 @@ static unsigned long lazy_max_pages(void)
 {
 	unsigned int log;

-	if (!vmap_lazy_unmap)
-		return 0;
-
 	log = fls(num_online_cpus());

 	return log * (32UL * 1024 * 1024 / PAGE_SIZE);
@@ -566,7 +561,6 @@ static void __purge_vmap_area_lazy(unsigned long
*start, unsigned long *end,
 			if (va->va_end > *end)
 				*end = va->va_end;
 			nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
-			unmap_vmap_area(va);
 			list_add_tail(&va->purge_list, &valist);
 			va->flags |= VM_LAZY_FREEING;
 			va->flags &= ~VM_LAZY_FREE;
@@ -612,10 +606,11 @@ static void purge_vmap_area_lazy(void)
 }

 /*
- * Free and unmap a vmap area, caller ensuring flush_cache_vunmap had been
- * called for the correct range previously.
+ * Free a vmap area, caller ensuring that the area has been unmapped
+ * and flush_cache_vunmap had been called for the correct range
+ * previously.
  */
-static void free_unmap_vmap_area_noflush(struct vmap_area *va)
+static void free_vmap_area_noflush(struct vmap_area *va)
 {
 	va->flags |= VM_LAZY_FREE;
 	atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr);
@@ -624,6 +619,16 @@ static void free_unmap_vmap_area_noflush(struct
vmap_area *va)
 }

 /*
+ * Free and unmap a vmap area, caller ensuring flush_cache_vunmap had been
+ * called for the correct range previously.
+ */
+static void free_unmap_vmap_area_noflush(struct vmap_area *va)
+{
+	unmap_vmap_area(va);
+	free_vmap_area_noflush(va);
+}
+
+/*
  * Free and unmap a vmap area
  */
 static void free_unmap_vmap_area(struct vmap_area *va)
@@ -799,7 +804,7 @@ static void free_vmap_block(struct vmap_block *vb)
 	spin_unlock(&vmap_block_tree_lock);
 	BUG_ON(tmp != vb);

-	free_unmap_vmap_area_noflush(vb->va);
+	free_vmap_area_noflush(vb->va);
 	call_rcu(&vb->rcu_head, rcu_free_vb);
 }

@@ -936,6 +941,8 @@ static void vb_free(const void *addr, unsigned long size)
 	rcu_read_unlock();
 	BUG_ON(!vb);

+	vunmap_page_range((unsigned long)addr, (unsigned long)addr + size);
+
 	spin_lock(&vb->lock);
 	BUG_ON(bitmap_allocate_region(vb->dirty_map, offset >> PAGE_SHIFT, order));

@@ -988,7 +995,6 @@ void vm_unmap_aliases(void)

 				s = vb->va->va_start + (i << PAGE_SHIFT);
 				e = vb->va->va_start + (j << PAGE_SHIFT);
-				vunmap_page_range(s, e);
 				flush = 1;

 				if (s < start)

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* RE: kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-11 12:16         ` Teck Choon Giam
  2011-04-11 12:22           ` Teck Choon Giam
@ 2011-04-11 12:31           ` MaoXiaoyun
  2011-04-11 15:25             ` Teck Choon Giam
  2011-04-11 18:08             ` Jeremy Fitzhardinge
  1 sibling, 2 replies; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-11 12:31 UTC (permalink / raw)
  To: giamteckchoon; +Cc: jeremy, xen devel, keir, ian.campbell, konrad.wilk, dave


[-- Attachment #1.1: Type: text/plain, Size: 2759 bytes --]


Hi:
 
     I believe this is the fix at much extent. 
     Since I have my own test cases which with this patch, my test case will success in 30 rounds run. 
     Every round takes 8hours.  While without this patch, tests fail evey round in 15minutes.
 
      So this really means fix most of the things. 
 
      But during running, I met another crash, from the log it it looks like has relation with
this BUG, since the crash log shows it is tlb related and this BUG also tlb related.
 
      Well, I'm also have poor knowledge of kernel.
      Hope someone from Xen Devel offer some help. 
 
      Many thanks.
 
> Date: Mon, 11 Apr 2011 20:16:53 +0800
> Subject: Re: kernel BUG at arch/x86/xen/mmu.c:1872
> From: giamteckchoon@gmail.com
> To: tinnycloud@hotmail.com
> CC: xen-devel@lists.xensource.com; dave@ivt.com.au; ian.campbell@citrix.com; konrad.wilk@oracle.com; jeremy@goop.org; keir@xen.org
> 
> >
> > Hi,
> >
> > Sorry, since this mmu related BUG has been troubled me for very
> > long... I really want to "kill" this BUG but my knowledge in kernel
> > hacking and/or xen is very limited.
> >
> > While waiting for Jeremy or Konrad or others ...
> >
> > Many thanks for spending time to track down this mmu related BUG.  I
> > have backported the commit from
> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec
> > to 2.6.32.36 PVOPS kernel and patch attached.  I won't know whether
> > did I backport it correctly nor does it affects anything.  I am
> > currently testing the 2.6.32.36 PVOPS kernel with this patch applied
> > and also unset CONFIG_DEBUG_PAGEALLOC.  Currently running testcrash.sh
> > loop 1000 as I am unable to reproduce this mmu BUG 1872 in
> > testcrash.sh loop 100.  Please note that when CONFIG_DEBUG_PAGEALLOC
> > is unset, I can reproduce this mmu BUG 1872 easily within <50
> > testcrash.sh loop cycle with PVOPS version 2.6.32.24 to 2.6.32.36
> > kernel.  Now test with this backport patch to see whether I can
> > reproduce this mmu BUG... ...
> >
> > Kindest regards,
> > Giam Teck Choon
> >
> 
> I have tested with my backport patch and it is working fine as I am
> unable to reproduce the mmu.c 1872 or 1860 bug with
> CONFIG_DEBUG_PAGEALLOC not set. I tested with testcrash.sh loop 100
> and 1000. Now doing testcrash.sh loop 10000.
> 
> Xiaoyun, is it possible for you to test my patch and see whether can
> you reproduce the mmu.c 1872/1860 bug?
> 
> Can anyone of you review my patch?
> 
> I will post a format patch according to
> Documentation/SubmittingPatches in my next reply and hopefully can be
> reviewed.
> 
> Thanks.
> 
> Kindest regards,
> Giam Teck Choon
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 3639 bytes --]

[-- Attachment #2: 196.22.txt --]
[-- Type: text/plain, Size: 6940 bytes --]

_ratelimit: 62 callbacks suppressed
blktap_sysfs_create: adding attributes for dev ffff88009bdb6000
blktap_sysfs_create: adding attributes for dev ffff88009bdb2200
INIT: Id "s0" respawning too fast: disabled for 5 minutes
__ratelimit: 14 callbacks suppressed
blktap_sysfs_destroy
blktap_sysfs_destroy
------------[ cut here ]------------
kernel BUG at arch/x86/mm/tlb.c:61!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb
CPU 1 
Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285          
RIP: e030:[<ffffffff8103a3cb>]  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
RSP: e02b:ffff88002805be48  EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0
RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200
R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880
R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000
FS:  00007f62362d66e0(0000) GS:ffff880028058000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process khelper (pid: 25581, threadinfo ffff88007691e000, task ffff88009b92db40)
Stack:
 ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88
<0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78
<0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108
Call Trace:
 <IRQ> 
 [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
 [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
 [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
 [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI> 
 [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
 [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
 [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
 [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
 [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
 [<ffffffff81013daa>] ? child_rip+0xa/0x20
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81013da0>] ? child_rip+0x0/0x20
Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 
RIP  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
 RSP <ffff88002805be48>
---[ end trace ce9cee6832a9c503 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 25581, comm: khelper Tainted: G      D    2.6.32.36fixxen #1
Call Trace:
 <IRQ>  [<ffffffff8105682e>] panic+0xe0/0x19a
 [<ffffffff8144008a>] ? init_amd+0x296/0x37a
 [<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff8100f8e2>] ? check_events+0x12/0x20
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
 [<ffffffff81448185>] oops_end+0xb6/0xc6
 [<ffffffff810166e5>] die+0x5a/0x63
 [<ffffffff81447a5c>] do_trap+0x115/0x124
 [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
 [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
 [<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23
 [<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11
 [<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67
 [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81
 [<ffffffff81013b3b>] invalid_op+0x1b/0x20
 [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
 [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
 [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
 [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
 [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI>  [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
 [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
 [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
 [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
 [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
 [<ffffffff81013daa>] ? child_rip+0xa/0x20
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81013da0>] ? child_rip+0x0/0x20
(XEN) Domain 0 crashed: 'noreboot' set - not rebooting.

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-11 12:31           ` MaoXiaoyun
@ 2011-04-11 15:25             ` Teck Choon Giam
  2011-04-12  3:30               ` MaoXiaoyun
  2011-04-11 18:08             ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 41+ messages in thread
From: Teck Choon Giam @ 2011-04-11 15:25 UTC (permalink / raw)
  To: MaoXiaoyun; +Cc: jeremy, xen devel, keir, ian.campbell, konrad.wilk, dave

2011/4/11 MaoXiaoyun <tinnycloud@hotmail.com>:
> Hi:
>
>      I believe this is the fix at much extent.
>      Since I have my own test cases which with this patch, my test case will
> success in 30 rounds run.
>      Every round takes 8hours.  While without this patch, tests fail evey
> round in 15minutes.
>
>       So this really means fix most of the things.
>
>       But during running, I met another crash, from the log it it looks like
> has relation with
> this BUG, since the crash log shows it is tlb related and this BUG also tlb
> related.

Are you able to run another test with cpuidle=0 cpufreq=none in kernel
boot option?  Just curious whether can you reproduce the tlb bug when
you boot with cpuidle=0 cpufreq=none... ...

>
>       Well, I'm also have poor knowledge of kernel.
>       Hope someone from Xen Devel offer some help.
>
>       Many thanks.
>
>> Date: Mon, 11 Apr 2011 20:16:53 +0800
>> Subject: Re: kernel BUG at arch/x86/xen/mmu.c:1872
>> From: giamteckchoon@gmail.com
>> To: tinnycloud@hotmail.com
>> CC: xen-devel@lists.xensource.com; dave@ivt.com.au;
>> ian.campbell@citrix.com; konrad.wilk@oracle.com; jeremy@goop.org;
>> keir@xen.org
>>
>> >
>> > Hi,
>> >
>> > Sorry, since this mmu related BUG has been troubled me for very
>> > long... I really want to "kill" this BUG but my knowledge in kernel
>> > hacking and/or xen is very limited.
>> >
>> > While waiting for Jeremy or Konrad or others ...
>> >
>> > Many thanks for spending time to track down this mmu related BUG.  I
>> > have backported the commit from
>> >
>> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec
>> > to 2.6.32.36 PVOPS kernel and patch attached.  I won't know whether
>> > did I backport it correctly nor does it affects anything.  I am
>> > currently testing the 2.6.32.36 PVOPS kernel with this patch applied
>> > and also unset CONFIG_DEBUG_PAGEALLOC.  Currently running testcrash.sh
>> > loop 1000 as I am unable to reproduce this mmu BUG 1872 in
>> > testcrash.sh loop 100.  Please note that when CONFIG_DEBUG_PAGEALLOC
>> > is unset, I can reproduce this mmu BUG 1872 easily within <50
>> > testcrash.sh loop cycle with PVOPS version 2.6.32.24 to 2.6.32.36
>> > kernel.  Now test with this backport patch to see whether I can
>> > reproduce this mmu BUG... ...
>> >
>> > Kindest regards,
>> > Giam Teck Choon
>> >
>>
>> I have tested with my backport patch and it is working fine as I am
>> unable to reproduce the mmu.c 1872 or 1860 bug with
>> CONFIG_DEBUG_PAGEALLOC not set. I tested with testcrash.sh loop 100
>> and 1000. Now doing testcrash.sh loop 10000.
>>
>> Xiaoyun, is it possible for you to test my patch and see whether can
>> you reproduce the mmu.c 1872/1860 bug?
>>
>> Can anyone of you review my patch?
>>
>> I will post a format patch according to
>> Documentation/SubmittingPatches in my next reply and hopefully can be
>> reviewed.
>>
>> Thanks.
>>
>> Kindest regards,
>> Giam Teck Choon
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-11 12:31           ` MaoXiaoyun
  2011-04-11 15:25             ` Teck Choon Giam
@ 2011-04-11 18:08             ` Jeremy Fitzhardinge
  2011-04-12  3:35               ` MaoXiaoyun
  2011-04-12 16:32               ` kernel BUG at arch/x86/xen/mmu.c:1872 Teck Choon Giam
  1 sibling, 2 replies; 41+ messages in thread
From: Jeremy Fitzhardinge @ 2011-04-11 18:08 UTC (permalink / raw)
  To: MaoXiaoyun
  Cc: xen devel, keir, ian.campbell, konrad.wilk, giamteckchoon, dave

On 04/11/2011 05:31 AM, MaoXiaoyun wrote:
> Hi:
>
> I believe this is the fix at much extent.
> Since I have my own test cases which with this patch, my test case
> will success in 30 rounds run.
> Every round takes 8hours. While without this patch, tests fail evey
> round in 15minutes.
>
> So this really means fix most of the things.
>
> But during running, I met another crash, from the log it it looks like
> has relation with
> this BUG, since the crash log shows it is tlb related and this BUG
> also tlb related.
>
> Well, I'm also have poor knowledge of kernel.
> Hope someone from Xen Devel offer some help.

Thanks for confirming; it makes sense and explains the symptoms, so I'm
glad it also works ;)


J

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-11 15:25             ` Teck Choon Giam
@ 2011-04-12  3:30               ` MaoXiaoyun
  2011-04-12 16:08                 ` Teck Choon Giam
  0 siblings, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-12  3:30 UTC (permalink / raw)
  To: giamteckchoon; +Cc: jeremy, xen devel, keir, ian.campbell, konrad.wilk, dave


[-- Attachment #1.1: Type: text/plain, Size: 4966 bytes --]


Hi:
 
       I have just kicked off cpuidle=0 "cpufreq=none" tests.
 
       What is your Xen version?  Do you use the backend driver of 2.6.32.36?
 
       Beside the "TLB BUG ", I've met at least two other issues
       1)Xen4.0.1 + 2.6.32.36 kernel + backend driver from 2.6.31  ==> will cause "Bad grant reference " log in serial output
       2)Xen4.0.1 + 2.6.32.36 kernel with its owen backend driver   ==> will cause disk error like belows.
 
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
end_request: I/O error, dev tdb, sector 28699593
end_request: I/O error, dev tdb, sector 28699673
end_request: I/O error, dev tdb, sector 28699753
end_request: I/O error, dev tdb, sector 28699833
end_request: I/O error, dev tdb, sector 28699913
end_request: I/O error, dev tdb, sector 28699993
end_request: I/O error, dev tdb, sector 28700073
     
    thanks.
 
 
> Date: Mon, 11 Apr 2011 23:25:19 +0800
> Subject: Re: kernel BUG at arch/x86/xen/mmu.c:1872
> From: giamteckchoon@gmail.com
> To: tinnycloud@hotmail.com
> CC: xen-devel@lists.xensource.com; dave@ivt.com.au; ian.campbell@citrix.com; konrad.wilk@oracle.com; jeremy@goop.org; keir@xen.org
> 
> 2011/4/11 MaoXiaoyun <tinnycloud@hotmail.com>:
> > Hi:
> >
> >      I believe this is the fix at much extent.
> >      Since I have my own test cases which with this patch, my test case will
> > success in 30 rounds run.
> >      Every round takes 8hours.  While without this patch, tests fail evey
> > round in 15minutes.
> >
> >       So this really means fix most of the things.
> >
> >       But during running, I met another crash, from the log it it looks like
> > has relation with
> > this BUG, since the crash log shows it is tlb related and this BUG also tlb
> > related.
> 
> Are you able to run another test with cpuidle=0 cpufreq=none in kernel
> boot option? Just curious whether can you reproduce the tlb bug when
> you boot with cpuidle=0 cpufreq=none... ...
> 
> >
> >       Well, I'm also have poor knowledge of kernel.
> >       Hope someone from Xen Devel offer some help.
> >
> >       Many thanks.
> >
> >> Date: Mon, 11 Apr 2011 20:16:53 +0800
> >> Subject: Re: kernel BUG at arch/x86/xen/mmu.c:1872
> >> From: giamteckchoon@gmail.com
> >> To: tinnycloud@hotmail.com
> >> CC: xen-devel@lists.xensource.com; dave@ivt.com.au;
> >> ian.campbell@citrix.com; konrad.wilk@oracle.com; jeremy@goop.org;
> >> keir@xen.org
> >>
> >> >
> >> > Hi,
> >> >
> >> > Sorry, since this mmu related BUG has been troubled me for very
> >> > long... I really want to "kill" this BUG but my knowledge in kernel
> >> > hacking and/or xen is very limited.
> >> >
> >> > While waiting for Jeremy or Konrad or others ...
> >> >
> >> > Many thanks for spending time to track down this mmu related BUG.  I
> >> > have backported the commit from
> >> >
> >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec
> >> > to 2.6.32.36 PVOPS kernel and patch attached.  I won't know whether
> >> > did I backport it correctly nor does it affects anything.  I am
> >> > currently testing the 2.6.32.36 PVOPS kernel with this patch applied
> >> > and also unset CONFIG_DEBUG_PAGEALLOC.  Currently running testcrash.sh
> >> > loop 1000 as I am unable to reproduce this mmu BUG 1872 in
> >> > testcrash.sh loop 100.  Please note that when CONFIG_DEBUG_PAGEALLOC
> >> > is unset, I can reproduce this mmu BUG 1872 easily within <50
> >> > testcrash.sh loop cycle with PVOPS version 2.6.32.24 to 2.6.32.36
> >> > kernel.  Now test with this backport patch to see whether I can
> >> > reproduce this mmu BUG... ...
> >> >
> >> > Kindest regards,
> >> > Giam Teck Choon
> >> >
> >>
> >> I have tested with my backport patch and it is working fine as I am
> >> unable to reproduce the mmu.c 1872 or 1860 bug with
> >> CONFIG_DEBUG_PAGEALLOC not set. I tested with testcrash.sh loop 100
> >> and 1000. Now doing testcrash.sh loop 10000.
> >>
> >> Xiaoyun, is it possible for you to test my patch and see whether can
> >> you reproduce the mmu.c 1872/1860 bug?
> >>
> >> Can anyone of you review my patch?
> >>
> >> I will post a format patch according to
> >> Documentation/SubmittingPatches in my next reply and hopefully can be
> >> reviewed.
> >>
> >> Thanks.
> >>
> >> Kindest regards,
> >> Giam Teck Choon
> >
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 7337 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-11 18:08             ` Jeremy Fitzhardinge
@ 2011-04-12  3:35               ` MaoXiaoyun
  2011-04-12  6:48                 ` Grant Table Error on 2.6.32.36 + Xen 4.0.1 MaoXiaoyun
  2011-04-12  9:11                 ` Kernel BUG at arch/x86/mm/tlb.c:61 MaoXiaoyun
  2011-04-12 16:32               ` kernel BUG at arch/x86/xen/mmu.c:1872 Teck Choon Giam
  1 sibling, 2 replies; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-12  3:35 UTC (permalink / raw)
  To: jeremy; +Cc: xen devel, giamteckchoon, ian.campbell, dave, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 1294 bytes --]


Thanks for your reply and comfirm.
 
Well, what's your opinion of TLB bug?
Is it related to this patch or a new bug?
 
Attached it the new log I've got in 28 machine tests, one crashed.
 
> Date: Mon, 11 Apr 2011 11:08:10 -0700
> From: jeremy@goop.org
> To: tinnycloud@hotmail.com
> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; dave@ivt.com.au; ian.campbell@citrix.com; konrad.wilk@oracle.com; keir@xen.org
> Subject: Re: kernel BUG at arch/x86/xen/mmu.c:1872
> 
> On 04/11/2011 05:31 AM, MaoXiaoyun wrote:
> > Hi:
> >
> > I believe this is the fix at much extent.
> > Since I have my own test cases which with this patch, my test case
> > will success in 30 rounds run.
> > Every round takes 8hours. While without this patch, tests fail evey
> > round in 15minutes.
> >
> > So this really means fix most of the things.
> >
> > But during running, I met another crash, from the log it it looks like
> > has relation with
> > this BUG, since the crash log shows it is tlb related and this BUG
> > also tlb related.
> >
> > Well, I'm also have poor knowledge of kernel.
> > Hope someone from Xen Devel offer some help.
> 
> Thanks for confirming; it makes sense and explains the symptoms, so I'm
> glad it also works ;)
> 
> 
> J
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 1745 bytes --]

[-- Attachment #2: 195.31.txt --]
[-- Type: text/plain, Size: 6940 bytes --]

_ratelimit: 62 callbacks suppressed
blktap_sysfs_create: adding attributes for dev ffff88009bdb6000
blktap_sysfs_create: adding attributes for dev ffff88009bdb2200
INIT: Id "s0" respawning too fast: disabled for 5 minutes
__ratelimit: 14 callbacks suppressed
blktap_sysfs_destroy
blktap_sysfs_destroy
------------[ cut here ]------------
kernel BUG at arch/x86/mm/tlb.c:61!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb
CPU 1 
Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285          
RIP: e030:[<ffffffff8103a3cb>]  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
RSP: e02b:ffff88002805be48  EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0
RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200
R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880
R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000
FS:  00007f62362d66e0(0000) GS:ffff880028058000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process khelper (pid: 25581, threadinfo ffff88007691e000, task ffff88009b92db40)
Stack:
 ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88
<0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78
<0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108
Call Trace:
 <IRQ> 
 [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
 [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
 [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
 [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI> 
 [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
 [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
 [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
 [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
 [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
 [<ffffffff81013daa>] ? child_rip+0xa/0x20
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81013da0>] ? child_rip+0x0/0x20
Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 
RIP  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
 RSP <ffff88002805be48>
---[ end trace ce9cee6832a9c503 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 25581, comm: khelper Tainted: G      D    2.6.32.36fixxen #1
Call Trace:
 <IRQ>  [<ffffffff8105682e>] panic+0xe0/0x19a
 [<ffffffff8144008a>] ? init_amd+0x296/0x37a
 [<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff8100f8e2>] ? check_events+0x12/0x20
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
 [<ffffffff81448185>] oops_end+0xb6/0xc6
 [<ffffffff810166e5>] die+0x5a/0x63
 [<ffffffff81447a5c>] do_trap+0x115/0x124
 [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
 [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
 [<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23
 [<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11
 [<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67
 [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81
 [<ffffffff81013b3b>] invalid_op+0x1b/0x20
 [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
 [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
 [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
 [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
 [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI>  [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
 [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
 [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
 [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
 [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
 [<ffffffff81013daa>] ? child_rip+0xa/0x20
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81013da0>] ? child_rip+0x0/0x20
(XEN) Domain 0 crashed: 'noreboot' set - not rebooting.

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Grant Table Error on 2.6.32.36 + Xen 4.0.1
  2011-04-12  3:35               ` MaoXiaoyun
@ 2011-04-12  6:48                 ` MaoXiaoyun
  2011-04-12  8:46                   ` Konrad Rzeszutek Wilk
  2011-04-12  9:11                 ` Kernel BUG at arch/x86/mm/tlb.c:61 MaoXiaoyun
  1 sibling, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-12  6:48 UTC (permalink / raw)
  To: xen devel
  Cc: tim.deegan, george.dunlap, giamteckchoon, ian.campbell, keir.fraser


[-- Attachment #1.1: Type: text/plain, Size: 16628 bytes --]


Hi:
 
      We are just about to try the new Kernel, but confront Error on grant table.
        
     2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4
     Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183
       
    Our test is simple, 24 HVMS(Win2003 )  on a single host, each HVM loopes in restart every 15minutes.
    Please refer to error log from serial output 
              
    I've traced the log a bit, and the log is from xen/common/grant_table.c
 
1) log " grant_table.c:1717:d0 Bad grant reference 4294965983 " if from 

1715     if ( unlikely(gref >= nr_grant_entries(rd->grant_table)) ){
1716         PIN_FAIL(unlock_out, GNTST_bad_gntref,
1717                  "Bad grant reference %ld\n", gref);
1718         BUG();
1719     }
 
2) log "grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) " is from 
 
  grant_table.c:1967 =>  __acquire_grant_for_copy  => _set_status
 
 ( not from __gnttab_map_grant_ref, since I add some log to identify this )

The log shows that all are from gnttab_copy, which I later found only netback
has grant copy hypercall. 
 
I also tried netback code from 2.6.31(which works well with kernel 2.6.31), but
still met these errors. So it looks like it is kernel related.
 
What happened for this, will this harmful for the usage of HVM?
 
Many thanks.
 
 =-=====================================
(XEN) Xen trace buffers: disabled
(XEN) Std. Loglevel: Errors and warnings
(XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings)
(XEN) Xen is relinquishing VGA console.
(XEN) *** Serial input -> DOM0 (type 'CTRL-a' three times to switch input to Xen)
(XEN) Freed 168kB init memory.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 17 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 13 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 11 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 11 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 10 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 2 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 6 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 2 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 2 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 10 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 15 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 8 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 15 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 29 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 25 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 25 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 19 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 27 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 27 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 5 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 10 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 2 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 15 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 2 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 8 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 8 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 9 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 2 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 7 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 5 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 2 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 2 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 3 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) printk: 1 messages suppressed.
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0)
(XEN) grant_table.c:578:d0 Iomem mapping not permitted ffffffffffffffff (domain 137)
(XEN) grant_table.c:578:d0 Iomem mapping not permitted ffffffffffffffff (domain 137)
(XEN) grant_table.c:578:d0 Iomem mapping not permitted ffffffffffffffff (domain 137)
(XEN) grant_table.c:578:d0 Iomem mapping not permitted ffffffffffffffff (domain 137)
(XEN) grant_table.c:578:d0 Iomem mapping not permitted ffffffffffffffff (domain 137)
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 18335 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Grant Table Error on 2.6.32.36 + Xen 4.0.1
  2011-04-12  6:48                 ` Grant Table Error on 2.6.32.36 + Xen 4.0.1 MaoXiaoyun
@ 2011-04-12  8:46                   ` Konrad Rzeszutek Wilk
  2011-04-12  9:02                     ` MaoXiaoyun
  0 siblings, 1 reply; 41+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-04-12  8:46 UTC (permalink / raw)
  To: MaoXiaoyun
  Cc: xen devel, tim.deegan, george.dunlap, giamteckchoon, keir.fraser,
	ian.campbell

On Tue, Apr 12, 2011 at 02:48:36PM +0800, MaoXiaoyun wrote:
> 
> Hi:
>  
>       We are just about to try the new Kernel, but confront Error on grant table.

Please open a new thread on this one. This is getting confusing.
>         
>      2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4
>      Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183
>        
>     Our test is simple, 24 HVMS(Win2003 )  on a single host, each HVM loopes in restart every 15minutes.
>     Please refer to error log from serial output 
>               
>     I've traced the log a bit, and the log is from xen/common/grant_table.c
>  
> 1) log " grant_table.c:1717:d0 Bad grant reference 4294965983 " if from 
> 
> 1715     if ( unlikely(gref >= nr_grant_entries(rd->grant_table)) ){
> 1716         PIN_FAIL(unlock_out, GNTST_bad_gntref,
> 1717                  "Bad grant reference %ld\n", gref);
> 1718         BUG();
> 1719     }
>  
> 2) log "grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) " is from 
>  
>   grant_table.c:1967 =>  __acquire_grant_for_copy  => _set_status
>  
>  ( not from __gnttab_map_grant_ref, since I add some log to identify this )
> 
> The log shows that all are from gnttab_copy, which I later found only netback
> has grant copy hypercall. 
>  
> I also tried netback code from 2.6.31(which works well with kernel 2.6.31), but
> still met these errors. So it looks like it is kernel related.
>  
> What happened for this, will this harmful for the usage of HVM?

What is the storage for your HVM guests? iSCSI?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Grant Table Error on 2.6.32.36 + Xen 4.0.1
  2011-04-12  8:46                   ` Konrad Rzeszutek Wilk
@ 2011-04-12  9:02                     ` MaoXiaoyun
  0 siblings, 0 replies; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-12  9:02 UTC (permalink / raw)
  To: konrad.wilk
  Cc: xen devel, tim.deegan, george.dunlap, giamteckchoon, keir.fraser,
	ian.campbell


[-- Attachment #1.1: Type: text/plain, Size: 8727 bytes --]


Thanks Konrad.
 
I will new a thread on TLB bug.
For grant table error.  I add some debug log on netback.c , line 388. 
 
 358 static u16 netbk_gop_frag(struct xen_netif *netif, struct netbk_rx_meta *meta,
 359                           int i, struct netrx_pending_operations *npo,
 360                           struct page *page, unsigned long size,
 361                           unsigned long offset)
 362 {
 363         struct gnttab_copy *copy_gop;
 364         struct xen_netif_rx_request *req;
 365         unsigned long old_mfn;
 366         int idx = netif_page_index(page);
 367 
 368         old_mfn = virt_to_mfn(page_address(page));
 369 
 370         req = RING_GET_REQUEST(&netif->rx, netif->rx.req_cons + i);
 371 
 372         copy_gop = npo->copy + npo->copy_prod++;
 373         copy_gop->flags = GNTCOPY_dest_gref;
 374         if (idx > -1) {
 375                 struct pending_tx_info *src_pend = &pending_tx_info[idx];
 376                 copy_gop->source.domid = src_pend->netif->domid;
 377                 copy_gop->source.u.ref = src_pend->req.gref;
 378                 copy_gop->flags |= GNTCOPY_source_gref;
 379         } else {
 380                 copy_gop->source.domid = DOMID_SELF;
 381                 copy_gop->source.u.gmfn = old_mfn;
 382         }
 383         copy_gop->source.offset = offset;
 384         copy_gop->dest.domid = netif->domid;
 385         copy_gop->dest.offset = 0;
 386         copy_gop->dest.u.ref = req->gref;
 387         copy_gop->len = size;
 388         if(req->gref > 16384)
 389            IPRINTK("dom %d, req gref %d size = %lu\n", netif->domid, req->gref, size);
 390 
 391         return req->id;
 392 }
 
And the output below, indicates something might wrong on grant table.
 
Apr 12 16:38:31 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:38:31 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:38:31 xmao kernel: xen_net: dom 14, req gref -1313 size = 270
Apr 12 16:38:31 xmao kernel: xen_net: dom 14, req gref -1313 size = 72
Apr 12 16:38:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:38:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:38:34 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:38:34 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:38:34 xmao kernel: xen_net: dom 14, req gref -1313 size = 270
Apr 12 16:38:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:38:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:38:40 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:38:40 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:38:42 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:38:42 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:38:44 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:38:44 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:38:57 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:38:57 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:38:59 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:38:59 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:38:59 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:38:59 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:39:22 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:39:22 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:39:26 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:39:26 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:39:29 xmao kernel: xen_net: dom 23, req gref -1313 size = 42
Apr 12 16:39:29 xmao kernel: xen_net: dom 14, req gref -1313 size = 42
Apr 12 16:39:29 xmao kernel: xen_net: dom 23, req gref -1313 size = 42
Apr 12 16:39:29 xmao kernel: xen_net: dom 14, req gref 5242956 size = 42
Apr 12 16:39:30 xmao kernel: xen_net: dom 23, req gref -1313 size = 42
Apr 12 16:39:30 xmao kernel: xen_net: dom 14, req gref 1817341261 size = 42
Apr 12 16:39:31 xmao kernel: xen_net: dom 23, req gref -1313 size = 42
Apr 12 16:39:31 xmao kernel: xen_net: dom 23, req gref -1313 size = 38
Apr 12 16:39:31 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:39:31 xmao kernel: xen_net: dom 14, req gref -1313 size = 38
Apr 12 16:39:31 xmao kernel: xen_net: dom 14, req gref -1313 size = 72
Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 42
Apr 12 16:39:32 xmao kernel: xen_net: dom 14, req gref -1408 size = 42
Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 38
Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:39:32 xmao kernel: xen_net: dom 14, req gref -1408 size = 38
Apr 12 16:39:32 xmao kernel: xen_net: dom 14, req gref -1408 size = 72
Apr 12 16:39:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 42
Apr 12 16:39:33 xmao kernel: xen_net: dom 14, req gref -1408 size = 42
Apr 12 16:39:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 42
Apr 12 16:39:33 xmao kernel: xen_net: dom 14, req gref -1408 size = 42
Apr 12 16:39:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 38
Apr 12 16:39:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:39:33 xmao kernel: xen_net: dom 14, req gref 1850305869 size = 38
Apr 12 16:39:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 42
Apr 12 16:39:34 xmao kernel: xen_net: dom 23, req gref -1313 size = 38
Apr 12 16:39:34 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:39:34 xmao kernel: xen_net: dom 14, req gref -1313 size = 38
Apr 12 16:39:34 xmao kernel: xen_net: dom 14, req gref -1313 size = 72
Apr 12 16:39:34 xmao kernel: xen_net: dom 23, req gref -1313 size = 42
Apr 12 16:39:34 xmao kernel: xen_net: dom 14, req gref -1313 size = 42
Apr 12 16:39:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 270
Apr 12 16:39:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:39:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 38
Apr 12 16:39:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
Apr 12 16:39:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 38
Apr 12 16:39:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 72
 
> Date: Tue, 12 Apr 2011 04:46:29 -0400
> From: konrad.wilk@oracle.com
> To: tinnycloud@hotmail.com
> CC: xen-devel@lists.xensource.com; tim.deegan@citrix.com; george.dunlap@eu.citrix.com; giamteckchoon@gmail.com; ian.campbell@citrix.com; keir.fraser@eu.citrix.com
> Subject: Re: [Xen-devel] Grant Table Error on 2.6.32.36 + Xen 4.0.1
> 
> On Tue, Apr 12, 2011 at 02:48:36PM +0800, MaoXiaoyun wrote:
> > 
> > Hi:
> > 
> > We are just about to try the new Kernel, but confront Error on grant table.
> 
> Please open a new thread on this one. This is getting confusing.
> > 
> > 2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4
> > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183
> > 
> > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes in restart every 15minutes.
> > Please refer to error log from serial output 
> > 
> > I've traced the log a bit, and the log is from xen/common/grant_table.c
> > 
> > 1) log " grant_table.c:1717:d0 Bad grant reference 4294965983 " if from 
> > 
> > 1715 if ( unlikely(gref >= nr_grant_entries(rd->grant_table)) ){
> > 1716 PIN_FAIL(unlock_out, GNTST_bad_gntref,
> > 1717 "Bad grant reference %ld\n", gref);
> > 1718 BUG();
> > 1719 }
> > 
> > 2) log "grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) " is from 
> > 
> > grant_table.c:1967 => __acquire_grant_for_copy => _set_status
> > 
> > ( not from __gnttab_map_grant_ref, since I add some log to identify this )
> > 
> > The log shows that all are from gnttab_copy, which I later found only netback
> > has grant copy hypercall. 
> > 
> > I also tried netback code from 2.6.31(which works well with kernel 2.6.31), but
> > still met these errors. So it looks like it is kernel related.
> > 
> > What happened for this, will this harmful for the usage of HVM?
> 
> What is the storage for your HVM guests? iSCSI?
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 11480 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-12  3:35               ` MaoXiaoyun
  2011-04-12  6:48                 ` Grant Table Error on 2.6.32.36 + Xen 4.0.1 MaoXiaoyun
@ 2011-04-12  9:11                 ` MaoXiaoyun
  2011-04-12 10:00                   ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-12  9:11 UTC (permalink / raw)
  To: xen devel; +Cc: jeremy, giamteckchoon, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 7435 bytes --]


Hi :
 
  We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel panic bug.
 
  2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4
  Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183  
 
  Our test is simple, 24 HVMS(Win2003 )  on a single host, each HVM loopes in restart every 15minutes.
  About 17 machines are invovled in the test,  after 10 hours run, one confrontted a crash at arch/x86/mm/tlb.c:61
 
  Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck's suggestion.
 
  Any comments, thanks. 
 
===============crash log==========================
INIT: Id "s0" respawning too fast: disabled for 5 minutes
__ratelimit: 14 callbacks suppressed
blktap_sysfs_destroy
blktap_sysfs_destroy
------------[ cut here ]------------
kernel BUG at arch/x86/mm/tlb.c:61!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb
CPU 1 
Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285          
RIP: e030:[<ffffffff8103a3cb>]  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
RSP: e02b:ffff88002805be48  EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0
RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200
R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880
R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000
FS:  00007f62362d66e0(0000) GS:ffff880028058000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process khelper (pid: 25581, threadinfo ffff88007691e000, task ffff88009b92db40)
Stack:
 ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88
<0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78
<0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108
Call Trace:
 <IRQ> 
 [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
 [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
 [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
 [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI> 
 [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
 [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
 [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
 [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
 [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
 [<ffffffff81013daa>] ? child_rip+0xa/0x20
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81013da0>] ? child_rip+0x0/0x20
Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 
RIP  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
 RSP <ffff88002805be48>
---[ end trace ce9cee6832a9c503 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 25581, comm: khelper Tainted: G      D    2.6.32.36fixxen #1
Call Trace:
 <IRQ>  [<ffffffff8105682e>] panic+0xe0/0x19a
 [<ffffffff8144008a>] ? init_amd+0x296/0x37a
 [<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff8100f8e2>] ? check_events+0x12/0x20
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
 [<ffffffff81448185>] oops_end+0xb6/0xc6
 [<ffffffff810166e5>] die+0x5a/0x63
 [<ffffffff81447a5c>] do_trap+0x115/0x124
 [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
 [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
 [<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23
 [<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11
 [<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67
 [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81
 [<ffffffff81013b3b>] invalid_op+0x1b/0x20
 [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
 [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
 [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
 [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
 [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI>  [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
 [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
 [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
 [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
 [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
 [<ffffffff81013daa>] ? child_rip+0xa/0x20
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81013da0>] ? child_rip+0x0/0x20
  
  		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 12935 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-12  9:11                 ` Kernel BUG at arch/x86/mm/tlb.c:61 MaoXiaoyun
@ 2011-04-12 10:00                   ` Konrad Rzeszutek Wilk
  2011-04-12 10:10                     ` MaoXiaoyun
  2011-04-14  6:16                     ` MaoXiaoyun
  0 siblings, 2 replies; 41+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-04-12 10:00 UTC (permalink / raw)
  To: MaoXiaoyun; +Cc: jeremy, xen devel, giamteckchoon

On Tue, Apr 12, 2011 at 05:11:51PM +0800, MaoXiaoyun wrote:
> 
> Hi :
>  
>   We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel panic bug.
>  
>   2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4
>   Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183  
>  
>   Our test is simple, 24 HVMS(Win2003 )  on a single host, each HVM loopes in restart every 15minutes.

What is the storage that you are using for your guests? AoE? Local disks?

>   About 17 machines are invovled in the test,  after 10 hours run, one confrontted a crash at arch/x86/mm/tlb.c:61
>  
>   Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck's suggestion.
>  
>   Any comments, thanks. 
>  
> ===============crash log==========================
> INIT: Id "s0" respawning too fast: disabled for 5 minutes
> __ratelimit: 14 callbacks suppressed
> blktap_sysfs_destroy
> blktap_sysfs_destroy
> ------------[ cut here ]------------
> kernel BUG at arch/x86/mm/tlb.c:61!
> invalid opcode: 0000 [#1] SMP 
> last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb
> CPU 1 
> Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
> Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285          
> RIP: e030:[<ffffffff8103a3cb>]  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
> RSP: e02b:ffff88002805be48  EFLAGS: 00010046
> RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0
> RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001
> RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200
> R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880
> R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000
> FS:  00007f62362d66e0(0000) GS:ffff880028058000(0000) knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process khelper (pid: 25581, threadinfo ffff88007691e000, task ffff88009b92db40)
> Stack:
>  ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88
> <0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78
> <0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108
> Call Trace:
>  <IRQ> 
>  [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
>  [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
>  [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
>  [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
>  [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
>  [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
>  [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
>  [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
>  <EOI> 
>  [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
>  [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
>  [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
>  [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
>  [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
>  [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
>  [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
>  [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
>  [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
>  [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
>  [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
>  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>  [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
>  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>  [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
>  [<ffffffff81013daa>] ? child_rip+0xa/0x20
>  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>  [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81013da0>] ? child_rip+0x0/0x20
> Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 
> RIP  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
>  RSP <ffff88002805be48>
> ---[ end trace ce9cee6832a9c503 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Pid: 25581, comm: khelper Tainted: G      D    2.6.32.36fixxen #1
> Call Trace:
>  <IRQ>  [<ffffffff8105682e>] panic+0xe0/0x19a
>  [<ffffffff8144008a>] ? init_amd+0x296/0x37a
>  [<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf
>  [<ffffffff8100f8e2>] ? check_events+0x12/0x20
>  [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
>  [<ffffffff81448185>] oops_end+0xb6/0xc6
>  [<ffffffff810166e5>] die+0x5a/0x63
>  [<ffffffff81447a5c>] do_trap+0x115/0x124
>  [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
>  [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
>  [<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23
>  [<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11
>  [<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67
>  [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81
>  [<ffffffff81013b3b>] invalid_op+0x1b/0x20
>  [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
>  [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
>  [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
>  [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
>  [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
>  [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
>  [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
>  [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
>  [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
>  [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
>  <EOI>  [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
>  [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
>  [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
>  [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
>  [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
>  [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
>  [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
>  [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
>  [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
>  [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
>  [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
>  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>  [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
>  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>  [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
>  [<ffffffff81013daa>] ? child_rip+0xa/0x20
>  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>  [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81013da0>] ? child_rip+0x0/0x20
>   
>   		 	   		  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-12 10:00                   ` Konrad Rzeszutek Wilk
@ 2011-04-12 10:10                     ` MaoXiaoyun
  2011-04-14  6:16                     ` MaoXiaoyun
  1 sibling, 0 replies; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-12 10:10 UTC (permalink / raw)
  To: konrad.wilk; +Cc: jeremy, xen devel, giamteckchoon


[-- Attachment #1.1: Type: text/plain, Size: 8334 bytes --]


VHD file in local disk. 
 
disk = [ 'tap:vhd:/mnt/xmao/test/img/win2003.cp1.vhd,hda,w']
 
thanks.
 
> Date: Tue, 12 Apr 2011 06:00:00 -0400
> From: konrad.wilk@oracle.com
> To: tinnycloud@hotmail.com
> CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com; jeremy@goop.org
> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> 
> On Tue, Apr 12, 2011 at 05:11:51PM +0800, MaoXiaoyun wrote:
> > 
> > Hi :
> > 
> > We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel panic bug.
> > 
> > 2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4
> > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183 
> > 
> > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes in restart every 15minutes.
> 
> What is the storage that you are using for your guests? AoE? Local disks?
> 
> > About 17 machines are invovled in the test, after 10 hours run, one confrontted a crash at arch/x86/mm/tlb.c:61
> > 
> > Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck's suggestion.
> > 
> > Any comments, thanks. 
> > 
> > ===============crash log==========================
> > INIT: Id "s0" respawning too fast: disabled for 5 minutes
> > __ratelimit: 14 callbacks suppressed
> > blktap_sysfs_destroy
> > blktap_sysfs_destroy
> > ------------[ cut here ]------------
> > kernel BUG at arch/x86/mm/tlb.c:61!
> > invalid opcode: 0000 [#1] SMP 
> > last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb
> > CPU 1 
> > Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
> > Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285 
> > RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46
> > RSP: e02b:ffff88002805be48 EFLAGS: 00010046
> > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0
> > RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001
> > RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200
> > R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880
> > R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000
> > FS: 00007f62362d66e0(0000) GS:ffff880028058000(0000) knlGS:0000000000000000
> > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process khelper (pid: 25581, threadinfo ffff88007691e000, task ffff88009b92db40)
> > Stack:
> > ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88
> > <0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78
> > <0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108
> > Call Trace:
> > <IRQ> 
> > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
> > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
> > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
> > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
> > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
> > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
> > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
> > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
> > <EOI> 
> > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
> > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
> > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
> > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
> > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
> > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
> > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
> > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
> > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
> > [<ffffffff81013daa>] ? child_rip+0xa/0x20
> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
> > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
> > [<ffffffff81013da0>] ? child_rip+0x0/0x20
> > Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 
> > RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46
> > RSP <ffff88002805be48>
> > ---[ end trace ce9cee6832a9c503 ]---
> > Kernel panic - not syncing: Fatal exception in interrupt
> > Pid: 25581, comm: khelper Tainted: G D 2.6.32.36fixxen #1
> > Call Trace:
> > <IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a
> > [<ffffffff8144008a>] ? init_amd+0x296/0x37a
> > [<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf
> > [<ffffffff8100f8e2>] ? check_events+0x12/0x20
> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
> > [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
> > [<ffffffff81448185>] oops_end+0xb6/0xc6
> > [<ffffffff810166e5>] die+0x5a/0x63
> > [<ffffffff81447a5c>] do_trap+0x115/0x124
> > [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
> > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
> > [<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23
> > [<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11
> > [<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67
> > [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81
> > [<ffffffff81013b3b>] invalid_op+0x1b/0x20
> > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
> > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
> > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
> > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
> > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
> > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
> > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
> > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
> > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
> > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
> > <EOI> [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
> > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
> > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
> > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
> > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
> > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
> > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
> > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
> > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
> > [<ffffffff81013daa>] ? child_rip+0xa/0x20
> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
> > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
> > [<ffffffff81013da0>] ? child_rip+0x0/0x20
> > 
> > 
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 10255 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-12  3:30               ` MaoXiaoyun
@ 2011-04-12 16:08                 ` Teck Choon Giam
  0 siblings, 0 replies; 41+ messages in thread
From: Teck Choon Giam @ 2011-04-12 16:08 UTC (permalink / raw)
  To: MaoXiaoyun; +Cc: jeremy, xen devel, keir, ian.campbell, konrad.wilk, dave

If it is possible, please try not to top-post as this make reading
more confusing for me at least.  Thanks ;)


2011/4/12 MaoXiaoyun <tinnycloud@hotmail.com>:
> Hi:
>
>        I have just kicked off cpuidle=0 "cpufreq=none" tests.

Let see whether are you able to reproduce the tlb BUG with the above.

>
>        What is your Xen version?  Do you use the backend driver of
> 2.6.32.36?

You are asking me?
xen-4.0.2-rc3-pre latest changeset and also xen-4.1.1-rc1-pre.
What do you mean backend driver?  My testing are mostly on PV domU and
HVM on windows with LVM as storage.  I do not use VDH or any PV
drivers for windows.

>
>        Beside the "TLB BUG ", I've met at least two other issues
>        1)Xen4.0.1 + 2.6.32.36 kernel + backend driver from 2.6.31  ==> will
> cause "Bad grant reference " log in serial output
>        2)Xen4.0.1 + 2.6.32.36 kernel with its owen backend driver   ==> will
> cause disk error like belows.
>
> sd 0:0:0:0: rejecting I/O to offline device
> sd 0:0:0:0: rejecting I/O to offline device
> sd 0:0:0:0: rejecting I/O to offline device
> sd 0:0:0:0: rejecting I/O to offline device
> sd 0:0:0:0: rejecting I/O to offline device
> sd 0:0:0:0: rejecting I/O to offline device
> sd 0:0:0:0: rejecting I/O to offline device
> sd 0:0:0:0: rejecting I/O to offline device
> sd 0:0:0:0: rejecting I/O to offline device
> sd 0:0:0:0: rejecting I/O to offline device
> sd 0:0:0:0: rejecting I/O to offline device
> sd 0:0:0:0: rejecting I/O to offline device
> sd 0:0:0:0: rejecting I/O to offline device
> end_request: I/O error, dev tdb, sector 28699593
> end_request: I/O error, dev tdb, sector 28699673
> end_request: I/O error, dev tdb, sector 28699753
> end_request: I/O error, dev tdb, sector 28699833
> end_request: I/O error, dev tdb, sector 28699913
> end_request: I/O error, dev tdb, sector 28699993
> end_request: I/O error, dev tdb, sector 28700073

Is this related to VDH?  What is the specific backend driver?  These
started to surface after you applied my backport patch or regardless
the patch applied it is already there?

Thanks.

Kindest regards,
Giam Teck Choon

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: kernel BUG at arch/x86/xen/mmu.c:1872
  2011-04-11 18:08             ` Jeremy Fitzhardinge
  2011-04-12  3:35               ` MaoXiaoyun
@ 2011-04-12 16:32               ` Teck Choon Giam
  1 sibling, 0 replies; 41+ messages in thread
From: Teck Choon Giam @ 2011-04-12 16:32 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: MaoXiaoyun, xen devel, keir, ian.campbell, konrad.wilk, dave

2011/4/12 Jeremy Fitzhardinge <jeremy@goop.org>:
> On 04/11/2011 05:31 AM, MaoXiaoyun wrote:
>> Hi:
>>
>> I believe this is the fix at much extent.
>> Since I have my own test cases which with this patch, my test case
>> will success in 30 rounds run.
>> Every round takes 8hours. While without this patch, tests fail evey
>> round in 15minutes.
>>
>> So this really means fix most of the things.
>>
>> But during running, I met another crash, from the log it it looks like
>> has relation with
>> this BUG, since the crash log shows it is tlb related and this BUG
>> also tlb related.
>>
>> Well, I'm also have poor knowledge of kernel.
>> Hope someone from Xen Devel offer some help.
>
> Thanks for confirming; it makes sense and explains the symptoms, so I'm
> glad it also works ;)
>
>
> J
>

Thanks Jeremy, I can see the needed backport patch is in your
xen/next-2.6.32 tree now ;)

Kindest regards,
Giam Teck Choon

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-12 10:00                   ` Konrad Rzeszutek Wilk
  2011-04-12 10:10                     ` MaoXiaoyun
@ 2011-04-14  6:16                     ` MaoXiaoyun
  2011-04-14  7:26                       ` Teck Choon Giam
  1 sibling, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-14  6:16 UTC (permalink / raw)
  To: xen devel; +Cc: jeremy, giamteckchoon, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 15036 bytes --]


Hi:
 
      I've done test with "cpuidle=0 cpufreq=none", two machine crashed.
 
blktap_sysfs_destroy
blktap_sysfs_destroy
blktap_sysfs_create: adding attributes for dev ffff8800ad581000
blktap_sysfs_create: adding attributes for dev ffff8800a48e3e00
------------[ cut here ]------------
kernel BUG at arch/x86/mm/tlb.c:61!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/block/tapdeve/dev
CPU 0 
Modules linked in: 8021q garp blktap xen_netback xen_blkback blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_ms
ghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy bnx2 
serio_raw snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_timer i2c_core snd iT
CO_wdt pata_acpi soundcore iTCO_vendor_
support ata_generic snd_page_alloc pcspkr ata_piix shpchp mptsas mptscsih mptbase [last unloa
ded: freq_table]
Pid: 8022, comm: khelper Not tainted 2.6.32.36xen #1 Tecal RH2285          
RIP: e030:[<ffffffff8103a3cb>]  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
RSP: e02b:ffff88002803ee48  EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81675980
RDX: ffff88002803ee78 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88002803ee48 R08: ffff8800a4929000 R09: dead000000200200
R10: dead000000100100 R11: ffffffff81447292 R12: ffff88012ba07b80
R13: ffff880028046020 R14: 00000000000004fb R15: 0000000000000000
FS:  00007f410af416e0(0000) GS:ffff88002803b000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000469000 CR3: 00000000ad639000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process khelper (pid: 8022, threadinfo ffff8800a4846000, task ffff8800a9ed0000)
Stack:
 ffff88002803ee68 ffffffff8100e4a4 0000000000000001 ffff880097de3b88
<0> ffff88002803ee98 ffffffff81087224 ffff88002803ee78 ffff88002803ee78
<0> ffff88015f808180 00000000000004fb ffff88002803eea8 ffffffff810100e8
Call Trace:
 <IRQ> 
 [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53
 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
 [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28
 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
 [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
 [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI> 
 [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500
 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef
 [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
 
[<ffffffff811f463c>] ? process_measurement+0xc0/0xd7
 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255
 [<ffffffff81114366>] ? do_execve+0x1c3/0x29e
 [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
 [<ffffffff81013daa>] ? child_rip+0xa/0x20
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81013da0>] ? c
hild_rip+0x0/0x20
Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 
RIP  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
 RSP <ffff88002803ee48>
---[ end trace 1522f17fdfc9162d ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 8022, comm: khelper Tainted: G      D    2.6.32.36xen #1
Call Trace:
 <IRQ>  [<ffffffff8105682e>] panic+0xe0/0x19a
 [<ffffffff8144006a>] ? init_amd+0x296/0x37a
 [<ffffffff8100f169>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff8100f8c2>] ? check_events+0x12/0x20
 [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
 [<ffffffff81448165>] oops_end+0xb6/0xc6
 [<ffffffff810166e5>] die+0x5a/0x63
 [<ffffffff81447a3c>] do_trap+0x115/0x124
 [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
 [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
 [<ffffffff8100f6e6>] ? xen_clocksource_read+0x21/0x23
 [<ffffffff8100f258>] ? HYPERVISOR_vcpu_op+0xf/0x11
 [<ffffffff8100f753>] ? xen_vcpuop_set_next_event+0x52/0x67
 [<ffffffff81013b3b>] invalid_op+0x1b/0x20
 [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
 [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53
 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
 [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28
 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
 [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
 [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI>  [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500
 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef
 [<ffffffff81042fcf>] ? need_resched+0x23/0x
2d
 [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7
 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255
 [<ffffffff81114366>] ? do_execve+0x1c3/0x29e
 [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
 [<ffffffff81013daa>] ? child_rip+0xa/0x20
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81013da0>] ? child_rip+0x0/0x20
(XEN) Domain 0 crashed: 'noreboot' set - not rebooting.

 
> Date: Tue, 12 Apr 2011 06:00:00 -0400
> From: konrad.wilk@oracle.com
> To: tinnycloud@hotmail.com
> CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com; jeremy@goop.org
> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> 
> On Tue, Apr 12, 2011 at 05:11:51PM +0800, MaoXiaoyun wrote:
> > 
> > Hi :
> > 
> > We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel panic bug.
> > 
> > 2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4
> > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183 
> > 
> > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes in restart every 15minutes.
> 
> What is the storage that you are using for your guests? AoE? Local disks?
> 
> > About 17 machines are invovled in the test, after 10 hours run, one confrontted a crash at arch/x86/mm/tlb.c:61
> > 
> > Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck's suggestion.
> > 
> > Any comments, thanks. 
> > 
> > ===============crash log==========================
> > INIT: Id "s0" respawning too fast: disabled for 5 minutes
> > __ratelimit: 14 callbacks suppressed
> > blktap_sysfs_destroy
> > blktap_sysfs_destroy
> > ------------[ cut here ]------------
> > kernel BUG at arch/x86/mm/tlb.c:61!
> > invalid opcode: 0000 [#1] SMP 
> > last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb
> > CPU 1 
> > Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
> > Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285 
> > RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46
> > RSP: e02b:ffff88002805be48 EFLAGS: 00010046
> > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0
> > RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001
> > RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200
> > R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880
> > R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000
> > FS: 00007f62362d66e0(0000) GS:ffff880028058000(0000) knlGS:0000000000000000
> > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process khelper (pid: 25581, threadinfo ffff88007691e000, task ffff88009b92db40)
> > Stack:
> > ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88
> > <0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78
> > <0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108
> > Call Trace:
> > <IRQ> 
> > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
> > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
> > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
> > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
> > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
> > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
> > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
> > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
> > <EOI> 
> > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
> > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
> > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
> > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
> > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
> > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
> > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
> > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
> > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
> > [<ffffffff81013daa>] ? child_rip+0xa/0x20
> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
> > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
> > [<ffffffff81013da0>] ? child_rip+0x0/0x20
> > Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 
> > RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46
> > RSP <ffff88002805be48>
> > ---[ end trace ce9cee6832a9c503 ]---
> > Kernel panic - not syncing: Fatal exception in interrupt
> > Pid: 25581, comm: khelper Tainted: G D 2.6.32.36fixxen #1
> > Call Trace:
> > <IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a
> > [<ffffffff8144008a>] ? init_amd+0x296/0x37a
> > [<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf
> > [<ffffffff8100f8e2>] ? check_events+0x12/0x20
> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
> > [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
> > [<ffffffff81448185>] oops_end+0xb6/0xc6
> > [<ffffffff810166e5>] die+0x5a/0x63
> > [<ffffffff81447a5c>] do_trap+0x115/0x124
> > [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
> > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
> > [<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23
> > [<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11
> > [<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67
> > [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81
> > [<ffffffff81013b3b>] invalid_op+0x1b/0x20
> > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
> > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
> > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
> > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
> > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
> > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
> > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
> > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
> > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
> > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
> > <EOI> [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
> > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
> > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
> > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
> > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
> > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
> > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
> > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
> > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
> > [<ffffffff81013daa>] ? child_rip+0xa/0x20
> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
> > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
> > [<ffffffff81013da0>] ? child_rip+0x0/0x20
> > 
> > 
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 20788 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-14  6:16                     ` MaoXiaoyun
@ 2011-04-14  7:26                       ` Teck Choon Giam
  2011-04-14  7:56                         ` MaoXiaoyun
  0 siblings, 1 reply; 41+ messages in thread
From: Teck Choon Giam @ 2011-04-14  7:26 UTC (permalink / raw)
  To: MaoXiaoyun; +Cc: jeremy, xen devel, konrad.wilk

2011/4/14 MaoXiaoyun <tinnycloud@hotmail.com>:
> Hi:
>
>       I've done test with "cpuidle=0 cpufreq=none", two machine crashed.
>
> blktap_sysfs_destroy
> blktap_sysfs_destroy
> blktap_sysfs_create: adding attributes for dev ffff8800ad581000
> blktap_sysfs_create: adding attributes for dev ffff8800a48e3e00
> ------------[ cut here ]------------
> kernel BUG at arch/x86/mm/tlb.c:61!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/block/tapdeve/dev
> CPU 0
> Modules linked in: 8021q garp blktap xen_netback xen_blkback blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_ms
> ghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy bnx2
> serio_raw snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_timer i2c_core snd iT
> CO_wdt pata_acpi soundcore iTCO_vendor_
> support ata_generic snd_page_alloc pcspkr ata_piix shpchp mptsas mptscsih mptbase [last unloa
> ded: freq_table]
> Pid: 8022, comm: khelper Not tainted 2.6.32.36xen #1 Tecal RH2285
> RIP: e030:[<ffffffff8103a3cb>]  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
> RSP: e02b:ffff88002803ee48  EFLAGS: 00010046
> RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81675980
> RDX: ffff88002803ee78 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: ffff88002803ee48 R08: ffff8800a4929000 R09: dead000000200200
> R10: dead000000100100 R11: ffffffff81447292 R12: ffff88012ba07b80
> R13: ffff880028046020 R14: 00000000000004fb R15: 0000000000000000
> FS:  00007f410af416e0(0000) GS:ffff88002803b000(0000) knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000469000 CR3: 00000000ad639000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process khelper (pid: 8022, threadinfo ffff8800a4846000, task ffff8800a9ed0000)
> Stack:
>  ffff88002803ee68 ffffffff8100e4a4 0000000000000001 ffff880097de3b88
> <0> ffff88002803ee98 ffffffff81087224 ffff88002803ee78 ffff88002803ee78
> <0> ffff88015f808180 00000000000004fb ffff88002803eea8 ffffffff810100e8
> Call Trace:
>  <IRQ>
>  [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53
>  [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
>  [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28
>  [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
>  [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
>  [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
>  [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
>  [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
>  <EOI>
>  [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
>  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500
>  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
>  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
>  [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef
>  [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
>
> [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7
>  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
>  [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255
>  [<ffffffff81114366>] ? do_execve+0x1c3/0x29e
>  [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
>  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>  [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
>  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
>  [<ffffffff81013daa>] ? child_rip+0xa/0x20
>  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>  [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81013da0>] ? c
> hild_rip+0x0/0x20
> Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8
> RIP  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
>  RSP <ffff88002803ee48>
> ---[ end trace 1522f17fdfc9162d ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Pid: 8022, comm: khelper Tainted: G      D    2.6.32.36xen #1
> Call Trace:
>  <IRQ>  [<ffffffff8105682e>] panic+0xe0/0x19a
>  [<ffffffff8144006a>] ? init_amd+0x296/0x37a

Hmmm... both machines are using AMD CPU?  Did you hit the same bug on Intel CPU?


>  [<ffffffff8100f169>] ? xen_force_evtchn_callback+0xd/0xf
>  [<ffffffff8100f8c2>] ? check_events+0x12/0x20
>  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
>  [<ffffffff81448165>] oops_end+0xb6/0xc6
>  [<ffffffff810166e5>] die+0x5a/0x63
>  [<ffffffff81447a3c>] do_trap+0x115/0x124
>  [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
>  [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
>  [<ffffffff8100f6e6>] ? xen_clocksource_read+0x21/0x23
>  [<ffffffff8100f258>] ? HYPERVISOR_vcpu_op+0xf/0x11
>  [<ffffffff8100f753>] ? xen_vcpuop_set_next_event+0x52/0x67
>  [<ffffffff81013b3b>] invalid_op+0x1b/0x20
>  [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
>  [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
>  [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53
>  [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
>  [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28
>  [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
>  [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
>  [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
>  [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
>  [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
>  <EOI>  [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
>  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500
>  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
>  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
>  [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef
>  [<ffffffff81042fcf>] ? need_resched+0x23/0x
> 2d
>  [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7
>  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
>  [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255
>  [<ffffffff81114366>] ? do_execve+0x1c3/0x29e
>  [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
>  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>  [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
>  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
>  [<ffffffff81013daa>] ? child_rip+0xa/0x20
>  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>  [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81013da0>] ? child_rip+0x0/0x20
> (XEN) Domain 0 crashed: 'noreboot' set - not rebooting.
>
>> Date: Tue, 12 Apr 2011 06:00:00 -0400
>> From: konrad.wilk@oracle.com
>> To: tinnycloud@hotmail.com
>> CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com;
>> jeremy@goop.org
>> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
>>
>> On Tue, Apr 12, 2011 at 05:11:51PM +0800, MaoXiaoyun wrote:
>> >
>> > Hi :
>> >
>> > We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel
>> > panic bug.
>> >
>> > 2.6.32.36 Kernel:
>> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4
>> > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183
>> >
>> > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes
>> > in restart every 15minutes.
>>
>> What is the storage that you are using for your guests? AoE? Local disks?
>>
>> > About 17 machines are invovled in the test, after 10 hours run, one
>> > confrontted a crash at arch/x86/mm/tlb.c:61
>> >
>> > Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck's
>> > suggestion.
>> >
>> > Any comments, thanks.
>> >
>> > ===============crash log==========================
>> > INIT: Id "s0" respawning too fast: disabled for 5 minutes
>> > __ratelimit: 14 callbacks suppressed
>> > blktap_sysfs_destroy
>> > blktap_sysfs_destroy
>> > ------------[ cut here ]------------
>> > kernel BUG at arch/x86/mm/tlb.c:61!
>> > invalid opcode: 0000 [#1] SMP
>> > last sysfs file:
>> > /sys/devices/system/xen_memory/xen_memory0/info/current_kb
>> > CPU 1
>> > Modules linked in: 8021q garp xen_netback xen_blkback blktap
>> > blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si
>> > ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output
>> > sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss
>> > snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss
>> > snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc
>> > i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix
>> > shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
>> > Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285
>> > RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46
>> > RSP: e02b:ffff88002805be48 EFLAGS: 00010046
>> > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0
>> > RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001
>> > RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200
>> > R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880
>> > R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000
>> > FS: 00007f62362d66e0(0000) GS:ffff880028058000(0000)
>> > knlGS:0000000000000000
>> > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
>> > CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660
>> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > Process khelper (pid: 25581, threadinfo ffff88007691e000, task
>> > ffff88009b92db40)
>> > Stack:
>> > ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88
>> > <0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78
>> > <0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108
>> > Call Trace:
>> > <IRQ>
>> > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
>> > [<ffffffff81087224>]
>> > generic_smp_call_function_single_interrupt+0xd8/0xfc
>> > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
>> > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
>> > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
>> > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
>> > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
>> > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
>> > <EOI>
>> > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
>> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
>> > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
>> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
>> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
>> > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
>> > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
>> > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
>> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
>> > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
>> > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
>> > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
>> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>> > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
>> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
>> > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
>> > [<ffffffff81013daa>] ? child_rip+0xa/0x20
>> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>> > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
>> > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
>> > [<ffffffff81013da0>] ? child_rip+0x0/0x20
>> > Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3
>> > 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe
>> > 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8
>> > RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46
>> > RSP <ffff88002805be48>
>> > ---[ end trace ce9cee6832a9c503 ]---
>> > Kernel panic - not syncing: Fatal exception in interrupt
>> > Pid: 25581, comm: khelper Tainted: G D 2.6.32.36fixxen #1
>> > Call Trace:
>> > <IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a
>> > [<ffffffff8144008a>] ? init_amd+0x296/0x37a
>> > [<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf
>> > [<ffffffff8100f8e2>] ? check_events+0x12/0x20
>> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
>> > [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
>> > [<ffffffff81448185>] oops_end+0xb6/0xc6
>> > [<ffffffff810166e5>] die+0x5a/0x63
>> > [<ffffffff81447a5c>] do_trap+0x115/0x124
>> > [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
>> > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
>> > [<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23
>> > [<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11
>> > [<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67
>> > [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81
>> > [<ffffffff81013b3b>] invalid_op+0x1b/0x20
>> > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
>> > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
>> > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
>> > [<ffffffff81087224>]
>> > generic_smp_call_function_single_interrupt+0xd8/0xfc
>> > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
>> > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
>> > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
>> > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
>> > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
>> > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
>> > <EOI> [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
>> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
>> > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
>> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
>> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
>> > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
>> > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
>> > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
>> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
>> > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
>> > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
>> > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
>> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>> > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
>> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
>> > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
>> > [<ffffffff81013daa>] ? child_rip+0xa/0x20
>> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
>> > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
>> > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
>> > [<ffffffff81013da0>] ? child_rip+0x0/0x20
>> >
>> >
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-14  7:26                       ` Teck Choon Giam
@ 2011-04-14  7:56                         ` MaoXiaoyun
  2011-04-14 11:16                           ` MaoXiaoyun
  0 siblings, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-14  7:56 UTC (permalink / raw)
  To: giamteckchoon; +Cc: jeremy, xen devel, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 8988 bytes --]



 

> Date: Thu, 14 Apr 2011 15:26:14 +0800
> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> From: giamteckchoon@gmail.com
> To: tinnycloud@hotmail.com
> CC: xen-devel@lists.xensource.com; jeremy@goop.org; konrad.wilk@oracle.com
> 
> 2011/4/14 MaoXiaoyun <tinnycloud@hotmail.com>:
> > Hi:
> >
> >       I've done test with "cpuidle=0 cpufreq=none", two machine crashed.
> >
> > blktap_sysfs_destroy
> > blktap_sysfs_destroy
> > blktap_sysfs_create: adding attributes for dev ffff8800ad581000
> > blktap_sysfs_create: adding attributes for dev ffff8800a48e3e00
> > ------------[ cut here ]------------
> > kernel BUG at arch/x86/mm/tlb.c:61!
> > invalid opcode: 0000 [#1] SMP
> > last sysfs file: /sys/block/tapdeve/dev
> > CPU 0
> > Modules linked in: 8021q garp blktap xen_netback xen_blkback blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_ms
> > ghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy bnx2
> > serio_raw snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_timer i2c_core snd iT
> > CO_wdt pata_acpi soundcore iTCO_vendor_
> > support ata_generic snd_page_alloc pcspkr ata_piix shpchp mptsas mptscsih mptbase [last unloa
> > ded: freq_table]
> > Pid: 8022, comm: khelper Not tainted 2.6.32.36xen #1 Tecal RH2285
> > RIP: e030:[<ffffffff8103a3cb>]  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
> > RSP: e02b:ffff88002803ee48  EFLAGS: 00010046
> > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81675980
> > RDX: ffff88002803ee78 RSI: 0000000000000000 RDI: 0000000000000000
> > RBP: ffff88002803ee48 R08: ffff8800a4929000 R09: dead000000200200
> > R10: dead000000100100 R11: ffffffff81447292 R12: ffff88012ba07b80
> > R13: ffff880028046020 R14: 00000000000004fb R15: 0000000000000000
> > FS:  00007f410af416e0(0000) GS:ffff88002803b000(0000) knlGS:0000000000000000
> > CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000000000469000 CR3: 00000000ad639000 CR4: 0000000000002660
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process khelper (pid: 8022, threadinfo ffff8800a4846000, task ffff8800a9ed0000)
> > Stack:
> >  ffff88002803ee68 ffffffff8100e4a4 0000000000000001 ffff880097de3b88
> > <0> ffff88002803ee98 ffffffff81087224 ffff88002803ee78 ffff88002803ee78
> > <0> ffff88015f808180 00000000000004fb ffff88002803eea8 ffffffff810100e8
> > Call Trace:
> >  <IRQ>
> >  [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53
> >  [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
> >  [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28
> >  [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
> >  [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
> >  [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
> >  [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
> >  [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
> >  <EOI>
> >  [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
> >  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500
> >  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
> >  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
> >  [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef
> >  [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
> >
> > [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7
> >  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
> >  [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255
> >  [<ffffffff81114366>] ? do_execve+0x1c3/0x29e
> >  [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
> >  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> >  [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
> >  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> >  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
> >  [<ffffffff81013daa>] ? child_rip+0xa/0x20
> >  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> >  [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
> >  [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
> >  [<ffffffff81013da0>] ? c
> > hild_rip+0x0/0x20
> > Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8
> > RIP  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
> >  RSP <ffff88002803ee48>
> > ---[ end trace 1522f17fdfc9162d ]---
> > Kernel panic - not syncing: Fatal exception in interrupt
> > Pid: 8022, comm: khelper Tainted: G      D    2.6.32.36xen #1
> > Call Trace:
> >  <IRQ>  [<ffffffff8105682e>] panic+0xe0/0x19a
> >  [<ffffffff8144006a>] ? init_amd+0x296/0x37a
> 
> Hmmm... both machines are using AMD CPU? Did you hit the same bug on Intel CPU?
> 
> 
 
It is Intel CPU, not AMD. 
 
model name      : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
 

> >  [<ffffffff8100f169>] ? xen_force_evtchn_callback+0xd/0xf
> >  [<ffffffff8100f8c2>] ? check_events+0x12/0x20
> >  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
> >  [<ffffffff81448165>] oops_end+0xb6/0xc6
> >  [<ffffffff810166e5>] die+0x5a/0x63
> >  [<ffffffff81447a3c>] do_trap+0x115/0x124
> >  [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
> >  [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
> >  [<ffffffff8100f6e6>] ? xen_clocksource_read+0x21/0x23
> >  [<ffffffff8100f258>] ? HYPERVISOR_vcpu_op+0xf/0x11
> >  [<ffffffff8100f753>] ? xen_vcpuop_set_next_event+0x52/0x67
> >  [<ffffffff81013b3b>] invalid_op+0x1b/0x20
> >  [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
> >  [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
> >  [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53
> >  [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
> >  [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28
> >  [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
> >  [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
> >  [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
> >  [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
> >  [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
> >  <EOI>  [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
> >  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500
> >  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
> >  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
> >  [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef
> >  [<ffffffff81042fcf>] ? need_resched+0x23/0x
> > 2d
> >  [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7
> >  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
> >  [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255
> >  [<ffffffff81114366>] ? do_execve+0x1c3/0x29e
> >  [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
> >  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> >  [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
> >  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> >  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
> >  [<ffffffff81013daa>] ? child_rip+0xa/0x20
> >  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> >  [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
> >  [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
> >  [<ffffffff81013da0>] ? child_rip+0x0/0x20
> > (XEN) Domain 0 crashed: 'noreboot' set - not rebooting.
> >
> >> Date: Tue, 12 Apr 2011 06:00:00 -0400
> >> From: konrad.wilk@oracle.com
> >> To: tinnycloud@hotmail.com
> >> CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com;
> >> jeremy@goop.org
> >> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> >>
> >> On Tue, Apr 12, 2011 at 05:11:51PM +0800, MaoXiaoyun wrote:
> >> >
> >> > Hi :
> >> >
> >> > We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel
> >> > panic bug.
> >> >
> >> > 2.6.32.36 Kernel:
> >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4
> >> > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183
> >> >
> >> > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes
> >> > in restart every 15minutes.
> >>
> >> What is the storage that you are using for your guests? AoE? Local disks?
> >>
> >> > About 17 machines are invovled in the test, after 10 hours run, one
> >> > confrontted a crash at arch/x86/mm/tlb.c:61
> >> >
> >> > Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck's
> >> > suggestion.
> >> >
> >> > Any comments, thanks.
> >> >

 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 13881 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-14  7:56                         ` MaoXiaoyun
@ 2011-04-14 11:16                           ` MaoXiaoyun
  2011-04-15 12:23                             ` MaoXiaoyun
  0 siblings, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-14 11:16 UTC (permalink / raw)
  To: giamteckchoon; +Cc: jeremy, xen devel, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 10460 bytes --]


Hi:
 
       As I go through the code. 
       From tlb.c:60, it looks like it  cpu_tlbstate.state  is TLBSTATE_OK, 
which indicates in user space, but the caller, in mmu.c:1512,  
(active_mm == mm) indicates kernel space, that the conflict.
     
   Well, the panic CPU is processing IPI interrupt, could it be something wrong
with CPU mask? 
 
   thanks. 
 
======arch/x86/mm/tlb.c===
 58 void leave_mm(int cpu)
 59 {
 60 <+++if (percpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
 61 <+++<+++BUG();                                                                                                                                           
 62 <+++cpumask_clear_cpu(cpu,
 63 <+++<+++<+++  mm_cpumask(percpu_read(cpu_tlbstate.active_mm)));
 64 <+++load_cr3(swapper_pg_dir);
 65 }
 66 EXPORT_SYMBOL_GPL(leave_mm);
 67 
 
///arch/x86/xen/mmu.c 
 
1502 #ifdef CONFIG_SMP
1503 /* Another cpu may still have their %cr3 pointing at the pagetable, so
1504    we need to repoint it somewhere else before we can unpin it. */
1505 static void drop_other_mm_ref(void *info)
1506 {
1507 <+++struct mm_struct *mm = info;
1508 <+++struct mm_struct *active_mm;
1509 
1510 <+++active_mm = percpu_read(cpu_tlbstate.active_mm);
1511 
1512 <+++if (active_mm == mm)
1513 <+++<+++leave_mm(smp_processor_id());                                                                                                                   
1514 
1515 <+++/* If this cpu still has a stale cr3 reference, then make sure
1516 <+++   it has been flushed. */
1517 <+++if (percpu_read(xen_current_cr3) == __pa(mm->pgd))
1518 <+++<+++load_cr3(swapper_pg_dir);
1519 }





 
> Date: Thu, 14 Apr 2011 15:26:14 +0800
> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> From: giamteckchoon@gmail.com
> To: tinnycloud@hotmail.com
> CC: xen-devel@lists.xensource.com; jeremy@goop.org; konrad.wilk@oracle.com
> 
> 2011/4/14 MaoXiaoyun <tinnycloud@hotmail.com>:
> > Hi:
> >
> >       I've done test with "cpuidle=0 cpufreq=none", two machine crashed.
> >
> > blktap_sysfs_destroy
> > blktap_sysfs_destroy
> > blktap_sysfs_create: adding attributes for dev ffff8800ad581000
> > blktap_sysfs_create: adding attributes for dev ffff8800a48e3e00
> > ------------[ cut here ]------------
> > kernel BUG at arch/x86/mm/tlb.c:61!
> > invalid opcode: 0000 [#1] SMP
> > last sysfs file: /sys/block/tapdeve/dev
> > CPU 0
> > Modules linked in: 8021q garp blktap xen_netback xen_blkback blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_ms
> > ghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy bnx2
> > serio_raw snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_timer i2c_core snd iT
> > CO_wdt pata_acpi soundcore iTCO_vendor_
> > support ata_generic snd_page_alloc pcspkr ata_piix shpchp mptsas mptscsih mptbase [last unloa
> > ded: freq_table]
> > Pid: 8022, comm: khelper Not tainted 2.6.32.36xen #1 Tecal RH2285
> > RIP: e030:[<ffffffff8103a3cb>]  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
> > RSP: e02b:ffff88002803ee48  EFLAGS: 00010046
> > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81675980
> > RDX: ffff88002803ee78 RSI: 0000000000000000 RDI: 0000000000000000
> > RBP: ffff88002803ee48 R08: ffff8800a4929000 R09: dead000000200200
> > R10: dead000000100100 R11: ffffffff81447292 R12: ffff88012ba07b80
> > R13: ffff880028046020 R14: 00000000000004fb R15: 0000000000000000
> > FS:  00007f410af416e0(0000) GS:ffff88002803b000(0000) knlGS:0000000000000000
> > CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000000000469000 CR3: 00000000ad639000 CR4: 0000000000002660
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process khelper (pid: 8022, threadinfo ffff8800a4846000, task ffff8800a9ed0000)
> > Stack:
> >  ffff88002803ee68 ffffffff8100e4a4 0000000000000001 ffff880097de3b88
> > <0> ffff88002803ee98 ffffffff81087224 ffff88002803ee78 ffff88002803ee78
> > <0> ffff88015f808180 00000000000004fb ffff88002803eea8 ffffffff810100e8
> > Call Trace:
> >  <IRQ>
> >  [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53
> >  [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
> >  [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28
> >  [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
> >  [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
> >  [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
> >  [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
> >  [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
> >  <EOI>
> >  [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
> >  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500
> >  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
> >  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
> >  [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef
> >  [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
> >
> > [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7
> >  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
> >  [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255
> >  [<ffffffff81114366>] ? do_execve+0x1c3/0x29e
> >  [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
> >  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> >  [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
> >  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> >  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
> >  [<ffffffff81013daa>] ? child_rip+0xa/0x20
> >  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> >  [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
> >  [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
> >  [<ffffffff81013da0>] ? c
> > hild_rip+0x0/0x20
> > Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8
> > RIP  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
> >  RSP <ffff88002803ee48>
> > ---[ end trace 1522f17fdfc9162d ]---
> > Kernel panic - not syncing: Fatal exception in interrupt
> > Pid: 8022, comm: khelper Tainted: G      D    2.6.32.36xen #1
> > Call Trace:
> >  <IRQ>  [<ffffffff8105682e>] panic+0xe0/0x19a
> >  [<ffffffff8144006a>] ? init_amd+0x296/0x37a
> >  [<ffffffff8100f169>] ? xen_force_evtchn_callback+0xd/0xf
> >  [<ffffffff8100f8c2>] ? check_events+0x12/0x20
> >  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
> >  [<ffffffff81448165>] oops_end+0xb6/0xc6
> >  [<ffffffff810166e5>] die+0x5a/0x63
> >  [<ffffffff81447a3c>] do_trap+0x115/0x124
> >  [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
> >  [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
> >  [<ffffffff8100f6e6>] ? xen_clocksource_read+0x21/0x23
> >  [<ffffffff8100f258>] ? HYPERVISOR_vcpu_op+0xf/0x11
> >  [<ffffffff8100f753>] ? xen_vcpuop_set_next_event+0x52/0x67
> >  [<ffffffff81013b3b>] invalid_op+0x1b/0x20
> >  [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
> >  [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
> >  [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53
> >  [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
> >  [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28
> >  [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
> >  [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
> >  [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
> >  [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
> >  [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
> >  <EOI>  [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
> >  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500
> >  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
> >  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
> >  [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef
> >  [<ffffffff81042fcf>] ? need_resched+0x23/0x
> > 2d
> >  [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7
> >  [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
> >  [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255
> >  [<ffffffff81114366>] ? do_execve+0x1c3/0x29e
> >  [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
> >  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> >  [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
> >  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> >  [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
> >  [<ffffffff81013daa>] ? child_rip+0xa/0x20
> >  [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
> >  [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
> >  [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
> >  [<ffffffff81013da0>] ? child_rip+0x0/0x20
> > (XEN) Domain 0 crashed: 'noreboot' set - not rebooting.
> >
> >> Date: Tue, 12 Apr 2011 06:00:00 -0400
> >> From: konrad.wilk@oracle.com
> >> To: tinnycloud@hotmail.com
> >> CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com;
> >> jeremy@goop.org
> >> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> >>
> >> On Tue, Apr 12, 2011 at 05:11:51PM +0800, MaoXiaoyun wrote:
> >> >
> >> > Hi :
> >> >
> >> > We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel
> >> > panic bug.
> >> >
> >> > 2.6.32.36 Kernel:
> >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4
> >> > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183
> >> >
> >> > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes
> >> > in restart every 15minutes.
> >>
> >> What is the storage that you are using for your guests? AoE? Local disks?
> >>
> >> > About 17 machines are invovled in the test, after 10 hours run, one
> >> > confrontted a crash at arch/x86/mm/tlb.c:61
> >> >
> >> > Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck's
> >> > suggestion.
> >> >
> >> > Any comments, thanks.
> >> >


 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 17127 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-14 11:16                           ` MaoXiaoyun
@ 2011-04-15 12:23                             ` MaoXiaoyun
  2011-04-15 21:22                               ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-15 12:23 UTC (permalink / raw)
  To: giamteckchoon; +Cc: jeremy, xen devel, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 1166 bytes --]


Hi:

Could the crash  related to this patch ? 
http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3

Since now TLB state  change to TLBSTATE_OK(mmu_context.h:40) is before cpumask_clear_cpu(line 49).
Could it possible that right after execute line 40 of mmu_context.h,  CPU revice IPI from other CPU to 
flush the mm, and when in interrupt, find the TLB state happened to be TLBSTATE_OK. Which conflicts.

Thanks.

arch/x86/include/asm/mmu_context.h
 
33 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 34 <+++<+++<+++     struct task_struct *tsk)
 35 {
 36 <+++unsigned cpu = smp_processor_id();
 37 
 38 <+++if (likely(prev != next)) {
 39 #ifdef CONFIG_SMP
 40 <+++<+++percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 41 <+++<+++percpu_write(cpu_tlbstate.active_mm, next);
 42 #endif
 43 <+++<+++cpumask_set_cpu(cpu, mm_cpumask(next));
 44 
 45 <+++<+++/* Re-load page tables */
 46 <+++<+++load_cr3(next->pgd);
 47 
 48 <+++<+++/* stop flush ipis for the previous mm */
 49 <+++<+++cpumask_clear_cpu(cpu, mm_cpumask(prev));  

 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 3143 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-15 12:23                             ` MaoXiaoyun
@ 2011-04-15 21:22                               ` Jeremy Fitzhardinge
  2011-04-18 15:20                                 ` MaoXiaoyun
                                                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Jeremy Fitzhardinge @ 2011-04-15 21:22 UTC (permalink / raw)
  To: MaoXiaoyun; +Cc: xen devel, giamteckchoon, konrad.wilk

On 04/15/2011 05:23 AM, MaoXiaoyun wrote:
> Hi:
>
> Could the crash related to this patch ?
> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3
>
> Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before
> cpumask_clear_cpu(line 49).
> Could it possible that right after execute line 40 of mmu_context.h,
> CPU revice IPI from other CPU to
> flush the mm, and when in interrupt, find the TLB state happened to be
> TLBSTATE_OK. Which conflicts.

Does reverting it help?

J

>
> Thanks.
>
> arch/x86/include/asm/mmu_context.h
>
> 33 static inline void switch_mm(struct mm_struct *prev, struct
> mm_struct *next,
> 34 <+++<+++<+++ struct task_struct *tsk)
> 35 {
> 36 <+++unsigned cpu = smp_processor_id();
> 37
> 38 <+++if (likely(prev != next)) {
> 39 #ifdef CONFIG_SMP
> 40 <+++<+++percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> 41 <+++<+++percpu_write(cpu_tlbstate.active_mm, next);
> 42 #endif
> 43 <+++<+++cpumask_set_cpu(cpu, mm_cpumask(next));
> 44
> 45 <+++<+++/* Re-load page tables */
> 46 <+++<+++load_cr3(next->pgd);
> 47
> 48 <+++<+++/* stop flush ipis for the previous mm */
> 49 <+++<+++cpumask_clear_cpu(cpu, mm_cpumask(prev));
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-15 21:22                               ` Jeremy Fitzhardinge
@ 2011-04-18 15:20                                 ` MaoXiaoyun
  2011-04-25  3:15                                 ` MaoXiaoyun
  2011-04-25  4:42                                 ` MaoXiaoyun
  2 siblings, 0 replies; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-18 15:20 UTC (permalink / raw)
  To: xen devel; +Cc: jeremy


[-- Attachment #1.1: Type: text/plain, Size: 2722 bytes --]



 

> Date: Fri, 15 Apr 2011 14:22:29 -0700
> From: jeremy@goop.org
> To: tinnycloud@hotmail.com
> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com
> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> 
> On 04/15/2011 05:23 AM, MaoXiaoyun wrote:
> > Hi:
> >
> > Could the crash related to this patch ?
> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3
> >
> > Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before
> > cpumask_clear_cpu(line 49).
> > Could it possible that right after execute line 40 of mmu_context.h,
> > CPU revice IPI from other CPU to
> > flush the mm, and when in interrupt, find the TLB state happened to be
> > TLBSTATE_OK. Which conflicts.
> 
> Does reverting it help?
> 
> J
 
Very likely.
 
Previously in 17 machines test, one to three machines will fail in 10hours, very easily.
 
But after reverting, we have 29machines involved the test, 28 successfuly rung 2 days, 1 fail after 28 hours. 
Unfortunately I can't tell wether the failed one related to this bug, since I got no log in messages.  And
the machine was reboot by someone before I can see something from serial port.
 
But in my opinion the fail points to another bug, which I happened to confront before.  
 
Before, one of my develop machine(2.6.32.36kernel+xen4.0.1) completely stop response, 
including serial console. There is no abnormal message in serial port,  looks like Xen runs in deadlock. 
Well, it is rarely happen, since I only met once till now. 
 
Now I am trying to figure out what might cause the deadlock, we never met this before.
I don't have clear thoughts on how to dig it out, but  I think this bug exists in Xen.
since if dom0 hangs, xen should work,  and serial output will response.
 If so, the bug may be introduced between 4.0.0 and 4.0.1.
 
What do you think,  thanks.

> >
> > Thanks.
> >
> > arch/x86/include/asm/mmu_context.h
> >
> > 33 static inline void switch_mm(struct mm_struct *prev, struct
> > mm_struct *next,
> > 34 <+++<+++<+++ struct task_struct *tsk)
> > 35 {
> > 36 <+++unsigned cpu = smp_processor_id();
> > 37
> > 38 <+++if (likely(prev != next)) {
> > 39 #ifdef CONFIG_SMP
> > 40 <+++<+++percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> > 41 <+++<+++percpu_write(cpu_tlbstate.active_mm, next);
> > 42 #endif
> > 43 <+++<+++cpumask_set_cpu(cpu, mm_cpumask(next));
> > 44
> > 45 <+++<+++/* Re-load page tables */
> > 46 <+++<+++load_cr3(next->pgd);
> > 47
> > 48 <+++<+++/* stop flush ipis for the previous mm */
> > 49 <+++<+++cpumask_clear_cpu(cpu, mm_cpumask(prev));
> >
> >
> 
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 3520 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-15 21:22                               ` Jeremy Fitzhardinge
  2011-04-18 15:20                                 ` MaoXiaoyun
@ 2011-04-25  3:15                                 ` MaoXiaoyun
  2011-04-26  5:52                                   ` Tian, Kevin
  2011-04-25  4:42                                 ` MaoXiaoyun
  2 siblings, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-25  3:15 UTC (permalink / raw)
  To: jeremy; +Cc: xen devel, giamteckchoon, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 2620 bytes --]


 

> Date: Fri, 15 Apr 2011 14:22:29 -0700
> From: jeremy@goop.org
> To: tinnycloud@hotmail.com
> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com
> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> 
> On 04/15/2011 05:23 AM, MaoXiaoyun wrote:
> > Hi:
> >
> > Could the crash related to this patch ?
> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3
> >
> > Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before
> > cpumask_clear_cpu(line 49).
> > Could it possible that right after execute line 40 of mmu_context.h,
> > CPU revice IPI from other CPU to
> > flush the mm, and when in interrupt, find the TLB state happened to be
> > TLBSTATE_OK. Which conflicts.
> 
> Does reverting it help?
> 
> J
 
Hi Jeremy:
 
    The lastest test result shows the reverting didn't help.
    Kernel panic exactly at the same place in tlb.c.
 
    I have question about TLB state, from the stack, 
    xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... ->drop_other_mm_ref
 
    What  cpu_tlbstate.state should be,  could  TLBSTATE_OK or TLBSTATE_LAZY all be possible? 
    That is after a hypercall from userspace, state will be TLBSTATE_OK, and
      if from kernel space, state will be TLBSTATE_LAZE ? 
 
       thanks.
    
 [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53
 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
 [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28
 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
 [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
 [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30


> 
> >
> > Thanks.
> >
> > arch/x86/include/asm/mmu_context.h
> >
> > 33 static inline void switch_mm(struct mm_struct *prev, struct
> > mm_struct *next,
> > 34 <+++<+++<+++ struct task_struct *tsk)
> > 35 {
> > 36 <+++unsigned cpu = smp_processor_id();
> > 37
> > 38 <+++if (likely(prev != next)) {
> > 39 #ifdef CONFIG_SMP
> > 40 <+++<+++percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> > 41 <+++<+++percpu_write(cpu_tlbstate.active_mm, next);
> > 42 #endif
> > 43 <+++<+++cpumask_set_cpu(cpu, mm_cpumask(next));
> > 44
> > 45 <+++<+++/* Re-load page tables */
> > 46 <+++<+++load_cr3(next->pgd);
> > 47
> > 48 <+++<+++/* stop flush ipis for the previous mm */
> > 49 <+++<+++cpumask_clear_cpu(cpu, mm_cpumask(prev));
> >
> >
> 
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 8237 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-15 21:22                               ` Jeremy Fitzhardinge
  2011-04-18 15:20                                 ` MaoXiaoyun
  2011-04-25  3:15                                 ` MaoXiaoyun
@ 2011-04-25  4:42                                 ` MaoXiaoyun
  2011-04-25 12:54                                   ` MaoXiaoyun
  2 siblings, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-25  4:42 UTC (permalink / raw)
  To: jeremy; +Cc: xen devel, giamteckchoon, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 2970 bytes --]


I go through the switch_mm more, and come up one more question:
 
Why we don't need to clear prev cpumask in line between line 59 and 60?
 
Say
1)  Context is switch from process A to kernel, then kernel has active_mm-> A's mm
2)  Context is switch from kernel to A, in sched.c oldmm = A's mm; mm = A's mm
3)  it will call arch/x86/include/asm/mmu_context.h:60, since prev = next;
     if another CPU flush A's mm, but this cpu don't clear CPU mask, it might enter IPI interrput
     routine, and also find cpu_tlbstate.state is TLBSTATE_OK.
 
Could this possible?
 
kernel/sched.c
 
2999 context_switch(struct rq *rq, struct task_struct *prev,
 3000            struct task_struct *next)
 3001 {
 3002     struct mm_struct *mm, *oldmm;
 3003 
 3004     prepare_task_switch(rq, prev, next);
 3005     trace_sched_switch(rq, prev, next);
 3006     mm = next->mm;
 3007     oldmm = prev->active_mm;
 3008     /*
 3009      * For paravirt, this is coupled with an exit in switch_to to
 3010      * combine the page table reload and the switch backend into
 3011      * one hypercall.
 3012      */
 3013     arch_start_context_switch(prev);
 3014 
 3015     if (unlikely(!mm)) {
 3016         next->active_mm = oldmm;
 3017         atomic_inc(&oldmm->mm_count);
 3018         enter_lazy_tlb(oldmm, next);
 3019     } else
 3020         switch_mm(oldmm, mm, next);
 3021 
 3022     if (unlikely(!prev->mm)) {
 3023         prev->active_mm = NULL;
 3024         rq->prev_mm = oldmm;
 3025     }
 
 
 33 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 34                  struct task_struct *tsk)
 35 {
 36     unsigned cpu = smp_processor_id();
 37 
 38     if (likely(prev != next)) {
 39         /* stop flush ipis for the previous mm */
 40         cpumask_clear_cpu(cpu, mm_cpumask(prev));
 41 
 42 
 43 #ifdef CONFIG_SMP
 44         percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 45         percpu_write(cpu_tlbstate.active_mm, next);
 46 #endif
 47         cpumask_set_cpu(cpu, mm_cpumask(next));
 48 
 49         /* Re-load page tables */
 50         load_cr3(next->pgd);
 51 
 52         /*
 53          * load the LDT, if the LDT is different:
 54          */
 55         if (unlikely(prev->context.ldt != next->context.ldt))
 56             load_LDT_nolock(&next->context);
 57     }
 58 #ifdef CONFIG_SMP
 59     else {
 60         percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 61         BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next);
 62 
 63         if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
 64             /* We were in lazy tlb mode and leave_mm disabled
 65              * tlb flush IPI delivery. We must reload CR3
 66              * to make sure to use no freed page tables.
 67              */
 68             load_cr3(next->pgd);
 69             load_LDT_nolock(&next->context);
 70         }
 71     } 
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 6837 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-25  4:42                                 ` MaoXiaoyun
@ 2011-04-25 12:54                                   ` MaoXiaoyun
  2011-04-25 13:11                                     ` MaoXiaoyun
  0 siblings, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-25 12:54 UTC (permalink / raw)
  To: jeremy; +Cc: xen devel, giamteckchoon, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 7257 bytes --]


Add some debug info in drop_other_mm_ref(line 1516), get on machine crash.
log attached, pity I lost prink info.
 
Does current->mm indicates userspace?
Thanks.
 
============================
1502 #ifdef CONFIG_SMP
1503 /* Another cpu may still have their %cr3 pointing at the pagetable, so
1504    we need to repoint it somewhere else before we can unpin it. */
1505 static void drop_other_mm_ref(void *info)
1506 {
1507 <+++struct mm_struct *mm = info;
1508 <+++struct mm_struct *active_mm;
1509 
1510 <+++active_mm = percpu_read(cpu_tlbstate.active_mm);
1511 
1512 <+++if (active_mm == mm){
1513         if(current->mm){
1514 <+++<+++    printk("in userspace active_mm %p mm %p curr_mm %p tlbstate%d\n",                                                                           
1515                    active_mm, mm, current->mm, percpu_read(cpu_tlbstate.state));
1516             BUG();
1517         }
1518 <+++<+++leave_mm(smp_processor_id());
1519     }
1520 
 
 
============================
 
Starting udev: ------------[ cut here ]------------
kernel BUG at arch/x86/xen/mmu.c:1516!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/class/raw/rawctl/dev
CPU 2 
Modules linked in: snd_seq_dummy bnx2 snd_seq_oss(+) snd_seq_midi_event snd_seq 
snd_seq_device serio_raw snd_pcm_oss snd_mixer_oss snd_pcm snd_timer i2c_i801 i2c_core iTCO_wdt snd pata_acpi iTCO_vendor_support ata_generic soundcore 
snd_page_alloc pcspkr ata_piix shpchp mptsas mptscsih mptbase                           
Pid: 1126, comm: khelper Not tainted 2.6.32.36xen #1 Tecal RH2285          
RIP: e030:[<ffffffff8100e4c0>]  [<ffffffff8100e4c0>] drop_other_mm_ref+0x46/0x80
RSP: e02b:ffff880028078e58  EFLAGS: 00010092
RAX: 0000000000000015 RBX: 0000000000000001 RCX: 00000000ffff0075
RDX: 0000000000009f9f RSI: ffffffff8144006a RDI: 0000000000000004
RBP: ffff880028078e68 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000028078cf8 R11: 0000000000000246 R12: ffff88012c032680
R13: ffff880028080020 R14: 00000000000004f1 R15: 0000000000000000
FS:  00007f01adcf8710(0000) GS:ffff880028075000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f01adf20648 CR3: 000000012a546000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process khelper (pid: 1126, threadinfo ffff88012d80e000, task ffff88012b880000)
Stack:
 0000000000000001 ffff88012bb9bb88 ffff880028078e98 ffffffff81087224
<0> ffff880028078e78 ffff880028078e78 ffff88015f808540 00000000000004f1
<0> ffff880028078ea8 ffffffff81010118 ffff880028078ee8 ffffffff810a936a
Call Trace:
 <IRQ> 
 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
 [<ffffffff81010118>] xen_call_function_single_interrupt+0x13/0x28
 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
 [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
 [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI> 
 [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1000
 [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1000
 [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f195>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff8100f8f2>] ? check_events+0x12/0x20
 [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8100d47f>] ? xen_mc_issue+0x2e/0x33
 [<ffffffff8100e42f>] ? __xen_pgd_pin+0xc1/0xc9
 [<ffffffff8100e449>] ? xen_pgd_pin+0x12/0x14
 [<ffffffff8100e470>] ? xen_activate_mm+0x25/0x2f
 [<ffffffff81113f59>] ? flush_old_exec+0x390/0x500
 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef
 [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
 [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7
 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255
 [<ffffffff81114366>] ? do_execve+0x1c3/0x29e
 [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
 [<ffffffff81013daa>] ? child_rip+0xa/0x20
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81013da0>] ? child_rip+0x0/0x20
Code: 75 3a 65 48 8b 04 25 c0 cb 00 00 48 83 b8 78 02 00 00 00 74 1a 65 8b 34 25 c8 55 01 00 48 c7 c7 06 98 5b 81 31 c0 e8 d9 90 04 00 <0f> 0b eb fe 65 8b 3c 
25 78 e3 00 00 e8 e5 be 02 00 65 48 8b 1c                                         
RIP  [<ffffffff8100e4c0>] drop_other_mm_ref+0x46/0x80
 RSP <ffff880028078e58>
[<ffffffff8144006a>] ? init_amd+0x296/0x37a
 [<ffffffff8100f195>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff8100f8f2>] ? check_events+0x12/0x20
 [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
 [<ffffffff81448165>] oops_end+0xb6/0xc6
 [<ffffffff810166e5>] die+0x5a/0x63
 [<ffffffff81447a3c>] do_trap+0x115/0x124
 [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
 [<ffffffff8100e4c0>] ? drop_other_mm_ref+0x46/0x80
 [<ffffffff81057640>] ? printk+0xa7/0xa9
 [<ffffffff81013b3b>] invalid_op+0x1b/0x20
 [<ffffffff8144006a>] ? init_amd+0x296/0x37a
 [<ffffffff8100e4c0>] ? drop_other_mm_ref+0x46/0x80
 [<ffffffff8100e4c0>] ? drop_other_mm_ref+0x46/0x80
 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
 [<ffffffff81010118>] xen_call_function_single_interrupt+0x13/0x28
 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
 [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
 [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI>  [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1000
 [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1000
 [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f195>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff8100f8f2>] ? check_events+0x12/0x20
 [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8100d47f>] ? xen_mc_issue+0x2e/0x33
 [<ffffffff8100e42f>] ? __xen_pgd_pin+0xc1/0xc9
 [<ffffffff8100e449>] ? xen_pgd_pin+0x12/0x14
 [<ffffffff8100e470>] ? xen_activate_mm+0x25/0x2f
 [<ffffffff81113f59>] ? flush_old_exec+0x390/0x500
 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 9845 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-25 12:54                                   ` MaoXiaoyun
@ 2011-04-25 13:11                                     ` MaoXiaoyun
  2011-04-25 15:05                                       ` MaoXiaoyun
  0 siblings, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-25 13:11 UTC (permalink / raw)
  To: jeremy; +Cc: xen devel, giamteckchoon, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 541 bytes --]


 
>From: tinnycloud@hotmail.com
>To: jeremy@goop.org
>CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com
>Subject: RE: Kernel BUG at arch/x86/mm/tlb.c:61
>Date: Mon, 25 Apr 2011 20:54:54 +0800




>Add some debug info in drop_other_mm_ref(line 1516), get on machine crash.
>log attached, pity I lost prink info.

printk info: in userspace active_mm ffff8800a3669f80 mm ffff8800a3669f80 curr_mm ffff88008d73c000 tlbstate 2  

>Does current->mm indicates userspace?
>Thanks.
 

 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 1026 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-25 13:11                                     ` MaoXiaoyun
@ 2011-04-25 15:05                                       ` MaoXiaoyun
  2011-04-26  5:55                                         ` Tian, Kevin
  0 siblings, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-25 15:05 UTC (permalink / raw)
  To: jeremy; +Cc: xen devel, giamteckchoon, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 715 bytes --]


Please ignore my last two mails, I just learnt that Current is meanless in irq context.
 
Just come up one whole assumption:
 
In my opinion:
 
1) CPU running in switch_mm has the possiblity of receiving IPI message and enter interrupt
2) Before revert that patch, not matter the if statement is true or not, the cpu_tlbstate.state
could be changed to TLBSTATE_OK, right before enter irq routhine
3) Since the cpu_tlbstate is per CPU variable, before calling leave_mm(), test cpu_tlbstate.state
in drop_other_mm_ref is feasible and nessary
4) If I am right, strange thing is the code of 2.6.32.36 is same as 2.6.31.x, which we never met tlb bug before.
 
any comments?
 
Many thanks.
  		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 1017 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-25  3:15                                 ` MaoXiaoyun
@ 2011-04-26  5:52                                   ` Tian, Kevin
  2011-04-26  7:04                                     ` MaoXiaoyun
  2011-04-28 23:29                                     ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 41+ messages in thread
From: Tian, Kevin @ 2011-04-26  5:52 UTC (permalink / raw)
  To: MaoXiaoyun, jeremy; +Cc: xen devel, giamteckchoon, konrad.wilk

[-- Attachment #1: Type: text/plain, Size: 2918 bytes --]

>From: MaoXiaoyun
>Sent: Monday, April 25, 2011 11:15 AM
>> Date: Fri, 15 Apr 2011 14:22:29 -0700
>> From: jeremy@goop.org
>> To: tinnycloud@hotmail.com
>> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com
>> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
>> 
>> On 04/15/2011 05:23 AM, MaoXiaoyun wrote:
>> > Hi:
>> >
>> > Could the crash related to this patch ?
>> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3
>> >
>> > Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before
>> > cpumask_clear_cpu(line 49).
>> > Could it possible that right after execute line 40 of mmu_context.h,
>> > CPU revice IPI from other CPU to
>> > flush the mm, and when in interrupt, find the TLB state happened to be
>> > TLBSTATE_OK. Which conflicts.
>> 
>> Does reverting it help?
>> 
>> J
> 
>Hi Jeremy:
> 
>    The lastest test result shows the reverting didn't help.
>    Kernel panic exactly at the same place in tlb.c.
> 
>    I have question about TLB state, from the stack, 
>    xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... ->drop_other_mm_ref
> 
>    What  cpu_tlbstate.state should be,  could  TLBSTATE_OK or TLBSTATE_LAZY all be possible? 
>    That is after a hypercall from userspace, state will be TLBSTATE_OK, and
>      if from kernel space, state will be TLBSTATE_LAZE ? 
> 
>       thanks.

it looks a bug in drop_other_mm_ref implementation, that current TLB state should be checked
before invoking leave_mm(). There's a window between below lines of code:

<xen_drop_mm_ref>
       /* Get the "official" set of cpus referring to our pagetable. */
        if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
                for_each_online_cpu(cpu) {
                        if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
                            && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
                                continue;
                        smp_call_function_single(cpu, drop_other_mm_ref, mm, 1);
                }
                return;
        }

there's chance that when smp_call_function_single is invoked, actual TLB state has been
updated in the other cpu. The upstream kernel patch you referred to earlier just makes
this bug exposed more easily. But even without this patch, you may still suffer such issue
which is why reverting the patch doesn't help.

Could you try adding a check in drop_other_mm_ref?

        if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK)
                leave_mm(smp_processor_id());

once the interrupted context has TLBSTATE_OK, it implicates that later it will handle 
the TLB flush and thus no need for leave_mm from interrupt handler, and that's the
assumption of doing leave_mm.

Thanks
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-25 15:05                                       ` MaoXiaoyun
@ 2011-04-26  5:55                                         ` Tian, Kevin
  0 siblings, 0 replies; 41+ messages in thread
From: Tian, Kevin @ 2011-04-26  5:55 UTC (permalink / raw)
  To: MaoXiaoyun, jeremy; +Cc: xen devel, giamteckchoon, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 1232 bytes --]

the race window is always there, but whether it will be triggered is not determined. It's possible that you never met this bug on 2.6.31.x now, but it doesn't mean you won't meet it in long run in the future. :)

Thanks
Kevin

From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of MaoXiaoyun
Sent: Monday, April 25, 2011 11:05 PM
To: jeremy@goop.org
Cc: xen devel; giamteckchoon@gmail.com; konrad.wilk@oracle.com
Subject: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61

Please ignore my last two mails, I just learnt that Current is meanless in irq context.

Just come up one whole assumption:

In my opinion:

1) CPU running in switch_mm has the possiblity of receiving IPI message and enter interrupt
2) Before revert that patch, not matter the if statement is true or not, the cpu_tlbstate.state
could be changed to TLBSTATE_OK, right before enter irq routhine
3) Since the cpu_tlbstate is per CPU variable, before calling leave_mm(), test cpu_tlbstate.state
in drop_other_mm_ref is feasible and nessary
4) If I am right, strange thing is the code of 2.6.32.36 is same as 2.6.31.x, which we never met tlb bug before.

any comments?

Many thanks.


[-- Attachment #1.2: Type: text/html, Size: 4985 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-26  5:52                                   ` Tian, Kevin
@ 2011-04-26  7:04                                     ` MaoXiaoyun
  2011-04-26  8:31                                       ` Tian, Kevin
  2011-04-28 23:29                                     ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-26  7:04 UTC (permalink / raw)
  To: kevin.tian, jeremy; +Cc: xen devel, giamteckchoon, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 3725 bytes --]


Many thanks, Kevin.
 
I agree on the race window.
One thing more,  In my understaning, the CPU who send out IPI message, will unpin the pagetable after 
receive all ACKS  from other cpu,  if the CPU who received  IPI message, enter drop_other_mm_ref, and 
has TLBSTATE_OK, does nothing, will it possible it possible confronts with stale pagetable
(that is unpinned by sender CPU)?
 
So do we need flush tlb when its state is TBLSTATE_OK?
 
if (active_mm == mm){
     if (percpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
        load_cr3(mm->pgd)
     else
                leave_mm(smp_processor_id());
 }

 
> From: kevin.tian@intel.com
> To: tinnycloud@hotmail.com; jeremy@goop.org
> CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com; konrad.wilk@oracle.com
> Date: Tue, 26 Apr 2011 13:52:11 +0800
> Subject: RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61
> 
> >From: MaoXiaoyun
> >Sent: Monday, April 25, 2011 11:15 AM
> >> Date: Fri, 15 Apr 2011 14:22:29 -0700
> >> From: jeremy@goop.org
> >> To: tinnycloud@hotmail.com
> >> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com
> >> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> >> 
> >> On 04/15/2011 05:23 AM, MaoXiaoyun wrote:
> >> > Hi:
> >> >
> >> > Could the crash related to this patch ?
> >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3
> >> >
> >> > Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before
> >> > cpumask_clear_cpu(line 49).
> >> > Could it possible that right after execute line 40 of mmu_context.h,
> >> > CPU revice IPI from other CPU to
> >> > flush the mm, and when in interrupt, find the TLB state happened to be
> >> > TLBSTATE_OK. Which conflicts.
> >> 
> >> Does reverting it help?
> >> 
> >> J
> > 
> >Hi Jeremy:
> > 
> >    The lastest test result shows the reverting didn't help.
> >    Kernel panic exactly at the same place in tlb.c.
> > 
> >    I have question about TLB state, from the stack, 
> >    xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... ->drop_other_mm_ref
> > 
> >    What  cpu_tlbstate.state should be,  could  TLBSTATE_OK or TLBSTATE_LAZY all be possible? 
> >    That is after a hypercall from userspace, state will be TLBSTATE_OK, and
> >      if from kernel space, state will be TLBSTATE_LAZE ? 
> > 
> >       thanks.
> 
> it looks a bug in drop_other_mm_ref implementation, that current TLB state should be checked
> before invoking leave_mm(). There's a window between below lines of code:
> 
> <xen_drop_mm_ref>
> /* Get the "official" set of cpus referring to our pagetable. */
> if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
> for_each_online_cpu(cpu) {
> if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
> && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
> continue;
> smp_call_function_single(cpu, drop_other_mm_ref, mm, 1);
> }
> return;
> }
> 
> there's chance that when smp_call_function_single is invoked, actual TLB state has been
> updated in the other cpu. The upstream kernel patch you referred to earlier just makes
> this bug exposed more easily. But even without this patch, you may still suffer such issue
> which is why reverting the patch doesn't help.
> 
> Could you try adding a check in drop_other_mm_ref?
> 
> if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK)
> leave_mm(smp_processor_id());
> 
> once the interrupted context has TLBSTATE_OK, it implicates that later it will handle 
> the TLB flush and thus no need for leave_mm from interrupt handler, and that's the
> assumption of doing leave_mm.
> 
> Thanks
> Kevin
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 6181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-26  7:04                                     ` MaoXiaoyun
@ 2011-04-26  8:31                                       ` Tian, Kevin
  0 siblings, 0 replies; 41+ messages in thread
From: Tian, Kevin @ 2011-04-26  8:31 UTC (permalink / raw)
  To: MaoXiaoyun, jeremy; +Cc: xen devel, giamteckchoon, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 4304 bytes --]

I think that should be fine. note a later check:

        /* If this cpu still has a stale cr3 reference, then make sure
           it has been flushed. */
        if (percpu_read(xen_current_cr3) == __pa(mm->pgd))
                load_cr3(swapper_pg_dir);

this should ensure the stale TLB being flushed if this cpu is still in lazy mode.

Thanks
Kevin

From: MaoXiaoyun [mailto:tinnycloud@hotmail.com]
Sent: Tuesday, April 26, 2011 3:05 PM
To: Tian, Kevin; jeremy@goop.org
Cc: xen devel; giamteckchoon@gmail.com; konrad.wilk@oracle.com
Subject: RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61

Many thanks, Kevin.

I agree on the race window.
One thing more,  In my understaning, the CPU who send out IPI message, will unpin the pagetable after
receive all ACKS  from other cpu,  if the CPU who received  IPI message, enter drop_other_mm_ref, and
has TLBSTATE_OK, does nothing, will it possible it possible confronts with stale pagetable
(that is unpinned by sender CPU)?

So do we need flush tlb when its state is TBLSTATE_OK?

if (active_mm == mm){
     if (percpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
        load_cr3(mm->pgd)
     else
                leave_mm(smp_processor_id());
 }

> From: kevin.tian@intel.com
> To: tinnycloud@hotmail.com; jeremy@goop.org
> CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com; konrad.wilk@oracle.com
> Date: Tue, 26 Apr 2011 13:52:11 +0800
> Subject: RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61
>
> >From: MaoXiaoyun
> >Sent: Monday, April 25, 2011 11:15 AM
> >> Date: Fri, 15 Apr 2011 14:22:29 -0700
> >> From: jeremy@goop.org
> >> To: tinnycloud@hotmail.com
> >> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com
> >> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> >>
> >> On 04/15/2011 05:23 AM, MaoXiaoyun wrote:
> >> > Hi:
> >> >
> >> > Could the crash related to this patch ?
> >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3
> >> >
> >> > Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before
> >> > cpumask_clear_cpu(line 49).
> >> > Could it possible that right after execute line 40 of mmu_context.h,
> >> > CPU revice IPI from other CPU to
> >> > flush the mm, and when in interrupt, find the TLB state happened to be
> >> > TLBSTATE_OK. Which conflicts.
> >>
> >> Does reverting it help?
> >>
> >> J
> >
> >Hi Jeremy:
> >
> >    The lastest test result shows the reverting didn't help.
> >    Kernel panic exactly at the same place in tlb.c.
> >
> >    I have question about TLB state, from the stack,
> >    xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... ->drop_other_mm_ref
> >
> >    What  cpu_tlbstate.state should be,  could  TLBSTATE_OK or TLBSTATE_LAZY all be possible?
> >    That is after a hypercall from userspace, state will be TLBSTATE_OK, and
> >      if from kernel space, state will be TLBSTATE_LAZE ?
> >
> >       thanks.
>
> it looks a bug in drop_other_mm_ref implementation, that current TLB state should be checked
> before invoking leave_mm(). There's a window between below lines of code:
>
> <xen_drop_mm_ref>
> /* Get the "official" set of cpus referring to our pagetable. */
> if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
> for_each_online_cpu(cpu) {
> if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
> && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
> continue;
> smp_call_function_single(cpu, drop_other_mm_ref, mm, 1);
> }
> return;
> }
>
> there's chance that when smp_call_function_single is invoked, actual TLB state has been
> updated in the other cpu. The upstream kernel patch you referred to earlier just makes
> this bug exposed more easily. But even without this patch, you may still suffer such issue
> which is why reverting the patch doesn't help.
>
> Could you try adding a check in drop_other_mm_ref?
>
> if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK)
> leave_mm(smp_processor_id());
>
> once the interrupted context has TLBSTATE_OK, it implicates that later it will handle
> the TLB flush and thus no need for leave_mm from interrupt handler, and that's the
> assumption of doing leave_mm.
>
> Thanks
> Kevin

[-- Attachment #1.2: Type: text/html, Size: 11127 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-26  5:52                                   ` Tian, Kevin
  2011-04-26  7:04                                     ` MaoXiaoyun
@ 2011-04-28 23:29                                     ` Jeremy Fitzhardinge
  2011-04-29  0:19                                       ` Tian, Kevin
  1 sibling, 1 reply; 41+ messages in thread
From: Jeremy Fitzhardinge @ 2011-04-28 23:29 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: MaoXiaoyun, xen devel, giamteckchoon, konrad.wilk

On 04/25/2011 10:52 PM, Tian, Kevin wrote:
>> From: MaoXiaoyun
>> Sent: Monday, April 25, 2011 11:15 AM
>>> Date: Fri, 15 Apr 2011 14:22:29 -0700
>>> From: jeremy@goop.org
>>> To: tinnycloud@hotmail.com
>>> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com
>>> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
>>>
>>> On 04/15/2011 05:23 AM, MaoXiaoyun wrote:
>>>> Hi:
>>>>
>>>> Could the crash related to this patch ?
>>>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3
>>>>
>>>> Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before
>>>> cpumask_clear_cpu(line 49).
>>>> Could it possible that right after execute line 40 of mmu_context.h,
>>>> CPU revice IPI from other CPU to
>>>> flush the mm, and when in interrupt, find the TLB state happened to be
>>>> TLBSTATE_OK. Which conflicts.
>>> Does reverting it help?
>>>
>>> J
>>  
>> Hi Jeremy:
>>  
>>     The lastest test result shows the reverting didn't help.
>>     Kernel panic exactly at the same place in tlb.c.
>>  
>>     I have question about TLB state, from the stack, 
>>     xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... ->drop_other_mm_ref
>>  
>>     What  cpu_tlbstate.state should be,  could  TLBSTATE_OK or TLBSTATE_LAZY all be possible? 
>>     That is after a hypercall from userspace, state will be TLBSTATE_OK, and
>>       if from kernel space, state will be TLBSTATE_LAZE ? 
>>  
>>        thanks.
> it looks a bug in drop_other_mm_ref implementation, that current TLB state should be checked
> before invoking leave_mm(). There's a window between below lines of code:
>
> <xen_drop_mm_ref>
>        /* Get the "official" set of cpus referring to our pagetable. */
>         if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
>                 for_each_online_cpu(cpu) {
>                         if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
>                             && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
>                                 continue;
>                         smp_call_function_single(cpu, drop_other_mm_ref, mm, 1);
>                 }
>                 return;
>         }
>
> there's chance that when smp_call_function_single is invoked, actual TLB state has been
> updated in the other cpu. The upstream kernel patch you referred to earlier just makes
> this bug exposed more easily. But even without this patch, you may still suffer such issue
> which is why reverting the patch doesn't help.
>
> Could you try adding a check in drop_other_mm_ref?
>
>         if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK)
>                 leave_mm(smp_processor_id());
>
> once the interrupted context has TLBSTATE_OK, it implicates that later it will handle 
> the TLB flush and thus no need for leave_mm from interrupt handler, and that's the
> assumption of doing leave_mm.

That seems reasonable.  MaoXiaoyun, does it fix the bug for you?

Kevin, could you submit this as a proper patch?

Thanks,
    J

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-28 23:29                                     ` Jeremy Fitzhardinge
@ 2011-04-29  0:19                                       ` Tian, Kevin
  2011-04-29  1:50                                         ` MaoXiaoyun
  0 siblings, 1 reply; 41+ messages in thread
From: Tian, Kevin @ 2011-04-29  0:19 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: MaoXiaoyun, xen devel, giamteckchoon, konrad.wilk

[-- Attachment #1: Type: text/plain, Size: 3576 bytes --]

> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
> Sent: Friday, April 29, 2011 7:29 AM
> 
> On 04/25/2011 10:52 PM, Tian, Kevin wrote:
> >> From: MaoXiaoyun
> >> Sent: Monday, April 25, 2011 11:15 AM
> >>> Date: Fri, 15 Apr 2011 14:22:29 -0700
> >>> From: jeremy@goop.org
> >>> To: tinnycloud@hotmail.com
> >>> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com;
> >>> konrad.wilk@oracle.com
> >>> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> >>>
> >>> On 04/15/2011 05:23 AM, MaoXiaoyun wrote:
> >>>> Hi:
> >>>>
> >>>> Could the crash related to this patch ?
> >>>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdi
> >>>> ff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3
> >>>>
> >>>> Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is
> >>>> before cpumask_clear_cpu(line 49).
> >>>> Could it possible that right after execute line 40 of
> >>>> mmu_context.h, CPU revice IPI from other CPU to flush the mm, and
> >>>> when in interrupt, find the TLB state happened to be TLBSTATE_OK.
> >>>> Which conflicts.
> >>> Does reverting it help?
> >>>
> >>> J
> >>
> >> Hi Jeremy:
> >>
> >>     The lastest test result shows the reverting didn't help.
> >>     Kernel panic exactly at the same place in tlb.c.
> >>
> >>     I have question about TLB state, from the stack,
> >>     xen_do_hypervisor_callback-> xen_evtchn_do_upcall->...
> >> ->drop_other_mm_ref
> >>
> >>     What  cpu_tlbstate.state should be,  could  TLBSTATE_OK or
> TLBSTATE_LAZY all be possible?
> >>     That is after a hypercall from userspace, state will be TLBSTATE_OK,
> and
> >>       if from kernel space, state will be TLBSTATE_LAZE ?
> >>
> >>        thanks.
> > it looks a bug in drop_other_mm_ref implementation, that current TLB
> > state should be checked before invoking leave_mm(). There's a window
> between below lines of code:
> >
> > <xen_drop_mm_ref>
> >        /* Get the "official" set of cpus referring to our pagetable. */
> >         if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
> >                 for_each_online_cpu(cpu) {
> >                         if (!cpumask_test_cpu(cpu,
> mm_cpumask(mm))
> >                             && per_cpu(xen_current_cr3, cpu) !=
> __pa(mm->pgd))
> >                                 continue;
> >                         smp_call_function_single(cpu,
> drop_other_mm_ref, mm, 1);
> >                 }
> >                 return;
> >         }
> >
> > there's chance that when smp_call_function_single is invoked, actual
> > TLB state has been updated in the other cpu. The upstream kernel patch
> > you referred to earlier just makes this bug exposed more easily. But
> > even without this patch, you may still suffer such issue which is why reverting
> the patch doesn't help.
> >
> > Could you try adding a check in drop_other_mm_ref?
> >
> >         if (active_mm == mm && percpu_read(cpu_tlbstate.state) !=
> TLBSTATE_OK)
> >                 leave_mm(smp_processor_id());
> >
> > once the interrupted context has TLBSTATE_OK, it implicates that later
> > it will handle the TLB flush and thus no need for leave_mm from
> > interrupt handler, and that's the assumption of doing leave_mm.
> 
> That seems reasonable.  MaoXiaoyun, does it fix the bug for you?
> 
> Kevin, could you submit this as a proper patch?
> 

I'm waiting for Xiaoyun's test result before submitting a proper patch, since this
part of logic is tricky and his test can make sure we don't overlook some corner
cases. :-)

Thanks
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-29  0:19                                       ` Tian, Kevin
@ 2011-04-29  1:50                                         ` MaoXiaoyun
  2011-04-29  1:57                                           ` Tian, Kevin
  0 siblings, 1 reply; 41+ messages in thread
From: MaoXiaoyun @ 2011-04-29  1:50 UTC (permalink / raw)
  To: kevin.tian, jeremy; +Cc: xen devel, giamteckchoon, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 3927 bytes --]


 

> From: kevin.tian@intel.com
> To: jeremy@goop.org
> CC: tinnycloud@hotmail.com; xen-devel@lists.xensource.com; giamteckchoon@gmail.com; konrad.wilk@oracle.com
> Date: Fri, 29 Apr 2011 08:19:44 +0800
> Subject: RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61
> 
> > From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
> > Sent: Friday, April 29, 2011 7:29 AM
> > 
> > On 04/25/2011 10:52 PM, Tian, Kevin wrote:
> > >> From: MaoXiaoyun
> > >> Sent: Monday, April 25, 2011 11:15 AM
> > >>> Date: Fri, 15 Apr 2011 14:22:29 -0700
> > >>> From: jeremy@goop.org
> > >>> To: tinnycloud@hotmail.com
> > >>> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com;
> > >>> konrad.wilk@oracle.com
> > >>> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> > >>>
> > >>> On 04/15/2011 05:23 AM, MaoXiaoyun wrote:
> > >>>> Hi:
> > >>>>
> > >>>> Could the crash related to this patch ?
> > >>>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdi
> > >>>> ff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3
> > >>>>
> > >>>> Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is
> > >>>> before cpumask_clear_cpu(line 49).
> > >>>> Could it possible that right after execute line 40 of
> > >>>> mmu_context.h, CPU revice IPI from other CPU to flush the mm, and
> > >>>> when in interrupt, find the TLB state happened to be TLBSTATE_OK.
> > >>>> Which conflicts.
> > >>> Does reverting it help?
> > >>>
> > >>> J
> > >>
> > >> Hi Jeremy:
> > >>
> > >> The lastest test result shows the reverting didn't help.
> > >> Kernel panic exactly at the same place in tlb.c.
> > >>
> > >> I have question about TLB state, from the stack,
> > >> xen_do_hypervisor_callback-> xen_evtchn_do_upcall->...
> > >> ->drop_other_mm_ref
> > >>
> > >> What cpu_tlbstate.state should be, could TLBSTATE_OK or
> > TLBSTATE_LAZY all be possible?
> > >> That is after a hypercall from userspace, state will be TLBSTATE_OK,
> > and
> > >> if from kernel space, state will be TLBSTATE_LAZE ?
> > >>
> > >> thanks.
> > > it looks a bug in drop_other_mm_ref implementation, that current TLB
> > > state should be checked before invoking leave_mm(). There's a window
> > between below lines of code:
> > >
> > > <xen_drop_mm_ref>
> > > /* Get the "official" set of cpus referring to our pagetable. */
> > > if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
> > > for_each_online_cpu(cpu) {
> > > if (!cpumask_test_cpu(cpu,
> > mm_cpumask(mm))
> > > && per_cpu(xen_current_cr3, cpu) !=
> > __pa(mm->pgd))
> > > continue;
> > > smp_call_function_single(cpu,
> > drop_other_mm_ref, mm, 1);
> > > }
> > > return;
> > > }
> > >
> > > there's chance that when smp_call_function_single is invoked, actual
> > > TLB state has been updated in the other cpu. The upstream kernel patch
> > > you referred to earlier just makes this bug exposed more easily. But
> > > even without this patch, you may still suffer such issue which is why reverting
> > the patch doesn't help.
> > >
> > > Could you try adding a check in drop_other_mm_ref?
> > >
> > > if (active_mm == mm && percpu_read(cpu_tlbstate.state) !=
> > TLBSTATE_OK)
> > > leave_mm(smp_processor_id());
> > >
> > > once the interrupted context has TLBSTATE_OK, it implicates that later
> > > it will handle the TLB flush and thus no need for leave_mm from
> > > interrupt handler, and that's the assumption of doing leave_mm.
> > 
> > That seems reasonable. MaoXiaoyun, does it fix the bug for you?
> > 
> > Kevin, could you submit this as a proper patch?
> > 
> 
> I'm waiting for Xiaoyun's test result before submitting a proper patch, since this
> part of logic is tricky and his test can make sure we don't overlook some corner
> cases. :-)
> 
 
I think it works. The test has been running over 70 hours successfully.
My plan is run one week.
 
Thanks. 
 
> Thanks
> Kevin
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 5407 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: RE: Kernel BUG at arch/x86/mm/tlb.c:61
  2011-04-29  1:50                                         ` MaoXiaoyun
@ 2011-04-29  1:57                                           ` Tian, Kevin
  0 siblings, 0 replies; 41+ messages in thread
From: Tian, Kevin @ 2011-04-29  1:57 UTC (permalink / raw)
  To: MaoXiaoyun, jeremy; +Cc: xen devel, giamteckchoon, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 4603 bytes --]

OK, thanks for the update. I’ll send out the patch then

Thanks
Kevin

From: MaoXiaoyun [mailto:tinnycloud@hotmail.com]
Sent: Friday, April 29, 2011 9:51 AM
To: Tian, Kevin; jeremy@goop.org
Cc: xen devel; giamteckchoon@gmail.com; konrad.wilk@oracle.com
Subject: RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61


> From: kevin.tian@intel.com<mailto:kevin.tian@intel.com>
> To: jeremy@goop.org<mailto:jeremy@goop.org>
> CC: tinnycloud@hotmail.com<mailto:tinnycloud@hotmail.com>; xen-devel@lists.xensource.com<mailto:xen-devel@lists.xensource.com>; giamteckchoon@gmail.com<mailto:giamteckchoon@gmail.com>; konrad.wilk@oracle.com<mailto:konrad.wilk@oracle.com>
> Date: Fri, 29 Apr 2011 08:19:44 +0800
> Subject: RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61
>
> > From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]<mailto:[mailto:jeremy@goop.org]>
> > Sent: Friday, April 29, 2011 7:29 AM
> >
> > On 04/25/2011 10:52 PM, Tian, Kevin wrote:
> > >> From: MaoXiaoyun
> > >> Sent: Monday, April 25, 2011 11:15 AM
> > >>> Date: Fri, 15 Apr 2011 14:22:29 -0700
> > >>> From: jeremy@goop.org<mailto:jeremy@goop.org>
> > >>> To: tinnycloud@hotmail.com<mailto:tinnycloud@hotmail.com>
> > >>> CC: giamteckchoon@gmail.com<mailto:giamteckchoon@gmail.com>; xen-devel@lists.xensource.com<mailto:xen-devel@lists.xensource.com>;
> > >>> konrad.wilk@oracle.com<mailto:konrad.wilk@oracle.com>
> > >>> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
> > >>>
> > >>> On 04/15/2011 05:23 AM, MaoXiaoyun wrote:
> > >>>> Hi:
> > >>>>
> > >>>> Could the crash related to this patch ?
> > >>>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdi
> > >>>> ff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3
> > >>>>
> > >>>> Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is
> > >>>> before cpumask_clear_cpu(line 49).
> > >>>> Could it possible that right after execute line 40 of
> > >>>> mmu_context.h, CPU revice IPI from other CPU to flush the mm, and
> > >>>> when in interrupt, find the TLB state happened to be TLBSTATE_OK.
> > >>>> Which conflicts.
> > >>> Does reverting it help?
> > >>>
> > >>> J
> > >>
> > >> Hi Jeremy:
> > >>
> > >> The lastest test result shows the reverting didn't help.
> > >> Kernel panic exactly at the same place in tlb.c.
> > >>
> > >> I have question about TLB state, from the stack,
> > >> xen_do_hypervisor_callback-> xen_evtchn_do_upcall->...
> > >> ->drop_other_mm_ref
> > >>
> > >> What cpu_tlbstate.state should be, could TLBSTATE_OK or
> > TLBSTATE_LAZY all be possible?
> > >> That is after a hypercall from userspace, state will be TLBSTATE_OK,
> > and
> > >> if from kernel space, state will be TLBSTATE_LAZE ?
> > >>
> > >> thanks.
> > > it looks a bug in drop_other_mm_ref implementation, that current TLB
> > > state should be checked before invoking leave_mm(). There's a window
> > between below lines of code:
> > >
> > > <xen_drop_mm_ref>
> > > /* Get the "official" set of cpus referring to our pagetable. */
> > > if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
> > > for_each_online_cpu(cpu) {
> > > if (!cpumask_test_cpu(cpu,
> > mm_cpumask(mm))
> > > && per_cpu(xen_current_cr3, cpu) !=
> > __pa(mm->pgd))
> > > continue;
> > > smp_call_function_single(cpu,
> > drop_other_mm_ref, mm, 1);
> > > }
> > > return;
> > > }
> > >
> > > there's chance that when smp_call_function_single is invoked, actual
> > > TLB state has been updated in the other cpu. The upstream kernel patch
> > > you referred to earlier just makes this bug exposed more easily. But
> > > even without this patch, you may still suffer such issue which is why reverting
> > the patch doesn't help.
> > >
> > > Could you try adding a check in drop_other_mm_ref?
> > >
> > > if (active_mm == mm && percpu_read(cpu_tlbstate.state) !=
> > TLBSTATE_OK)
> > > leave_mm(smp_processor_id());
> > >
> > > once the interrupted context has TLBSTATE_OK, it implicates that later
> > > it will handle the TLB flush and thus no need for leave_mm from
> > > interrupt handler, and that's the assumption of doing leave_mm.
> >
> > That seems reasonable. MaoXiaoyun, does it fix the bug for you?
> >
> > Kevin, could you submit this as a proper patch?
> >
>
> I'm waiting for Xiaoyun's test result before submitting a proper patch, since this
> part of logic is tricky and his test can make sure we don't overlook some corner
> cases. :-)
>

I think it works. The test has been running over 70 hours successfully.
My plan is run one week.

Thanks.

> Thanks
> Kevin

[-- Attachment #1.2: Type: text/html, Size: 9551 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2011-04-29  1:57 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <COL0-MC1-F14hmBzxHs00230882@col0-mc1-f14.Col0.hotmail.com>
2011-04-08 11:24 ` kernel BUG at arch/x86/xen/mmu.c:1860! MaoXiaoyun
2011-04-08 11:46   ` MaoXiaoyun
2011-04-10  3:57   ` kernel BUG at arch/x86/xen/mmu.c:1872 MaoXiaoyun
2011-04-10  4:29   ` MaoXiaoyun
2011-04-10 13:57     ` MaoXiaoyun
2011-04-10 20:14       ` Teck Choon Giam
2011-04-11 12:16         ` Teck Choon Giam
2011-04-11 12:22           ` Teck Choon Giam
2011-04-11 12:31           ` MaoXiaoyun
2011-04-11 15:25             ` Teck Choon Giam
2011-04-12  3:30               ` MaoXiaoyun
2011-04-12 16:08                 ` Teck Choon Giam
2011-04-11 18:08             ` Jeremy Fitzhardinge
2011-04-12  3:35               ` MaoXiaoyun
2011-04-12  6:48                 ` Grant Table Error on 2.6.32.36 + Xen 4.0.1 MaoXiaoyun
2011-04-12  8:46                   ` Konrad Rzeszutek Wilk
2011-04-12  9:02                     ` MaoXiaoyun
2011-04-12  9:11                 ` Kernel BUG at arch/x86/mm/tlb.c:61 MaoXiaoyun
2011-04-12 10:00                   ` Konrad Rzeszutek Wilk
2011-04-12 10:10                     ` MaoXiaoyun
2011-04-14  6:16                     ` MaoXiaoyun
2011-04-14  7:26                       ` Teck Choon Giam
2011-04-14  7:56                         ` MaoXiaoyun
2011-04-14 11:16                           ` MaoXiaoyun
2011-04-15 12:23                             ` MaoXiaoyun
2011-04-15 21:22                               ` Jeremy Fitzhardinge
2011-04-18 15:20                                 ` MaoXiaoyun
2011-04-25  3:15                                 ` MaoXiaoyun
2011-04-26  5:52                                   ` Tian, Kevin
2011-04-26  7:04                                     ` MaoXiaoyun
2011-04-26  8:31                                       ` Tian, Kevin
2011-04-28 23:29                                     ` Jeremy Fitzhardinge
2011-04-29  0:19                                       ` Tian, Kevin
2011-04-29  1:50                                         ` MaoXiaoyun
2011-04-29  1:57                                           ` Tian, Kevin
2011-04-25  4:42                                 ` MaoXiaoyun
2011-04-25 12:54                                   ` MaoXiaoyun
2011-04-25 13:11                                     ` MaoXiaoyun
2011-04-25 15:05                                       ` MaoXiaoyun
2011-04-26  5:55                                         ` Tian, Kevin
2011-04-12 16:32               ` kernel BUG at arch/x86/xen/mmu.c:1872 Teck Choon Giam

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.