From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Xu, Dongxiao" Subject: RE: Making snapshot of logical volumes handling HVM domU causes OOPS and instability Date: Tue, 31 Aug 2010 14:59:40 +0800 Message-ID: References: <4C7864BB.1010808@sce.pridelands.org> <4C7BE1C6.5030602@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <4C7BE1C6.5030602@goop.org> Content-Language: en-US List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Jeremy Fitzhardinge , Scott Garron Cc: Daniel, "xen-devel@lists.xensource.com" , Stodden List-Id: xen-devel@lists.xenproject.org Jeremy Fitzhardinge wrote: > On 08/27/2010 06:22 PM, Scott Garron wrote: >> I use LVM volumes for domU disks. To create backups, I create a >> snapshot of the volume, mount the snapshot in the dom0, mount an >> equally-sized backup volume from another physical storage source, run >> an rsync from one to the other, unmount both, then remove the >> snapshot.=20 >> This includes creating a snapshot and mounting NTFS volumes from >> Windows-based HVM guests. >>=20 >> This practice may not be perfect, but has worked fine for me for a >> couple of years - while I was running Xen 3.2.1 and >> linux-2.6.18.8-xen=20 >> dom0 (and the same kernel for domU). After upgrades of udev started >> complaining about the kernel being too old, I thought it was well >> past=20 >> time to try to transition to a newer version of Xen and a newer dom0 >> kernel. This transition has been a gigantic learning experience, let >> me tell you. >>=20 >> After that transition, here's the problem I've been wrestling with >> and=20 >> can't seem to find a solution for: It seems like any time I start >> manipulating a volume group to add or remove a snapshot of a logical >> volume that's used as a disk for a running HVM guest, new calls to >> LVM2 and/or Xen's storage locks up and spins forever. The first time >> I ran across the problem, there was no indication of a problem other >> than any command I ran that handled anything to do with LVM would >> freeze and be completely unable to be signaled to do anything. In >> other words, no error messages, nothing in dmesg, nothing in >> syslog...=20 >> The commands would just freeze and not return. That was with the >> 2.6.31.14 kernel that is what's currently retrieved if you checkout >> xen-4.0-testing.hg and just do a make dist. >>=20 >> I have since checked out and compiled 2.6.32.18 that comes from doing >> git checkout -b xen/stable-2.6.32.x origin/xen/stable-2.6.32.x, as >> described on the Wiki page here: >> http://wiki.xensource.com/xenwiki/XenParavirtOps >>=20 >> If I run that kernel for dom0, but continue to use 2.6.31.14 for the >> paravirtualized domUs, everything works fine until I try to >> manipulate=20 >> the snapshots of the HVM volumes. Today, I got this kernel OOPS: >=20 > That's definitely bad. Something is causing udevd to end up with bad > pagetables which are causing a kernel crash on exit. I'm not sure if > its *the* udevd or some transient child, but either way its bad. =20 >=20 > Any thoughts on this Daniel? >=20 >>=20 >> --------------------------- >>=20 >> [78084.004530] BUG: unable to handle kernel paging request at >> ffff8800267c9010 [78084.004710] IP: [] >> xen_set_pmd+0x24/0x44 [78084.004886] PGD 1002067 PUD 1006067 PMD >> 217067 PTE 80100000267c9065 [78084.005065] Oops: 0003 [#1] SMP >> [78084.005234] last sysfs file: >> /sys/devices/virtual/block/dm-32/removable >> [78084.005256] CPU 1 >> [78084.005256] Modules linked in: tun xt_multiport fuse dm_snapshot >> nf_nat_tftp nf_conntrack_tftp nf_nat_pptp nf_conntrack_pptp >> nf_conntrack_proto_gre nf_nat_proto_gre ntfs parport_pc parport >> k8temp=20 >> floppy forcedeth [last unloaded: scsi_wait_scan] >> [78084.005256] Pid: 22814, comm: udevd Tainted: G W=20 >> 2.6.32.18 #1=20 >> H8SMI >> [78084.005256] RIP: e030:[] [] >> xen_set_pmd+0x24/0x44 [78084.005256] RSP: e02b:ffff88002e2e1d18=20 >> EFLAGS: 00010246 [78084.005256] RAX: 0000000000000000 RBX: >> ffff8800267c9010 RCX:=20 >> ffff880000000000 >> [78084.005256] RDX: dead000000100100 RSI: 0000000000000000 RDI: >> 0000000000000004 [78084.005256] RBP: ffff88002e2e1d28 R08: >> 0000000001993000 R09:=20 >> dead000000100100 >> [78084.005256] R10: 800000016e90e165 R11: 0000000000000000 R12: >> 0000000000000000 [78084.005256] R13: ffff880002d8f580 R14: >> 0000000000400000 R15:=20 >> ffff880029248000 >> [78084.005256] FS: 00007fa07d87f7a0(0000) GS:ffff880002d81000(0000) >> knlGS:0000000000000000 [78084.005256] CS: e033 DS: 0000 ES: 0000 >> CR0: 000000008005003b [78084.005256] CR2: ffff8800267c9010 CR3: >> 0000000001001000 CR4: 0000000000000660 >> [78084.005256] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> 0000000000000000 [78084.005256] DR3: 0000000000000000 DR6: >> 00000000ffff0ff0 DR7: 0000000000000400 [78084.005256] Process udevd >> (pid: 22814, threadinfo ffff88002e2e0000,=20 >> task ffff880019491e80) [78084.005256] Stack: >> [78084.005256] 0000000000600000 000000000061e000 ffff88002e2e1de8 >> ffffffff810fb8a5 >> [78084.005256] <0> 00007fff13ffffff 0000000100000206 ffff880003158003 >> 0000000000000000 [78084.005256] <0> 0000000000000000 000000000061dfff >> 000000000061dfff 000000000061dfff [78084.005256] Call Trace: >> [78084.005256] [] free_pgd_range+0x27c/0x45e >> [78084.005256] [] free_pgtables+0xa4/0xc7 >> [78084.005256] [] exit_mmap+0x107/0x13f >> [78084.005256] [] mmput+0x39/0xda [78084.005256] >> [] exit_mm+0xfb/0x106 [78084.005256] >> [] do_exit+0x1e8/0x6ff [78084.005256] >> [] ? do_page_fault+0x2cd/0x2fd [78084.005256] >> [] do_group_exit+0x89/0xb3 [78084.005256] >> [] sys_exit_group+0x12/0x16 [78084.005256] >> [] system_call_fastpath+0x16/0x1b [78084.005256] >> Code: 48 83 c4 28 5b c9 c3 55 48 89 e5 41 54 49 89 f4 53 >> 48 89 fb e8 fc ee ff ff 48 89 df ff 05 52 8f 9e 00 e8 78 e4 ff ff 84 >> c0 >> 75 05 <4c> 89 23 eb 16 e8 e0 ee ff ff 4c 89 e6 48 89 df ff 05 37 8f >> 9e [78084.005256] RIP [] xen_set_pmd+0x24/0x44 >> [78084.005256] RSP [78084.005256] CR2: >> ffff8800267c9010 [78084.005256] ---[ end trace 4eaa2a86a8e2da24 ]--- >> [78084.005256] Fixing recursive fault but reboot is needed! >>=20 >> --------------------------- >>=20 >> After that was printed on the console, use of anything that interacts >> with Xen (xentop, xm) would freeze whatever command it was and not >> return. After trying to do a sane shutdown on the guests, the whole >> dom0 locked completely. Even the alt-sysrq things stopped working >> after looking at a couple of them. >>=20 >> I feel it's probably necessary to mention that this is after several, >> fairly rapid-fire creations and deletions of snapshot volumes. I >> have=20 >> it scripted to make a snapshot, mount it, mount a backup volume, >> rsync=20 >> it, unmount both volumes, and delete the snapshot for 19 volumes in a >> row. In other words, there's a lot of disk I/O going on around the >> time of the lockup. It always seems to coincide with when it gets to >> the volumes that are being used for active, running, Windows Server >> 2008, HVM volumes. That may be just coincidental, though, because >> those are the last ones on the list. 15 volumes used in active, >> running paravirtualized Linux guests are at the top of the list. >>=20 >>=20 >> Another issue that comes up is that if I run the 2.6.32.18 pvops >> kernel for my Linux domUs, after a time (usually only about an hour >> or=20 >> so), the network interfaces stop responding. I don't know if the >> problem is related, but it was something else that I noticed. The >> only way to get the network access to come back is to reboot the >> domU.=20 >> When I reverted the domU kernel to 2.6.31.14, this problem goes away. >=20 > That's a separate problem in netfront that appears to be a bug in the > "smartpoll" code. I think Dongxiao is looking into it.=20 Yes, I tried to reproduce these days, however I could catch it locally. I t= ried both netperf and ping for a long time, but the bug is not triggered. W= hat workload are you using when met the bug? Thanks, Dongxiao >=20 >> I'm not 100% >> sure, but I think this issue also causes xm console to not allow you >> to type on the console that you connect to. If I connect to a >> console, then issue an xm shutdown on the same domU from another >> terminal, all of the console messages that show the play-by-play of >> the shutdown process display, but my keyboard input doesn't seem to >> make it through.=20 >=20 > Hm, not familiar with this problem. Perhaps its just something wrong > with your console settings for the domain? Do you have "console=3D" on > the kernel command line? =20 >=20 >> Since I'm not a developer, I don't know if these questions are better >> suited for the xen-users list, but since it generated an OOPS with >> the=20 >> word "BUG" in capital letters, I thought I'd post it here. If that >> assumption was incorrect, just give me a gentle nudge and I'll >> redirect the inquiry to somewhere more appropriate. :) >=20 > Nope, they're both xen-devel fodder. Thanks for posting. >=20 > J