Making snapshot of logical volumes handling HVM domU causes OOPS and instability

* Making snapshot of logical volumes handling HVM domU causes OOPS and instability
@ 2010-08-28  1:22 Scott Garron
  2010-08-30 16:52 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 19+ messages in thread
From: Scott Garron @ 2010-08-28  1:22 UTC (permalink / raw)
  To: xen-devel

I use LVM volumes for domU disks.  To create backups, I create a
snapshot of the volume, mount the snapshot in the dom0, mount an
equally-sized backup volume from another physical storage source, run an
rsync from one to the other, unmount both, then remove the snapshot.
This includes creating a snapshot and mounting NTFS volumes from
Windows-based HVM guests.

This practice may not be perfect, but has worked fine for me for a
couple of years - while I was running Xen 3.2.1 and linux-2.6.18.8-xen
dom0 (and the same kernel for domU).  After upgrades of udev started
complaining about the kernel being too old, I thought it was well past
time to try to transition to a newer version of Xen and a newer dom0
kernel.  This transition has been a gigantic learning experience, let me
tell you.

After that transition, here's the problem I've been wrestling with and
can't seem to find a solution for:  It seems like any time I start
manipulating a volume group to add or remove a snapshot of a logical
volume that's used as a disk for a running HVM guest, new calls to LVM2
and/or Xen's storage locks up and spins forever.  The first time I ran
across the problem, there was no indication of a problem other than
any command I ran that handled anything to do with LVM would freeze and
be completely unable to be signaled to do anything.  In other words, no
error messages, nothing in dmesg, nothing in syslog...  The commands
would just freeze and not return.  That was with the 2.6.31.14 kernel
that is what's currently retrieved if you checkout xen-4.0-testing.hg
and just do a make dist.

I have since checked out and compiled 2.6.32.18 that comes from doing
git checkout -b xen/stable-2.6.32.x origin/xen/stable-2.6.32.x, as
described on the Wiki page here:
http://wiki.xensource.com/xenwiki/XenParavirtOps

If I run that kernel for dom0, but continue to use 2.6.31.14 for the
paravirtualized domUs, everything works fine until I try to manipulate
the snapshots of the HVM volumes.  Today, I got this kernel OOPS:

---------------------------

[78084.004530] BUG: unable to handle kernel paging request at
ffff8800267c9010
[78084.004710] IP: [<ffffffff810382ff>] xen_set_pmd+0x24/0x44
[78084.004886] PGD 1002067 PUD 1006067 PMD 217067 PTE 80100000267c9065
[78084.005065] Oops: 0003 [#1] SMP
[78084.005234] last sysfs file: /sys/devices/virtual/block/dm-32/removable
[78084.005256] CPU 1
[78084.005256] Modules linked in: tun xt_multiport fuse dm_snapshot
nf_nat_tftp nf_conntrack_tftp nf_nat_pptp nf_conntrack_pptp
nf_conntrack_proto_gre nf_nat_proto_gre ntfs parport_pc parport k8temp
floppy forcedeth [last unloaded: scsi_wait_scan]
[78084.005256] Pid: 22814, comm: udevd Tainted: G        W  2.6.32.18 #1
H8SMI
[78084.005256] RIP: e030:[<ffffffff810382ff>]  [<ffffffff810382ff>]
xen_set_pmd+0x24/0x44
[78084.005256] RSP: e02b:ffff88002e2e1d18  EFLAGS: 00010246
[78084.005256] RAX: 0000000000000000 RBX: ffff8800267c9010 RCX:
ffff880000000000
[78084.005256] RDX: dead000000100100 RSI: 0000000000000000 RDI:
0000000000000004
[78084.005256] RBP: ffff88002e2e1d28 R08: 0000000001993000 R09:
dead000000100100
[78084.005256] R10: 800000016e90e165 R11: 0000000000000000 R12:
0000000000000000
[78084.005256] R13: ffff880002d8f580 R14: 0000000000400000 R15:
ffff880029248000
[78084.005256] FS:  00007fa07d87f7a0(0000) GS:ffff880002d81000(0000)
knlGS:0000000000000000
[78084.005256] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[78084.005256] CR2: ffff8800267c9010 CR3: 0000000001001000 CR4:
0000000000000660
[78084.005256] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[78084.005256] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[78084.005256] Process udevd (pid: 22814, threadinfo ffff88002e2e0000,
task ffff880019491e80)
[78084.005256] Stack:
[78084.005256]  0000000000600000 000000000061e000 ffff88002e2e1de8
ffffffff810fb8a5
[78084.005256] <0> 00007fff13ffffff 0000000100000206 ffff880003158003
0000000000000000
[78084.005256] <0> 0000000000000000 000000000061dfff 000000000061dfff
000000000061dfff
[78084.005256] Call Trace:
[78084.005256]  [<ffffffff810fb8a5>] free_pgd_range+0x27c/0x45e
[78084.005256]  [<ffffffff810fbb2b>] free_pgtables+0xa4/0xc7
[78084.005256]  [<ffffffff810ff1fd>] exit_mmap+0x107/0x13f
[78084.005256]  [<ffffffff8107714b>] mmput+0x39/0xda
[78084.005256]  [<ffffffff8107adff>] exit_mm+0xfb/0x106
[78084.005256]  [<ffffffff8107c86d>] do_exit+0x1e8/0x6ff
[78084.005256]  [<ffffffff815c228b>] ? do_page_fault+0x2cd/0x2fd
[78084.005256]  [<ffffffff8107ce0d>] do_group_exit+0x89/0xb3
[78084.005256]  [<ffffffff8107ce49>] sys_exit_group+0x12/0x16
[78084.005256]  [<ffffffff8103cc82>] system_call_fastpath+0x16/0x1b
[78084.005256] Code: 48 83 c4 28 5b c9 c3 55 48 89 e5 41 54 49 89 f4 53
48 89 fb e8 fc ee ff ff 48 89 df ff 05 52 8f 9e 00 e8 78 e4 ff ff 84 c0
75 05 <4c> 89 23 eb 16 e8 e0 ee ff ff 4c 89 e6 48 89 df ff 05 37 8f 9e
[78084.005256] RIP  [<ffffffff810382ff>] xen_set_pmd+0x24/0x44
[78084.005256]  RSP <ffff88002e2e1d18>
[78084.005256] CR2: ffff8800267c9010
[78084.005256] ---[ end trace 4eaa2a86a8e2da24 ]---
[78084.005256] Fixing recursive fault but reboot is needed!

---------------------------

After that was printed on the console, use of anything that interacts
with Xen (xentop, xm) would freeze whatever command it was and not
return.  After trying to do a sane shutdown on the guests, the whole
dom0 locked completely.  Even the alt-sysrq things stopped working after
looking at a couple of them.

I feel it's probably necessary to mention that this is after several,
fairly rapid-fire creations and deletions of snapshot volumes.  I have
it scripted to make a snapshot, mount it, mount a backup volume, rsync
it, unmount both volumes, and delete the snapshot for 19 volumes in a
row.  In other words, there's a lot of disk I/O going on around the time
of the lockup.  It always seems to coincide with when it gets to the
volumes that are being used for active, running, Windows Server 2008,
HVM volumes.  That may be just coincidental, though, because those are
the last ones on the list.  15 volumes used in active, running
paravirtualized Linux guests are at the top of the list.

Another issue that comes up is that if I run the 2.6.32.18 pvops kernel
for my Linux domUs, after a time (usually only about an hour or so), the
network interfaces stop responding.  I don't know if the problem is
related, but it was something else that I noticed.  The only way to get
the network access to come back is to reboot the domU.  When I reverted
the domU kernel to 2.6.31.14, this problem goes away.  I'm not 100%
sure, but I think this issue also causes xm console to not allow you to
type on the console that you connect to.  If I connect to a console,
then issue an xm shutdown on the same domU from another terminal, all of
the console messages that show the play-by-play of the shutdown process
display, but my keyboard input doesn't seem to make it through.

Since I'm not a developer, I don't know if these questions are better
suited for the xen-users list, but since it generated an OOPS with the
word "BUG" in capital letters, I thought I'd post it here.  If that
assumption was incorrect, just give me a gentle nudge and I'll redirect
the inquiry to somewhere more appropriate.  :)

If you need any more information about my setup or steps used to
recreate the problem or other debugging information, I'll be happy to
accomodate.  Just let me know what you need and how I can get it.

Here's some more information about my setup:
http://www.pridelands.org/~simba/hurricane-server.txt

-- 
Scott Garron

^ permalink raw reply	[flat|nested] 19+ messages in thread