All of lore.kernel.org
 help / color / mirror / Atom feed
* [linux-lvm] LVM corruption/diagnosis
@ 2011-04-05  4:44 Jan Bakuwel
  2011-04-06  8:53 ` Radu Rendec
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Bakuwel @ 2011-04-05  4:44 UTC (permalink / raw)
  To: linux-lvm

Hi,

OS:         Debian Lenny
Hypervisor: Xen 3.2-1
kernel:     2.6.18-6-xen-amd64
lvm2:       2.02.06-4etch1
lvm-common: 1.5.20
Hardware:   IBM x3650 hardware RAID5 on SAS with battery backup

I've used LVM2 for years without any issue. I recently diagnosed a
problem with a Windows XP virtual machine running on a Debian Lenny Xen
dom0. After getting reports of stability problems and unexpected
crashes, I restored the VM from a image backup that is known to work. To
my surprise, that image also was crashing unexpectingly. After much
trial and error, I decided to create a new LV and restore the image to
the new LV (rather than using the existing LV). My surprise was even
bigger when that turned out to be the solution: the Windows XP VM has
been running fine since.

Now the daunting (?!) task awaits me to diagnose why one LV is suitable
for a Windows XP VM and another no longer is (even though it has been
fine for at least half a year).

Advise on how I can diagnose potential corruption problems with LVs much
appreciated.

cheers,
Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] LVM corruption/diagnosis
  2011-04-05  4:44 [linux-lvm] LVM corruption/diagnosis Jan Bakuwel
@ 2011-04-06  8:53 ` Radu Rendec
  2011-04-06 20:51   ` Jan Bakuwel
  2011-04-06 21:32   ` Jan Bakuwel
  0 siblings, 2 replies; 9+ messages in thread
From: Radu Rendec @ 2011-04-06  8:53 UTC (permalink / raw)
  To: LVM general discussion and development

On Tue, 2011-04-05 at 16:44 +1200, Jan Bakuwel wrote:
> I've used LVM2 for years without any issue. I recently diagnosed a
> problem with a Windows XP virtual machine running on a Debian Lenny Xen
> dom0. After getting reports of stability problems and unexpected
> crashes, I restored the VM from a image backup that is known to work. To
> my surprise, that image also was crashing unexpectingly. After much
> trial and error, I decided to create a new LV and restore the image to
> the new LV (rather than using the existing LV). My surprise was even
> bigger when that turned out to be the solution: the Windows XP VM has
> been running fine since.

Hi,

I can report a strikingly similar issue *but* I'm pretty sure it's not a
LVM issue - read below for the full details of what I did.

I've recently migrated a bunch of vmware machines to xen hvm. All of the
vmware machines were originally cloned from the same image, so they were
almost identical. I applied the exact same migration procedure on all of
them. All started fine on xen, except for just one.

For that machine that didn't started, I've repeatedly copied the vmware
image and converted it to raw data using qemu-img, but with no success:
the machine just wouldn't boot.

Then I saw your post, created a new LVM volume, converted *exactly the
same* vmware image to raw data in the new volume and - to my surprise -
the machine booted just fine.

So far I think this is exactly what you experienced. But I went one step
further: I zero'ed the original LVM volume (the one that didn't boot)
with dd if=/dev/zero of=... then converted the vmware image again with
qemu-img. Surprisingly, the machine booted.

So I came up with this theory:
* vmware images (vmdk) are "sparse" images (they only contain the blocks
that have been written at least once by the guest os - all other blocks
would read "0" until they are first written);
* when I used qemu-img to convert the vmware images, only the
"allocated" blocks in the sparse image were written to the LVM volume,
leaving the other blocks unchanged;
* for the guest os in xen, the "unused" blocks would no longer read "0",
they would read whatever data was previously there in the LVM volume;
* the disks in my machine *had* been used before, so I'm pretty sure
that my LVM volumes initially contained some "random junk".

Creating a new LVM volume has the side effect of being "clean". So I'm
pretty sure that the problem is not the LVM volume itself, but the data
that it contains before restoring a sparse image to it. I also believe
that "cleaning" the *same* lvm volume with dd prior to restoring the
image would have worked just as well for you.

Now the only question that's left to be answered is why the heck a
windows xp guest (yes, my guest machines are windows xp too) would crash
(or not even boot) when there is some particular data left in the
*unallocated* filesystem blocks. But since I have a "certain" opinion
about microsoft and their products, I just think that life is too short
to bother yourself with this kind of crap.

I hope this helps you debug the issue you had. It would also be
interesting if you could try to zero the *original* LVM volume (the one
that didn't work) and then restore the image once again and see if it
works. It would prove (or disprove) my theory :)

Best regards,

Radu Rendec

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] LVM corruption/diagnosis
  2011-04-06  8:53 ` Radu Rendec
@ 2011-04-06 20:51   ` Jan Bakuwel
  2011-04-06 21:32   ` Jan Bakuwel
  1 sibling, 0 replies; 9+ messages in thread
From: Jan Bakuwel @ 2011-04-06 20:51 UTC (permalink / raw)
  To: linux-lvm

Hi Radu,


Thanks for your response.

> Now the only question that's left to be answered is why the heck a
> windows xp guest (yes, my guest machines are windows xp too) would crash
> (or not even boot) when there is some particular data left in the
> *unallocated* filesystem blocks. But since I have a "certain" opinion
> about microsoft and their products, I just think that life is too short
> to bother yourself with this kind of crap.

That is indeed a good question and the reason why I initially didn't
pursue this track. I wouldn't be surprised if this would indeed be a
"feature" of Windows XP/Microsoft. Another question would be why a)
Windows XP would hang itself up beyond repair after having run for half
a year just fine as well as leaving it's disc space in a unusable state.
After all making image backups of Windows machines and restoring them to
known states is a well tried method.


> I hope this helps you debug the issue you had. It would also be
> interesting if you could try to zero the *original* LVM volume (the one
> that didn't work) and then restore the image once again and see if it
> works. It would prove (or disprove) my theory :)

I'll do just that in the next few days and will report back.


best regards,
Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] LVM corruption/diagnosis
  2011-04-06  8:53 ` Radu Rendec
  2011-04-06 20:51   ` Jan Bakuwel
@ 2011-04-06 21:32   ` Jan Bakuwel
  2011-04-06 21:52     ` Ron Johnson
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Bakuwel @ 2011-04-06 21:32 UTC (permalink / raw)
  To: linux-lvm

Hi Radu, all,

> I hope this helps you debug the issue you had. It would also be
> interesting if you could try to zero the *original* LVM volume (the one
> that didn't work) and then restore the image once again and see if it
> works. It would prove (or disprove) my theory :)

Even though I still consider your theory (of zeroing the blocks before
restoring the image) the most plausible, something else showed up when I
looked at zeroing the partition (I have to wait restoring the image as
this is a production system).

The old LV apparently is still online/active and I cannot deactivate
it/take it offline even though I'm sure it is not in use. This is
something (with LVM2) I've seen before: LVs are marked to be in use (and
cannot be taken offline) even though none of the running VMs is using
the LV.

# lvchange -a n /dev/d/xm.wxp
LV d/xm.wxp in use: not deactivating

If something else is using the LV as well as the VM, it would be logical
that the VM experiences corruptions (even if it's running code from
Redmond :-P ).

I've tried using kpartx in the past (as suggested in some places) but
without much success. In the following list, d-xm.wxp is the old LV
(that no longer works [pre zeroing the blocks]), d-xm.wxp2 is the new LV
(that is currently in use) and I don't know what d-xm.wxp1 is...
possible the first partition d-xm.wxp that keeps it online?

brw-rw----  1 root disk 254, 22 2011-03-30 04:58 d-xm.wxp
brw-rw----  1 root disk 254, 24 2011-03-28 09:18 d-xm.wxp1
brw-rw----  1 root disk 254, 25 2011-04-01 05:56 d-xm.wxp2
# kpartx -d /dev/d/d-xm.wxp
failed to stat() /dev/d/d-xm.wxp
# kpartx -d /dev/d/d-xm.wxp1
failed to stat() /dev/d/d-xm.wxp1

I wouldn't know though what else could be using the LV and I am not
aware of any methods to find out... any suggestions?

best regards,
Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] LVM corruption/diagnosis
  2011-04-06 21:32   ` Jan Bakuwel
@ 2011-04-06 21:52     ` Ron Johnson
  2011-04-07  2:06       ` Jan Bakuwel
  0 siblings, 1 reply; 9+ messages in thread
From: Ron Johnson @ 2011-04-06 21:52 UTC (permalink / raw)
  To: linux-lvm

On 04/06/2011 04:32 PM, Jan Bakuwel wrote:
[snip]
>
> I wouldn't know though what else could be using the LV and I am not
> aware of any methods to find out... any suggestions?
>

lsof is tool #1.

Then "fdisk /dev/d/d-xm.wxp".

The poke around to see if a networked fs like samba or nfs is serving 
the device.

As a last resort, you could always reboot the box.

-- 
"Neither the wisest constitution nor the wisest laws will secure
the liberty and happiness of a people whose manners are universally
corrupt."
Samuel Adams, essay in The Public Advertiser, 1749

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] LVM corruption/diagnosis
  2011-04-06 21:52     ` Ron Johnson
@ 2011-04-07  2:06       ` Jan Bakuwel
  2011-04-07  2:16         ` Ron Johnson
  2011-04-07  5:47         ` Radu Rendec
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Bakuwel @ 2011-04-07  2:06 UTC (permalink / raw)
  To: linux-lvm

Hi Ron,

Thanks for your reply.

Problem solved. It was my brain mixing /dev/d/ and /dev/mapper.
Releasing the partition device with kpartx -d worked - as long as I use
the correct path and not mix the VG name with "mapper".

Duh...

Radu: the first test I'll do is not to zero the partition but to restore
the image now the partition device (/dev/d/xm.wxp1) is gone. I don't
understand why it's there in the first place (dom0 has no business
there). If that helps, the presence of that partition device apparently
interferes with the VM. If that doesn't help, I'll zero the blocks and
report back (some time next week).

thanks folks for the help,
Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] LVM corruption/diagnosis
  2011-04-07  2:06       ` Jan Bakuwel
@ 2011-04-07  2:16         ` Ron Johnson
  2011-04-07  5:47         ` Radu Rendec
  1 sibling, 0 replies; 9+ messages in thread
From: Ron Johnson @ 2011-04-07  2:16 UTC (permalink / raw)
  To: linux-lvm

On 04/06/2011 09:06 PM, Jan Bakuwel wrote:
> Hi Ron,
>
> Thanks for your reply.
>
> Problem solved. It was my brain mixing /dev/d/ and /dev/mapper.
> Releasing the partition device with kpartx -d worked - as long as I use
> the correct path and not mix the VG name with "mapper".
>
> Duh...
>

:)

That's why I stick with vg/lv notation.

-- 
"Neither the wisest constitution nor the wisest laws will secure
the liberty and happiness of a people whose manners are universally
corrupt."
Samuel Adams, essay in The Public Advertiser, 1749

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] LVM corruption/diagnosis
  2011-04-07  2:06       ` Jan Bakuwel
  2011-04-07  2:16         ` Ron Johnson
@ 2011-04-07  5:47         ` Radu Rendec
  2011-04-07  8:31           ` Jan Bakuwel
  1 sibling, 1 reply; 9+ messages in thread
From: Radu Rendec @ 2011-04-07  5:47 UTC (permalink / raw)
  To: LVM general discussion and development

On Thu, 2011-04-07 at 14:06 +1200, Jan Bakuwel wrote:
> Problem solved. It was my brain mixing /dev/d/ and /dev/mapper.
> Releasing the partition device with kpartx -d worked - as long as I use
> the correct path and not mix the VG name with "mapper".
>
> Radu: the first test I'll do is not to zero the partition but to restore
> the image now the partition device (/dev/d/xm.wxp1) is gone. I don't
> understand why it's there in the first place (dom0 has no business
> there). If that helps, the presence of that partition device apparently
> interferes with the VM. If that doesn't help, I'll zero the blocks and
> report back (some time next week).

I don't think that mapping the partitions with kpartx could affect the
VM (that reads/writes to the LV directly).

But what I know for sure is that when you map a block device with
kpartx, the "partition" devices that kpartx creates under /dev/mapper
have different read/write caches than the original block device (the LV
in your case).

One issue that I experienced is that when you write data to a kpartx
mapped device (partition) and some (or all) of the blocks that you write
happen to be in the read cache of the original block device (the LV),
then you'll read "old" data from the LV, even if you first unmap the
partitions with kpartx -d.

This issue can be simply addressed by using "blockdev --flushbufs" on
the LV, after you do "kpartx -d" and before you use the LV (start the VM
for instance).

What type of image are you restoring? The whole LV (including its
partition table) or just the partition inside the LV (perhaps with
ntfsclone)? Because if you're restoring the partition (and not using
"kpartx -d" and "blockdev --flushbufs", it's very likely that you ran
into caching issues.

Best regards,

Radu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] LVM corruption/diagnosis
  2011-04-07  5:47         ` Radu Rendec
@ 2011-04-07  8:31           ` Jan Bakuwel
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Bakuwel @ 2011-04-07  8:31 UTC (permalink / raw)
  To: linux-lvm

Hi Radu,


> I don't think that mapping the partitions with kpartx could affect the
> VM (that reads/writes to the LV directly).

I don't think so either but then I didn't think that not zeroing
unallocated block might make a difference either :-)

> But what I know for sure is that when you map a block device with
> kpartx, the "partition" devices that kpartx creates under /dev/mapper
> have different read/write caches than the original block device (the LV
> in your case).
>
> One issue that I experienced is that when you write data to a kpartx
> mapped device (partition) and some (or all) of the blocks that you write
> happen to be in the read cache of the original block device (the LV),
> then you'll read "old" data from the LV, even if you first unmap the
> partitions with kpartx -d.


The VM is the only entity accessing the LV.


> This issue can be simply addressed by using "blockdev --flushbufs" on
> the LV, after you do "kpartx -d" and before you use the LV (start the VM
> for instance).
>
> What type of image are you restoring? The whole LV (including its
> partition table) or just the partition inside the LV (perhaps with
> ntfsclone)? Because if you're restoring the partition (and not using
> "kpartx -d" and "blockdev --flushbufs", it's very likely that you ran
> into caching issues.


A full disc image including the partition table, boot block etc.

Will let you know how it goes.

best regards,
Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-04-07  8:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-05  4:44 [linux-lvm] LVM corruption/diagnosis Jan Bakuwel
2011-04-06  8:53 ` Radu Rendec
2011-04-06 20:51   ` Jan Bakuwel
2011-04-06 21:32   ` Jan Bakuwel
2011-04-06 21:52     ` Ron Johnson
2011-04-07  2:06       ` Jan Bakuwel
2011-04-07  2:16         ` Ron Johnson
2011-04-07  5:47         ` Radu Rendec
2011-04-07  8:31           ` Jan Bakuwel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.