linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
* [linux-lvm] repair pool with bad checksum in superblock
@ 2019-08-23  0:18 Dave Cohen
  2019-08-23  8:59 ` Zdenek Kabelac
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Cohen @ 2019-08-23  0:18 UTC (permalink / raw)
  To: linux-lvm

I've read some old posts on this group, which give me some hope that I might recover a failed drive.  But I'm not well-versed in LVM, so details of what I've read are going over my head. 

My problems started when my laptop failed to shut down properly, and afterwards booted only to dracut emergency shell.  I've since attempted to rescue the bad drive, using `ddrescue`.  That tool reported 99.99% of the drive rescued, but so far I'm unable to access the LVM data.

Decrypting the copy I made with `ddrescue` gives me /dev/mapper/encrypted_rescue, but I can't activate the LVM data that is there.  I get these errors:

$ sudo lvconvert --repair qubes_dom0/pool00
  WARNING: Not using lvmetad because of repair.
  WARNING: Disabling lvmetad cache for repair command.
bad checksum in superblock, wanted 823063976
  Repair of thin metadata volume of thin pool qubes_dom0/pool00 failed (status:1). Manual repair required!

$ sudo thin_check /dev/mapper/encrypted_rescue
examining superblock
  superblock is corrupt
    bad checksum in superblock, wanted 636045691

(Note the two command return different "wanted" values.  Are there two superblocks?)

I found a post, several years old, written by Ming-Hung Tsai, which describes restoring a broken superblock.  I'll show that post below, along with my questions, because I'm missing some of the knowledge necessary.

I would greatly appreciate any help! 

-Dave

Original post from several years ago, plus my questions:
> The original post asks how to do if the superblock was broken (his superblock
> was accidentally wiped). Since that I don't have time to update the program
> at this moment, here's my workaround:
> 
> 1. Partially rebuild the superblock
> 
> (1) Obtain pool parameter from LVM
> 
> ./sbin/lvm lvs vg1/tp1 -o transaction_id,chunksize,lv_size --units s
> 
> sample output:
> Tran Chunk LSize
> 3545 128S 7999381504S
> 
> The number of data blocks is $((7999381504/128)) = 62495168
> 

Here's what I get:

$ sudo lvs qubes_dom0/pool00 -o transaction_id,chunksize,lv_size --units S 
  TransId Chunk LSize     
    14757  512S 901660672S

So, number of data blocks if I undestand correctly is $((901660672/512)) = 1761056

> (2) Create input.xml with pool parameters obtained from LVM:
> 
> <superblock uuid="" time="0" transaction="3545"
> data_block_size="128" nr_data_blocks="62495168">
> </superblock>
> 
> (3) Run thin_restore to generate a temporary metadata with correct superblock
> 
> dd if=/dev/zero of=/tmp/test.bin bs=1M count=16
> thin_restore -i input.xml -o /tmp/test.bin
> 
> The size of /tmp/test.bin depends on your pool size.

I don't understand the last sentence.  What should the size of my /tmp/test.bin be?  Should I be using "bs=1M count=16"?


> 
> (4) Copy the partially-rebuilt superblock (4KB) to your broken metadata.
> (<src_metadata>).
> 
> dd if=/tmp/test.bin of=<src_metadata> bs=4k count=1 conv=notrunc
>

What is <src_metadata> here?
 
> 2. Run thin_ll_dump and thin_ll_restore
> https://www.redhat.com/archives/linux-lvm/2016-February/msg00038.html
> 
> Example: assume that we found data-mapping-root=2303
> and device-details-root=277313
> 
> ./pdata_tools thin_ll_dump <src_metadata> --data-mapping-root=2303 \
> --device-details-root 277313 -o thin_ll_dump.txt
> 
> ./pdata_tools thin_ll_restore -E <src_metadata> -i thin_ll_dump.txt \
> -o <dst_metadata>
> 
> Note that <dst_metadata> should be sufficient large especially when you
> have snapshots, since that the mapping trees reconstructed by thintools
> do not share blocks.

Here, I don't have commands `thin_ll_dump` or `thin_ll_restore`.  How should I obtain those?  Or is there a way to do this with the tools I do have.  (I'm on fedora 30, FYI).

> 
> 3. Fix superblock's time field
> 
> (1) Run thin_dump on the repaired metadata
> 
> thin_dump <dst_metadata> -o thin_dump.txt
> 
> (2) Find the maximum time value in data mapping trees
> (the device with maximum snap_time might be remove, so find the
> maximum time in data mapping trees, not the device detail tree)
> 
> grep "time=\"[0-9]*\"" thin_dump.txt -o | uniq | sort | uniq | tail
> 
> (I run uniq twice to avoid sorting too much data)
> 
> sample output:
> ...
> time="1785"
> time="1786"
> time="1787"
> 
> so the maximum time is 1787.
> 
> (3) Edit the "time" value of the <superblock> tag in thin_dump's output
> 
> <superblock uuid="" time="1787" ... >
> ...
> 
> (4) Run thin_restore to get the final metadata
> 
> thin_restore -i thin_dump.txt -o <dst_metadata>
> 
> 
> Ming-Hung Tsai

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] repair pool with bad checksum in superblock
  2019-08-23  0:18 [linux-lvm] repair pool with bad checksum in superblock Dave Cohen
@ 2019-08-23  8:59 ` Zdenek Kabelac
  2019-08-23 11:40   ` Dave Cohen
  0 siblings, 1 reply; 7+ messages in thread
From: Zdenek Kabelac @ 2019-08-23  8:59 UTC (permalink / raw)
  To: LVM general discussion and development, Dave Cohen

Dne 23. 08. 19 v 2:18 Dave Cohen napsal(a):
> I've read some old posts on this group, which give me some hope that I might recover a failed drive.  But I'm not well-versed in LVM, so details of what I've read are going over my head.
> 
> My problems started when my laptop failed to shut down properly, and afterwards booted only to dracut emergency shell.  I've since attempted to rescue the bad drive, using `ddrescue`.  That tool reported 99.99% of the drive rescued, but so far I'm unable to access the LVM data.
> 
> Decrypting the copy I made with `ddrescue` gives me /dev/mapper/encrypted_rescue, but I can't activate the LVM data that is there.  I get these errors:
> 
> $ sudo lvconvert --repair qubes_dom0/pool00
>    WARNING: Not using lvmetad because of repair.
>    WARNING: Disabling lvmetad cache for repair command.
> bad checksum in superblock, wanted 823063976
>    Repair of thin metadata volume of thin pool qubes_dom0/pool00 failed (status:1). Manual repair required!
> 
> $ sudo thin_check /dev/mapper/encrypted_rescue
> examining superblock
>    superblock is corrupt
>      bad checksum in superblock, wanted 636045691
> 
> (Note the two command return different "wanted" values.  Are there two superblocks?)
> 
> I found a post, several years old, written by Ming-Hung Tsai, which describes restoring a broken superblock.  I'll show that post below, along with my questions, because I'm missing some of the knowledge necessary.
> 
> I would greatly appreciate any help!


I think it's important to know the version of thin tools ?

Are you using  0.8.5 ?

If so  - feel free to open Bugzilla and upload your metadata so we can check 
what's going on there.

In BZ provide also lvm2 metadata and the way how the error was reached.

Out typical error we see with thin-pool usage is  'doubled' activation.
So thin-pool gets acticated on 2 host in parallel (usually unwantedly) - and 
when this happens and 2 pools are updating same metadata - it gets damaged.

Regards

Zdenek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] repair pool with bad checksum in superblock
  2019-08-23  8:59 ` Zdenek Kabelac
@ 2019-08-23 11:40   ` Dave Cohen
  2019-08-23 12:47     ` Zdenek Kabelac
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Cohen @ 2019-08-23 11:40 UTC (permalink / raw)
  To: Zdenek Kabelac, LVM general discussion and development



On Fri, Aug 23, 2019, at 4:59 AM, Zdenek Kabelac wrote:
> Dne 23. 08. 19 v 2:18 Dave Cohen napsal(a):
> > I've read some old posts on this group, which give me some hope that I might recover a failed drive.  But I'm not well-versed in LVM, so details of what I've read are going over my head.
> > 
> > My problems started when my laptop failed to shut down properly, and afterwards booted only to dracut emergency shell.  I've since attempted to rescue the bad drive, using `ddrescue`.  That tool reported 99.99% of the drive rescued, but so far I'm unable to access the LVM data.
> > 
> > Decrypting the copy I made with `ddrescue` gives me /dev/mapper/encrypted_rescue, but I can't activate the LVM data that is there.  I get these errors:
> > 
> > $ sudo lvconvert --repair qubes_dom0/pool00
> >    WARNING: Not using lvmetad because of repair.
> >    WARNING: Disabling lvmetad cache for repair command.
> > bad checksum in superblock, wanted 823063976
> >    Repair of thin metadata volume of thin pool qubes_dom0/pool00 failed (status:1). Manual repair required!
> > 
> > $ sudo thin_check /dev/mapper/encrypted_rescue
> > examining superblock
> >    superblock is corrupt
> >      bad checksum in superblock, wanted 636045691
> > 
> > (Note the two command return different "wanted" values.  Are there two superblocks?)
> > 
> > I found a post, several years old, written by Ming-Hung Tsai, which describes restoring a broken superblock.  I'll show that post below, along with my questions, because I'm missing some of the knowledge necessary.
> > 
> > I would greatly appreciate any help!
> 
> 
> I think it's important to know the version of thin tools ?
> 
> Are you using  0.8.5 ?

I had been using "0.7.6-4.fc30" (provided by fedora).  Upon seeing your email, I built tag "v0.8.5", but the results from `lvconvert` and `thin_check` commands are identical to what I wrote above.

$ thin_check --version
0.8.5

> 
> If so  - feel free to open Bugzilla and upload your metadata so we can check 
> what's going on there.
> 
> In BZ provide also lvm2 metadata and the way how the error was reached.
> 

When you say "upload your metadata" and "lvm2 metadata", can you tell me exactly how to get it?  Sorry for the basic question but I'm not sure what to run and what to upload.

> Out typical error we see with thin-pool usage is  'doubled' activation.
> So thin-pool gets acticated on 2 host in parallel (usually unwantedly) - and 
> when this happens and 2 pools are updating same metadata - it gets damaged.

In my case, lvm was set up by qubes-os, on a laptop.  The disk drive had a physical problem.  I'll put those details into bugzilla.  (But I'm waiting for answer to metadata question above before I submit ticket.)

Thanks for your help!

-Dave

> 
> Regards
> 
> Zdenek
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] repair pool with bad checksum in superblock
  2019-08-23 11:40   ` Dave Cohen
@ 2019-08-23 12:47     ` Zdenek Kabelac
  2019-08-23 14:58       ` Gionatan Danti
  2019-08-25  2:13       ` Dave Cohen
  0 siblings, 2 replies; 7+ messages in thread
From: Zdenek Kabelac @ 2019-08-23 12:47 UTC (permalink / raw)
  To: Dave Cohen, LVM general discussion and development

Dne 23. 08. 19 v 13:40 Dave Cohen napsal(a):
> 
> 

> $ thin_check --version
> 0.8.5

Hi

So if repairing fails even with the latest version - it's better to upload 
metadata into BZ created here:

https://bugzilla.redhat.com/enter_bug.cgi?product=LVM%20and%20device-mapper

>> If so  - feel free to open Bugzilla and upload your metadata so we can check
>> what's going on there.
>>
>> In BZ provide also lvm2 metadata and the way how the error was reached.
>>
> 
> When you say "upload your metadata" and "lvm2 metadata", can you tell me exactly how to get it?  Sorry for the basic question but I'm not sure what to run and what to upload.


Upload 'dd' compressed copy of you ORIGINAL  _tmeta content (which now could 
be likely already in volume  _meta0 - if you had one succesful run of --repair 
command).

If you use older 'lvm2' you might have a problem with accessing _tmeta
device content - if you have latest fc30 - you should be able
to activate _tmeta as standalone component activation.

To get lvm2 metadata backup just use  'vgcfgbackup -f output.txt  VGNAME'

Let us know if you have problem with getting kernel _tmeta or lvm2 meta.

> In my case, lvm was set up by qubes-os, on a laptop.  The disk drive had a physical problem.  I'll put those details into bugzilla.  (But I'm waiting for answer to metadata question above before I submit ticket.)

Ok - serious disk error might lead to eventually irrepairable metadata content 
- since if you lose some root b-tree node sequence it might be really hard
to get something sensible  (it's the reason why the metadata should be located
on some 'mirrored' device - since while there is lot of effort put into
protection again software errors - it's hard to do something with hardware 
error...


Regards

Zdenek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] repair pool with bad checksum in superblock
  2019-08-23 12:47     ` Zdenek Kabelac
@ 2019-08-23 14:58       ` Gionatan Danti
  2019-08-23 15:29         ` Stuart D. Gathman
  2019-08-25  2:13       ` Dave Cohen
  1 sibling, 1 reply; 7+ messages in thread
From: Gionatan Danti @ 2019-08-23 14:58 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: Dave Cohen

Il 23-08-2019 14:47 Zdenek Kabelac ha scritto:
> Ok - serious disk error might lead to eventually irrepairable metadata
> content - since if you lose some root b-tree node sequence it might be
> really hard
> to get something sensible  (it's the reason why the metadata should be 
> located
> on some 'mirrored' device - since while there is lot of effort put into
> protection again software errors - it's hard to do something with
> hardware error...

Would be possible to have a backup superblock, maybe located on device 
end?
XFS, EXT4 and ZFS already do something similar...

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] repair pool with bad checksum in superblock
  2019-08-23 14:58       ` Gionatan Danti
@ 2019-08-23 15:29         ` Stuart D. Gathman
  0 siblings, 0 replies; 7+ messages in thread
From: Stuart D. Gathman @ 2019-08-23 15:29 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: Dave Cohen

On Fri, 23 Aug 2019, Gionatan Danti wrote:

> Il 23-08-2019 14:47 Zdenek Kabelac ha scritto:
>> Ok - serious disk error might lead to eventually irrepairable metadata
>> content - since if you lose some root b-tree node sequence it might be
>> really hard
>> to get something sensible  (it's the reason why the metadata should be 
>> located
>> on some 'mirrored' device - since while there is lot of effort put into
>> protection again software errors - it's hard to do something with
>> hardware error...
>
> Would be possible to have a backup superblock, maybe located on device end?
> XFS, EXT4 and ZFS already do something similar...

On my btree file system, I can recover from arbitrary hardware
corruption by storing the root id of the file (table) in each node. 
Leaf nodes (with full data records) are also indicated.  Thus, even if
the root node of a file is lost/corrupted, the raw file/device can be
scanned for corresponding leaf nodes to rebuild the file (table) with
all remaining records.

Drawbacks: deleting individual leaf nodes requires changing the root id
of the node requiring an extra write.  (Otherwise records could be
included in some future recovery.)  Deleting entire files (tables) 
just requires marking the root node deleted - no need to write all the
leaf nodes.

-- 
 	      Stuart D. Gathman <stuart@gathman.org>
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] repair pool with bad checksum in superblock
  2019-08-23 12:47     ` Zdenek Kabelac
  2019-08-23 14:58       ` Gionatan Danti
@ 2019-08-25  2:13       ` Dave Cohen
  1 sibling, 0 replies; 7+ messages in thread
From: Dave Cohen @ 2019-08-25  2:13 UTC (permalink / raw)
  To: Zdenek Kabelac, LVM general discussion and development



On Fri, Aug 23, 2019, at 8:47 AM, Zdenek Kabelac wrote:
> Dne 23. 08. 19 v 13:40 Dave Cohen napsal(a):
> > 
> > 
> 
> > $ thin_check --version
> > 0.8.5
> 
> Hi
> 
> So if repairing fails even with the latest version - it's better to upload 
> metadata into BZ created here:
> 
> https://bugzilla.redhat.com/enter_bug.cgi?product=LVM%20and%20device-mapper
>

I've created https://bugzilla.redhat.com/show_bug.cgi?id=1745204

 
> >> If so  - feel free to open Bugzilla and upload your metadata so we can check
> >> what's going on there.
> >>
> >> In BZ provide also lvm2 metadata and the way how the error was reached.
> >>
> > 
> > When you say "upload your metadata" and "lvm2 metadata", can you tell me exactly how to get it?  Sorry for the basic question but I'm not sure what to run and what to upload.
> 
> 
> Upload 'dd' compressed copy of you ORIGINAL  _tmeta content (which now could 
> be likely already in volume  _meta0 - if you had one succesful run of --repair 
> command).
> 

Hmmm.  I'm not sure how to use `dd` for this.  If I'm missing something obvious, please let me know. Note, I cannot activate any portion of the pool.

> If you use older 'lvm2' you might have a problem with accessing _tmeta
> device content - if you have latest fc30 - you should be able
> to activate _tmeta as standalone component activation.
> 
> To get lvm2 metadata backup just use  'vgcfgbackup -f output.txt  VGNAME'

This succeeded, and I attached to the ticket.

> 
> Let us know if you have problem with getting kernel _tmeta or lvm2 meta.

As I wrote above, could not get the _tmeta.  If you're referring to a part of the pool, it does not activate via `lvchange -ay`


> 
> > In my case, lvm was set up by qubes-os, on a laptop.  The disk drive had a physical problem.  I'll put those details into bugzilla.  (But I'm waiting for answer to metadata question above before I submit ticket.)
> 
> Ok - serious disk error might lead to eventually irrepairable metadata content 
> - since if you lose some root b-tree node sequence it might be really hard
> to get something sensible  (it's the reason why the metadata should be located
> on some 'mirrored' device - since while there is lot of effort put into
> protection again software errors - it's hard to do something with hardware 
> error...

Exactly how to do this is still beyond me.  But I'm up for learning, and contributing it back to the qubes-os project.

-Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-08-25  2:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-23  0:18 [linux-lvm] repair pool with bad checksum in superblock Dave Cohen
2019-08-23  8:59 ` Zdenek Kabelac
2019-08-23 11:40   ` Dave Cohen
2019-08-23 12:47     ` Zdenek Kabelac
2019-08-23 14:58       ` Gionatan Danti
2019-08-23 15:29         ` Stuart D. Gathman
2019-08-25  2:13       ` Dave Cohen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).