* corruption with multi-device btrfs + single bcache, won't mount @ 2019-02-10 6:56 STEVE LEUNG 2019-02-10 10:35 ` Thiago Ramon 2019-02-10 13:52 ` Qu Wenruo 0 siblings, 2 replies; 7+ messages in thread From: STEVE LEUNG @ 2019-02-10 6:56 UTC (permalink / raw) To: linux-btrfs Hi all, I decided to try something a bit crazy, and try multi-device raid1 btrfs on top of dm-crypt and bcache. That is: btrfs -> dm-crypt -> bcache -> physical disks I have a single cache device in front of 4 disks. Maybe this wasn't that good of an idea, because the filesystem went read-only a few days after setting it up, and now it won't mount. I'd been running btrfs on top of 4 dm-crypt-ed disks for some time without any problems, and only added bcache (taking one device out at a time, converting it over, adding it back) recently. This was on Arch Linux x86-64, kernel 4.20.1. dmesg from a mount attempt (using -o usebackuproot,nospace_cache,clear_cache): [ 267.355024] BTRFS info (device dm-5): trying to use backup root at mount time [ 267.355027] BTRFS info (device dm-5): force clearing of disk cache [ 267.355030] BTRFS info (device dm-5): disabling disk space caching [ 267.355032] BTRFS info (device dm-5): has skinny extents [ 271.446808] BTRFS error (device dm-5): parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 [ 271.447485] BTRFS error (device dm-5): parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 [ 271.447491] BTRFS error (device dm-5): failed to read block groups: -5 [ 271.455868] BTRFS error (device dm-5): open_ctree failed btrfs check: parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 Ignoring transid failure ERROR: child eb corrupted: parent bytenr=13069708722176 item=7 parent level=2 child level=0 ERROR: cannot open file system Any simple fix for the filesystem? It'd be nice to recover the data that's hopefully still intact. I have some backups that I can dust off if it really comes down to it, but it's more convenient to recover the data in-place. This is complete speculation, but I do wonder if having the single cache device for multiple btrfs disks triggered the problem. Thanks for any assistance. Steve ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: corruption with multi-device btrfs + single bcache, won't mount 2019-02-10 6:56 corruption with multi-device btrfs + single bcache, won't mount STEVE LEUNG @ 2019-02-10 10:35 ` Thiago Ramon 2019-02-11 5:22 ` STEVE LEUNG 2019-02-10 13:52 ` Qu Wenruo 1 sibling, 1 reply; 7+ messages in thread From: Thiago Ramon @ 2019-02-10 10:35 UTC (permalink / raw) To: STEVE LEUNG; +Cc: Btrfs BTRFS On Sun, Feb 10, 2019 at 5:07 AM STEVE LEUNG <sjleung@shaw.ca> wrote: > > Hi all, > > I decided to try something a bit crazy, and try multi-device raid1 btrfs on > top of dm-crypt and bcache. That is: > > btrfs -> dm-crypt -> bcache -> physical disks > > I have a single cache device in front of 4 disks. Maybe this wasn't > that good of an idea, because the filesystem went read-only a few > days after setting it up, and now it won't mount. I'd been running > btrfs on top of 4 dm-crypt-ed disks for some time without any > problems, and only added bcache (taking one device out at a time, > converting it over, adding it back) recently. > > This was on Arch Linux x86-64, kernel 4.20.1. > > dmesg from a mount attempt (using -o usebackuproot,nospace_cache,clear_cache): > > [ 267.355024] BTRFS info (device dm-5): trying to use backup root at mount time > [ 267.355027] BTRFS info (device dm-5): force clearing of disk cache > [ 267.355030] BTRFS info (device dm-5): disabling disk space caching > [ 267.355032] BTRFS info (device dm-5): has skinny extents > [ 271.446808] BTRFS error (device dm-5): parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 > [ 271.447485] BTRFS error (device dm-5): parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 > [ 271.447491] BTRFS error (device dm-5): failed to read block groups: -5 > [ 271.455868] BTRFS error (device dm-5): open_ctree failed > > btrfs check: > > parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 > parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 > parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 > parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 > Ignoring transid failure > ERROR: child eb corrupted: parent bytenr=13069708722176 item=7 parent level=2 child level=0 > ERROR: cannot open file system > > Any simple fix for the filesystem? It'd be nice to recover the data > that's hopefully still intact. I have some backups that I can dust > off if it really comes down to it, but it's more convenient to > recover the data in-place. > > This is complete speculation, but I do wonder if having the single > cache device for multiple btrfs disks triggered the problem. No, having a single cache device with multiple backing devices is the most common way to use bcache. I've used a setup similar to yours for a couple of years without problems (until it broke down recently due to other issues). Your current filesystem is probably too damaged to properly repair right now (some other people here might be able to help with that), but you probably haven't lost much of what's in there. You can dump the files out with "btrfs restore", or you can use a patch to allow you to mount the damaged filesystem read-only (https://patchwork.kernel.org/patch/10738583/). But before you try to restore anything, can you go back in your kernel logs and check for errors? Either one of your devices is failing, you might have physical link issues or bad memory. Even with a complex setup like this you shouldn't be getting random corruption like this. > > Thanks for any assistance. > > Steve ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: corruption with multi-device btrfs + single bcache, won't mount 2019-02-10 10:35 ` Thiago Ramon @ 2019-02-11 5:22 ` STEVE LEUNG 0 siblings, 0 replies; 7+ messages in thread From: STEVE LEUNG @ 2019-02-11 5:22 UTC (permalink / raw) To: Thiago Ramon; +Cc: Btrfs BTRFS ----- Original Message ----- > From: "Thiago Ramon" <thiagoramon@gmail.com> > On Sun, Feb 10, 2019 at 5:07 AM STEVE LEUNG <sjleung@shaw.ca> wrote: >> >> Hi all, >> >> I decided to try something a bit crazy, and try multi-device raid1 btrfs on >> top of dm-crypt and bcache. That is: >> >> btrfs -> dm-crypt -> bcache -> physical disks >> >> I have a single cache device in front of 4 disks. Maybe this wasn't >> that good of an idea, because the filesystem went read-only a few >> days after setting it up, and now it won't mount. I'd been running >> btrfs on top of 4 dm-crypt-ed disks for some time without any >> problems, and only added bcache (taking one device out at a time, >> converting it over, adding it back) recently. >> >> This is complete speculation, but I do wonder if having the single >> cache device for multiple btrfs disks triggered the problem. > > But before you try to restore anything, can you go back in your kernel > logs and check for errors? Either one of your devices is failing, you > might have physical link issues or bad memory. Even with a complex > setup like this you shouldn't be getting random corruption like this. Indeed, it looks like plugging in the 5th device for caching may have destabilized things (maybe I'm drawing too much power from the power supply or something), as I've observed some spurious ATA errors trying to boot from rescue media. Things seem to go back to normal if I take the cache device out. This hardware is old, but has seemed reliable enough. Although that said, this is my second btrfs corruption I've run into (fortunately no data lost), so maybe the hardware is not as solid as I'd thought. I guess I should have given it more of a shakedown before rolling out bcache everywhere. :) Thanks for the insight. Steve ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: corruption with multi-device btrfs + single bcache, won't mount 2019-02-10 6:56 corruption with multi-device btrfs + single bcache, won't mount STEVE LEUNG 2019-02-10 10:35 ` Thiago Ramon @ 2019-02-10 13:52 ` Qu Wenruo 2019-02-11 5:25 ` STEVE LEUNG 2019-02-12 6:22 ` Steve Leung 1 sibling, 2 replies; 7+ messages in thread From: Qu Wenruo @ 2019-02-10 13:52 UTC (permalink / raw) To: STEVE LEUNG, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 3020 bytes --] On 2019/2/10 下午2:56, STEVE LEUNG wrote: > Hi all, > > I decided to try something a bit crazy, and try multi-device raid1 btrfs on > top of dm-crypt and bcache. That is: > > btrfs -> dm-crypt -> bcache -> physical disks > > I have a single cache device in front of 4 disks. Maybe this wasn't > that good of an idea, because the filesystem went read-only a few > days after setting it up, and now it won't mount. I'd been running > btrfs on top of 4 dm-crypt-ed disks for some time without any > problems, and only added bcache (taking one device out at a time, > converting it over, adding it back) recently. > > This was on Arch Linux x86-64, kernel 4.20.1. > > dmesg from a mount attempt (using -o usebackuproot,nospace_cache,clear_cache): > > [ 267.355024] BTRFS info (device dm-5): trying to use backup root at mount time > [ 267.355027] BTRFS info (device dm-5): force clearing of disk cache > [ 267.355030] BTRFS info (device dm-5): disabling disk space caching > [ 267.355032] BTRFS info (device dm-5): has skinny extents > [ 271.446808] BTRFS error (device dm-5): parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 > [ 271.447485] BTRFS error (device dm-5): parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 When this happens, there is no good way to completely recover (btrfs check pass after the recovery) the fs. We should enhance btrfs-progs to handle it, but it will take some time. > [ 271.447491] BTRFS error (device dm-5): failed to read block groups: -5 > [ 271.455868] BTRFS error (device dm-5): open_ctree failed > > btrfs check: > > parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 > parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 > parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 > parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 > Ignoring transid failure > ERROR: child eb corrupted: parent bytenr=13069708722176 item=7 parent level=2 child level=0 > ERROR: cannot open file system > > Any simple fix for the filesystem? It'd be nice to recover the data > that's hopefully still intact. I have some backups that I can dust > off if it really comes down to it, but it's more convenient to > recover the data in-place. However there is a patch to address this kinda "common" corruption scenario. https://lwn.net/Articles/777265/ In that patchset, there is a new rescue=bg_skip mount option (needs to be used with ro), which should allow you to access whatever you still have from the fs. From other reporters, such corruption is mainly related to extent tree, thus data damage should be pretty small. Thanks, Qu > > This is complete speculation, but I do wonder if having the single > cache device for multiple btrfs disks triggered the problem. > > Thanks for any assistance. > > Steve > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: corruption with multi-device btrfs + single bcache, won't mount 2019-02-10 13:52 ` Qu Wenruo @ 2019-02-11 5:25 ` STEVE LEUNG 2019-02-12 6:22 ` Steve Leung 1 sibling, 0 replies; 7+ messages in thread From: STEVE LEUNG @ 2019-02-11 5:25 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs ----- Original Message ----- > From: "Qu Wenruo" <quwenruo.btrfs@gmx.com> > On 2019/2/10 下午2:56, STEVE LEUNG wrote: >> Hi all, >> >> I decided to try something a bit crazy, and try multi-device raid1 btrfs on >> top of dm-crypt and bcache. That is: >> >> btrfs -> dm-crypt -> bcache -> physical disks >> >> I have a single cache device in front of 4 disks. Maybe this wasn't >> that good of an idea, because the filesystem went read-only a few >> days after setting it up, and now it won't mount. I'd been running >> btrfs on top of 4 dm-crypt-ed disks for some time without any >> problems, and only added bcache (taking one device out at a time, >> converting it over, adding it back) recently. >> > However there is a patch to address this kinda "common" corruption scenario. > > https://lwn.net/Articles/777265/ > > In that patchset, there is a new rescue=bg_skip mount option (needs to > be used with ro), which should allow you to access whatever you still > have from the fs. > > From other reporters, such corruption is mainly related to extent tree, > thus data damage should be pretty small. I can also report that this patch has allowed me to recover the data. The devices were apparently flaky with the addition of the cache device to the system, which explains why the filesystem got corrupted. Thanks very much for the help! Steve ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: corruption with multi-device btrfs + single bcache, won't mount 2019-02-10 13:52 ` Qu Wenruo 2019-02-11 5:25 ` STEVE LEUNG @ 2019-02-12 6:22 ` Steve Leung 2019-02-12 6:51 ` Qu Wenruo 1 sibling, 1 reply; 7+ messages in thread From: Steve Leung @ 2019-02-12 6:22 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs ----- Original Message ----- > From: "Qu Wenruo" <quwenruo.btrfs@gmx.com> > To: "STEVE LEUNG" <sjleung@shaw.ca>, linux-btrfs@vger.kernel.org > Sent: Sunday, February 10, 2019 6:52:23 AM > Subject: Re: corruption with multi-device btrfs + single bcache, won't mount > ----- Original Message ----- > From: "Qu Wenruo" <quwenruo.btrfs@gmx.com> > On 2019/2/10 下午2:56, STEVE LEUNG wrote: >> Hi all, >> >> I decided to try something a bit crazy, and try multi-device raid1 btrfs on >> top of dm-crypt and bcache. That is: >> >> btrfs -> dm-crypt -> bcache -> physical disks >> >> I have a single cache device in front of 4 disks. Maybe this wasn't >> that good of an idea, because the filesystem went read-only a few >> days after setting it up, and now it won't mount. I'd been running >> btrfs on top of 4 dm-crypt-ed disks for some time without any >> problems, and only added bcache (taking one device out at a time, >> converting it over, adding it back) recently. >> >> This was on Arch Linux x86-64, kernel 4.20.1. >> >> dmesg from a mount attempt (using -o usebackuproot,nospace_cache,clear_cache): >> >> [ 267.355024] BTRFS info (device dm-5): trying to use backup root at mount time >> [ 267.355027] BTRFS info (device dm-5): force clearing of disk cache >> [ 267.355030] BTRFS info (device dm-5): disabling disk space caching >> [ 267.355032] BTRFS info (device dm-5): has skinny extents >> [ 271.446808] BTRFS error (device dm-5): parent transid verify failed on >> 13069706166272 wanted 4196588 found 4196585 >> [ 271.447485] BTRFS error (device dm-5): parent transid verify failed on >> 13069706166272 wanted 4196588 found 4196585 > > When this happens, there is no good way to completely recover (btrfs > check pass after the recovery) the fs. > > We should enhance btrfs-progs to handle it, but it will take some time. > >> [ 271.447491] BTRFS error (device dm-5): failed to read block groups: -5 >> [ 271.455868] BTRFS error (device dm-5): open_ctree failed >> >> btrfs check: >> >> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 >> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 >> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 >> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 >> Ignoring transid failure >> ERROR: child eb corrupted: parent bytenr=13069708722176 item=7 parent level=2 >> child level=0 >> ERROR: cannot open file system >> >> Any simple fix for the filesystem? It'd be nice to recover the data >> that's hopefully still intact. I have some backups that I can dust >> off if it really comes down to it, but it's more convenient to >> recover the data in-place. > > However there is a patch to address this kinda "common" corruption scenario. > > https://lwn.net/Articles/777265/ > > In that patchset, there is a new rescue=bg_skip mount option (needs to > be used with ro), which should allow you to access whatever you still > have from the fs. > > From other reporters, such corruption is mainly related to extent tree, > thus data damage should be pretty small. Ok I think I spoke too soon. Some files are recoverable, but many cannot be read. Userspace gets back an I/O error, and the kernel log reports similar parent transid verify failed errors, with what seem to be similar generation numbers to what I saw in my original mount error. i.e. wants 4196588, found something that's off by usually 2 or 3. Occasionally there's one that's off by about 1300. There are multiple snapshots on this filesystem (going back a few days), and the same file in each snapshot seems to be equally affected, even if the file hasn't changed in many months. Metadata seems to be intact - I can stat every file in one of the snapshots and I don't get any errors back. Any other ideas? It kind of seems like "btrfs restore" would be suitable here, but it sounds like it would need to be taught about rescue=bg_skip first. Thanks for all the help. Even a partial recovery is a lot better than what I was facing before. Steve ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: corruption with multi-device btrfs + single bcache, won't mount 2019-02-12 6:22 ` Steve Leung @ 2019-02-12 6:51 ` Qu Wenruo 0 siblings, 0 replies; 7+ messages in thread From: Qu Wenruo @ 2019-02-12 6:51 UTC (permalink / raw) To: Steve Leung; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 4861 bytes --] On 2019/2/12 下午2:22, Steve Leung wrote: > > > ----- Original Message ----- >> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com> >> To: "STEVE LEUNG" <sjleung@shaw.ca>, linux-btrfs@vger.kernel.org >> Sent: Sunday, February 10, 2019 6:52:23 AM >> Subject: Re: corruption with multi-device btrfs + single bcache, won't mount > >> ----- Original Message ----- >> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com> >> On 2019/2/10 下午2:56, STEVE LEUNG wrote: >>> Hi all, >>> >>> I decided to try something a bit crazy, and try multi-device raid1 btrfs on >>> top of dm-crypt and bcache. That is: >>> >>> btrfs -> dm-crypt -> bcache -> physical disks >>> >>> I have a single cache device in front of 4 disks. Maybe this wasn't >>> that good of an idea, because the filesystem went read-only a few >>> days after setting it up, and now it won't mount. I'd been running >>> btrfs on top of 4 dm-crypt-ed disks for some time without any >>> problems, and only added bcache (taking one device out at a time, >>> converting it over, adding it back) recently. >>> >>> This was on Arch Linux x86-64, kernel 4.20.1. >>> >>> dmesg from a mount attempt (using -o usebackuproot,nospace_cache,clear_cache): >>> >>> [ 267.355024] BTRFS info (device dm-5): trying to use backup root at mount time >>> [ 267.355027] BTRFS info (device dm-5): force clearing of disk cache >>> [ 267.355030] BTRFS info (device dm-5): disabling disk space caching >>> [ 267.355032] BTRFS info (device dm-5): has skinny extents >>> [ 271.446808] BTRFS error (device dm-5): parent transid verify failed on >>> 13069706166272 wanted 4196588 found 4196585 >>> [ 271.447485] BTRFS error (device dm-5): parent transid verify failed on >>> 13069706166272 wanted 4196588 found 4196585 >> >> When this happens, there is no good way to completely recover (btrfs >> check pass after the recovery) the fs. >> >> We should enhance btrfs-progs to handle it, but it will take some time. >> >>> [ 271.447491] BTRFS error (device dm-5): failed to read block groups: -5 >>> [ 271.455868] BTRFS error (device dm-5): open_ctree failed >>> >>> btrfs check: >>> >>> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 >>> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 >>> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 >>> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 >>> Ignoring transid failure >>> ERROR: child eb corrupted: parent bytenr=13069708722176 item=7 parent level=2 >>> child level=0 >>> ERROR: cannot open file system >>> >>> Any simple fix for the filesystem? It'd be nice to recover the data >>> that's hopefully still intact. I have some backups that I can dust >>> off if it really comes down to it, but it's more convenient to >>> recover the data in-place. >> >> However there is a patch to address this kinda "common" corruption scenario. >> >> https://lwn.net/Articles/777265/ >> >> In that patchset, there is a new rescue=bg_skip mount option (needs to >> be used with ro), which should allow you to access whatever you still >> have from the fs. >> >> From other reporters, such corruption is mainly related to extent tree, >> thus data damage should be pretty small. > > Ok I think I spoke too soon. Some files are recoverable, but many > cannot be read. Userspace gets back an I/O error, and the kernel log > reports similar parent transid verify failed errors, with what seem > to be similar generation numbers to what I saw in my original mount > error. > > i.e. wants 4196588, found something that's off by usually 2 or 3. > Occasionally there's one that's off by about 1300. That's more or less expected for such transid corruption. The fs is already screwed up. The lowest generation you found during all these error message could be when the first corruption happens. (And it may date back to very old days) > > There are multiple snapshots on this filesystem (going back a few > days), and the same file in each snapshot seems to be equally > affected, even if the file hasn't changed in many months.> > Metadata seems to be intact - I can stat every file in one of the > snapshots and I don't get any errors back. > > Any other ideas? It kind of seems like "btrfs restore" would be > suitable here, but it sounds like it would need to be taught about > rescue=bg_skip first. Since v4.16.1, btrfs-restore should be OK to ignore extent tree completely, thus you can try btrfs-restore. Btrfs-restore may be a little better, since it can ignore csum errors completely. Thanks, Qu > > Thanks for all the help. Even a partial recovery is a lot better > than what I was facing before. > > Steve > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2019-02-12 6:51 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-10 6:56 corruption with multi-device btrfs + single bcache, won't mount STEVE LEUNG 2019-02-10 10:35 ` Thiago Ramon 2019-02-11 5:22 ` STEVE LEUNG 2019-02-10 13:52 ` Qu Wenruo 2019-02-11 5:25 ` STEVE LEUNG 2019-02-12 6:22 ` Steve Leung 2019-02-12 6:51 ` Qu Wenruo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).