From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from plane.gmane.org ([80.91.229.3]:58376 "EHLO plane.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2992842AbbHIC45 (ORCPT ); Sat, 8 Aug 2015 22:56:57 -0400 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1ZOGn8-0006a1-Cf for linux-btrfs@vger.kernel.org; Sun, 09 Aug 2015 04:56:54 +0200 Received: from ip98-167-165-199.ph.ph.cox.net ([98.167.165.199]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 09 Aug 2015 04:56:54 +0200 Received: from 1i5t5.duncan by ip98-167-165-199.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 09 Aug 2015 04:56:54 +0200 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: fs unreadable after powercycle: BTRFS (device sda): parent transid verify failed on 427084513280 wanted 390924 found 390922 Date: Sun, 9 Aug 2015 02:56:49 +0000 (UTC) Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Martin Tippmann posted on Sat, 08 Aug 2015 20:43:34 +0200 as excerpted: > Hi, after a hard reboot (powercycle) a btrfs volume did not come up > again: > > It's a single 4TB disk - only btrfs with lzo - data=single,metadata=dup > > [ 121.831814] BTRFS info (device sda): disk space caching is enabled [ > 121.857820] BTRFS (device sda): parent transid verify failed on > 427084513280 wanted 390924 found 390922 [ 121.861607] BTRFS (device > sda): > parent transid verify failed on 427084513280 wanted 390924 found 390922 > [ 121.861715] BTRFS: failed to read tree root on sda [ 121.878111] > BTRFS: open_ctree failed > > btrfs-progs v4.0 Kernel: 4.1.4 > > I'm quite sure that the HDD is fine (no SMART Problems, Disk Errorlog is > empty, It's a new Enterprise-Drive that worked well in the past > days/weeks). > > So I'm kind at loss what to do: > > How can I recover from that problem? I've found just a note in the > FAQ[1] but no solution to the problem. [The FAQ reference was to the wiki problem faq, transid failure explanation, but it didn't say what to do about it.] Did you try the recovery mount option suggested earlier in the problem-faq under mount problems? https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_can.27t_mount_my_filesystem.2 For transid failures, that's what I'd try first, since that scans previous tree-roots and tries to use the first one it can read. Since the transid it wants (390924) is only a couple ahead of what it finds (390922), and the recover mount option scans backward in the tree-root history to see if it can find any that work, that could well solve the problem. If not, as Hugo mentions, given find-tree-root looks good, btrfs restore has a good chance of working. I've used that myself to good effect a couple times when a btrfs refused to mount (I have backups if I have to use 'em, but recovery or restore, when they work, will normally leave me with more current copies, since I tend to let my backups get somewhat stale). There's a page on the wiki for using it with find-root if necessary, but the wiki page is a bit dated. The btrfs-restore manpage should be current, but doesn't have the detail about using it with find- root that the wiki page has. > Maybe someone can give some clues why does this happen in the first > place? > Is it unfortunate timing due to the abrupt power cycle? > Shouldn't CoW protect against this somewhat? As Hugo says, in theory cow should protect against this, but the combination of possible bugs in a still not yet fully stable and mature btrfs, and possibly buggy hardware, means theory and practice don't always line up as well as they should, in theory. (How's that for an ouroboros, aka snake eating it's tail circular-reference, explanation? =:^) But the recovery mount option is a reasonable first recovery (now ouroboroi =:^) option, and btrfs restore not too bad to work with if that fails. Referencing the hardware write-caching option you mentioned later, yes, turning that off can help... in theory... but it also tends to have a DRAMATICALLY bad effect on spinning rust write performance (I don't know enough about SSD write caching to venture a guess), and in some cases voids warranties due to the additional thrashing it's likely to cause as well, so do your research before turning it off. In general, it's not a good idea as it's simply not worth it. Both Linux at the generic IO level and the various filesystem stacks are designed to work around all but the worst hardware IO barrier failures, and the write slowdown and increased disk thrashing are simply not worth it, in most cases. If the hardware is actually bad enough that it's worth it, I'd strongly consider different hardware. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman