linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Alan Hardman" <alanh@fastmail.com>
To: "Hugo Mills" <hugo@carfax.org.uk>,
	"Btrfs BTRFS" <linux-btrfs@vger.kernel.org>,
	"Chris Murphy" <lists@colorremedies.com>
Subject: Re: RAID1 filesystem not mounting
Date: Sun, 03 Feb 2019 00:40:02 -0500	[thread overview]
Message-ID: <276ebe4c-2d16-4850-8aa1-32794f3f2502@www.fastmail.com> (raw)
In-Reply-To: <CAJCQCtRL4bd+2_g8+YZ5uV254E_nK3OxS9Q0AB5dqN-N4zcd0Q@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5329 bytes --]

Thanks for the quick response, Chris and Hugo!

After some testing, there *was* a RAM issue that has now been resolved, so that should prevent it from being a factor going forward, but could definitely have been related. The high number of lifetime errors for the filesystem is expected, and isn't related to this issue; it was caused by a bad power supply that caused a disk to go completely offline during a balance operation, but was fully recovered via scrub and hasn't shown any increase in errors since then until this new issue (going several months without an error, several TB written).

I've attached full output from Chris's recommendations, here are a couple excerpts:

# btrfs rescue super -v /dev/sdb
...
All supers are valid, no need to recover

# journalctl | grep -A 15 exception
...
Jan 23 01:06:37 localhost kernel: ata3.00: status: { DRDY }
Jan 23 01:06:37 localhost kernel: ata3.00: failed command: WRITE FPDMA QUEUED
Jan 23 01:06:37 localhost kernel: ata3.00: cmd 61/b0:98:ea:7a:48/00:00:0a:00:00/40 tag 19 ncq dma 90112 out
                                           res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
--
Jan 31 19:24:32 localhost kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 31 19:24:32 localhost kernel: ata5.00: failed command: READ DMA EXT
Jan 31 19:24:32 localhost kernel: ata5.00: cmd 25/00:08:a8:2a:81/00:00:a3:03:00/e0 tag 0 dma 4096 in
                                           res 40/00:01:00:00:00/00:00:00:00:00/10 Emask 0x4 (timeout)
Jan 31 19:24:32 localhost kernel: ata5.00: status: { DRDY }
Jan 31 19:24:32 localhost kernel: ata5: link is slow to respond, please be patient (ready=0)
Jan 31 19:24:32 localhost kernel: ata5: device not ready (errno=-16), forcing hardreset
Jan 31 19:24:32 localhost kernel: ata5: soft resetting link
Jan 31 19:24:32 localhost kernel: ata5.00: configured for UDMA/33
Jan 31 19:24:32 localhost kernel: ata5.01: configured for UDMA/33
Jan 31 19:24:32 localhost kernel: sd 4:0:0:0: [sde] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 19:24:32 localhost kernel: sd 4:0:0:0: [sde] tag#0 Sense Key : Illegal Request [current]
Jan 31 19:24:32 localhost kernel: sd 4:0:0:0: [sde] tag#0 Add. Sense: Unaligned write command
Jan 31 19:24:32 localhost kernel: sd 4:0:0:0: [sde] tag#0 CDB: Read(16) 88 00 00 00 00 03 a3 81 2a a8 00 00 00 08 00 00

This last journalctl result was from the first system boot when the filesystem stopped being mountable. The filesystem had been remounted as read-only automatically after a few errors (see btrfs-journal.log in archive). None of my other system log files were relevant from what I could tell, so I limited this to journalctl's output.

I have been able to successfully recover files via "btrfs restore ...", and there doesn't seem to be anything essential missing from its full output with -D, so if that's necessary to use to offload the entire filesystem, it at least seems possible if it can't be recovered directly.

Thanks for the help!

On Sat, Feb 2, 2019, at 17:26, Chris Murphy wrote:
> On Sat, Feb 2, 2019 at 5:02 AM Hugo Mills <hugo@carfax.org.uk> wrote:
> >
> > On Fri, Feb 01, 2019 at 11:28:27PM -0500, Alan Hardman wrote:
> > > I have a Btrfs filesystem using 6 partitionless disks in RAID1 that's failing to mount. I've tried the common recommended safe check options, but I haven't gotten the disk to mount at all, even with -o ro,recovery. If necessary, I can try to use the recovery to another filesystem, but I have around 18 TB of data on the filesystem that won't mount, so I'd like to avoid that if there's some other way of recovering it.
> > >
> > > Versions:
> > > btrfs-progs v4.19.1
> > > Linux localhost 4.20.6-arch1-1-ARCH #1 SMP PREEMPT Thu Jan 31 08:22:01 UTC 2019 x86_64 GNU/Linux
> > >
> > > Based on my understanding of how RAID1 works with Btrfs, I would expect a single disk failure to not prevent the volume from mounting entirely, but I'm only seeing one disk with errors according to dmesg output, maybe I'm misinterpreting it:
> > >
> > > [  534.519437] BTRFS warning (device sdd): 'recovery' is deprecated, use 'usebackuproot' instead
> > > [  534.519441] BTRFS info (device sdd): trying to use backup root at mount time
> > > [  534.519443] BTRFS info (device sdd): disk space caching is enabled
> > > [  534.519446] BTRFS info (device sdd): has skinny extents
> > > [  536.306194] BTRFS info (device sdd): bdev /dev/sdc errs: wr 23038942, rd 22208378, flush 1, corrupt 29486730, gen 2933
> > > [  556.126928] BTRFS critical (device sdd): corrupt leaf: root=2 block=25540634836992 slot=45, unexpected item end, have 13882 expect 13898
> >
> >    It's worth noting that 13898-13882 = 16, which is a power of
> > two. This means that you most likely have a single-bit error in your
> > metadata. That, plus the checksum not being warned about, would
> > strongly suggest that you have bad RAM. I would recommend that you
> > check your RAM first before trying anything else that would write to
> > your filesystem (including btrfs check --repair).
> 
> Good catch!
> 
> I think that can account for the corrupt and generation errors. I
> don't know that memory errors can account for the large number of read
> and write errors, however. So there may be more than one problem.
> 
> 
> -- 
> Chris Murphy
>

[-- Attachment #2: btrfs.tar.gz --]
[-- Type: application/x-gzip, Size: 11799 bytes --]

  reply	other threads:[~2019-02-03  5:40 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-02  4:28 RAID1 filesystem not mounting Alan Hardman
2019-02-02  9:59 ` Bernhard K
2019-02-02 12:01 ` Hugo Mills
2019-02-03  0:26   ` Chris Murphy
2019-02-03  5:40     ` Alan Hardman [this message]
2019-02-03 18:43       ` Chris Murphy
2019-02-03  0:18 ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=276ebe4c-2d16-4850-8aa1-32794f3f2502@www.fastmail.com \
    --to=alanh@fastmail.com \
    --cc=hugo@carfax.org.uk \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).