Re: Unusual crash -- data rolled back ~2 weeks?

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Timothy Pearson <tpearson@raptorengineering.com>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Unusual crash -- data rolled back ~2 weeks?
Date: Sun, 10 Nov 2019 15:10:17 -0500	[thread overview]
Message-ID: <20191110201017.GV22121@hungrycats.org> (raw)
In-Reply-To: <825354711.177110.1573380131178.JavaMail.zimbra@raptorengineeringinc.com>

[-- Attachment #1: Type: text/plain, Size: 6230 bytes --]

On Sun, Nov 10, 2019 at 04:02:11AM -0600, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
> > From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
> > To: "Timothy Pearson" <tpearson@raptorengineering.com>
> > Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> > Sent: Sunday, November 10, 2019 1:45:14 AM
> > Subject: Re: Unusual crash -- data rolled back ~2 weeks?
> 
> > On 2019/11/10 下午3:18, Timothy Pearson wrote:
> >> 
> >> 
> >> ----- Original Message -----
> >>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
> >>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> >>> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> >>> Sent: Sunday, November 10, 2019 6:54:55 AM
> >>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
> >> 
> >>> On 2019/11/10 下午2:47, Timothy Pearson wrote:
> >>>>
> >>>>
> >>>> ----- Original Message -----
> >>>>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
> >>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linux-btrfs"
> >>>>> <linux-btrfs@vger.kernel.org>
> >>>>> Sent: Saturday, November 9, 2019 9:38:21 PM
> >>>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
> >>>>
> >>>>> On 2019/11/10 上午6:33, Timothy Pearson wrote:
> >>>>>> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to
> >>>>>> serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent
> >>>>>> underlying disk subsystem problems, and when the server was hard rebooted to
> >>>>>> clear the D wait the BTRFS filesystem remounted itself in the state that it was
> >>>>>> in approximately two weeks earlier (!).
> >>>>>
> >>>>> This means during two weeks, the btrfs is not committed.
> >>>>
> >>>> Is there any hope of getting the data from that interval back via btrfs-recover
> >>>> or a similar tool, or does the lack of commit mean the data was stored in RAM
> >>>> only and is therefore gone after the server reboot?

Writeback will dump out some data blocks between commits; however, without
a commit, there will be no metadata pages on disk that point to the data.

Writeback could keep a fileserver running for a long time as long as
nobody calls a nontrivial fsync() (too complex to be sent to the log tree)
or sync(), or renames a file over another existing file (all may trigger a
commit if reservations fill up); however, as soon as one of those happens,
something should be noticeably failing as the calls will block.

> >>> If it's deadlock preventing new transaction to be committed, then no
> >>> metadata is even written back to disk, so no way to recover metadata.
> >>> Maybe you can find some data written, but without metadata it makes no
> >>> sense.
> >> 
> >> OK, I'll just assume the data written in that window is unrecoverable at this
> >> point then.
> >> 
> >> Would the commit deadlock affect only one btrfs filesystem or all of them on the
> >> machine?  I take it there is no automatic dmesg spew on extended deadlock?
> >> dmesg was completely clean at the time of the fault / reboot.

Stepping away from btrfs a bit, I've heard rumors of something like this
happening to SSDs (on Windows, so not a btrfs issue).  I guess it may
be possible for a log-structured FTL layer to revert to a significantly
earlier disk content state if there are enough free erase blocks so that
the older data isn't destroyed, and the pointer to the current log record
isn't updated in persistent storage due to a firmware bug.  Obviously this
is not relevant if you're not using SSD, and not likely if you have a
multi-disk filesystem (one disk will appear to be corrupted in that case).

> > It should have some kernel message for things like process hang for over
> > 120s.
> > If you could recover that, it would help us to locate the cause.
> > 
> > Normally such deadlock should only affect the unlucky fs which meets the
> > condition, not all filesystems.
> > But if you're unlucky enough, it may happen to other filesystems.
> > 
> > Anyway, without enough info, it's really hard to say.
> 
> I was able to retrieve complete logs from the kernel for the entire time period.  The BTRFS filesystem was online resized five days before the last apparent filesystem commit.  Immediately after resize, a couple of csum errors were thrown for a single inode on the resized filesystem, though this was not detected at the time.  The underlying hardware did not experience a fault at any point and is passing all diagnostics at this time.  Intriguingly, there are a handful of files accessible from after the last known good filesystem commit (Oct. 29), but the vast majority are simply absent.
> 
> At this point I'm more interested in making sure this type of event does not happen in the future than anything else.  At no point did the kernel print any type of stack trace or deadlock warning.  I'm starting to wonder if we hit a bug in the online resize path, but am just guessing at this point.  The timing is certainly very close / coincidental.

To detect this kind of failure we use a watchdog script that invokes mkdir
and rmdir every 30 seconds on each filesystem backed by disk (i.e. btrfs,
ext4, and xfs).  If the mkdir/rmdir takes too long (*) then we try to
log some information (mostly 'echo w > /proc/sysrq-trigger') and force
a reboot.  mkdir and rmdir will eventually get stuck on btrfs if there
is a commit that is not making forward progress.  It's a surprisingly
simple and effective bug detector on ext4 and xfs too.

(This doesn't detect the SSD thing--you'd need RAID1 to handle that case).

The lack of kernel messages is unexpected, especially when you have a NFS
process stuck in D state long enough to get admins to force a reboot.
That should have produced at least a stuck task warning if they are
enabled in your kernel.  Did anyone capture the nfsd process stack trace?

(*) too long can be surprisingly long.  Some btrfs algorithms don't have
bounded running time and can delay a commit for several hours if there
are active writers on the system.  We record logs for commits over 100
seconds, send alarms to admins set at one hour, and automatic reboots
after 12 hours.

> Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]