Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: Timothy Pearson <tpearson@raptorengineering.com>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Unusual crash -- data rolled back ~2 weeks?
Date: Tue, 12 Nov 2019 11:30:17 +0000
Message-ID: <CAJCQCtRxAyUPeTvZFKSi-GGdFpQGKAOz5pUEaj43xeT7xhO4eg@mail.gmail.com> (raw)
In-Reply-To: <741683181.533799.1573514917384.JavaMail.zimbra@raptorengineeringinc.com>

On Mon, Nov 11, 2019 at 11:28 PM Timothy Pearson
<tpearson@raptorengineering.com> wrote:
>
> Here's the final information we gleaned from the disk image -- that is now being archived and we're moving on from this failure.
>
> It doesn't look like a general commit failure, it looks like somehow specific directories were corrupted / automatically rolled back.  Again I wonder how much of this is due to the online resize; needless to say, we won't be doing that again -- future procedure will be to isolate the existing array, format a new array, transfer files, then restart the services.

I'm skeptical of resize being involved for a couple reasons:
a) it should have resulted in immediate problems, not days later
b) resize involves the same code as balance and device removal, the
first step is to identify any chunks in physical areas that will no
longer exist after the resize and moving those chunks to areas with
free space that will continue to exist, and updating all the metadata
that points to those chunks. It's essentially identical to a filtered
balance.

Therefore, if there's a bug in resize, there's also a bug in balance
and device removal. And if that's true I think we'd have other people
running into it.



>
> btrfs-find-root returned the following:
>
> =====
> These generations showed the missing files and also contained files from after the crash and restart:
> Well block 114904137728(gen: 295060 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1

That's really suspicious that it wants a LOWER generation number than
what it has. And it's not a huge difference, just 151 generations,
which isn't likely weeks. For a system root, that's maybe an hour or
two of time? Or if not used that much it could be a couple days.


> Well block 114679480320(gen: 295059 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 114592710656(gen: 295058 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 114092670976(gen: 295057 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 114844827648(gen: 295056 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 114618925056(gen: 295055 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 923598848(gen: 294112 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 495386624(gen: 294111 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
>
> =====
> This generation failed to recover any data whatsoever:
> Well block 92602368(gen: 294008 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1

And that's 901 generations, could be a day with average use, or more
days with light use.

What are the mount options for this file system?

>
> =====
> Generations below do not show files created after the crash and restart, but the directories that would have contained the ~2 weeks of files are corrupted badly enough that they cannot be recovered.  Lots of "leaf parent key incorrect" on those directories; unknown if this is because of corruption that occurred prior to the crash or if this data was simply overwritten after remount and file restore.
>
> Well block 299955716096(gen: 293446 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 299916853248(gen: 293446 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 299787747328(gen: 293445 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
>
> My confidence still isn't great here that we don't have an underlying bug of some sort still present in btrfs, but all we can really do is keep an eye on it and increase backup frequency at this point.
>
> Thanks!

There isn't a lot to go on. Have you gone through the logs looking for
non-Btrfs related errors? Like SCSI or libata link resets, or doing a
grep -i 'fail\|error' and so on? Each drive has its own log, exposed
by 'smartctl -x' and also useful is to know what the SCT ERC is,
'smartctl -l scterc' for each drive in the volume. Somewhere something
got dropped.


-- 
Chris Murphy

  parent reply index

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-09 22:33 Timothy Pearson
2019-11-09 22:48 ` Timothy Pearson
2019-11-10  3:38 ` Qu Wenruo
2019-11-10  6:47   ` Timothy Pearson
2019-11-10  6:54     ` Qu Wenruo
2019-11-10  7:18       ` Timothy Pearson
2019-11-10  7:45         ` Qu Wenruo
2019-11-10  7:48           ` Timothy Pearson
2019-11-10 10:02           ` Timothy Pearson
2019-11-10 20:10             ` Zygo Blaxell
2019-11-11 23:28           ` Timothy Pearson
2019-11-11 23:33             ` Timothy Pearson
2019-11-12 11:30             ` Chris Murphy [this message]
2019-11-10  8:04         ` Andrei Borzenkov

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJCQCtRxAyUPeTvZFKSi-GGdFpQGKAOz5pUEaj43xeT7xhO4eg@mail.gmail.com \
    --to=lists@colorremedies.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=tpearson@raptorengineering.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git