All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: james harvey <jamespharvey20@gmail.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: "decompress failed" in 1-2 files always causes kernel oops, check/scrub pass
Date: Mon, 14 May 2018 14:36:29 +0800	[thread overview]
Message-ID: <c9feaa61-4362-f2ea-c619-56fdc4a87248@gmx.com> (raw)
In-Reply-To: <d7c78e64-0659-6552-8e7a-8dbbc9b4df72@gmx.com>


[-- Attachment #1.1: Type: text/plain, Size: 5252 bytes --]



On 2018年05月14日 13:30, Qu Wenruo wrote:
> 
> 
> On 2018年05月14日 12:41, james harvey wrote:
>> On Sun, May 13, 2018 at 10:08 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>> On 2018年05月12日 13:08, james harvey wrote:
>>>> Hardware is fine.  Passes memtest86+ in SMP mode.  Works fine on all
>>>> other files.
>>>>
>>>>
>>>>
>>>> [  381.869940] BUG: unable to handle kernel paging request at 0000000000390e50
>>>> [  381.870881] BTRFS: decompress failed
>>>> [  381.891775] IP: rebalance_domains+0x8a/0x2c0
>>>
>>> The interesting part here is, btrfs is not showing up the call trace,
>>> not even lzo code.
>>> (Despite of the "decompress failed" message).
>>> Maybe some corrupted data is screwing up some random kernel memory?
>>
>> I've been surprised by this too.  I've seen a few "styles" of crashes from this.
>>
>> The fuller version of the one I posted in original post:
>> https://bugzilla.kernel.org/attachment.cgi?id=275949
>>
>> One that starts with a "general protection fault":
>> https://bugzilla.kernel.org/attachment.cgi?id=275951
>>
>> And my most recent version, starts with "BTRFS: decompress failed"
>> then "BUG: unable to handle kernel NULL pointer dereference at
>> 0000000000000001":
>> https://bugzilla.kernel.org/attachment.cgi?id=275961
>>
>> This latest one does have a call trace including btrfs.  The top of
>> the call trace is "end_compressed_bio_read+0x34e/0x3d0 [btrfs]", and
>> although it includes the word compressed, I'm not sure that's actually
>> having to do with lzo compression.  The call stack doesn't scream that
>> to me.
>>
>> It seems like when the invalid decompression happens, that code itself
>> doesn't give any kernel errors, but the rest of the kernel starts
>> spazzing.
> 
> Yep, even the last case it still looks like that it's kernel memory get
> corrupted.
> 
>>
>> I've replicated this probably about 15 times now.  Only happens on
>> these files that have inconsistent mirrored data.
> 
> From the thread, since you have already located the corrupted mirror,
> would you please provide the corrupted dump along with correct one?
> 
> It would help a lot for us to under stand what's going on.
> 
>>
>>
>>
>>> Would you please get the inode number of that corrupted files, and throw
>>> it through btrfs-debug-tree?
>>>
>>> # btrfs-debug-tree -t <subvol_id> <device> | grep -A 50 \(<INO>
>>>
>>> This is the preferred method as it would provide all the details we
>>> need. But since it could contain sensitive info like filename, please
>>> double check before posting it.
>>
>> # ls -i system@00fa3c0596e64d2e84096520ca46f008-0000000000000001-00053cd2c1756577.journal
>> 291489 system@00fa3c0596e64d2e84096520ca46f008-0000000000000001-00053cd2c1756577.journal
>>
>> # ls -i user-1000@b70add0ef010457d933fec23a2afa48a-0000000000000495-00053b6b6e65e9cf.journal
>> 72267 user-1000@b70add0ef010457d933fec23a2afa48a-0000000000000495-00053b6b6e65e9cf.journal
>>
>> # btrfs-debug-tree -t 5 /dev/lvm/newMain1 | grep -A 50 \(291489 >
>> debug.tree.291489
>> Available at: http://termbin.com/kegj
>>
>> # btrfs-debug-tree -t 5 /dev/lvm/newMain1 | grep -A 50 \(72267 >
>> debug.tree.72267
>> Available at: http://termbin.com/xhdc
> 
> The dump indicates the same conclusion you reached.
> The inode has NODATACOW NODATASUM flag, which means it should not has
> csum nor has data compressed.
> While in fact we have tons of compressed extents.
> 
> But the following fiemap result also shows that these extents get
> shared. This could happen when there is a snapshot.
> 
> So there is something wrong that btrfs allows compressed data to be
> generated for such file.
> (Could not reproduce the same behavior with 4.16 kernel, could such
> problem happens in older kernels? Or just get fixed recently?)

OK, I could reproduce it now.

Just mount with -o nodatasum, then create a file.
Remount with compress-force=lzo, then write something.

So at least btrfs should disallow such thing.

Thanks,
Qu

> 
> Then some corruption screwed up the compressed data, and when we
> decompress, the kernel is screwed up.
> 
> 
> To pindown the lzo decompress corruption, kasan would be a nice try.
> However this means you need to enable it at compile time, and recompile
> a kernel.
> Not to mention kasan has a great impact on performance.
> 
> But it should provide more info before memory get corrupted.
> 
> Thanks,
> Qu
> 
>>
>>
>>
>>> Or fiemap of that file could also help:
>>>
>>> # xfs_io -c "fiemap -v" <corrupted_file>
>>>
>>> This is completely safe, but I'm not 100% sure about if the info is enough.
>>
>> # xfs_io -c "fiemap -v"
>> system@00fa3c0596e64d2e84096520ca46f008-0000000000000001-00053cd2c1756577.journal
>> Available at: http://termbin.com/nsej
>>
>> # xfs_io -c "fiemap -v"
>> system@00fa3c0596e64d2e84096520ca46f008-0000000000000001-00053cd2c1756577.journal
>> Available at: http://termbin.com/4fiz
> 
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2018-05-14  6:36 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-12  5:08 "decompress failed" in 1-2 files always causes kernel oops, check/scrub pass james harvey
2018-05-12  7:51 ` Martin Steigerwald
2018-05-13  0:10   ` james harvey
2018-05-13  2:09     ` Chris Murphy
2018-05-13  5:28       ` james harvey
2018-05-13 11:01       ` james harvey
2018-05-13 11:45         ` james harvey
2018-05-13 21:27       ` Chris Murphy
2018-05-14  2:08 ` Qu Wenruo
2018-05-14  4:41   ` james harvey
2018-05-14  5:30     ` Qu Wenruo
2018-05-14  6:36       ` Qu Wenruo [this message]
2018-05-14 10:29       ` james harvey
2018-05-14 11:05         ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c9feaa61-4362-f2ea-c619-56fdc4a87248@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=jamespharvey20@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.