linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Linux-next regression?
@ 2018-11-26 15:01 Andrea Gelmini
  2018-11-27  1:13 ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: Andrea Gelmini @ 2018-11-26 15:01 UTC (permalink / raw)
  To: linux-btrfs

Hi everybody,
   and thanks a lot for your work.

   I'm using BTRFS over LVM over cryptsetup, over Samsung SSD 860 EVO (latest git of btrfs-progs).
   Usually I run kernel in development, because I know BTRFS is young and there are still lots of bugs and corner case to fix.

   Anyway, I just want to submit to you a - maybe - useful info.

   Yesterday I compiled and booted latest linux-next,¹ and I've got this:

-----------
nov 26 01:18:22 glet kernel: Btrfs loaded, crc32c=crc32c-intel
nov 26 01:18:22 glet kernel: BTRFS: device label home devid 1 transid 32759 /dev/mapper/cry-home
nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): force lzo compression, level 0
nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): disk space caching is enabled
nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): has skinny extents
nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): bad tree block start, want 2152002191360 have 8829432654847901262
nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): failed to read block groups: -5
nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): open_ctree failed
-----------

   Now, rebooting with 4.19.0-041900 (downloaded from here)², or 4.20-rc4 (compiled on this machine), the problem disappears.

   Now, running scrub a few times, and copying data (all files of the logical volume) to external device, gives no complain.

   Here I stop. This is my primary dev laptop, and at the moment I can't spend time switching/rebooting/testing. I'm comparing the data with last backup (I rsync each hour), but it takes time (it's more then 3TB).

   So, that was about to let you know. Well, it's Ubuntu 18.10, and between reboots no dist-upgrade or changes in booting related packages or systemd.

  One question: I can completely trust the ok return status of scrub? I know is made for this, but shit happens...

Kisses,
Gelma   

-----------------
¹ commit:  8c9733fd9806c71e7f2313a280f98cb3051f93df
  "Add linux-next specific files for 20181123"
² http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux-next regression?
  2018-11-26 15:01 Linux-next regression? Andrea Gelmini
@ 2018-11-27  1:13 ` Qu Wenruo
  2018-11-27 14:11   ` Andrea Gelmini
  0 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2018-11-27  1:13 UTC (permalink / raw)
  To: Andrea Gelmini, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2898 bytes --]



On 2018/11/26 下午11:01, Andrea Gelmini wrote:
> Hi everybody,
>    and thanks a lot for your work.
> 
>    I'm using BTRFS over LVM over cryptsetup, over Samsung SSD 860 EVO (latest git of btrfs-progs).
>    Usually I run kernel in development, because I know BTRFS is young and there are still lots of bugs and corner case to fix.
> 
>    Anyway, I just want to submit to you a - maybe - useful info.
> 
>    Yesterday I compiled and booted latest linux-next,¹ and I've got this:
> 
> -----------
> nov 26 01:18:22 glet kernel: Btrfs loaded, crc32c=crc32c-intel
> nov 26 01:18:22 glet kernel: BTRFS: device label home devid 1 transid 32759 /dev/mapper/cry-home
> nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): force lzo compression, level 0
> nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): disk space caching is enabled
> nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): has skinny extents
> nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): bad tree block start, want 2152002191360 have 8829432654847901262

This means we failed to read one extent tree block and caused the problem.

And if you're using default mkfs profile it should try again to use the
extra copy, but it doesn't look like to be the case.

BTW, does it always happen like this? Or is there any possibility involved?

> nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): failed to read block groups: -5
> nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): open_ctree failed
> -----------
> 
>    Now, rebooting with 4.19.0-041900 (downloaded from here)², or 4.20-rc4 (compiled on this machine), the problem disappears.
> 
>    Now, running scrub a few times, and copying data (all files of the logical volume) to external device, gives no complain
Would you please also try "btrfs check --readonly"?

> 
>    Here I stop. This is my primary dev laptop, and at the moment I can't spend time switching/rebooting/testing. I'm comparing the data with last backup (I rsync each hour), but it takes time (it's more then 3TB).
> 
>    So, that was about to let you know. Well, it's Ubuntu 18.10, and between reboots no dist-upgrade or changes in booting related packages or systemd.
> 
>   One question: I can completely trust the ok return status of scrub? I know is made for this, but shit happens...

No, scrub only checks csum of data and tree blocks, it doesn't ensure
the content of tree blocks are OK.

For comprehensive check, go "btrfs check --readonly".

However I don't think it's something "btrfs check --readonly" would
report, but some strange behavior, maybe from LVM or cryptsetup.

Thanks,
Qu

> 
> Kisses,
> Gelma   
> 
> -----------------
> ¹ commit:  8c9733fd9806c71e7f2313a280f98cb3051f93df
>   "Add linux-next specific files for 20181123"
> ² http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19/
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux-next regression?
  2018-11-27  1:13 ` Qu Wenruo
@ 2018-11-27 14:11   ` Andrea Gelmini
  2018-11-27 14:16     ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: Andrea Gelmini @ 2018-11-27 14:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1150 bytes --]

On Tue, Nov 27, 2018 at 09:13:02AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/11/26 下午11:01, Andrea Gelmini wrote:
> >   One question: I can completely trust the ok return status of scrub? I know is made for this, but shit happens...
> 
> No, scrub only checks csum of data and tree blocks, it doesn't ensure
> the content of tree blocks are OK.

Hi Qu,
  and thanks a lot, really. Your answers are always the best: short,
  detailed and very kind. You rock.

  I'm going to send a patch to propose to add your explanation above
  on the relative man page, if you agree.

> For comprehensive check, go "btrfs check --readonly".

  I'll do it.

  At the moment I just compared the file existance between my laptop and
  latest backup. Everything is fine.

> 
> However I don't think it's something "btrfs check --readonly" would
> report, but some strange behavior, maybe from LVM or cryptsetup.

  Well, I'm using this setup with ext4 and xfs, on same machine, without
  troubles.
  I've got files checksummed on the backup machine, so I can be sure about
  comparing integrity.

Anyway, thanks a lot again,
Andrea

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 963 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux-next regression?
  2018-11-27 14:11   ` Andrea Gelmini
@ 2018-11-27 14:16     ` Qu Wenruo
  2018-11-28 16:05       ` Andrea Gelmini
  0 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2018-11-27 14:16 UTC (permalink / raw)
  To: Andrea Gelmini; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1716 bytes --]



On 2018/11/27 下午10:11, Andrea Gelmini wrote:
> On Tue, Nov 27, 2018 at 09:13:02AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018/11/26 下午11:01, Andrea Gelmini wrote:
>>>   One question: I can completely trust the ok return status of scrub? I know is made for this, but shit happens...
>>
>> No, scrub only checks csum of data and tree blocks, it doesn't ensure
>> the content of tree blocks are OK.
> 
> Hi Qu,
>   and thanks a lot, really. Your answers are always the best: short,
>   detailed and very kind. You rock.
> 
>   I'm going to send a patch to propose to add your explanation above
>   on the relative man page, if you agree.
> 
>> For comprehensive check, go "btrfs check --readonly".
> 
>   I'll do it.
> 
>   At the moment I just compared the file existance between my laptop and
>   latest backup. Everything is fine.
> 
>>
>> However I don't think it's something "btrfs check --readonly" would
>> report, but some strange behavior, maybe from LVM or cryptsetup.
> 
>   Well, I'm using this setup with ext4 and xfs, on same machine, without
>   troubles.

Then it indeed looks like something goes wrong in linux-next.

I would recommend to do a bisect if possible.

As you compared all your data with laptop, it ensures your csum/file
trees are OK, thus no corruption in that trees.
But still something doesn't look right for extent tree only.

But it's less a concerning problem since it doesn't reach latest RC, so
if you could reproduce it stably, I'd recommend to do a bisect.

Thanks,
Qu

>   I've got files checksummed on the backup machine, so I can be sure about
>   comparing integrity.
> 
> Anyway, thanks a lot again,
> Andrea
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux-next regression?
  2018-11-27 14:16     ` Qu Wenruo
@ 2018-11-28 16:05       ` Andrea Gelmini
  2018-12-04 22:29         ` Chris Mason
  0 siblings, 1 reply; 8+ messages in thread
From: Andrea Gelmini @ 2018-11-28 16:05 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4395 bytes --]

On Tue, Nov 27, 2018 at 10:16:52PM +0800, Qu Wenruo wrote:
>
> But it's less a concerning problem since it doesn't reach latest RC, so
> if you could reproduce it stably, I'd recommend to do a bisect.

No problem to bisect, usually.
But right now it's not possible for me, I explain further.
Anyway, here the rest of the story.

So, in the end I:
a) booted with 4.20.0-rc4
b) updated backup
c) did the btrfs check --read-only
d) seven steps, everything is perfect
e) no complains on screen or in logs (never had)
f) so, started to compile linux-next 20181128 (on another partition)
e) without using (reading or writing) on /home, I started
f) btrfs filesystem defrag -v -r -t 128M /home
g) it worked without complain (in screen or logs)
h) then, reboot with kernel tag 20181128
i) and no way to mount:

------
nov 28 15:44:03 glet kernel: BTRFS: device label home devid 1 transid 37360 /dev/mapper/cry-home
nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): use lzo compression, level 0
nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): turning on discard
nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): enabling auto defrag
nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): disk space caching is enabled
nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): has skinny extents
nov 28 15:44:04 glet kernel: BTRFS error (device dm-3): bad tree block start, want 2150302023680 have 17816181330383341936
nov 28 15:44:04 glet kernel: BTRFS error (device dm-3): failed to read block groups: -5
nov 28 15:44:04 glet kernel: BTRFS error (device dm-3): open_ctree failed
------

l) get back to 4.20.0-rc4
m) mounted, but after a few minutes, I get this:

------
nov 28 15:51:23 glet kernel: BTRFS warning (device dm-3): block group 2199347265536 has wrong amount of free space
nov 28 15:51:23 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2199347265536, rebuilding it now
nov 28 15:51:23 glet kernel: BTRFS warning (device dm-3): block group 2196126040064 has wrong amount of free space
nov 28 15:51:23 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2196126040064, rebuilding it now
nov 28 15:52:09 glet kernel: BTRFS warning (device dm-3): block group 2184314880000 has wrong amount of free space
nov 28 15:52:09 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2184314880000, rebuilding it now
nov 28 15:52:09 glet kernel: BTRFS warning (device dm-3): block group 2183241138176 has wrong amount of free space
nov 28 15:52:09 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2183241138176, rebuilding it now
nov 28 15:52:53 glet kernel: BTRFS warning (device dm-3): block group 2152102625280 has wrong amount of free space
nov 28 15:52:53 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2152102625280, rebuilding it now
nov 28 15:54:13 glet kernel: BTRFS warning (device dm-3): block group 2530059747328 has wrong amount of free space
nov 28 15:54:13 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2530059747328, rebuilding it now
nov 28 15:55:10 glet kernel: BTRFS warning (device dm-3): block group 2151028883456 has wrong amount of free space
nov 28 15:55:10 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2151028883456, rebuilding it now
nov 28 15:55:48 glet kernel: BTRFS warning (device dm-3): block group 2203642232832 has wrong amount of free space
nov 28 15:55:48 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2203642232832, rebuilding it now
------

n) and then read-only mode:

------
[ 1058.996960] BTRFS error (device dm-3): bad tree block start, want 2150382092288 have 159161645701828393
[ 1058.996967] BTRFS: error (device dm-3) in __btrfs_free_extent:6831: errno=-5 IO failure
[ 1058.996969] BTRFS info (device dm-3): forced readonly
[ 1058.996971] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:2978: errno=-5 IO failure
[ 1059.002857] BTRFS error (device dm-3): pending csums is 97832960
------

So, ok, for the moment I'm very sorry I can't help you with bisect, because I have to
revert to ext4. This is the laptop I use to work with.

If I can help you investigating, just tell me.

Thanks for your time,
Gelma

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 963 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux-next regression?
  2018-11-28 16:05       ` Andrea Gelmini
@ 2018-12-04 22:29         ` Chris Mason
  2018-12-05 10:59           ` Andrea Gelmini
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Mason @ 2018-12-04 22:29 UTC (permalink / raw)
  To: Andrea Gelmini; +Cc: Qu Wenruo, linux-btrfs

On 28 Nov 2018, at 11:05, Andrea Gelmini wrote:

> On Tue, Nov 27, 2018 at 10:16:52PM +0800, Qu Wenruo wrote:
>>
>> But it's less a concerning problem since it doesn't reach latest RC, 
>> so
>> if you could reproduce it stably, I'd recommend to do a bisect.
>
> No problem to bisect, usually.
> But right now it's not possible for me, I explain further.
> Anyway, here the rest of the story.
>
> So, in the end I:
> a) booted with 4.20.0-rc4
> b) updated backup
> c) did the btrfs check --read-only
> d) seven steps, everything is perfect
> e) no complains on screen or in logs (never had)
> f) so, started to compile linux-next 20181128 (on another partition)
> e) without using (reading or writing) on /home, I started
> f) btrfs filesystem defrag -v -r -t 128M /home
> g) it worked without complain (in screen or logs)
> h) then, reboot with kernel tag 20181128
> i) and no way to mount:

I think (hope) this is:

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Which was just nailed down to a blkmq bug.  It triggers when you have 
scsi devices using elevator=none over blkmq.

-chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux-next regression?
  2018-12-04 22:29         ` Chris Mason
@ 2018-12-05 10:59           ` Andrea Gelmini
  2018-12-05 19:32             ` Chris Mason
  0 siblings, 1 reply; 8+ messages in thread
From: Andrea Gelmini @ 2018-12-05 10:59 UTC (permalink / raw)
  To: Chris Mason; +Cc: Qu Wenruo, linux-btrfs

On Tue, Dec 04, 2018 at 10:29:49PM +0000, Chris Mason wrote:
> I think (hope) this is:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=201685
> 
> Which was just nailed down to a blkmq bug.  It triggers when you have 
> scsi devices using elevator=none over blkmq.

Thanks a lot Chris. Really.
Good news: I confirm I recompiled and used blkmq and no-op (at that time).
Also, the massive write of btrfs defrag can explain the massive trigger of
the bug, and next corruption.

Thanks again,
Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux-next regression?
  2018-12-05 10:59           ` Andrea Gelmini
@ 2018-12-05 19:32             ` Chris Mason
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Mason @ 2018-12-05 19:32 UTC (permalink / raw)
  To: Andrea Gelmini; +Cc: Qu Wenruo, linux-btrfs

On 5 Dec 2018, at 5:59, Andrea Gelmini wrote:

> On Tue, Dec 04, 2018 at 10:29:49PM +0000, Chris Mason wrote:
>> I think (hope) this is:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>
>> Which was just nailed down to a blkmq bug.  It triggers when you have
>> scsi devices using elevator=none over blkmq.
>
> Thanks a lot Chris. Really.
> Good news: I confirm I recompiled and used blkmq and no-op (at that 
> time).
> Also, the massive write of btrfs defrag can explain the massive 
> trigger of
> the bug, and next corruption.

Sorry this happened, but glad you were able to confirm that it explains 
the trouble you hit.  Thanks for the report, I did end up using this as 
a datapoint to convince myself the bugzilla above wasn't ext4 specific.

-chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-12-05 19:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-26 15:01 Linux-next regression? Andrea Gelmini
2018-11-27  1:13 ` Qu Wenruo
2018-11-27 14:11   ` Andrea Gelmini
2018-11-27 14:16     ` Qu Wenruo
2018-11-28 16:05       ` Andrea Gelmini
2018-12-04 22:29         ` Chris Mason
2018-12-05 10:59           ` Andrea Gelmini
2018-12-05 19:32             ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).