Buffer I/O error on dev md5, logical block 7073536, async page read

All of lore.kernel.org
 help / color / mirror / Atom feed

* Buffer I/O error on dev md5, logical block 7073536, async page read
@ 2016-10-30  2:16 Marc MERLIN
  2016-10-30  9:33 ` Andreas Klauer
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-10-30  2:16 UTC (permalink / raw)
  To: linux-raid

Howdy,

I'm struggling with this problem.

I have this md5 array with 5 drives:
Personalities : [linear] [raid0] [raid1] [raid10] [multipath] [raid6] [raid5] [raid4] 
md5 : active raid5 sdg1[0] sdh1[6] sdf1[2] sde1[3] sdd1[5]
      15627542528 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 0/30 pages [0KB], 65536KB chunk

I started having filesystem problems with it, so I did a scan with hdrecover on the drives first,
and that passed. Then I did it on the md5 array, and it failed.

With a simple dd, I get this:

25526374400 bytes (26 GB) copied, 249.888 s, 102 MB/s
dd: reading `/dev/md5': Input/output error
56588288+0 records in
56588288+0 records out
28973203456 bytes (29 GB) copied, 283.325 s, 102 MB/s
[1]+  Exit 1                  dd if=/dev/md5 of=/dev/null
kernel: [202693.708639] Buffer I/O error on dev md5, logical block 7073536, async page read

Yes, I can read the entire disk devices without problem (took a long time
to run, but it finished)

Can someone tell me how this is possible?
More generally, is it possible for the kernel to return an md error and then not log
any underlying hardware error on the drives the md was being read from?

Kernel 4.6.0. I'll upgrade just in case, but md has been stable enough for so many years that I'm 
thinking the problem is likely elsewhere.

Any ideas?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
  2016-10-30  2:16 Buffer I/O error on dev md5, logical block 7073536, async page read Marc MERLIN
@ 2016-10-30  9:33 ` Andreas Klauer
  2016-10-30 15:38   ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Andreas Klauer @ 2016-10-30  9:33 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-raid

On Sat, Oct 29, 2016 at 07:16:14PM -0700, Marc MERLIN wrote:
> Can someone tell me how this is possible?
> More generally, is it possible for the kernel to return an md error 
> and then not log any underlying hardware error on the drives the md 
> was being read from?

Is there something in mdadm --examine(-badblocks) /dev/sd*?

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
  2016-10-30  9:33 ` Andreas Klauer
@ 2016-10-30 15:38   ` Marc MERLIN
  2016-10-30 16:19     ` Andreas Klauer
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-10-30 15:38 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid

On Sun, Oct 30, 2016 at 10:33:37AM +0100, Andreas Klauer wrote:
> On Sat, Oct 29, 2016 at 07:16:14PM -0700, Marc MERLIN wrote:
> > Can someone tell me how this is possible?
> > More generally, is it possible for the kernel to return an md error 
> > and then not log any underlying hardware error on the drives the md 
> > was being read from?
> 
> Is there something in mdadm --examine(-badblocks) /dev/sd*?

Well, well, I learned something new today. First I had to upgrade my mdadm
tools to get that option, and sure enough:
myth:~# mdadm --examine-badblocks /dev/sd[defgh]1
Bad-blocks on /dev/sdd1:
            14408704 for 352 sectors
            14409568 for 160 sectors
           132523032 for 512 sectors
           372496968 for 440 sectors
Bad-blocks list is empty in /dev/sde1
Bad-blocks on /dev/sdf1:
            14408704 for 352 sectors
            14409568 for 160 sectors
           132523032 for 512 sectors
           372496968 for 440 sectors
Bad-blocks list is empty in /dev/sdg1
Bad-blocks list is empty in /dev/sdh1

So thank you for pointing me in the right direction.

I think they are due to the fact that it's an external disk array on a port
multiplier where sometimes I get bus errors that aren't actually on the
disks.

Questions:
1) shouldn't my array have been invalidated if I have bad blocks on 2 drives
in the same place or is the only possible way for this to happen that it did
get invalidated and I somehow force rebuilt the array to bring it back up
and I don't remember doing so?
(mmmh, but even so, rebuilding the spare should have cleared the bad blocks
on at least one drive, no?)

2) I'm currently running this, which I believe is the way to recover:
myth:~# echo 'check' > /sys/block/md5/md/sync_action 
but I'm not too hopeful on how that's going to work out if I have 2 drives with
supposed bad blocks at the same offsets.

Is there another way to just clear the bad block list on both drives if I've
already verified that those blocks are not bad and that they were due to some 
I/O errors that came from a bad cable connection?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
  2016-10-30 15:38   ` Marc MERLIN
@ 2016-10-30 16:19     ` Andreas Klauer
  2016-10-30 16:34       ` Phil Turmel
  2016-10-30 16:43       ` Buffer I/O error on dev md5, logical block 7073536, async page read Marc MERLIN
  0 siblings, 2 replies; 67+ messages in thread
From: Andreas Klauer @ 2016-10-30 16:19 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-raid

On Sun, Oct 30, 2016 at 08:38:57AM -0700, Marc MERLIN wrote:
> (mmmh, but even so, rebuilding the spare should have cleared the bad blocks
> on at least one drive, no?)

If n+1 disks have bad blocks there's no data to sync over, so they just 
propagate and stay bad forever. Or at least that's how it seemed to work 
last time I tried it. I'm not familiar with bad blocks. I just turn it off.

As long as the bad block list is empty you can --update=no-bbl.
If everything else fails - edit the metadata or carefully recreate.
Which I don't recommend because you can go wrong in a hundred ways.

I don't remember if anyone ever had a proper solution to this.
It came up a couple of times on the list so you could search.

If you've replaced drives since, the drive that has been part of the array 
the longest is probably the most likely to still have valid data in there. 
That could be synced over to the other drives once the bbl is cleared. 
It might not matter, you'd have to check with your filesystems if they 
believe any files located there. (Filesystems sometimes maintain their 
own bad block lists so you'd have to check those too.)

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
  2016-10-30 16:19     ` Andreas Klauer
@ 2016-10-30 16:34       ` Phil Turmel
  2016-10-30 17:12         ` clearing blocks wrongfully marked as bad if --update=no-bbl can't be used? Marc MERLIN
  2016-10-30 18:56         ` [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds TomK
  2016-10-30 16:43       ` Buffer I/O error on dev md5, logical block 7073536, async page read Marc MERLIN
  1 sibling, 2 replies; 67+ messages in thread
From: Phil Turmel @ 2016-10-30 16:34 UTC (permalink / raw)
  To: Andreas Klauer, Marc MERLIN; +Cc: linux-raid

On 10/30/2016 12:19 PM, Andreas Klauer wrote:
> On Sun, Oct 30, 2016 at 08:38:57AM -0700, Marc MERLIN wrote:
>> (mmmh, but even so, rebuilding the spare should have cleared the bad blocks
>> on at least one drive, no?)
> 
> If n+1 disks have bad blocks there's no data to sync over, so they just 
> propagate and stay bad forever. Or at least that's how it seemed to work 
> last time I tried it. I'm not familiar with bad blocks. I just turn it off.

I, too, turn it off.  (I never let it turn on, actually.)

I'm a little disturbed that this feature has become the default on new
arrays.  This feature was introduced specifically to support underlying
storage technologies that cannot perform their own bad block management.
 And since it doesn't implement any relocation algorithm for blocks
marked bad, it simply gives up any redundancy for affected sectors.  And
when there's no remaining redundancy, it simply passes the error up the
stack.  In this case, your errors were created by known communications
weaknesses that should always be recoverable with --assemble --force.

As far as I'm concerned, the bad block system is an incomplete feature
that should never be used in production, and certainly not on top of any
storage technology that implements error detection, correction, and
relocation.  Like, every modern SATA and SAS drive.

Phil

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
  2016-10-30 16:19     ` Andreas Klauer
  2016-10-30 16:34       ` Phil Turmel
@ 2016-10-30 16:43       ` Marc MERLIN
  2016-10-30 17:02         ` Andreas Klauer
  2016-10-31 19:24         ` Wols Lists
  1 sibling, 2 replies; 67+ messages in thread
From: Marc MERLIN @ 2016-10-30 16:43 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid

On Sun, Oct 30, 2016 at 05:19:29PM +0100, Andreas Klauer wrote:
> On Sun, Oct 30, 2016 at 08:38:57AM -0700, Marc MERLIN wrote:
> > (mmmh, but even so, rebuilding the spare should have cleared the bad blocks
> > on at least one drive, no?)
> 
> If n+1 disks have bad blocks there's no data to sync over, so they just 
> propagate and stay bad forever. Or at least that's how it seemed to work 
> last time I tried it. I'm not familiar with bad blocks. I just turn it off.
> 
> As long as the bad block list is empty you can --update=no-bbl.
> If everything else fails - edit the metadata or carefully recreate.
> Which I don't recommend because you can go wrong in a hundred ways.

Right.
There should be some --update=no-bbl --force if the admin knows the bad
block list is wrong and due to IO issues not related to the drive.

> I don't remember if anyone ever had a proper solution to this.
> It came up a couple of times on the list so you could search.

Will look, thanks.

> If you've replaced drives since, the drive that has been part of the array 
> the longest is probably the most likely to still have valid data in there. 
> That could be synced over to the other drives once the bbl is cleared. 
> It might not matter, you'd have to check with your filesystems if they 
> believe any files located there. (Filesystems sometimes maintain their 
> own bad block lists so you'd have to check those too.)

No drives were ever replaced, this is an original array used only a few
times (for backups).
At this point I'm almost tempted to wipe and start over, but it's going to
take a week to recreate the backup (lots of data, slow link).
As for the filesystem it's btrfs with data and metadata checksums, so it's
easy to verify that everything is fine once I can get md5 to stop returning
IO errors on blocks it thinks are bad, but in fact are not.

And here isn't one good drive between the 2, the bad blocks are identical on
both drives and must have happened at the same time due to those cable
induced IO errors I mentionned.
Too bad that mdadm doesn't seem to account for the fact that it could be
wrong when marking blocks as bad and does not seem to give a way to recover
from this easily....
I'll do more reading, thanks.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
  2016-10-30 16:43       ` Buffer I/O error on dev md5, logical block 7073536, async page read Marc MERLIN
@ 2016-10-30 17:02         ` Andreas Klauer
  2016-10-31 19:24         ` Wols Lists
  1 sibling, 0 replies; 67+ messages in thread
From: Andreas Klauer @ 2016-10-30 17:02 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-raid

On Sun, Oct 30, 2016 at 09:43:42AM -0700, Marc MERLIN wrote:
> Right.
> There should be some --update=no-bbl --force if the admin knows the bad
> block list is wrong and due to IO issues not related to the drive.

Good point. And hey, there it is.

mdadm.c

|                       	if (strcmp(c.update, "bbl") == 0)
|                               	continue;
|                       	if (strcmp(c.update, "no-bbl") == 0)
|                                continue;
|                       	if (strcmp(c.update, "force-no-bbl") == 0)
|                               	continue;

force-no-bbl. It's in mdadm v3.4, not sure about older ones.

If I stumbled across that one before then I forgot about it.

Good luck
Andreas Klauer

^ permalink raw reply	[flat|nested] 67+ messages in thread

* clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-10-30 16:34       ` Phil Turmel
@ 2016-10-30 17:12         ` Marc MERLIN
  2016-10-30 17:16           ` Marc MERLIN
  2016-10-30 18:56         ` [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds TomK
  1 sibling, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-10-30 17:12 UTC (permalink / raw)
  To: Phil Turmel, Neil Brown; +Cc: Andreas Klauer, linux-raid

Hi Neil,

Could you offer any guidance here? Is there somethign else I can do to clear
those fake bad blocks (the underlying disks are fine, I scanned them)
without rebuilding the array?

On Sun, Oct 30, 2016 at 12:34:56PM -0400, Phil Turmel wrote:
> On 10/30/2016 12:19 PM, Andreas Klauer wrote:
> > On Sun, Oct 30, 2016 at 08:38:57AM -0700, Marc MERLIN wrote:
> >> (mmmh, but even so, rebuilding the spare should have cleared the bad blocks
> >> on at least one drive, no?)
> > 
> > If n+1 disks have bad blocks there's no data to sync over, so they just 
> > propagate and stay bad forever. Or at least that's how it seemed to work 
> > last time I tried it. I'm not familiar with bad blocks. I just turn it off.
> 
> I, too, turn it off.  (I never let it turn on, actually.)
> 
> I'm a little disturbed that this feature has become the default on new
> arrays.  This feature was introduced specifically to support underlying
> storage technologies that cannot perform their own bad block management.
>  And since it doesn't implement any relocation algorithm for blocks
> marked bad, it simply gives up any redundancy for affected sectors.  And
> when there's no remaining redundancy, it simply passes the error up the
> stack.  In this case, your errors were created by known communications
> weaknesses that should always be recoverable with --assemble --force.
> 
> As far as I'm concerned, the bad block system is an incomplete feature
> that should never be used in production, and certainly not on top of any
> storage technology that implements error detection, correction, and
> relocation.  Like, every modern SATA and SAS drive.

Agreed. Just to confirm, I did indeed not willlingly turn this on, and I
really wish I had not been turned on automatically.
As you point out, I've never needed this, and cabling induced problems just
used to kill my array, I would fix the cabling and manually rebuild it.
Now my array doesn't get killed, but it gets rendered not very usable and
cause my filesystem (btrfs) to abort and fail when I access the wrong parts
of it.

I'm now stuck with those fake bad blocks that I can't remove without some
complicated surgery of editting md metadata on disk or recreating an array
on top of the current one with the option disabled and hope things line up.

This really ought to work, or something similar:
myth:~# mdadm --assemble --force --update=no-bbl /dev/md5
mdadm: Cannot remove active bbl from /dev/sdf1
mdadm: Cannot remove active bbl from /dev/sdd1
mdadm: /dev/md5 has been started with 5 drives.
(as in the array was assembled, but it's not really useful without those
fake bad blocks cleared from the bad block list)

And yes I agree that bad blocks should not be a default, now I really wish they
had never been auto turned on, I already lost a week of scanning this array
and looking at problems over thie feature that turns out made a wrong
assumption and doesn't seem to let me clear it :-/

Thanks both for your answer and pointing me in the right direction.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-10-30 17:12         ` clearing blocks wrongfully marked as bad if --update=no-bbl can't be used? Marc MERLIN
@ 2016-10-30 17:16           ` Marc MERLIN
  2016-11-04 18:18             ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-10-30 17:16 UTC (permalink / raw)
  To: Phil Turmel, Neil Brown, Andreas Klauer; +Cc: linux-raid

On Sun, Oct 30, 2016 at 10:12:34AM -0700, Marc MERLIN wrote:
> Hi Neil,
> 
> Could you offer any guidance here? Is there somethign else I can do to clear
> those fake bad blocks (the underlying disks are fine, I scanned them)
> without rebuilding the array?

On Sun, Oct 30, 2016 at 06:02:42PM +0100, Andreas Klauer wrote:
> > There should be some --update=no-bbl --force if the admin knows the bad
> > block list is wrong and due to IO issues not related to the drive.
> 
> Good point. And hey, there it is.
> 
> mdadm.c
> 
> |                       	if (strcmp(c.update, "bbl") == 0)
> |                               	continue;
> |                       	if (strcmp(c.update, "no-bbl") == 0)
> |                                continue;
> |                       	if (strcmp(c.update, "force-no-bbl") == 0)
> |                               	continue;
> 
> force-no-bbl. It's in mdadm v3.4, not sure about older ones.

Oh, very nice, thank you. It's not in the man page, but it works:

myth:~# mdadm --assemble --update=force-no-bbl /dev/md5
mdadm: /dev/md5 has been started with 5 drives.
myth:~# 
myth:~# mdadm --examine-badblocks /dev/sd[defgh]1
No bad-blocks list configured on /dev/sdd1
No bad-blocks list configured on /dev/sde1
No bad-blocks list configured on /dev/sdf1
No bad-blocks list configured on /dev/sdg1
No bad-blocks list configured on /dev/sdh1

Now I'll make sure to turn off this feature on all my other arrays
in case it got turned on without my asking for it.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* btrfs check --repair: ERROR: cannot read chunk root
@ 2016-10-30 18:34 Marc MERLIN
  2016-10-31  1:02 ` Qu Wenruo
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-10-30 18:34 UTC (permalink / raw)
  To: linux-btrfs

I have a filesystem on top of md raid5 that got a few problems due to the
underlying block layer (bad data cable).
The filesystem mounts fine, but had a few issues
Scrub runs (I didn't let it finish, it takes a _long_ time)
But check --repair won't even run at all:

myth:~# btrfs --version
btrfs-progs v4.7.3
myth:~# uname -r
4.8.5-ia32-20161028

myth:~# btrfs check -p --repair  /dev/mapper/crypt_bcache0  2>&1 | tee
/var/spool/repair
bytenr mismatch, want=13835462344704, have=0
ERROR: cannot read chunk root
Couldn't open file system
enabling repair mode
myth:~#

myth:~# btrfs rescue super-recover -v /dev//mapper/crypt_bcache0 
All Devices:
        Device: id = 1, name = /dev//mapper/crypt_bcache0

Before Recovering:
        [All good supers]:
                device name = /dev//mapper/crypt_bcache0
                superblock bytenr = 65536

                device name = /dev//mapper/crypt_bcache0
                superblock bytenr = 67108864

                device name = /dev//mapper/crypt_bcache0
                superblock bytenr = 274877906944

        [All bad supers]:

All supers are valid, no need to recover


I don't care about the data, it's a backup array, but I'd still like to know
if I can recover from this state and do a repair to see how much data got
damaged

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-30 16:34       ` Phil Turmel
  2016-10-30 17:12         ` clearing blocks wrongfully marked as bad if --update=no-bbl can't be used? Marc MERLIN
@ 2016-10-30 18:56         ` TomK
  2016-10-30 19:16           ` TomK
                             ` (2 more replies)
  1 sibling, 3 replies; 67+ messages in thread
From: TomK @ 2016-10-30 18:56 UTC (permalink / raw)
  To: linux-raid

Hey Guy's,

We recently saw a situation where smartctl -A errored out eventually in 
a short time of a few days the disk cascaded into bad blocks eventually 
becoming a completely unrecognizable SATA disk.  It apparently was 
limping along for 6 months causing random timeout and slowdowns 
accessing the array.  But the RAID array did not pull it out or and did 
not mark it as bad.  The RAID 6 we have has been running for 6 years, 
however we did have alot of disk replacements in it yet it was always 
very very reliable.  Disks started as all 1TB Seagates but are now 2 WD 
2TB, 1 2TB Seagate with 2 left as 1TB Seagates and the last one as 
1.5TB.  Has a mix of green, red, blue etc.  Yet very rock solid.

We did not do a thorough R/W test to see how the error and bad disk 
affected the data stored on the array but did notice pauses and 
slowdowns on the CIFS share presented from it with pauses and generally 
difficulty in reading data, however no data errors that we could see. 
Since then we replaced the 2TB Seagate with a new 2TB WD and everything 
is fine even if the array is degraded.  But as soon as we put in this 
bad disk, it degraded to it's previous behaviour.  Yet the array didn't 
catch it as a failed disk until the disk was nearly completely 
inaccessible.

So the question is how come the mdadm RAID did not catch this disk as a 
failed disk and pull it out of the array?  Seams this disk was going bad 
for a while now but as long as the array reported all 6 healthy, there 
was no cause for alarm.  Also how does the array not detect the disk 
failure while issues in applications using the array show up?  Removing 
the disk and leaving the array in a degraded state also solved the 
accessibility issue on the array.  So appears the disk was generating 
some sort of errors (Possibly bad PCB) that were not caught before.

Looking at the changelogs, has a similar case been addressed?

On a separate topic, if I eventually expand the array to 6 2TB disks, 
will the array be smart enough to allow me to expand it to the new size? 
  Have not tried that yet and wanted to ask first.

Cheers,
Tom

[root@mbpc-pc modprobe.d]# rpm -qf /sbin/mdadm
mdadm-3.3.2-5.el6.x86_64
[root@mbpc-pc modprobe.d]#

(The 100% util lasts roughly 30 seconds)
10/23/2016 10:18:20 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.25   25.19    0.00   74.56

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    1.00     0.00     2.50 5.00 
   0.03   27.00  27.00   2.70
sdc               0.00     0.00    0.00    1.00     0.00     2.50 5.00 
   0.01   15.00  15.00   1.50
sdd               0.00     0.00    0.00    1.00     0.00     2.50 5.00 
   0.02   18.00  18.00   1.80
sde               0.00     0.00    0.00    1.00     0.00     2.50 5.00 
   0.02   23.00  23.00   2.30
sdf               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   1.15    0.00   0.00 100.00
sdg               0.00     2.00    1.00    4.00     4.00   172.00 70.40 
    0.04    8.40   2.80   1.40
sda               0.00     0.00    0.00    1.00     0.00     2.50 5.00 
   0.04   37.00  37.00   3.70
sdh               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
fd0               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-0              0.00     0.00    1.00    6.00     4.00   172.00 50.29 
    0.05    7.29   2.00   1.40
dm-1              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   1.00    0.00   0.00 100.00

10/23/2016 10:18:21 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.25   24.81    0.00   74.94

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   2.00    0.00   0.00 100.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
fd0               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   1.00    0.00   0.00 100.00

We can see that /dev/sdf ramps up to 100% starting at around (10/23/2016 
10:18:18 PM) and stays that way till about the (10/23/2016 10:18:42 PM) 
mark when something occurs and it drops down to below 100% numbers.

So I checked the array which shows all clean, even across reboots:

[root@mbpc-pc ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdb[7] sdf[6] sdd[3] sda[5] sdc[1] sde[8]
       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6] 
[UUUUUU]
       bitmap: 1/8 pages [4KB], 65536KB chunk

unused devices: <none>
[root@mbpc-pc ~]#

Then I run smartctl across all disks and sure enough /dev/sdf prints this:

[root@mbpc-pc ~]# smartctl -A /dev/sdf
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.8.4] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

Error SMART Values Read failed: scsi error badly formed scsi parameters
Smartctl: SMART Read Values failed.

=== START OF READ SMART DATA SECTION ===
[root@mbpc-pc ~]#

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-30 18:56         ` [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds TomK
@ 2016-10-30 19:16           ` TomK
  2016-10-30 20:13           ` Andreas Klauer
  2016-10-31 19:29           ` Wols Lists
  2 siblings, 0 replies; 67+ messages in thread
From: TomK @ 2016-10-30 19:16 UTC (permalink / raw)
  To: linux-raid

On 10/30/2016 2:56 PM, TomK wrote:
> Hey Guy's,
>
> We recently saw a situation where smartctl -A errored out eventually in
> a short time of a few days the disk cascaded into bad blocks eventually
> becoming a completely unrecognizable SATA disk.  It apparently was
> limping along for 6 months causing random timeout and slowdowns
> accessing the array.  But the RAID array did not pull it out or and did
> not mark it as bad.  The RAID 6 we have has been running for 6 years,
> however we did have alot of disk replacements in it yet it was always
> very very reliable.  Disks started as all 1TB Seagates but are now 2 WD
> 2TB, 1 2TB Seagate with 2 left as 1TB Seagates and the last one as
> 1.5TB.  Has a mix of green, red, blue etc.  Yet very rock solid.
>
> We did not do a thorough R/W test to see how the error and bad disk
> affected the data stored on the array but did notice pauses and
> slowdowns on the CIFS share presented from it with pauses and generally
> difficulty in reading data, however no data errors that we could see.
> Since then we replaced the 2TB Seagate with a new 2TB WD and everything
> is fine even if the array is degraded.  But as soon as we put in this
> bad disk, it degraded to it's previous behaviour.  Yet the array didn't
> catch it as a failed disk until the disk was nearly completely
> inaccessible.
>
> So the question is how come the mdadm RAID did not catch this disk as a
> failed disk and pull it out of the array?  Seams this disk was going bad
> for a while now but as long as the array reported all 6 healthy, there
> was no cause for alarm.  Also how does the array not detect the disk
> failure while issues in applications using the array show up?  Removing
> the disk and leaving the array in a degraded state also solved the
> accessibility issue on the array.  So appears the disk was generating
> some sort of errors (Possibly bad PCB) that were not caught before.
>
> Looking at the changelogs, has a similar case been addressed?
>
> On a separate topic, if I eventually expand the array to 6 2TB disks,
> will the array be smart enough to allow me to expand it to the new size?
>  Have not tried that yet and wanted to ask first.
>
> Cheers,
> Tom
>
>
> [root@mbpc-pc modprobe.d]# rpm -qf /sbin/mdadm
> mdadm-3.3.2-5.el6.x86_64
> [root@mbpc-pc modprobe.d]#
>
>
> (The 100% util lasts roughly 30 seconds)
> 10/23/2016 10:18:20 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00    0.25   25.19    0.00   74.56
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.03   27.00  27.00   2.70
> sdc               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.01   15.00  15.00   1.50
> sdd               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.02   18.00  18.00   1.80
> sde               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.02   23.00  23.00   2.30
> sdf               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 1.15    0.00   0.00 100.00
> sdg               0.00     2.00    1.00    4.00     4.00   172.00 70.40
>    0.04    8.40   2.80   1.40
> sda               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.04   37.00  37.00   3.70
> sdh               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdk               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdi               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> fd0               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    1.00    6.00     4.00   172.00 50.29
>    0.05    7.29   2.00   1.40
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-2              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> md0               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-3              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-4              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-5              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-6              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 1.00    0.00   0.00 100.00
>
> 10/23/2016 10:18:21 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00    0.25   24.81    0.00   74.94
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdc               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdd               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sde               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdf               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 2.00    0.00   0.00 100.00
> sdg               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sda               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdh               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdk               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdi               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> fd0               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-2              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> md0               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-3              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-4              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-5              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-6              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 1.00    0.00   0.00 100.00
>
>
> We can see that /dev/sdf ramps up to 100% starting at around (10/23/2016
> 10:18:18 PM) and stays that way till about the (10/23/2016 10:18:42 PM)
> mark when something occurs and it drops down to below 100% numbers.
>
> So I checked the array which shows all clean, even across reboots:
>
> [root@mbpc-pc ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdb[7] sdf[6] sdd[3] sda[5] sdc[1] sde[8]
>       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6]
> [UUUUUU]
>       bitmap: 1/8 pages [4KB], 65536KB chunk
>
> unused devices: <none>
> [root@mbpc-pc ~]#
>
>
> Then I run smartctl across all disks and sure enough /dev/sdf prints this:
>
> [root@mbpc-pc ~]# smartctl -A /dev/sdf
> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.8.4] (local build)
> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Error SMART Values Read failed: scsi error badly formed scsi parameters
> Smartctl: SMART Read Values failed.
>
> === START OF READ SMART DATA SECTION ===
> [root@mbpc-pc ~]#
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bit trigger happy.  Here's a better version of the first sentence.  :)

We recently saw a situation where smartctl -A errored out but mdadm 
didn't pick this up. Eventually, in a short time of a few days, the disk 
cascaded into bad blocks then became a completely unrecognizable SATA disk.

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-30 18:56         ` [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds TomK
  2016-10-30 19:16           ` TomK
@ 2016-10-30 20:13           ` Andreas Klauer
  2016-10-30 21:08             ` TomK
  2016-10-31 19:29           ` Wols Lists
  2 siblings, 1 reply; 67+ messages in thread
From: Andreas Klauer @ 2016-10-30 20:13 UTC (permalink / raw)
  To: TomK; +Cc: linux-raid

On Sun, Oct 30, 2016 at 02:56:58PM -0400, TomK wrote:
> So the question is how come the mdadm RAID did not catch this disk as a 
> failed disk and pull it out of the array?

RAID doesn't know about SMART. It's that simple.

If SMART already knows about errors - too bad, RAID doesn't care.
It also doesn't know about anything else really. You ddrescue the 
member disk directly and it finds tons of errors... RAID isn't involved.

RAID will only kick when it by itself stumbles over an error that does 
not go away when rewriting data. Or when the drive just doesn't respond 
anymore for an extended period of time. And that timeout is per request 
so a bad disk can grind the entire system to a halt without ever kicked.

ddrescue has this nice --min-read-rate option, any zone that yields data
slower will be considered a hopeless case, RAID does not have such magic. 
If your drive always responds and always claims to successfully write 
even when it doesn't, then RAID will never kick it.

If you never run array checks or smart selftests, errors won't show.
RAID will show them as healthy, SMART will show them as healthy, 
doesn't mean diddly-squat until you actually test it. Regularly.

Kicking drives yourself is quite normal. RAID only does so much. 
This is why we have mdadm --replace, that way even a semi-broken disk 
can help with the rebuild effort and bad sectors on other disks won't 
result in an even bigger problem, or at least, not right away.

If you leave RAID to its own devices, it has a much higher chance of dying 
than if you run tests, and actually decide to do something once *you're* 
aware that there are problems that RAID itself isn't aware of.

> On a separate topic, if I eventually expand the array to 6 2TB disks, 
> will the array be smart enough to allow me to expand it to the new size? 

Yes. Perhaps after an additional --grow --size=max.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-30 20:13           ` Andreas Klauer
@ 2016-10-30 21:08             ` TomK
  0 siblings, 0 replies; 67+ messages in thread
From: TomK @ 2016-10-30 21:08 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid

On 10/30/2016 4:13 PM, Andreas Klauer wrote:
> On Sun, Oct 30, 2016 at 02:56:58PM -0400, TomK wrote:
>> So the question is how come the mdadm RAID did not catch this disk as a
>> failed disk and pull it out of the array?
>
> RAID doesn't know about SMART. It's that simple.
>
> If SMART already knows about errors - too bad, RAID doesn't care.
> It also doesn't know about anything else really. You ddrescue the
> member disk directly and it finds tons of errors... RAID isn't involved.
>
> RAID will only kick when it by itself stumbles over an error that does
> not go away when rewriting data. Or when the drive just doesn't respond
> anymore for an extended period of time. And that timeout is per request
> so a bad disk can grind the entire system to a halt without ever kicked.
>
> ddrescue has this nice --min-read-rate option, any zone that yields data
> slower will be considered a hopeless case, RAID does not have such magic.
> If your drive always responds and always claims to successfully write
> even when it doesn't, then RAID will never kick it.
>
> If you never run array checks or smart selftests, errors won't show.
> RAID will show them as healthy, SMART will show them as healthy,
> doesn't mean diddly-squat until you actually test it. Regularly.
>
> Kicking drives yourself is quite normal. RAID only does so much.
> This is why we have mdadm --replace, that way even a semi-broken disk
> can help with the rebuild effort and bad sectors on other disks won't
> result in an even bigger problem, or at least, not right away.
>
> If you leave RAID to its own devices, it has a much higher chance of dying
> than if you run tests, and actually decide to do something once *you're*
> aware that there are problems that RAID itself isn't aware of.
>
>> On a separate topic, if I eventually expand the array to 6 2TB disks,
>> will the array be smart enough to allow me to expand it to the new size?
>
> Yes. Perhaps after an additional --grow --size=max.
>
> Regards
> Andreas Klauer
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Very clear. Thanks Andreas!

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-30 18:34 btrfs check --repair: ERROR: cannot read chunk root Marc MERLIN
@ 2016-10-31  1:02 ` Qu Wenruo
  2016-10-31  2:06   ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-10-31  1:02 UTC (permalink / raw)
  To: Marc MERLIN, linux-btrfs



At 10/31/2016 02:34 AM, Marc MERLIN wrote:
> I have a filesystem on top of md raid5 that got a few problems due to the
> underlying block layer (bad data cable).
> The filesystem mounts fine, but had a few issues
> Scrub runs (I didn't let it finish, it takes a _long_ time)
> But check --repair won't even run at all:
>
> myth:~# btrfs --version
> btrfs-progs v4.7.3
> myth:~# uname -r
> 4.8.5-ia32-20161028
>
> myth:~# btrfs check -p --repair  /dev/mapper/crypt_bcache0  2>&1 | tee
> /var/spool/repair
> bytenr mismatch, want=13835462344704, have=0
> ERROR: cannot read chunk root

Your chunk root is corrupted, and since chunk tree provides the 
underlying disk layout, even for single device, so if we failed to read 
it, then it will never be able to be mounted.

You could try to use backup chunk root.

"btrfs inspect-internal dump-super -f" to find the backup chunk root, 
and use "btrfs check --chunk-root <backup chunk root bytenr>" to have 
another try.

Thanks,
Qu
> Couldn't open file system
> enabling repair mode
> myth:~#
>
> myth:~# btrfs rescue super-recover -v /dev//mapper/crypt_bcache0
> All Devices:
>         Device: id = 1, name = /dev//mapper/crypt_bcache0
>
> Before Recovering:
>         [All good supers]:
>                 device name = /dev//mapper/crypt_bcache0
>                 superblock bytenr = 65536
>
>                 device name = /dev//mapper/crypt_bcache0
>                 superblock bytenr = 67108864
>
>                 device name = /dev//mapper/crypt_bcache0
>                 superblock bytenr = 274877906944
>
>         [All bad supers]:
>
> All supers are valid, no need to recover
>
>
> I don't care about the data, it's a backup array, but I'd still like to know
> if I can recover from this state and do a repair to see how much data got
> damaged
>
> Thanks,
> Marc
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31  1:02 ` Qu Wenruo
@ 2016-10-31  2:06   ` Marc MERLIN
  2016-10-31  4:21     ` Marc MERLIN
  2016-10-31  5:27     ` Qu Wenruo
  0 siblings, 2 replies; 67+ messages in thread
From: Marc MERLIN @ 2016-10-31  2:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Oct 31, 2016 at 09:02:50AM +0800, Qu Wenruo wrote:
> Your chunk root is corrupted, and since chunk tree provides the 
> underlying disk layout, even for single device, so if we failed to read 
> it, then it will never be able to be mounted.
 
That's the thing though, I can mount the filesystem just fine :)

> You could try to use backup chunk root.
> 
> "btrfs inspect-internal dump-super -f" to find the backup chunk root, 
> and use "btrfs check --chunk-root <backup chunk root bytenr>" to have 
> another try.

Am I doing this right? It doesn't seem to work

myth:~# btrfs check -p --repair --chunk-root 13835462344704 /dev/mapper/crypt_bcache0  2>&1 | tee /var/spool/repair2
bytenr mismatch, want=13835462344704, have=0
ERROR: cannot read chunk root
Couldn't open file system
enabling repair mode


myth:~# btrfs inspect-internal dump-super -f /dev/mapper/crypt_bcache0 | less
superblock: bytenr=65536, device=/dev/mapper/crypt_bcache0
---------------------------------------------------------
csum_type               0 (crc32c)
csum_size               4
csum                    0x3814e4a0 [match]
bytenr                  65536
flags                   0x1
                        ( WRITTEN )
magic                   _BHRfS_M [match]
fsid                    6692cf4c-93d9-438c-ac30-5db6381dc4f2
label                   DS5
generation              51176
root                    13845513109504
sys_array_size          129
chunk_root_generation   51135
root_level              1
chunk_root              13835462344704
chunk_root_level        1
log_root                0
log_root_transid        0
log_root_level          0
total_bytes             16002599346176
bytes_used              14584560160768
sectorsize              4096
nodesize                16384
leafsize                16384
stripesize              4096
root_dir                6
num_devices             1
compat_flags            0x0
compat_ro_flags         0x0
incompat_flags          0x169
                        ( MIXED_BACKREF |
                          COMPRESS_LZO |
                          BIG_METADATA |
                          EXTENDED_IREF |
                          SKINNY_METADATA )
cache_generation        51176
uuid_tree_generation    51176
dev_item.uuid           0cf779be-8e16-4982-b7d7-f8241deea0d1
dev_item.fsid           6692cf4c-93d9-438c-ac30-5db6381dc4f2 [match]
dev_item.type           0
dev_item.total_bytes    16002599346176
dev_item.bytes_used     14691011133440
dev_item.io_align       4096
dev_item.io_width       4096
dev_item.sector_size    4096
dev_item.devid          1
dev_item.dev_group      0
dev_item.seek_speed     0
dev_item.bandwidth      0
dev_item.generation     0
sys_chunk_array[2048]:
        item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 13835461197824)
                chunk length 33554432 owner 2 stripe_len 65536
                type SYSTEM|DUP num_stripes 2
                        stripe 0 devid 1 offset 13500327919616
                        dev uuid: 0cf779be-8e16-4982-b7d7-f8241deea0d1
                        stripe 1 devid 1 offset 13500361474048
                        dev uuid: 0cf779be-8e16-4982-b7d7-f8241deea0d1
backup_roots[4]:
        backup 0:
                backup_tree_root:       12801101791232  gen: 51174      level: 1
                backup_chunk_root:      13835462344704  gen: 51135      level: 1
                backup_extent_root:     12801124352000  gen: 51174      level: 3
                backup_fs_root:         10548133724160  gen: 51172      level: 0
                backup_dev_root:        11125467824128  gen: 51172      level: 1
                backup_csum_root:       12801133953024  gen: 51174      level: 3
                backup_total_bytes:     16002599346176
                backup_bytes_used:      14584560160768
                backup_num_devices:     1

        backup 1:
                backup_tree_root:       13842532810752  gen: 51175      level: 1
                backup_chunk_root:      13835462344704  gen: 51135      level: 1
                backup_extent_root:     13843784695808  gen: 51175      level: 3
                backup_fs_root:         10548133724160  gen: 51172      level: 0
                backup_dev_root:        11125467824128  gen: 51172      level: 1
                backup_csum_root:       13842542362624  gen: 51175      level: 3
                backup_total_bytes:     16002599346176
                backup_bytes_used:      14584560160768
                backup_num_devices:     1

        backup 2:
                backup_tree_root:       13845513109504  gen: 51176      level: 1
                backup_chunk_root:      13835462344704  gen: 51135      level: 1
                backup_extent_root:     13845513191424  gen: 51176      level: 3
                backup_fs_root:         10548133724160  gen: 51172      level: 0
                backup_dev_root:        11125467824128  gen: 51172      level: 1
                backup_csum_root:       13852180938752  gen: 51176      level: 3
                backup_total_bytes:     16002599346176
                backup_bytes_used:      14584560160768
                backup_num_devices:     1

        backup 3:
                backup_tree_root:       12750807580672  gen: 51173      level: 1
                backup_chunk_root:      13835462344704  gen: 51135      level: 1
                backup_extent_root:     12750810447872  gen: 51173      level: 3
                backup_fs_root:         10548133724160  gen: 51172      level: 0
                backup_dev_root:        11125467824128  gen: 51172      level: 1
                backup_csum_root:       12684302712832  gen: 51173      level: 3
                backup_total_bytes:     16002599346176
                backup_bytes_used:      14584560177152
                backup_num_devices:     1



-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31  2:06   ` Marc MERLIN
@ 2016-10-31  4:21     ` Marc MERLIN
  2016-10-31  5:27     ` Qu Wenruo
  1 sibling, 0 replies; 67+ messages in thread
From: Marc MERLIN @ 2016-10-31  4:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Oct 30, 2016 at 07:06:16PM -0700, Marc MERLIN wrote:
> On Mon, Oct 31, 2016 at 09:02:50AM +0800, Qu Wenruo wrote:
> > Your chunk root is corrupted, and since chunk tree provides the 
> > underlying disk layout, even for single device, so if we failed to read 
> > it, then it will never be able to be mounted.
>  
> That's the thing though, I can mount the filesystem just fine :)

Actually, has anyone seen any configuration where the kernel can mount a
filesystem without ro, or recovery, it can just mount it read/write and
btrfs check --repair can't open it?

This kind of sounds like a bug in check --repair IMO.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31  2:06   ` Marc MERLIN
  2016-10-31  4:21     ` Marc MERLIN
@ 2016-10-31  5:27     ` Qu Wenruo
  2016-10-31  5:47       ` Marc MERLIN
  1 sibling, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-10-31  5:27 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs



At 10/31/2016 10:06 AM, Marc MERLIN wrote:
> On Mon, Oct 31, 2016 at 09:02:50AM +0800, Qu Wenruo wrote:
>> Your chunk root is corrupted, and since chunk tree provides the
>> underlying disk layout, even for single device, so if we failed to read
>> it, then it will never be able to be mounted.
>
> That's the thing though, I can mount the filesystem just fine :)

That's strange, pretty strange.

And according to your super dump, I didn't see anything btrfs-progs 
can't handle.

Your chunk tree lies in a DUP chunk, which btrfs-progs should be able to 
handle it. (Unlike RAID5/6, btrfs-progs doesn't support to recover it at 
read time)



>
>> You could try to use backup chunk root.
>>
>> "btrfs inspect-internal dump-super -f" to find the backup chunk root,
>> and use "btrfs check --chunk-root <backup chunk root bytenr>" to have
>> another try.
>
> Am I doing this right? It doesn't seem to work
>
> myth:~# btrfs check -p --repair --chunk-root 13835462344704 /dev/mapper/crypt_bcache0  2>&1 | tee /var/spool/repair2
> bytenr mismatch, want=13835462344704, have=0
> ERROR: cannot read chunk root
> Couldn't open file system
> enabling repair mode

You're doing it right, while the superblock doesn't contain any old 
chunk root bytenr.

So this method doesn't work at all. :(


Would you please dump the following bytes?
That's the chunk root tree block on your disk.

offset: 13500329066496 length: 16384
offset: 13500330213376 length: 16384

According to your fsck error output, I assume btrfs-progs fails to read 
the first copy of chunk root, and due to a bug, it doesn't continue to 
read 2nd copy.

While kernel continues to read the 2nd copy and everything goes on.

IIRC btrfs-progs can handle csum error and continue trying, maybe some 
logical goes wrong.

Thanks,
Qu
>
>
> myth:~# btrfs inspect-internal dump-super -f /dev/mapper/crypt_bcache0 | less
> superblock: bytenr=65536, device=/dev/mapper/crypt_bcache0
> ---------------------------------------------------------
> csum_type               0 (crc32c)
> csum_size               4
> csum                    0x3814e4a0 [match]
> bytenr                  65536
> flags                   0x1
>                         ( WRITTEN )
> magic                   _BHRfS_M [match]
> fsid                    6692cf4c-93d9-438c-ac30-5db6381dc4f2
> label                   DS5
> generation              51176
> root                    13845513109504
> sys_array_size          129
> chunk_root_generation   51135
> root_level              1
> chunk_root              13835462344704
> chunk_root_level        1
> log_root                0
> log_root_transid        0
> log_root_level          0
> total_bytes             16002599346176
> bytes_used              14584560160768
> sectorsize              4096
> nodesize                16384
> leafsize                16384
> stripesize              4096
> root_dir                6
> num_devices             1
> compat_flags            0x0
> compat_ro_flags         0x0
> incompat_flags          0x169
>                         ( MIXED_BACKREF |
>                           COMPRESS_LZO |
>                           BIG_METADATA |
>                           EXTENDED_IREF |
>                           SKINNY_METADATA )
> cache_generation        51176
> uuid_tree_generation    51176
> dev_item.uuid           0cf779be-8e16-4982-b7d7-f8241deea0d1
> dev_item.fsid           6692cf4c-93d9-438c-ac30-5db6381dc4f2 [match]
> dev_item.type           0
> dev_item.total_bytes    16002599346176
> dev_item.bytes_used     14691011133440
> dev_item.io_align       4096
> dev_item.io_width       4096
> dev_item.sector_size    4096
> dev_item.devid          1
> dev_item.dev_group      0
> dev_item.seek_speed     0
> dev_item.bandwidth      0
> dev_item.generation     0
> sys_chunk_array[2048]:
>         item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 13835461197824)
>                 chunk length 33554432 owner 2 stripe_len 65536
>                 type SYSTEM|DUP num_stripes 2
>                         stripe 0 devid 1 offset 13500327919616
>                         dev uuid: 0cf779be-8e16-4982-b7d7-f8241deea0d1
>                         stripe 1 devid 1 offset 13500361474048
>                         dev uuid: 0cf779be-8e16-4982-b7d7-f8241deea0d1
> backup_roots[4]:
>         backup 0:
>                 backup_tree_root:       12801101791232  gen: 51174      level: 1
>                 backup_chunk_root:      13835462344704  gen: 51135      level: 1
>                 backup_extent_root:     12801124352000  gen: 51174      level: 3
>                 backup_fs_root:         10548133724160  gen: 51172      level: 0
>                 backup_dev_root:        11125467824128  gen: 51172      level: 1
>                 backup_csum_root:       12801133953024  gen: 51174      level: 3
>                 backup_total_bytes:     16002599346176
>                 backup_bytes_used:      14584560160768
>                 backup_num_devices:     1
>
>         backup 1:
>                 backup_tree_root:       13842532810752  gen: 51175      level: 1
>                 backup_chunk_root:      13835462344704  gen: 51135      level: 1
>                 backup_extent_root:     13843784695808  gen: 51175      level: 3
>                 backup_fs_root:         10548133724160  gen: 51172      level: 0
>                 backup_dev_root:        11125467824128  gen: 51172      level: 1
>                 backup_csum_root:       13842542362624  gen: 51175      level: 3
>                 backup_total_bytes:     16002599346176
>                 backup_bytes_used:      14584560160768
>                 backup_num_devices:     1
>
>         backup 2:
>                 backup_tree_root:       13845513109504  gen: 51176      level: 1
>                 backup_chunk_root:      13835462344704  gen: 51135      level: 1
>                 backup_extent_root:     13845513191424  gen: 51176      level: 3
>                 backup_fs_root:         10548133724160  gen: 51172      level: 0
>                 backup_dev_root:        11125467824128  gen: 51172      level: 1
>                 backup_csum_root:       13852180938752  gen: 51176      level: 3
>                 backup_total_bytes:     16002599346176
>                 backup_bytes_used:      14584560160768
>                 backup_num_devices:     1
>
>         backup 3:
>                 backup_tree_root:       12750807580672  gen: 51173      level: 1
>                 backup_chunk_root:      13835462344704  gen: 51135      level: 1
>                 backup_extent_root:     12750810447872  gen: 51173      level: 3
>                 backup_fs_root:         10548133724160  gen: 51172      level: 0
>                 backup_dev_root:        11125467824128  gen: 51172      level: 1
>                 backup_csum_root:       12684302712832  gen: 51173      level: 3
>                 backup_total_bytes:     16002599346176
>                 backup_bytes_used:      14584560177152
>                 backup_num_devices:     1
>
>
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31  5:27     ` Qu Wenruo
@ 2016-10-31  5:47       ` Marc MERLIN
  2016-10-31  6:04         ` Qu Wenruo
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-10-31  5:47 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Oct 31, 2016 at 01:27:56PM +0800, Qu Wenruo wrote:
> Would you please dump the following bytes?
> That's the chunk root tree block on your disk.
> 
> offset: 13500329066496 length: 16384
> offset: 13500330213376 length: 16384

Sorry for asking, am I doing this wrong?
myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32
skip=26367830208
dd: reading `/dev/mapper/crypt_bcache0': Invalid argument
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000401393 s, 0.0 kB/s

> According to your fsck error output, I assume btrfs-progs fails to read 
> the first copy of chunk root, and due to a bug, it doesn't continue to 
> read 2nd copy.
> 
> While kernel continues to read the 2nd copy and everything goes on.

Ah, that would make sense.
But from what you're saying, I should be able to do recovery by pointing
to the 2nd copy of the chunk root, but somehow I haven't typed the right
command to do so yet, correct?

Should I try another command offset than 
btrfs check -p --repair --chunk-root 13835462344704 /dev/mapper/crypt_bcache0 
?

Or are you saying the btrfs progs bug causes it to fail to even try to read
the 2nd copy of the chunk root even though it was given on the command line?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31  5:47       ` Marc MERLIN
@ 2016-10-31  6:04         ` Qu Wenruo
  2016-10-31  6:25           ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-10-31  6:04 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs



At 10/31/2016 01:47 PM, Marc MERLIN wrote:
> On Mon, Oct 31, 2016 at 01:27:56PM +0800, Qu Wenruo wrote:
>> Would you please dump the following bytes?
>> That's the chunk root tree block on your disk.
>>
>> offset: 13500329066496 length: 16384
>> offset: 13500330213376 length: 16384
>
> Sorry for asking, am I doing this wrong?
> myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32
> skip=26367830208
> dd: reading `/dev/mapper/crypt_bcache0': Invalid argument
> 0+0 records in
> 0+0 records out
> 0 bytes (0 B) copied, 0.000401393 s, 0.0 kB/s

So, the underlying MD RAID5 are complaining about some wrong data, and 
refuse to read out.

It seems that btrfs-progs can't handle read failure?
Maybe dm-error could emulate it.

And what about the 2nd range?

>
>> According to your fsck error output, I assume btrfs-progs fails to read
>> the first copy of chunk root, and due to a bug, it doesn't continue to
>> read 2nd copy.
>>
>> While kernel continues to read the 2nd copy and everything goes on.
>
> Ah, that would make sense.
> But from what you're saying, I should be able to do recovery by pointing
> to the 2nd copy of the chunk root, but somehow I haven't typed the right
> command to do so yet, correct?

Unfortunately, no the case.

For --chunk-root command, *logical* bytenr is specified.

We can only tell btrfs-progs(kernel is the same) to find tree root/chunk 
root at given *logical* bytenr.

But to read which *physical* copy, we can't specify.

Normally, btrfs-progs/kernel should find the correct physical copy 
without problem, but not this time for btrfs-progs.

And further more, all backup chunk root are in facts pointing to current 
chunk root, so --chunk-root doesn't work at all.

>
> Should I try another command offset than
> btrfs check -p --repair --chunk-root 13835462344704 /dev/mapper/crypt_bcache0
> ?
Nope, that bytenr is *physical* bytenr, not *logical* bytenr 
--chunk-root accepts.

But the read error for first tree block already gives some hint.
I'll try to emulate it.

Thanks,
Qu

>
> Or are you saying the btrfs progs bug causes it to fail to even try to read
> the 2nd copy of the chunk root even though it was given on the command line?
>
> Thanks,
> Marc
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31  6:04         ` Qu Wenruo
@ 2016-10-31  6:25           ` Marc MERLIN
  2016-10-31  6:32             ` Qu Wenruo
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-10-31  6:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Oct 31, 2016 at 02:04:10PM +0800, Qu Wenruo wrote:
> >Sorry for asking, am I doing this wrong?
> >myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32
> >skip=26367830208
> >dd: reading `/dev/mapper/crypt_bcache0': Invalid argument
> >0+0 records in
> >0+0 records out
> >0 bytes (0 B) copied, 0.000401393 s, 0.0 kB/s
> 
> So, the underlying MD RAID5 are complaining about some wrong data, and 
> refuse to read out.
> 
> It seems that btrfs-progs can't handle read failure?
> Maybe dm-error could emulate it.
> 
> And what about the 2nd range?

they both fail the same, but I wasn' tsure if I typed the wrong dd command
or not.

myth:~# btrfs fi df /mnt/mnt
Data, single: total=13.22TiB, used=13.19TiB
System, DUP: total=32.00MiB, used=1.42MiB
Metadata, DUP: total=74.00GiB, used=72.82GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
myth:~# btrfs fi show
Label: 'DS5'  uuid: 6692cf4c-93d9-438c-ac30-5db6381dc4f2
        Total devices 1 FS bytes used 13.26TiB
        devid    1 size 14.55TiB used 13.36TiB path /dev/mapper/crypt_bcache0

For now, I mounted the filesystem and I'm running scrub on it to see how
much damage there is. It will take all night:
BTRFS warning (device dm-0): checksum error at logical 27886878720 on dev /dev/mapper/crypt_bcache0, sector 56580096, root 9461, inode 45837, offset 15460089856, length 4096, links 1 (path: system/mlocate/mlocate.db)
BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 27887009792 on dev /dev/mapper/crypt_bcache0
BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 27886878720 on dev /dev/mapper/crypt_bcache0
BTRFS warning (device dm-0): checksum error at logical 27885961216 on dev /dev/mapper/crypt_bcache0, sector 56578304, root 9461, inode 45837, offset 15459172352, length 4096, links 1 (path: system/mlocate/mlocate.db)
BTRFS warning (device dm-0): checksum error at logical 27885830144 on dev /dev/mapper/crypt_bcache0, sector 56578048, root 9461, inode 45837, offset 15459041280, length 4096, links 1 (path: system/mlocate/mlocate.db)
BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 27885830144 on dev /dev/mapper/crypt_bcache0
BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 27885961216 on dev /dev/mapper/crypt_bcache0
BTRFS warning (device dm-0): checksum error at logical 27887013888 on dev /dev/mapper/crypt_bcache0, sector 56580360, root 9461, inode 45837, offset 15460225024, length 4096, links 1 (path: system/mlocate/mlocate.db)
BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 27887013888 on dev /dev/mapper/crypt_bcache0
BTRFS warning (device dm-0): checksum error at logical 27885834240 on dev /dev/mapper/crypt_bcache0, sector 56578056, root 9461, inode 45837, offset 15459045376, length 4096, links 1 (path: system/mlocate/mlocate.db)
BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 27885834240 on dev /dev/mapper/crypt_bcache0
BTRFS warning (device dm-0): checksum error at logical 27887017984 on dev /dev/mapper/crypt_bcache0, sector 56580368, root 9461, inode 45837, offset 15460229120, length 4096, links 1 (path: system/mlocate/mlocate.db)
BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 27887017984 on dev /dev/mapper/crypt_bcache0

So far, it looks like mnior damage limited to one file, I'll see tomorrow morning after it's done reading the whole array

> And further more, all backup chunk root are in facts pointing to current 
> chunk root, so --chunk-root doesn't work at all.

Ah, ok, so there is nothing I can do at the moment until I get a new btrfs-progs, correct?

Thanks for your answers
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31  6:25           ` Marc MERLIN
@ 2016-10-31  6:32             ` Qu Wenruo
  2016-10-31  6:37               ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-10-31  6:32 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs



At 10/31/2016 02:25 PM, Marc MERLIN wrote:
> On Mon, Oct 31, 2016 at 02:04:10PM +0800, Qu Wenruo wrote:
>>> Sorry for asking, am I doing this wrong?
>>> myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32
>>> skip=26367830208
>>> dd: reading `/dev/mapper/crypt_bcache0': Invalid argument
>>> 0+0 records in
>>> 0+0 records out
>>> 0 bytes (0 B) copied, 0.000401393 s, 0.0 kB/s
>>
>> So, the underlying MD RAID5 are complaining about some wrong data, and
>> refuse to read out.
>>
>> It seems that btrfs-progs can't handle read failure?
>> Maybe dm-error could emulate it.
>>
>> And what about the 2nd range?
>
> they both fail the same, but I wasn' tsure if I typed the wrong dd command
> or not.

Strange, your command seems OK to me.

Does it has anything to do with your security setup or something like that?
Or is it related to dm-crypt or bcache?


But this reminds me, if dd can't read it, maybe btrfs-progs is the same.

Maybe only kernel can read dm-crypt device while user space tools can't 
access dm-crypt devices directly?

Thanks,
Qu

>
> myth:~# btrfs fi df /mnt/mnt
> Data, single: total=13.22TiB, used=13.19TiB
> System, DUP: total=32.00MiB, used=1.42MiB
> Metadata, DUP: total=74.00GiB, used=72.82GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> myth:~# btrfs fi show
> Label: 'DS5'  uuid: 6692cf4c-93d9-438c-ac30-5db6381dc4f2
>         Total devices 1 FS bytes used 13.26TiB
>         devid    1 size 14.55TiB used 13.36TiB path /dev/mapper/crypt_bcache0
>
> For now, I mounted the filesystem and I'm running scrub on it to see how
> much damage there is. It will take all night:
> BTRFS warning (device dm-0): checksum error at logical 27886878720 on dev /dev/mapper/crypt_bcache0, sector 56580096, root 9461, inode 45837, offset 15460089856, length 4096, links 1 (path: system/mlocate/mlocate.db)
> BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> BTRFS error (device dm-0): unable to fixup (regular) error at logical 27887009792 on dev /dev/mapper/crypt_bcache0
> BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
> BTRFS error (device dm-0): unable to fixup (regular) error at logical 27886878720 on dev /dev/mapper/crypt_bcache0
> BTRFS warning (device dm-0): checksum error at logical 27885961216 on dev /dev/mapper/crypt_bcache0, sector 56578304, root 9461, inode 45837, offset 15459172352, length 4096, links 1 (path: system/mlocate/mlocate.db)
> BTRFS warning (device dm-0): checksum error at logical 27885830144 on dev /dev/mapper/crypt_bcache0, sector 56578048, root 9461, inode 45837, offset 15459041280, length 4096, links 1 (path: system/mlocate/mlocate.db)
> BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
> BTRFS error (device dm-0): unable to fixup (regular) error at logical 27885830144 on dev /dev/mapper/crypt_bcache0
> BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
> BTRFS error (device dm-0): unable to fixup (regular) error at logical 27885961216 on dev /dev/mapper/crypt_bcache0
> BTRFS warning (device dm-0): checksum error at logical 27887013888 on dev /dev/mapper/crypt_bcache0, sector 56580360, root 9461, inode 45837, offset 15460225024, length 4096, links 1 (path: system/mlocate/mlocate.db)
> BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
> BTRFS error (device dm-0): unable to fixup (regular) error at logical 27887013888 on dev /dev/mapper/crypt_bcache0
> BTRFS warning (device dm-0): checksum error at logical 27885834240 on dev /dev/mapper/crypt_bcache0, sector 56578056, root 9461, inode 45837, offset 15459045376, length 4096, links 1 (path: system/mlocate/mlocate.db)
> BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
> BTRFS error (device dm-0): unable to fixup (regular) error at logical 27885834240 on dev /dev/mapper/crypt_bcache0
> BTRFS warning (device dm-0): checksum error at logical 27887017984 on dev /dev/mapper/crypt_bcache0, sector 56580368, root 9461, inode 45837, offset 15460229120, length 4096, links 1 (path: system/mlocate/mlocate.db)
> BTRFS error (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
> BTRFS error (device dm-0): unable to fixup (regular) error at logical 27887017984 on dev /dev/mapper/crypt_bcache0
>
> So far, it looks like mnior damage limited to one file, I'll see tomorrow morning after it's done reading the whole array
>
>> And further more, all backup chunk root are in facts pointing to current
>> chunk root, so --chunk-root doesn't work at all.
>
> Ah, ok, so there is nothing I can do at the moment until I get a new btrfs-progs, correct?
>
> Thanks for your answers
> Marc
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31  6:32             ` Qu Wenruo
@ 2016-10-31  6:37               ` Marc MERLIN
  2016-10-31  7:04                 ` Qu Wenruo
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-10-31  6:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Oct 31, 2016 at 02:32:53PM +0800, Qu Wenruo wrote:
> 
> 
> At 10/31/2016 02:25 PM, Marc MERLIN wrote:
> >On Mon, Oct 31, 2016 at 02:04:10PM +0800, Qu Wenruo wrote:
> >>>Sorry for asking, am I doing this wrong?
> >>>myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32
> >>>skip=26367830208
> >>>dd: reading `/dev/mapper/crypt_bcache0': Invalid argument
> >>>0+0 records in
> >>>0+0 records out
> >>>0 bytes (0 B) copied, 0.000401393 s, 0.0 kB/s
> >>
> >>So, the underlying MD RAID5 are complaining about some wrong data, and
> >>refuse to read out.
> >>
> >>It seems that btrfs-progs can't handle read failure?
> >>Maybe dm-error could emulate it.
> >>
> >>And what about the 2nd range?
> >
> >they both fail the same, but I wasn' tsure if I typed the wrong dd command
> >or not.
> 
> Strange, your command seems OK to me.
> 
> Does it has anything to do with your security setup or something like that?
> Or is it related to dm-crypt or bcache?
> 
> 
> But this reminds me, if dd can't read it, maybe btrfs-progs is the same.
> 
> Maybe only kernel can read dm-crypt device while user space tools can't 
> access dm-crypt devices directly?

It can, it's just the offset seems wrong:

myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32 skip=26367830208
dd: reading `/dev/mapper/crypt_bcache0': Invalid argument
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000421662 s, 0.0 kB/s

If I divide by 1000, it works:
myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32 skip=26367830
32+0 records in
32+0 records out
16384 bytes (16 kB) copied, 0.139005 s, 118 kB/s

so that's why I was asking you if I counted the offset wrong. I took the
value you asked and divided by 512, but it seems too big

13500329066496 / 512 = 26367830208

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31  6:37               ` Marc MERLIN
@ 2016-10-31  7:04                 ` Qu Wenruo
  2016-10-31  8:44                   ` Hugo Mills
  0 siblings, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-10-31  7:04 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs



At 10/31/2016 02:37 PM, Marc MERLIN wrote:
> On Mon, Oct 31, 2016 at 02:32:53PM +0800, Qu Wenruo wrote:
>>
>>
>> At 10/31/2016 02:25 PM, Marc MERLIN wrote:
>>> On Mon, Oct 31, 2016 at 02:04:10PM +0800, Qu Wenruo wrote:
>>>>> Sorry for asking, am I doing this wrong?
>>>>> myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32
>>>>> skip=26367830208
>>>>> dd: reading `/dev/mapper/crypt_bcache0': Invalid argument
>>>>> 0+0 records in
>>>>> 0+0 records out
>>>>> 0 bytes (0 B) copied, 0.000401393 s, 0.0 kB/s
>>>>
>>>> So, the underlying MD RAID5 are complaining about some wrong data, and
>>>> refuse to read out.
>>>>
>>>> It seems that btrfs-progs can't handle read failure?
>>>> Maybe dm-error could emulate it.
>>>>
>>>> And what about the 2nd range?
>>>
>>> they both fail the same, but I wasn' tsure if I typed the wrong dd command
>>> or not.
>>
>> Strange, your command seems OK to me.
>>
>> Does it has anything to do with your security setup or something like that?
>> Or is it related to dm-crypt or bcache?
>>
>>
>> But this reminds me, if dd can't read it, maybe btrfs-progs is the same.
>>
>> Maybe only kernel can read dm-crypt device while user space tools can't
>> access dm-crypt devices directly?
>
> It can, it's just the offset seems wrong:
>
> myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32 skip=26367830208
> dd: reading `/dev/mapper/crypt_bcache0': Invalid argument
> 0+0 records in
> 0+0 records out
> 0 bytes (0 B) copied, 0.000421662 s, 0.0 kB/s
>
> If I divide by 1000, it works:
> myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32 skip=26367830
> 32+0 records in
> 32+0 records out
> 16384 bytes (16 kB) copied, 0.139005 s, 118 kB/s
>
> so that's why I was asking you if I counted the offset wrong. I took the
> value you asked and divided by 512, but it seems too big
>
> 13500329066496 / 512 = 26367830208
>
> Marc
>
But according to your dump-super output, that's strange.
------
chunk_root              13835462344704 (CR)
         item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 13835461197824) (CS)
                 chunk length 33554432 owner 2 stripe_len 65536
                 type SYSTEM|DUP num_stripes 2
                         stripe 0 devid 1 offset 13500327919616 (ST1)
                         dev uuid: 0cf779be-8e16-4982-b7d7-f8241deea0d1
                         stripe 1 devid 1 offset 13500361474048 (ST2)
                         dev uuid: 0cf779be-8e16-4982-b7d7-f8241deea0d1
------

Here, your chunk logical bytenr is 13835461197824, and its physical 
bytenr is 13500327919616 and 13500361474048.

My calculation is quite simple.
Start1 = CR - CS + ST1
Start2 = CR - CS + ST2

Unless the superblock is incorrect, it is not possile.

And the physical offset, is about 12.2 TiB, which is smaller than 15TiB 
of your device.

So that's quite strange that dd can't read out the data.
And if dd can't read it out, then I see no reason btrfs-progs can read 
it out.

Any idea on special dm setup which can make us fail to read out some 
data range?

Thanks,
Qu




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31  7:04                 ` Qu Wenruo
@ 2016-10-31  8:44                   ` Hugo Mills
  2016-10-31 15:04                     ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Hugo Mills @ 2016-10-31  8:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Marc MERLIN, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3608 bytes --]

On Mon, Oct 31, 2016 at 03:04:27PM +0800, Qu Wenruo wrote:
> 
> 
> At 10/31/2016 02:37 PM, Marc MERLIN wrote:
> >On Mon, Oct 31, 2016 at 02:32:53PM +0800, Qu Wenruo wrote:
> >>
> >>
> >>At 10/31/2016 02:25 PM, Marc MERLIN wrote:
> >>>On Mon, Oct 31, 2016 at 02:04:10PM +0800, Qu Wenruo wrote:
> >>>>>Sorry for asking, am I doing this wrong?
> >>>>>myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32
> >>>>>skip=26367830208
> >>>>>dd: reading `/dev/mapper/crypt_bcache0': Invalid argument
> >>>>>0+0 records in
> >>>>>0+0 records out
> >>>>>0 bytes (0 B) copied, 0.000401393 s, 0.0 kB/s
> >>>>
> >>>>So, the underlying MD RAID5 are complaining about some wrong data, and
> >>>>refuse to read out.
> >>>>
> >>>>It seems that btrfs-progs can't handle read failure?
> >>>>Maybe dm-error could emulate it.
> >>>>
> >>>>And what about the 2nd range?
> >>>
> >>>they both fail the same, but I wasn' tsure if I typed the wrong dd command
> >>>or not.
> >>
> >>Strange, your command seems OK to me.
> >>
> >>Does it has anything to do with your security setup or something like that?
> >>Or is it related to dm-crypt or bcache?
> >>
> >>
> >>But this reminds me, if dd can't read it, maybe btrfs-progs is the same.
> >>
> >>Maybe only kernel can read dm-crypt device while user space tools can't
> >>access dm-crypt devices directly?
> >
> >It can, it's just the offset seems wrong:
> >
> >myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32 skip=26367830208
> >dd: reading `/dev/mapper/crypt_bcache0': Invalid argument
> >0+0 records in
> >0+0 records out
> >0 bytes (0 B) copied, 0.000421662 s, 0.0 kB/s
> >
> >If I divide by 1000, it works:
> >myth:~# dd if=/dev/mapper/crypt_bcache0 of=/tmp/dump1 bs=512 count=32 skip=26367830
> >32+0 records in
> >32+0 records out
> >16384 bytes (16 kB) copied, 0.139005 s, 118 kB/s
> >
> >so that's why I was asking you if I counted the offset wrong. I took the
> >value you asked and divided by 512, but it seems too big
> >
> >13500329066496 / 512 = 26367830208
> >
> >Marc
> >
> But according to your dump-super output, that's strange.
> ------
> chunk_root              13835462344704 (CR)
>         item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 13835461197824) (CS)
>                 chunk length 33554432 owner 2 stripe_len 65536
>                 type SYSTEM|DUP num_stripes 2
>                         stripe 0 devid 1 offset 13500327919616 (ST1)
>                         dev uuid: 0cf779be-8e16-4982-b7d7-f8241deea0d1
>                         stripe 1 devid 1 offset 13500361474048 (ST2)
>                         dev uuid: 0cf779be-8e16-4982-b7d7-f8241deea0d1
> ------
> 
> Here, your chunk logical bytenr is 13835461197824, and its physical
> bytenr is 13500327919616 and 13500361474048.
> 
> My calculation is quite simple.
> Start1 = CR - CS + ST1
> Start2 = CR - CS + ST2
> 
> Unless the superblock is incorrect, it is not possile.
> 
> And the physical offset, is about 12.2 TiB, which is smaller than
> 15TiB of your device.
> 
> So that's quite strange that dd can't read out the data.
> And if dd can't read it out, then I see no reason btrfs-progs can
> read it out.
> 
> Any idea on special dm setup which can make us fail to read out some
> data range?

   I've seen both btrfs check and btrfs dump-super give wrong answers
(particularly, some addresses end up larger than the device, for some
reason) when run on a mounted filesystem. Worth ruling that one out.

   Hugo.

-- 
Hugo Mills             | Great films about cricket: Silly Point Break
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31  8:44                   ` Hugo Mills
@ 2016-10-31 15:04                     ` Marc MERLIN
  2016-11-01  3:48                       ` Marc MERLIN
  2016-11-01  4:13                       ` Qu Wenruo
  0 siblings, 2 replies; 67+ messages in thread
From: Marc MERLIN @ 2016-10-31 15:04 UTC (permalink / raw)
  To: Hugo Mills, Qu Wenruo, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3432 bytes --]

On Mon, Oct 31, 2016 at 08:44:12AM +0000, Hugo Mills wrote:
> > Any idea on special dm setup which can make us fail to read out some
> > data range?
> 
>    I've seen both btrfs check and btrfs dump-super give wrong answers
> (particularly, some addresses end up larger than the device, for some
> reason) when run on a mounted filesystem. Worth ruling that one out.

I just finished running my scrub overnight, and it failed around 10%:
[115500.316921] BTRFS error (device dm-0): bad tree block start 8461247125784585065 17619396231168
[115500.332354] BTRFS error (device dm-0): bad tree block start 8461247125784585065 17619396231168
[115500.332626] BTRFS: error (device dm-0) in __btrfs_free_extent:6954: errno=-5 IO failure
[115500.332629] BTRFS info (device dm-0): forced readonly
[115500.332632] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2960: errno=-5 IO failure
[115500.436002] btrfs_printk: 550 callbacks suppressed
[115500.436024] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
[115500.436029] BTRFS: error (device dm-0) in cleanup_transaction:1854: errno=-5 IO failure


myth:~# ionice -c 3 nice -10 btrfs scrub start -Bd /mnt/mnt
(...)
scrub device /dev/mapper/crypt_bcache0 (id 1) canceled
        scrub started at Sun Oct 30 22:52:59 2016 and was aborted after 09:03:11
        total bytes scrubbed: 1.15TiB with 512 errors
        error details: csum=512
        corrected errors: 0, uncorrectable errors: 512, unverified errors: 0

Am I correct that if I see "__btrfs_free_extent:6954: errno=-5 IO failure" it means
that btrfs had physical read errors from the underlying block layer?

Do I have some weird mismatch between the size of my md array and the size of my filesystem
(as per dd apparently thinking parts of it are out of bounds?)
Yet,  the sizes seem to match:


myth:~#  mdadm --query --detail /dev/md5
/dev/md5:
        Version : 1.2
  Creation Time : Tue Jan 21 10:35:52 2014
     Raid Level : raid5
     Array Size : 15627542528 (14903.59 GiB 16002.60 GB)
  Used Dev Size : 3906885632 (3725.90 GiB 4000.65 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon Oct 31 07:56:07 2016
          State : clean 
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : gargamel.svh.merlins.org:5
           UUID : ec672af7:a66d9557:2f00d76c:38c9f705
         Events : 147992

    Number   Major   Minor   RaidDevice State
       0       8       97        0      active sync   /dev/sdg1
       6       8      113        1      active sync   /dev/sdh1
       2       8       81        2      active sync   /dev/sdf1
       3       8       65        3      active sync   /dev/sde1
       5       8       49        4      active sync   /dev/sdd1

myth:~# btrfs fi df /mnt/mnt
Data, single: total=13.22TiB, used=13.19TiB
System, DUP: total=32.00MiB, used=1.42MiB
Metadata, DUP: total=75.00GiB, used=72.82GiB
GlobalReserve, single: total=512.00MiB, used=6.73MiB

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 291 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
  2016-10-30 16:43       ` Buffer I/O error on dev md5, logical block 7073536, async page read Marc MERLIN
  2016-10-30 17:02         ` Andreas Klauer
@ 2016-10-31 19:24         ` Wols Lists
  1 sibling, 0 replies; 67+ messages in thread
From: Wols Lists @ 2016-10-31 19:24 UTC (permalink / raw)
  To: Marc MERLIN, Andreas Klauer; +Cc: linux-raid

On 30/10/16 16:43, Marc MERLIN wrote:
> And here isn't one good drive between the 2, the bad blocks are identical on
> both drives and must have happened at the same time due to those cable
> induced IO errors I mentionned.
> Too bad that mdadm doesn't seem to account for the fact that it could be
> wrong when marking blocks as bad and does not seem to give a way to recover
> from this easily....
> I'll do more reading, thanks.

Reading the list, I've picked up that somehow badblocks seem to get
propagated from one drive to another. So if one drive gets a badblock,
that seems to get marked as bad on other drives too :-(

Oh - and as for badblocks being obsolete, isn't there a load of work
being done on it at the moment? For hardware raid I believe, which
presumably does not handle badblocks the way Phil thinks all modern
drives do? (Not surprising - hardware raid is regularly slated for being
buggy and not a good idea, this is probably more of the same...)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-30 18:56         ` [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds TomK
  2016-10-30 19:16           ` TomK
  2016-10-30 20:13           ` Andreas Klauer
@ 2016-10-31 19:29           ` Wols Lists
  2016-11-01  2:40             ` TomK
  2 siblings, 1 reply; 67+ messages in thread
From: Wols Lists @ 2016-10-31 19:29 UTC (permalink / raw)
  To: TomK, linux-raid

On 30/10/16 18:56, TomK wrote:
> 
> We did not do a thorough R/W test to see how the error and bad disk
> affected the data stored on the array but did notice pauses and
> slowdowns on the CIFS share presented from it with pauses and generally
> difficulty in reading data, however no data errors that we could see.
> Since then we replaced the 2TB Seagate with a new 2TB WD and everything
> is fine even if the array is degraded.  But as soon as we put in this
> bad disk, it degraded to it's previous behaviour.  Yet the array didn't
> catch it as a failed disk until the disk was nearly completely
> inaccessible.

What is this 2TB Seagate? A Barracuda? There's your problem, quite
possibly. Sounds like you've got your timeouts correctly matched, so
this drive is responding, but taking ages to do so. And that's why it
doesn't get kicked, but it knackers system response times - the kernel
is correctly configured to wait for the geriatric to respond.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-31 19:29           ` Wols Lists
@ 2016-11-01  2:40             ` TomK
  0 siblings, 0 replies; 67+ messages in thread
From: TomK @ 2016-11-01  2:40 UTC (permalink / raw)
  To: Wols Lists, linux-raid

On 10/31/2016 3:29 PM, Wols Lists wrote:
> On 30/10/16 18:56, TomK wrote:
>>
>> We did not do a thorough R/W test to see how the error and bad disk
>> affected the data stored on the array but did notice pauses and
>> slowdowns on the CIFS share presented from it with pauses and generally
>> difficulty in reading data, however no data errors that we could see.
>> Since then we replaced the 2TB Seagate with a new 2TB WD and everything
>> is fine even if the array is degraded.  But as soon as we put in this
>> bad disk, it degraded to it's previous behaviour.  Yet the array didn't
>> catch it as a failed disk until the disk was nearly completely
>> inaccessible.
>
> What is this 2TB Seagate? A Barracuda? There's your problem, quite
> possibly. Sounds like you've got your timeouts correctly matched, so
> this drive is responding, but taking ages to do so. And that's why it
> doesn't get kicked, but it knackers system response times - the kernel
> is correctly configured to wait for the geriatric to respond.
>
> Cheers,
> Wol
>

Hey Wols,

It's about a 2-3 year old Seagate but not a Barracuda.  They did not 
come with high ratings back then.  I also do adjust other recommended 
settings like write caches etc.

With the previous answer provided by Andreas, I got a very good picture 
what scope of issues RAID should cover and what is not.

So rightly so there is a gap where RAID will not cover all disk failures 
while the disk may impact the applications sitting on top of the array.

Where I was going with this as well is to help me identify what other 
tools I may need in solutions that use RAID.  In this case the answer 
Andreas provided tells me I have to have specific software for disk 
monitoring to the array that would tell me potential issues ahead of 
time alongside the RAID.

On a side note, I like to see the RAID mailing lists so busy.  If I were 
to read the various blog posts, I would believe RAID died 5 years ago.  :)

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31 15:04                     ` Marc MERLIN
@ 2016-11-01  3:48                       ` Marc MERLIN
  2016-11-01  4:13                       ` Qu Wenruo
  1 sibling, 0 replies; 67+ messages in thread
From: Marc MERLIN @ 2016-11-01  3:48 UTC (permalink / raw)
  To: Hugo Mills, Qu Wenruo, linux-btrfs

So, I'm willing to wait 2 more days before I wipe this filesystem and
start over if I can't get check --repair to work on it.
If you need longer, please let me konw you have an upcoming patch for me
to try and I'll wait.

Thanks,
Marc

On Mon, Oct 31, 2016 at 08:04:22AM -0700, Marc MERLIN wrote:
> On Mon, Oct 31, 2016 at 08:44:12AM +0000, Hugo Mills wrote:
> > > Any idea on special dm setup which can make us fail to read out some
> > > data range?
> > 
> >    I've seen both btrfs check and btrfs dump-super give wrong answers
> > (particularly, some addresses end up larger than the device, for some
> > reason) when run on a mounted filesystem. Worth ruling that one out.
> 
> I just finished running my scrub overnight, and it failed around 10%:
> [115500.316921] BTRFS error (device dm-0): bad tree block start 8461247125784585065 17619396231168
> [115500.332354] BTRFS error (device dm-0): bad tree block start 8461247125784585065 17619396231168
> [115500.332626] BTRFS: error (device dm-0) in __btrfs_free_extent:6954: errno=-5 IO failure
> [115500.332629] BTRFS info (device dm-0): forced readonly
> [115500.332632] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2960: errno=-5 IO failure
> [115500.436002] btrfs_printk: 550 callbacks suppressed
> [115500.436024] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
> [115500.436029] BTRFS: error (device dm-0) in cleanup_transaction:1854: errno=-5 IO failure
> 
> 
> myth:~# ionice -c 3 nice -10 btrfs scrub start -Bd /mnt/mnt
> (...)
> scrub device /dev/mapper/crypt_bcache0 (id 1) canceled
>         scrub started at Sun Oct 30 22:52:59 2016 and was aborted after 09:03:11
>         total bytes scrubbed: 1.15TiB with 512 errors
>         error details: csum=512
>         corrected errors: 0, uncorrectable errors: 512, unverified errors: 0
> 
> Am I correct that if I see "__btrfs_free_extent:6954: errno=-5 IO failure" it means
> that btrfs had physical read errors from the underlying block layer?
> 
> Do I have some weird mismatch between the size of my md array and the size of my filesystem
> (as per dd apparently thinking parts of it are out of bounds?)
> Yet,  the sizes seem to match:
> 
> 
> myth:~#  mdadm --query --detail /dev/md5
> /dev/md5:
>         Version : 1.2
>   Creation Time : Tue Jan 21 10:35:52 2014
>      Raid Level : raid5
>      Array Size : 15627542528 (14903.59 GiB 16002.60 GB)
>   Used Dev Size : 3906885632 (3725.90 GiB 4000.65 GB)
>    Raid Devices : 5
>   Total Devices : 5
>     Persistence : Superblock is persistent
> 
>   Intent Bitmap : Internal
> 
>     Update Time : Mon Oct 31 07:56:07 2016
>           State : clean 
>  Active Devices : 5
> Working Devices : 5
>  Failed Devices : 0
>   Spare Devices : 0
> 
>          Layout : left-symmetric
>      Chunk Size : 512K
> 
>            Name : gargamel.svh.merlins.org:5
>            UUID : ec672af7:a66d9557:2f00d76c:38c9f705
>          Events : 147992
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       97        0      active sync   /dev/sdg1
>        6       8      113        1      active sync   /dev/sdh1
>        2       8       81        2      active sync   /dev/sdf1
>        3       8       65        3      active sync   /dev/sde1
>        5       8       49        4      active sync   /dev/sdd1
> 
> myth:~# btrfs fi df /mnt/mnt
> Data, single: total=13.22TiB, used=13.19TiB
> System, DUP: total=32.00MiB, used=1.42MiB
> Metadata, DUP: total=75.00GiB, used=72.82GiB
> GlobalReserve, single: total=512.00MiB, used=6.73MiB
> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                       .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/  



-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-10-31 15:04                     ` Marc MERLIN
  2016-11-01  3:48                       ` Marc MERLIN
@ 2016-11-01  4:13                       ` Qu Wenruo
  2016-11-01  4:21                         ` Marc MERLIN
  1 sibling, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-11-01  4:13 UTC (permalink / raw)
  To: Marc MERLIN, Hugo Mills, linux-btrfs



At 10/31/2016 11:04 PM, Marc MERLIN wrote:
> On Mon, Oct 31, 2016 at 08:44:12AM +0000, Hugo Mills wrote:
>>> Any idea on special dm setup which can make us fail to read out some
>>> data range?
>>
>>    I've seen both btrfs check and btrfs dump-super give wrong answers
>> (particularly, some addresses end up larger than the device, for some
>> reason) when run on a mounted filesystem. Worth ruling that one out.
>
> I just finished running my scrub overnight, and it failed around 10%:
> [115500.316921] BTRFS error (device dm-0): bad tree block start 8461247125784585065 17619396231168
> [115500.332354] BTRFS error (device dm-0): bad tree block start 8461247125784585065 17619396231168
> [115500.332626] BTRFS: error (device dm-0) in __btrfs_free_extent:6954: errno=-5 IO failure
> [115500.332629] BTRFS info (device dm-0): forced readonly
> [115500.332632] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2960: errno=-5 IO failure
> [115500.436002] btrfs_printk: 550 callbacks suppressed
> [115500.436024] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
> [115500.436029] BTRFS: error (device dm-0) in cleanup_transaction:1854: errno=-5 IO failure
>
>
> myth:~# ionice -c 3 nice -10 btrfs scrub start -Bd /mnt/mnt
> (...)
> scrub device /dev/mapper/crypt_bcache0 (id 1) canceled
>         scrub started at Sun Oct 30 22:52:59 2016 and was aborted after 09:03:11
>         total bytes scrubbed: 1.15TiB with 512 errors
>         error details: csum=512
>         corrected errors: 0, uncorrectable errors: 512, unverified errors: 0
>
> Am I correct that if I see "__btrfs_free_extent:6954: errno=-5 IO failure" it means
> that btrfs had physical read errors from the underlying block layer?

Not really sure if it's physical read errors. As we throw -EIO almost 
every where.

But that's possible that your extent tree got corrupted so 
__btrfs_free_extent() failed to modify extent tree.

And in that case, we do throw -EIO.

>
> Do I have some weird mismatch between the size of my md array and the size of my filesystem
> (as per dd apparently thinking parts of it are out of bounds?)
> Yet,  the sizes seem to match:

Would you try to locate the range where we starts to fail to read?

I still think the root problem is we failed to read the device in user 
space.

Thanks,
Qu
>
>
> myth:~#  mdadm --query --detail /dev/md5
> /dev/md5:
>         Version : 1.2
>   Creation Time : Tue Jan 21 10:35:52 2014
>      Raid Level : raid5
>      Array Size : 15627542528 (14903.59 GiB 16002.60 GB)
>   Used Dev Size : 3906885632 (3725.90 GiB 4000.65 GB)
>    Raid Devices : 5
>   Total Devices : 5
>     Persistence : Superblock is persistent
>
>   Intent Bitmap : Internal
>
>     Update Time : Mon Oct 31 07:56:07 2016
>           State : clean
>  Active Devices : 5
> Working Devices : 5
>  Failed Devices : 0
>   Spare Devices : 0
>
>          Layout : left-symmetric
>      Chunk Size : 512K
>
>            Name : gargamel.svh.merlins.org:5
>            UUID : ec672af7:a66d9557:2f00d76c:38c9f705
>          Events : 147992
>
>     Number   Major   Minor   RaidDevice State
>        0       8       97        0      active sync   /dev/sdg1
>        6       8      113        1      active sync   /dev/sdh1
>        2       8       81        2      active sync   /dev/sdf1
>        3       8       65        3      active sync   /dev/sde1
>        5       8       49        4      active sync   /dev/sdd1
>
> myth:~# btrfs fi df /mnt/mnt
> Data, single: total=13.22TiB, used=13.19TiB
> System, DUP: total=32.00MiB, used=1.42MiB
> Metadata, DUP: total=75.00GiB, used=72.82GiB
> GlobalReserve, single: total=512.00MiB, used=6.73MiB
>
> Thanks,
> Marc
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-11-01  4:13                       ` Qu Wenruo
@ 2016-11-01  4:21                         ` Marc MERLIN
  2016-11-04  8:01                           ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-01  4:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Hugo Mills, linux-btrfs

On Tue, Nov 01, 2016 at 12:13:38PM +0800, Qu Wenruo wrote:
> Would you try to locate the range where we starts to fail to read?
> 
> I still think the root problem is we failed to read the device in user
> space.
 
Understood.

I'll run this then:
myth:~# dd if=/dev/mapper/crypt_bcache0 of=/dev/null bs=1M &
[2] 21108
myth:~# while :; do killall -USR1 dd; sleep 1200; done
275+0 records in
274+0 records out
287309824 bytes (287 MB) copied, 7.20248 s, 39.9 MB/s

This will take a while to run, I'll report back on how far it goes.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-11-01  4:21                         ` Marc MERLIN
@ 2016-11-04  8:01                           ` Marc MERLIN
  2016-11-04  9:00                             ` Roman Mamedov
  2016-11-07  1:11                             ` Qu Wenruo
  0 siblings, 2 replies; 67+ messages in thread
From: Marc MERLIN @ 2016-11-04  8:01 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Hugo Mills, linux-btrfs

On Mon, Oct 31, 2016 at 09:21:40PM -0700, Marc MERLIN wrote:
> On Tue, Nov 01, 2016 at 12:13:38PM +0800, Qu Wenruo wrote:
> > Would you try to locate the range where we starts to fail to read?
> > 
> > I still think the root problem is we failed to read the device in user
> > space.
>  
> Understood.
> 
> I'll run this then:
> myth:~# dd if=/dev/mapper/crypt_bcache0 of=/dev/null bs=1M &
> [2] 21108
> myth:~# while :; do killall -USR1 dd; sleep 1200; done
> 275+0 records in
> 274+0 records out
> 287309824 bytes (287 MB) copied, 7.20248 s, 39.9 MB/s
> 
> This will take a while to run, I'll report back on how far it goes.

Well, turns out you were right. My array is 14TB and dd was only able to
copy 8.8TB out of it.

I wonder if it's a bug with bcache and source devices that are too big?

8782434271232 bytes (8.8 TB) copied, 214809 s, 40.9 MB/s
dd: reading `/dev/mapper/crypt_bcache0': Invalid argument
8388608+0 records in
8388608+0 records out
8796093022208 bytes (8.8 TB) copied, 215197 s, 40.9 MB/s
[2]+  Exit 1                  dd if=/dev/mapper/crypt_bcache0 of=/dev/null bs=1M

What's vexing is that absolutely nothing has been logged in the kernel dmesg
buffer about this read error.

Basically I have this:
sde                            8:64   0   3.7T  0 
└─sde1                         8:65   0   3.7T  0 
  └─md5                        9:5    0  14.6T  0 
    └─bcache0                252:0    0  14.6T  0 
      └─crypt_bcache0 (dm-0) 253:0    0  14.6T  0 

I'll try dd'ing the md5 directly now, but that's going to take another 2 days :(

That said, given that almost half the device is not readable from user space
for some reason, that would explain why btrfs check is failing. Obviously it
can't do its job if it can't read blocks.

I'll report back on what I find out with this problem but if you have
suggestions on what to look for, let me know :)

Thanks.
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-11-04  8:01                           ` Marc MERLIN
@ 2016-11-04  9:00                             ` Roman Mamedov
  2016-11-04 17:59                               ` Marc MERLIN
  2016-11-07  1:11                             ` Qu Wenruo
  1 sibling, 1 reply; 67+ messages in thread
From: Roman Mamedov @ 2016-11-04  9:00 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Qu Wenruo, Hugo Mills, linux-btrfs

On Fri, 4 Nov 2016 01:01:13 -0700
Marc MERLIN <marc@merlins.org> wrote:

> Basically I have this:
> sde                            8:64   0   3.7T  0 
> └─sde1                         8:65   0   3.7T  0 
>   └─md5                        9:5    0  14.6T  0 
>     └─bcache0                252:0    0  14.6T  0 
>       └─crypt_bcache0 (dm-0) 253:0    0  14.6T  0 
> 
> I'll try dd'ing the md5 directly now, but that's going to take another 2 days :(
> 
> That said, given that almost half the device is not readable from user space
> for some reason, that would explain why btrfs check is failing. Obviously it
> can't do its job if it can't read blocks.

I don't see anything to support the notion that "half is unreadable", maybe
just a 512-byte sector is unreadable -- but that would be enough to make
regular dd bail out -- which is why you should be using dd_rescue for this,
not regular dd. Assuming you just want to copy over as much data as possible,
and not simply test if dd fails or not (but in any case dd_rescue at least
would not fail instantly and would tell you precise count of how much is
unreadable).

There is "GNU ddrescue" and "dd_rescue", I liked the first one better, but
they both work on a similar principle.

Also didn't you recently have issues with bad block lists on mdadm. This
mysterious "unreadable and nothing in dmesg" does appear to be a continuation
of that.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-11-04  9:00                             ` Roman Mamedov
@ 2016-11-04 17:59                               ` Marc MERLIN
  0 siblings, 0 replies; 67+ messages in thread
From: Marc MERLIN @ 2016-11-04 17:59 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Qu Wenruo, Hugo Mills, linux-btrfs

On Fri, Nov 04, 2016 at 02:00:43PM +0500, Roman Mamedov wrote:
> On Fri, 4 Nov 2016 01:01:13 -0700
> Marc MERLIN <marc@merlins.org> wrote:
> 
> > Basically I have this:
> > sde                            8:64   0   3.7T  0 
> > └─sde1                         8:65   0   3.7T  0 
> >   └─md5                        9:5    0  14.6T  0 
> >     └─bcache0                252:0    0  14.6T  0 
> >       └─crypt_bcache0 (dm-0) 253:0    0  14.6T  0 
> > 
> > I'll try dd'ing the md5 directly now, but that's going to take another 2 days :(
> > 
> > That said, given that almost half the device is not readable from user space
> > for some reason, that would explain why btrfs check is failing. Obviously it
> > can't do its job if it can't read blocks.
> 
> I don't see anything to support the notion that "half is unreadable", maybe
> just a 512-byte sector is unreadable -- but that would be enough to make
> regular dd bail out -- which is why you should be using dd_rescue for this,
> not regular dd. Assuming you just want to copy over as much data as possible,
> and not simply test if dd fails or not (but in any case dd_rescue at least
> would not fail instantly and would tell you precise count of how much is
> unreadable).

Thanks for the plug on ddrescue, I have used it to rescue drives in the
past.
Here, however, everything after the 8.8TB mark, is unreadable, so there
is nothing to skip.

Because the underlying drives are fine, I'm not entirely sure where the
issue is although it has to be on the mdadm side and not related to
btrfs.

And of course the mdadm array shows clean, and I have already disabled
the mdadm per drive bad block (mis-)feature which probably is
responsible for all the problems I've had here.
myth:~# mdadm --examine-badblocks /dev/sd[defgh]1
No bad-blocks list configured on /dev/sdd1
No bad-blocks list configured on /dev/sde1
No bad-blocks list configured on /dev/sdf1
No bad-blocks list configured on /dev/sdg1
No bad-blocks list configured on /dev/sdh1

I'm also still perplexed as to why despite the rear error I'm getting,
absolutely nothing is logged in the kernel :-/

I'll pursue that further and post a summary on the thread here if I find
something interesting.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-10-30 17:16           ` Marc MERLIN
@ 2016-11-04 18:18             ` Marc MERLIN
  2016-11-04 18:22               ` Phil Turmel
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-04 18:18 UTC (permalink / raw)
  To: Phil Turmel, Neil Brown, Andreas Klauer; +Cc: linux-raid

On Sun, Oct 30, 2016 at 10:16:54AM -0700, Marc MERLIN wrote:
> myth:~# mdadm --assemble --update=force-no-bbl /dev/md5
> mdadm: /dev/md5 has been started with 5 drives.
> myth:~# 
> myth:~# mdadm --examine-badblocks /dev/sd[defgh]1
> No bad-blocks list configured on /dev/sdd1
> No bad-blocks list configured on /dev/sde1
> No bad-blocks list configured on /dev/sdf1
> No bad-blocks list configured on /dev/sdg1
> No bad-blocks list configured on /dev/sdh1
> 
> Now I'll make sure to turn off this feature on all my other arrays
> in case it got turned on without my asking for it.

Right, so I thought I was home free, but not even close. My array is
back up, the badblock feature is disabled, array reports clean, but I
cannot access data past 8.8TB, it just fails.

myth:~# dd if=/dev/md5 of=/dev/null bs=1GB skip=8797
dd: reading `/dev/md5': Invalid argument
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000403171 s, 0.0 kB/s
myth:~# dd if=/dev/md5 of=/dev/null bs=1GB skip=8796
dd: reading `/dev/md5': Invalid argument
1+0 records in
1+0 records out
1000000000 bytes (1.0 GB) copied, 10.5817 s, 94.5 MB/s


myth:~# mdadm --query --detail /dev/md5
 
/dev/md5:
        Version : 1.2
  Creation Time : Tue Jan 21 10:35:52 2014
     Raid Level : raid5
     Array Size : 15627542528 (14903.59 GiB 16002.60 GB)
  Used Dev Size : 3906885632 (3725.90 GiB 4000.65 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent
 
  Intent Bitmap : Internal
 
    Update Time : Mon Oct 31 07:56:07 2016
          State : clean 
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0
  
         Layout : left-symmetric
     Chunk Size : 512K
  
           Name : gargamel.svh.merlins.org:5
           UUID : ec672af7:a66d9557:2f00d76c:38c9f705
         Events : 147992
  
    Number   Major   Minor   RaidDevice State
       0       8       97        0      active sync   /dev/sdg1
       6       8      113        1      active sync   /dev/sdh1
       2       8       81        2      active sync   /dev/sdf1
       3       8       65        3      active sync   /dev/sde1
       5       8       49        4      active sync   /dev/sdd1


myth:~# 
myth:~# mdadm --examine /dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : ec672af7:a66d9557:2f00d76c:38c9f705
           Name : gargamel.svh.merlins.org:5
  Creation Time : Tue Jan 21 10:35:52 2014
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 7813771264 (3725.90 GiB 4000.65 GB)
     Array Size : 15627542528 (14903.59 GiB 16002.60 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262064 sectors, after=0 sectors
          State : clean
    Device UUID : 075571ff:411517e9:027f8c2f:cef0457a

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Oct 31 07:56:07 2016
       Checksum : d4e74521 - correct
         Events : 147992

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 4
   Array State : AAAAA ('A' == active, '.' == missing, 'R' == replacing)

all 5 devices look about the same outside of serial numbers.

Any idea why it's failing that way?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-11-04 18:18             ` Marc MERLIN
@ 2016-11-04 18:22               ` Phil Turmel
  2016-11-04 18:50                 ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Phil Turmel @ 2016-11-04 18:22 UTC (permalink / raw)
  To: Marc MERLIN, Neil Brown, Andreas Klauer; +Cc: linux-raid

On 11/04/2016 02:18 PM, Marc MERLIN wrote:

> Right, so I thought I was home free, but not even close. My array is
> back up, the badblock feature is disabled, array reports clean, but I
> cannot access data past 8.8TB, it just fails.
> 
> myth:~# dd if=/dev/md5 of=/dev/null bs=1GB skip=8797
> dd: reading `/dev/md5': Invalid argument
> 0+0 records in
> 0+0 records out
> 0 bytes (0 B) copied, 0.000403171 s, 0.0 kB/s
> myth:~# dd if=/dev/md5 of=/dev/null bs=1GB skip=8796
> dd: reading `/dev/md5': Invalid argument
> 1+0 records in
> 1+0 records out
> 1000000000 bytes (1.0 GB) copied, 10.5817 s, 94.5 MB/s

That has nothing to do with MD.  You are using a power of ten suffix in
your block size, so you are running into non-aligned sector locations.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-11-04 18:22               ` Phil Turmel
@ 2016-11-04 18:50                 ` Marc MERLIN
  2016-11-04 18:59                   ` Roman Mamedov
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-04 18:50 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Neil Brown, Andreas Klauer, linux-raid

On Fri, Nov 04, 2016 at 02:22:48PM -0400, Phil Turmel wrote:
> On 11/04/2016 02:18 PM, Marc MERLIN wrote:
> 
> > Right, so I thought I was home free, but not even close. My array is
> > back up, the badblock feature is disabled, array reports clean, but I
> > cannot access data past 8.8TB, it just fails.
> > 
> > myth:~# dd if=/dev/md5 of=/dev/null bs=1GB skip=8797
> > dd: reading `/dev/md5': Invalid argument
> > 0+0 records in
> > 0+0 records out
> > 0 bytes (0 B) copied, 0.000403171 s, 0.0 kB/s
> > myth:~# dd if=/dev/md5 of=/dev/null bs=1GB skip=8796
> > dd: reading `/dev/md5': Invalid argument
> > 1+0 records in
> > 1+0 records out
> > 1000000000 bytes (1.0 GB) copied, 10.5817 s, 94.5 MB/s
> 
> That has nothing to do with MD.  You are using a power of ten suffix in
> your block size, so you are running into non-aligned sector locations.

not really, I read the whole device from scratch (without skip) and it
read 8.8TB before it failed. It just takes 2 days to run, so it's a bit
annoying to do repeatedly :)

myth:/dev# dd if=/dev/md5 of=/dev/null bs=1GB skip=8790
dd: reading `/dev/md5': Invalid argument
7+0 records in
7+0 records out
7000000000 bytes (7.0 GB) copied, 76.7736 s, 91.2 MB/s

It doesn't matter where I start, it fails exactly in the same place, and
I can't skip over it, anything after that mark is unreadable.

I can switch to GiB if you'd like, same thing:
myth:/dev# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190
dd: reading `/dev/md5': Invalid argument
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 21.9751 s, 97.7 MB/s
myth:/dev# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8200
dd: reading `/dev/md5': Invalid argument
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000281885 s, 0.0 kB/s
myth:/dev# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8500
dd: reading `/dev/md5': Invalid argument
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000395691 s, 0.0 kB/s

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-11-04 18:50                 ` Marc MERLIN
@ 2016-11-04 18:59                   ` Roman Mamedov
  2016-11-04 19:31                     ` Roman Mamedov
  2016-11-04 19:51                     ` Marc MERLIN
  0 siblings, 2 replies; 67+ messages in thread
From: Roman Mamedov @ 2016-11-04 18:59 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Phil Turmel, Neil Brown, Andreas Klauer, linux-raid

On Fri, 4 Nov 2016 11:50:40 -0700
Marc MERLIN <marc@merlins.org> wrote:

> I can switch to GiB if you'd like, same thing:
> myth:/dev# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190
> dd: reading `/dev/md5': Invalid argument
> 2+0 records in
> 2+0 records out
> 2147483648 bytes (2.1 GB) copied, 21.9751 s, 97.7 MB/s

But now you cansee the cutoff point is exactly at 8192 -- a strangely familiar
number, much more so than "8.8 TB", right? :D

Could you recheck (and post) your mdadm --detail /dev/md5, if the whole array
didn't get cut to a half of its size in "Array Size".

Or maybe the remove bad block list code has some overflow bug which cuts each
device size to 2048 GiB, without the array size reflecting that. You run RAID5
of five members, (5-1)*2048 would give you exactly 8192 GiB.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-11-04 18:59                   ` Roman Mamedov
@ 2016-11-04 19:31                     ` Roman Mamedov
  2016-11-04 20:02                       ` Marc MERLIN
  2016-11-04 19:51                     ` Marc MERLIN
  1 sibling, 1 reply; 67+ messages in thread
From: Roman Mamedov @ 2016-11-04 19:31 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Phil Turmel, Neil Brown, Andreas Klauer, linux-raid

On Fri, 4 Nov 2016 23:59:17 +0500
Roman Mamedov <rm@romanrm.net> wrote:

> On Fri, 4 Nov 2016 11:50:40 -0700
> Marc MERLIN <marc@merlins.org> wrote:
> 
> > I can switch to GiB if you'd like, same thing:
> > myth:/dev# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190
> > dd: reading `/dev/md5': Invalid argument
> > 2+0 records in
> > 2+0 records out
> > 2147483648 bytes (2.1 GB) copied, 21.9751 s, 97.7 MB/s
> 
> But now you cansee the cutoff point is exactly at 8192 -- a strangely familiar
> number, much more so than "8.8 TB", right? :D
> 
> Could you recheck (and post) your mdadm --detail /dev/md5, if the whole array
> didn't get cut to a half of its size in "Array Size".

Also check that member devices of /dev/md5 (/dev/sd*1 partitions) are still
larger than 2TB, and are still readable past 2TB.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-11-04 18:59                   ` Roman Mamedov
  2016-11-04 19:31                     ` Roman Mamedov
@ 2016-11-04 19:51                     ` Marc MERLIN
  2016-11-07  0:16                       ` NeilBrown
  1 sibling, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-04 19:51 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Phil Turmel, Neil Brown, Andreas Klauer, linux-raid

On Fri, Nov 04, 2016 at 11:59:17PM +0500, Roman Mamedov wrote:
> On Fri, 4 Nov 2016 11:50:40 -0700
> Marc MERLIN <marc@merlins.org> wrote:
> 
> > I can switch to GiB if you'd like, same thing:
> > myth:/dev# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190
> > dd: reading `/dev/md5': Invalid argument
> > 2+0 records in
> > 2+0 records out
> > 2147483648 bytes (2.1 GB) copied, 21.9751 s, 97.7 MB/s
> 
> But now you cansee the cutoff point is exactly at 8192 -- a strangely familiar
> number, much more so than "8.8 TB", right? :D
 
Yes, that's a valid point :)

> Could you recheck (and post) your mdadm --detail /dev/md5, if the whole array
> didn't get cut to a half of its size in "Array Size".

I just posted it in my previous Email:
myth:~# mdadm --query --detail /dev/md5

/dev/md5:
        Version : 1.2
  Creation Time : Tue Jan 21 10:35:52 2014
     Raid Level : raid5
     Array Size : 15627542528 (14903.59 GiB 16002.60 GB)
  Used Dev Size : 3906885632 (3725.90 GiB 4000.65 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon Oct 31 07:56:07 2016
          State : clean

(more in the previous Email)

> Or maybe the remove bad block list code has some overflow bug which cuts each
> device size to 2048 GiB, without the array size reflecting that. You run RAID5
> of five members, (5-1)*2048 would give you exactly 8192 GiB.

that's very possible too.
So even though the array is marked clean and I don't care if some md
blocks return data that is actually corrupt as long as the read succeeds
(my filesystem will sort that out), I figured I could try a repair.

What's interesting is that it started exactly at 50%, which is also
likely where my reads were failing.

myth:/sys/block/md5/md# echo repair > sync_action 

md5 : active raid5 sdg1[0] sdd1[5] sde1[3] sdf1[2] sdh1[6]
      15627542528 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [==========>..........]  resync = 50.0% (1953925916/3906885632) finish=1899.1min speed=17138K/sec
      bitmap: 0/30 pages [0KB], 65536KB chunk

That said, as this resync is processing, I'd think/hope it would move
the error forward, but it does not seem to:
myth:/sys/block/md5/md# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190
dd: reading `/dev/md5': Invalid argument
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 27.8491 s, 77.1 MB/s

So basically I'm stuck in the same place, and it seems that I've found
an actual swraid bug in the kernel and I'm not hopeful that the problem
will be fixed after the resync completes.

If someone wants me to try stuff before I wipe it all and restart, let
me know, but otherwise I've been in this broken state for 3 weeks now
and I need to fix it so that I can restart my backups again.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-11-04 19:31                     ` Roman Mamedov
@ 2016-11-04 20:02                       ` Marc MERLIN
  0 siblings, 0 replies; 67+ messages in thread
From: Marc MERLIN @ 2016-11-04 20:02 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Phil Turmel, Neil Brown, Andreas Klauer, linux-raid

On Sat, Nov 05, 2016 at 12:31:09AM +0500, Roman Mamedov wrote:
> Also check that member devices of /dev/md5 (/dev/sd*1 partitions) are still
> larger than 2TB, and are still readable past 2TB.

Just for my own sanity: if the drives had rear errors, those would be
logged by the kernel, right?
I did run hdrecover on all those drives and it completed on all of them
(I did that first before checking anything else)

Here's me reading 1GB from each drive at the 3.5TB mark:
myth:/sys/block/md5/md# for i in /dev/sd[defgh]; do dd if=$i of=/dev/null bs=1GiB skip=3500 count=1; done
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 77.4343 s, 13.9 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 50.1179 s, 21.4 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 39.6499 s, 27.1 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 71.6397 s, 15.0 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 73.1003 s, 14.7 MB/s

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-11-04 19:51                     ` Marc MERLIN
@ 2016-11-07  0:16                       ` NeilBrown
  2016-11-07  1:13                         ` Marc MERLIN
  2016-11-07  1:20                         ` Marc MERLIN
  0 siblings, 2 replies; 67+ messages in thread
From: NeilBrown @ 2016-11-07  0:16 UTC (permalink / raw)
  To: Marc MERLIN, Roman Mamedov
  Cc: Phil Turmel, Neil Brown, Andreas Klauer, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1652 bytes --]

On Sat, Nov 05 2016, Marc MERLIN wrote:
>
> What's interesting is that it started exactly at 50%, which is also
> likely where my reads were failing.
>
> myth:/sys/block/md5/md# echo repair > sync_action 
>
> md5 : active raid5 sdg1[0] sdd1[5] sde1[3] sdf1[2] sdh1[6]
>       15627542528 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
>       [==========>..........]  resync = 50.0% (1953925916/3906885632) finish=1899.1min speed=17138K/sec
>       bitmap: 0/30 pages [0KB], 65536KB chunk

Yep, that is weird.

You can cause that to happen by e.g
   echo 7813771264 > /sys/block/md5/md/sync_min

but you are unlikely to have done that deliberately.


>
> That said, as this resync is processing, I'd think/hope it would move
> the error forward, but it does not seem to:
> myth:/sys/block/md5/md# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190
> dd: reading `/dev/md5': Invalid argument
> 2+0 records in
> 2+0 records out
> 2147483648 bytes (2.1 GB) copied, 27.8491 s, 77.1 MB/s

EINVAL from a read() system call is surprising in this context.....

do_generic_file_read can return it:
	if (unlikely(*ppos >= inode->i_sb->s_maxbytes))
		return -EINVAL;

s_maxbytes will be MAX_LFS_FILESIZE which, on a 32bit system, is

#define MAX_LFS_FILESIZE        (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)

That is 2^(12+31) or 2^43 or 8TB.

Is this a 32bit system you are using?  Such systems can only support
buffered IO up to 8TB.  If you use iflags=direct to avoid buffering, you
should get access to the whole device.

If this is a 64bit system, then the problem must be elsewhere.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs check --repair: ERROR: cannot read chunk root
  2016-11-04  8:01                           ` Marc MERLIN
  2016-11-04  9:00                             ` Roman Mamedov
@ 2016-11-07  1:11                             ` Qu Wenruo
  1 sibling, 0 replies; 67+ messages in thread
From: Qu Wenruo @ 2016-11-07  1:11 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Hugo Mills, linux-btrfs



At 11/04/2016 04:01 PM, Marc MERLIN wrote:
> On Mon, Oct 31, 2016 at 09:21:40PM -0700, Marc MERLIN wrote:
>> On Tue, Nov 01, 2016 at 12:13:38PM +0800, Qu Wenruo wrote:
>>> Would you try to locate the range where we starts to fail to read?
>>>
>>> I still think the root problem is we failed to read the device in user
>>> space.
>>
>> Understood.
>>
>> I'll run this then:
>> myth:~# dd if=/dev/mapper/crypt_bcache0 of=/dev/null bs=1M &
>> [2] 21108
>> myth:~# while :; do killall -USR1 dd; sleep 1200; done
>> 275+0 records in
>> 274+0 records out
>> 287309824 bytes (287 MB) copied, 7.20248 s, 39.9 MB/s
>>
>> This will take a while to run, I'll report back on how far it goes.
>
> Well, turns out you were right. My array is 14TB and dd was only able to
> copy 8.8TB out of it.
>
> I wonder if it's a bug with bcache and source devices that are too big?

At least we know it's not a problem of btrfs-progs.

And for bcache/soft raid/encryption, unfortunately I'm not familiar with 
any of them.

I would recommend to report it to bcache/mdadm/encryption ML after 
locating the layer which returns EINVAL.

>
> 8782434271232 bytes (8.8 TB) copied, 214809 s, 40.9 MB/s
> dd: reading `/dev/mapper/crypt_bcache0': Invalid argument
> 8388608+0 records in
> 8388608+0 records out
> 8796093022208 bytes (8.8 TB) copied, 215197 s, 40.9 MB/s
> [2]+  Exit 1                  dd if=/dev/mapper/crypt_bcache0 of=/dev/null bs=1M
>
> What's vexing is that absolutely nothing has been logged in the kernel dmesg
> buffer about this read error.
>
> Basically I have this:
> sde                            8:64   0   3.7T  0
> └─sde1                         8:65   0   3.7T  0
>   └─md5                        9:5    0  14.6T  0
>     └─bcache0                252:0    0  14.6T  0
>       └─crypt_bcache0 (dm-0) 253:0    0  14.6T  0
>
> I'll try dd'ing the md5 directly now, but that's going to take another 2 days :(

No need to read them out, just reading from the 8T would be good enough 
for me.

BTW, that's really a complicated layout, with soft raid, bcache, and 
encryption, it will take a long time to find the real cause.

But at least we know the 8.8T position, we can save some time not 
reading the whole disk.

Thanks,
Qu

>
> That said, given that almost half the device is not readable from user space
> for some reason, that would explain why btrfs check is failing. Obviously it
> can't do its job if it can't read blocks.
>
> I'll report back on what I find out with this problem but if you have
> suggestions on what to look for, let me know :)
>
> Thanks.
> Marc
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-11-07  0:16                       ` NeilBrown
@ 2016-11-07  1:13                         ` Marc MERLIN
  2016-11-07  3:36                           ` Phil Turmel
  2016-11-07  1:20                         ` Marc MERLIN
  1 sibling, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-07  1:13 UTC (permalink / raw)
  To: NeilBrown
  Cc: Roman Mamedov, Phil Turmel, Neil Brown, Andreas Klauer, linux-raid

On Mon, Nov 07, 2016 at 11:16:56AM +1100, NeilBrown wrote:
> On Sat, Nov 05 2016, Marc MERLIN wrote:
> >
> > What's interesting is that it started exactly at 50%, which is also
> > likely where my reads were failing.
> >
> > myth:/sys/block/md5/md# echo repair > sync_action 
> >
> > md5 : active raid5 sdg1[0] sdd1[5] sde1[3] sdf1[2] sdh1[6]
> >       15627542528 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
> >       [==========>..........]  resync = 50.0% (1953925916/3906885632) finish=1899.1min speed=17138K/sec
> >       bitmap: 0/30 pages [0KB], 65536KB chunk
> 
> Yep, that is weird.
> 
> You can cause that to happen by e.g
>    echo 7813771264 > /sys/block/md5/md/sync_min
> 
> but you are unlikely to have done that deliberately.

I might have done this by mistake instead of sync_speed_min, but as you
say, unlikely. Then again, this is not the main problem and I think you
did find the reason below.

> s_maxbytes will be MAX_LFS_FILESIZE which, on a 32bit system, is
> 
> #define MAX_LFS_FILESIZE        (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
> 
> That is 2^(12+31) or 2^43 or 8TB.
> 
> Is this a 32bit system you are using?  Such systems can only support
> buffered IO up to 8TB.  If you use iflags=direct to avoid buffering, you
> should get access to the whole device.

You found the problem, and you also found the reason why btrfs_tools
also fails past 8GB. It is indeed a 32bit distro. If I put a 64bit
kernel with the 32bit userland, there is a weird problem with a sound
driver/video driver sync, so I've stuck with 32bits.

This also explains why my btrfs filesystem mounts perfectly because the
kernel knows how to deal with it, but as soon as I use btrfs check
(32bits), it fails to access data past the 8TB limit, and falls on its
face too.
myth:/sys/block/md5/md# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190
dd: reading `/dev/md5': Invalid argument
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 37.0785 s, 57.9 MB/s
myth:/sys/block/md5/md# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190 count=3 iflag=direct
3+0 records in
3+0 records out
3221225472 bytes (3.2 GB) copied, 41.0663 s, 78.4 MB/s

So a big thanks for solving this mystery.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-11-07  0:16                       ` NeilBrown
  2016-11-07  1:13                         ` Marc MERLIN
@ 2016-11-07  1:20                         ` Marc MERLIN
  2016-11-07  1:39                           ` Qu Wenruo
  1 sibling, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-07  1:20 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Hugo Mills, linux-btrfs

On Mon, Nov 07, 2016 at 09:11:54AM +0800, Qu Wenruo wrote:
> > Well, turns out you were right. My array is 14TB and dd was only able to
> > copy 8.8TB out of it.
> > 
> > I wonder if it's a bug with bcache and source devices that are too big?
> 
> At least we know it's not a problem of btrfs-progs.
> 
> And for bcache/soft raid/encryption, unfortunately I'm not familiar with any
> of them.
> 
> I would recommend to report it to bcache/mdadm/encryption ML after locating
> the layer which returns EINVAL.

So, Neil Brown found the problem.

myth:/sys/block/md5/md# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190
dd: reading `/dev/md5': Invalid argument
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 37.0785 s, 57.9 MB/s
myth:/sys/block/md5/md# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190 count=3 iflag=direct
3+0 records in
3+0 records out


On Mon, Nov 07, 2016 at 11:16:56AM +1100, NeilBrown wrote:
> EINVAL from a read() system call is surprising in this context.....
> 
> do_generic_file_read can return it:
> 	if (unlikely(*ppos >= inode->i_sb->s_maxbytes))
> 		return -EINVAL;
> 
> s_maxbytes will be MAX_LFS_FILESIZE which, on a 32bit system, is
> 
> #define MAX_LFS_FILESIZE        (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
> 
> That is 2^(12+31) or 2^43 or 8TB.
> 
> Is this a 32bit system you are using?  Such systems can only support
> buffered IO up to 8TB.  If you use iflags=direct to avoid buffering, you
> should get access to the whole device.

I am indeed using a 32bit system, and now we know why the kernel can
mount and use my filesystem just fine while btrfs check repair fails to
deal with it.
The filesystem is more than 8TB on a 32bit kernel with 32bit userland.

Since iflag=direct fixes the issue with dd, it sounds like something
similar could be done for btrfs progs, to support filesystems bigger
than 8TB on 32bit systems.

However, could you confirm that filesystems more than 8TB are supported
by the kernel code itself on 32bit systems? (I think so, but just
wanting to make sure)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-11-07  1:20                         ` Marc MERLIN
@ 2016-11-07  1:39                           ` Qu Wenruo
  2016-11-07  4:18                             ` Qu Wenruo
  0 siblings, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-11-07  1:39 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Hugo Mills, linux-btrfs



At 11/07/2016 09:20 AM, Marc MERLIN wrote:
> On Mon, Nov 07, 2016 at 09:11:54AM +0800, Qu Wenruo wrote:
>>> Well, turns out you were right. My array is 14TB and dd was only able to
>>> copy 8.8TB out of it.
>>>
>>> I wonder if it's a bug with bcache and source devices that are too big?
>>
>> At least we know it's not a problem of btrfs-progs.
>>
>> And for bcache/soft raid/encryption, unfortunately I'm not familiar with any
>> of them.
>>
>> I would recommend to report it to bcache/mdadm/encryption ML after locating
>> the layer which returns EINVAL.
>
> So, Neil Brown found the problem.
>
> myth:/sys/block/md5/md# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190
> dd: reading `/dev/md5': Invalid argument
> 2+0 records in
> 2+0 records out
> 2147483648 bytes (2.1 GB) copied, 37.0785 s, 57.9 MB/s
> myth:/sys/block/md5/md# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190 count=3 iflag=direct
> 3+0 records in
> 3+0 records out

That's interesting.

>
>
> On Mon, Nov 07, 2016 at 11:16:56AM +1100, NeilBrown wrote:
>> EINVAL from a read() system call is surprising in this context.....
>>
>> do_generic_file_read can return it:
>> 	if (unlikely(*ppos >= inode->i_sb->s_maxbytes))
>> 		return -EINVAL;

At least the return value is a bug.
Normally we should return -EFBIG instead of -EINVAL.

>>
>> s_maxbytes will be MAX_LFS_FILESIZE which, on a 32bit system, is
>>
>> #define MAX_LFS_FILESIZE        (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
>>
>> That is 2^(12+31) or 2^43 or 8TB.
>>
>> Is this a 32bit system you are using?  Such systems can only support
>> buffered IO up to 8TB.  If you use iflags=direct to avoid buffering, you
>> should get access to the whole device.
>
> I am indeed using a 32bit system, and now we know why the kernel can
> mount and use my filesystem just fine while btrfs check repair fails to
> deal with it.
> The filesystem is more than 8TB on a 32bit kernel with 32bit userland.
>
> Since iflag=direct fixes the issue with dd, it sounds like something
> similar could be done for btrfs progs, to support filesystems bigger
> than 8TB on 32bit systems.
>
> However, could you confirm that filesystems more than 8TB are supported
> by the kernel code itself on 32bit systems? (I think so, but just
> wanting to make sure)

Yep, fs can support to u64 max size fs. (But I'd assume u63 max as some 
fs may use the highest bit for special purpose)
Just VFS/mm layer is blocking things.

Direct IO can handle it because it avoids cache, while for buffered IO, 
it's cache(memory) size limiting the offsize.

It's good to locate the root cause.

It doesn't look hard to add such workaround for btrfs-progs.
I'll send such workaround soon.

Thanks,
Qu

>
> Thanks,
> Marc
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-11-07  1:13                         ` Marc MERLIN
@ 2016-11-07  3:36                           ` Phil Turmel
  0 siblings, 0 replies; 67+ messages in thread
From: Phil Turmel @ 2016-11-07  3:36 UTC (permalink / raw)
  To: Marc MERLIN, NeilBrown
  Cc: Roman Mamedov, Neil Brown, Andreas Klauer, linux-raid

On 11/06/2016 08:13 PM, Marc MERLIN wrote:
> On Mon, Nov 07, 2016 at 11:16:56AM +1100, NeilBrown wrote:

>> Is this a 32bit system you are using?  Such systems can only support
>> buffered IO up to 8TB.  If you use iflags=direct to avoid buffering, you
>> should get access to the whole device.
> 
> You found the problem, and you also found the reason why btrfs_tools
> also fails past 8GB. It is indeed a 32bit distro. If I put a 64bit
> kernel with the 32bit userland, there is a weird problem with a sound
> driver/video driver sync, so I've stuck with 32bits.

Huh.  Learn something new every day, I suppose.  Never would have
thought of this.  Thanks, Neil.

Phil

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
  2016-11-07  1:39                           ` Qu Wenruo
@ 2016-11-07  4:18                             ` Qu Wenruo
  2016-11-07  5:36                               ` btrfs support for filesystems >8TB on 32bit architectures Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-11-07  4:18 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Hugo Mills, linux-btrfs



At 11/07/2016 09:39 AM, Qu Wenruo wrote:
>
>
> At 11/07/2016 09:20 AM, Marc MERLIN wrote:
>> On Mon, Nov 07, 2016 at 09:11:54AM +0800, Qu Wenruo wrote:
>>>> Well, turns out you were right. My array is 14TB and dd was only
>>>> able to
>>>> copy 8.8TB out of it.
>>>>
>>>> I wonder if it's a bug with bcache and source devices that are too big?
>>>
>>> At least we know it's not a problem of btrfs-progs.
>>>
>>> And for bcache/soft raid/encryption, unfortunately I'm not familiar
>>> with any
>>> of them.
>>>
>>> I would recommend to report it to bcache/mdadm/encryption ML after
>>> locating
>>> the layer which returns EINVAL.
>>
>> So, Neil Brown found the problem.
>>
>> myth:/sys/block/md5/md# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190
>> dd: reading `/dev/md5': Invalid argument
>> 2+0 records in
>> 2+0 records out
>> 2147483648 bytes (2.1 GB) copied, 37.0785 s, 57.9 MB/s
>> myth:/sys/block/md5/md# dd if=/dev/md5 of=/dev/null bs=1GiB skip=8190
>> count=3 iflag=direct
>> 3+0 records in
>> 3+0 records out
>
> That's interesting.
>
>>
>>
>> On Mon, Nov 07, 2016 at 11:16:56AM +1100, NeilBrown wrote:
>>> EINVAL from a read() system call is surprising in this context.....
>>>
>>> do_generic_file_read can return it:
>>>     if (unlikely(*ppos >= inode->i_sb->s_maxbytes))
>>>         return -EINVAL;
>
> At least the return value is a bug.
> Normally we should return -EFBIG instead of -EINVAL.
>
>>>
>>> s_maxbytes will be MAX_LFS_FILESIZE which, on a 32bit system, is
>>>
>>> #define MAX_LFS_FILESIZE        (((loff_t)PAGE_SIZE <<
>>> (BITS_PER_LONG-1))-1)
>>>
>>> That is 2^(12+31) or 2^43 or 8TB.
>>>
>>> Is this a 32bit system you are using?  Such systems can only support
>>> buffered IO up to 8TB.  If you use iflags=direct to avoid buffering, you
>>> should get access to the whole device.
>>
>> I am indeed using a 32bit system, and now we know why the kernel can
>> mount and use my filesystem just fine while btrfs check repair fails to
>> deal with it.
>> The filesystem is more than 8TB on a 32bit kernel with 32bit userland.
>>
>> Since iflag=direct fixes the issue with dd, it sounds like something
>> similar could be done for btrfs progs, to support filesystems bigger
>> than 8TB on 32bit systems.
>>
>> However, could you confirm that filesystems more than 8TB are supported
>> by the kernel code itself on 32bit systems? (I think so, but just
>> wanting to make sure)
>
> Yep, fs can support to u64 max size fs. (But I'd assume u63 max as some
> fs may use the highest bit for special purpose)
> Just VFS/mm layer is blocking things.
>
> Direct IO can handle it because it avoids cache, while for buffered IO,
> it's cache(memory) size limiting the offsize.
>
> It's good to locate the root cause.
>
> It doesn't look hard to add such workaround for btrfs-progs.
> I'll send such workaround soon.

I'm totally wrong here.

DirectIO needs the 'buf' parameter of read()/pread() to be 512 bytes 
aligned.

While we are using a lot of stack memory() and normal malloc()/calloc() 
allocated memory, which are seldom aligned to 512 bytes.

So to *workaround* the problem in btrfs-progs, we may need to change any 
pread() caller to use aligned memory allocation.

I really don't think David will accept such huge change for a workdaround...

Thanks,
Qu
>
> Thanks,
> Qu
>
>>
>> Thanks,
>> Marc
>>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-07  4:18                             ` Qu Wenruo
@ 2016-11-07  5:36                               ` Marc MERLIN
  2016-11-07  6:16                                 ` Qu Wenruo
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-07  5:36 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Hugo Mills, linux-btrfs

(sorry for the bad subject line from the mdadm list on the previous mail) 

On Mon, Nov 07, 2016 at 12:18:10PM +0800, Qu Wenruo wrote:
> I'm totally wrong here.
> 
> DirectIO needs the 'buf' parameter of read()/pread() to be 512 bytes
> aligned.
> 
> While we are using a lot of stack memory() and normal malloc()/calloc()
> allocated memory, which are seldom aligned to 512 bytes.
> 
> So to *workaround* the problem in btrfs-progs, we may need to change any
> pread() caller to use aligned memory allocation.
> 
> I really don't think David will accept such huge change for a workdaround...

Thanks for looking into it.
So basically should we just document that btrfs filesystems past 8TB in
size are not supported on 32bit architectures?
(as in you can mount them and use them I believe, but you cannot create,
or repair them)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-07  5:36                               ` btrfs support for filesystems >8TB on 32bit architectures Marc MERLIN
@ 2016-11-07  6:16                                 ` Qu Wenruo
  2016-11-07 14:55                                   ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-11-07  6:16 UTC (permalink / raw)
  To: Marc MERLIN, David Sterba; +Cc: Hugo Mills, linux-btrfs



At 11/07/2016 01:36 PM, Marc MERLIN wrote:
> (sorry for the bad subject line from the mdadm list on the previous mail)
>
> On Mon, Nov 07, 2016 at 12:18:10PM +0800, Qu Wenruo wrote:
>> I'm totally wrong here.
>>
>> DirectIO needs the 'buf' parameter of read()/pread() to be 512 bytes
>> aligned.
>>
>> While we are using a lot of stack memory() and normal malloc()/calloc()
>> allocated memory, which are seldom aligned to 512 bytes.
>>
>> So to *workaround* the problem in btrfs-progs, we may need to change any
>> pread() caller to use aligned memory allocation.
>>
>> I really don't think David will accept such huge change for a workdaround...
>
> Thanks for looking into it.
> So basically should we just document that btrfs filesystems past 8TB in
> size are not supported on 32bit architectures?
> (as in you can mount them and use them I believe, but you cannot create,
> or repair them)
>
> Marc
>
Add David to this thread.

For create, it should be OK. As at create time, we hardly write beyond 
3G. So it won't be a big problem.

For repair, we do have a possibility that btrfsck can't handle it.

Anyway, I'd like to see how David thinks what we should do the handle 
the problem.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-07  6:16                                 ` Qu Wenruo
@ 2016-11-07 14:55                                   ` Marc MERLIN
  2016-11-08  0:35                                     ` Qu Wenruo
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-07 14:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, Hugo Mills, linux-btrfs

On Mon, Nov 07, 2016 at 02:16:37PM +0800, Qu Wenruo wrote:
> 
> 
> At 11/07/2016 01:36 PM, Marc MERLIN wrote:
> > (sorry for the bad subject line from the mdadm list on the previous mail)
> > 
> > On Mon, Nov 07, 2016 at 12:18:10PM +0800, Qu Wenruo wrote:
> > > I'm totally wrong here.
> > > 
> > > DirectIO needs the 'buf' parameter of read()/pread() to be 512 bytes
> > > aligned.
> > > 
> > > While we are using a lot of stack memory() and normal malloc()/calloc()
> > > allocated memory, which are seldom aligned to 512 bytes.
> > > 
> > > So to *workaround* the problem in btrfs-progs, we may need to change any
> > > pread() caller to use aligned memory allocation.
> > > 
> > > I really don't think David will accept such huge change for a workdaround...
> > 
> > Thanks for looking into it.
> > So basically should we just document that btrfs filesystems past 8TB in
> > size are not supported on 32bit architectures?
> > (as in you can mount them and use them I believe, but you cannot create,
> > or repair them)
> > 
> > Marc
> > 
> Add David to this thread.
> 
> For create, it should be OK. As at create time, we hardly write beyond 3G.
> So it won't be a big problem.
> 
> For repair, we do have a possibility that btrfsck can't handle it.
> 
> Anyway, I'd like to see how David thinks what we should do the handle the
> problem.

Understood. One big thing (for me) I forgot to confirm:
1) btrfs receive
2) btrfs scrub
should both be able to work because the IO operations are done directly
inside the kernel and not from user space, correct?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-07 14:55                                   ` Marc MERLIN
@ 2016-11-08  0:35                                     ` Qu Wenruo
  2016-11-08  0:39                                       ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-11-08  0:35 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: David Sterba, Hugo Mills, linux-btrfs



At 11/07/2016 10:55 PM, Marc MERLIN wrote:
> On Mon, Nov 07, 2016 at 02:16:37PM +0800, Qu Wenruo wrote:
>>
>>
>> At 11/07/2016 01:36 PM, Marc MERLIN wrote:
>>> (sorry for the bad subject line from the mdadm list on the previous mail)
>>>
>>> On Mon, Nov 07, 2016 at 12:18:10PM +0800, Qu Wenruo wrote:
>>>> I'm totally wrong here.
>>>>
>>>> DirectIO needs the 'buf' parameter of read()/pread() to be 512 bytes
>>>> aligned.
>>>>
>>>> While we are using a lot of stack memory() and normal malloc()/calloc()
>>>> allocated memory, which are seldom aligned to 512 bytes.
>>>>
>>>> So to *workaround* the problem in btrfs-progs, we may need to change any
>>>> pread() caller to use aligned memory allocation.
>>>>
>>>> I really don't think David will accept such huge change for a workdaround...
>>>
>>> Thanks for looking into it.
>>> So basically should we just document that btrfs filesystems past 8TB in
>>> size are not supported on 32bit architectures?
>>> (as in you can mount them and use them I believe, but you cannot create,
>>> or repair them)
>>>
>>> Marc
>>>
>> Add David to this thread.
>>
>> For create, it should be OK. As at create time, we hardly write beyond 3G.
>> So it won't be a big problem.
>>
>> For repair, we do have a possibility that btrfsck can't handle it.
>>
>> Anyway, I'd like to see how David thinks what we should do the handle the
>> problem.
>
> Understood. One big thing (for me) I forgot to confirm:
> 1) btrfs receive

Unfortunately, receive is completely done in userspace.
Only send works inside kernel.

So, receive will fail to reconstruct any file larger beyond 8T.
Despite that, any other normal file smaller than 8T is not affected.

> 2) btrfs scrub

Scrub does work in kernel, so it's unaffected.

Thanks,
Qu

> should both be able to work because the IO operations are done directly
> inside the kernel and not from user space, correct?
>
> Thanks,
> Marc
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-08  0:35                                     ` Qu Wenruo
@ 2016-11-08  0:39                                       ` Marc MERLIN
  2016-11-08  0:43                                         ` Qu Wenruo
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-08  0:39 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, Hugo Mills, linux-btrfs

On Tue, Nov 08, 2016 at 08:35:54AM +0800, Qu Wenruo wrote:
> >Understood. One big thing (for me) I forgot to confirm:
> >1) btrfs receive
> 
> Unfortunately, receive is completely done in userspace.
> Only send works inside kernel.
 
right, I've confirmed that btrfs receive fails.
It looks like btrfs balance is also failing, which is more surprising.
Isn't that one in the kernel?

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-08  0:39                                       ` Marc MERLIN
@ 2016-11-08  0:43                                         ` Qu Wenruo
  2016-11-08  1:06                                           ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-11-08  0:43 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: David Sterba, Hugo Mills, linux-btrfs



At 11/08/2016 08:39 AM, Marc MERLIN wrote:
> On Tue, Nov 08, 2016 at 08:35:54AM +0800, Qu Wenruo wrote:
>>> Understood. One big thing (for me) I forgot to confirm:
>>> 1) btrfs receive
>>
>> Unfortunately, receive is completely done in userspace.
>> Only send works inside kernel.
>
> right, I've confirmed that btrfs receive fails.
> It looks like btrfs balance is also failing, which is more surprising.
> Isn't that one in the kernel?

That's strange, balance is done completely in kernel space.

Unless we're calling vfs_* function we won't go through the extra check.

What's the error reported?

Thanks,
Qu
>
> Marc
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-08  0:43                                         ` Qu Wenruo
@ 2016-11-08  1:06                                           ` Marc MERLIN
  2016-11-08  1:17                                             ` Qu Wenruo
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-08  1:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, Hugo Mills, linux-btrfs

On Tue, Nov 08, 2016 at 08:43:34AM +0800, Qu Wenruo wrote:
> That's strange, balance is done completely in kernel space.
> 
> Unless we're calling vfs_* function we won't go through the extra check.
> 
> What's the error reported?

See below. Note however that is may be because btrfs received messed up the
filesystem first.

BTRFS info (device dm-0): use zlib compression
BTRFS info (device dm-0): disk space caching is enabled
BTRFS info (device dm-0): has skinny extents
BTRFS info (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 512, gen 0
BTRFS info (device dm-0): detected SSD devices, enabling SSD mode
BTRFS info (device dm-0): continuing balance
BTRFS info (device dm-0): The free space cache file (1593999097856) is invalid. skip it

BTRFS info (device dm-0): The free space cache file (1671308509184) is invalid. skip it

BTRFS info (device dm-0): relocating block group 13835461197824 flags 34
------------[ cut here ]------------
WARNING: CPU: 0 PID: 22825 at fs/btrfs/disk-io.c:520 btree_csum_one_bio.isra.39+0xf7/0x100
Modules linked in: bcache configs rc_hauppauge ir_kbd_i2c cpufreq_userspace cpufreq_powersave cpufreq_conservative autofs4 snd_hda_codec_hdmi joydev snd_hda_codec_realtek snd_hda_codec_generic tuner_simple tuner_types tda9887 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep tda8290 coretemp snd_pcm_oss snd_mixer_oss tuner snd_pcm msp3400 snd_seq_midi snd_seq_midi_event firewire_sbp2 saa7127 snd_rawmidi hwmon_vid dm_crypt dm_mod saa7115 snd_seq bttv hid_generic snd_seq_device snd_timer ehci_pci ivtv tea575x videobuf_dma_sg rc_core videobuf_core input_leds tveeprom cx2341x v4l2_common ehci_hcd videodev media acpi_cpufreq tpm_tis tpm_tis_core gpio_ich snd soundcore tpm psmouse lpc_ich evdev asus_atk0110 serio_raw lp parport raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx multipath usbhid hid sr_mod cdrom sg firewire_ohci firewire_core floppy crc_itu_t i915 atl1 fjes mii uhci_hcd usbcore usb_common
CPU: 0 PID: 22825 Comm: kworker/u9:2 Tainted: G        W       4.8.5-ia32-20161028 #2
Hardware name: System manufacturer P5E-VM HDMI/P5E-VM HDMI, BIOS 0604    07/16/2008
Workqueue: btrfs-worker-high btrfs_worker_helper
 00200286 00200286 d3d81e48 df414827 00000000 dfa12da5 d3d81e78 df05677a
 df9ed884 00000000 00005929 dfa12da5 00000208 df2cf067 00000208 f7463fa0
 f401a080 00000000 d3d81e8c df05684a 00000009 00000000 00000000 d3d81eb4
Call Trace:
 [<df414827>] dump_stack+0x58/0x81
 [<df05677a>] __warn+0xea/0x110
 [<df2cf067>] ? btree_csum_one_bio.isra.39+0xf7/0x100
 [<df05684a>] warn_slowpath_null+0x2a/0x30
 [<df2cf067>] btree_csum_one_bio.isra.39+0xf7/0x100
 [<df2cf085>] __btree_submit_bio_start+0x15/0x20
 [<df2cdd10>] run_one_async_start+0x30/0x40
 [<df31286d>] btrfs_scrubparity_helper+0xcd/0x2d0
 [<df2cde70>] ? run_one_async_free+0x20/0x20
 [<df312bbd>] btrfs_worker_helper+0xd/0x10
 [<df06d05b>] process_one_work+0x10b/0x400
 [<df06d387>] worker_thread+0x37/0x4b0
 [<df06d350>] ? process_one_work+0x400/0x400
 [<df0722db>] kthread+0x9b/0xb0
 [<df799922>] ret_from_kernel_thread+0xe/0x24
 [<df072240>] ? kthread_stop+0x100/0x100
---[ end trace f461faff989bf258 ]---
BTRFS: error (device dm-0) in btrfs_commit_transaction:2232: errno=-5 IO failure (Error while writing out transaction)
BTRFS info (device dm-0): forced readonly
BTRFS warning (device dm-0): Skipping commit of aborted transaction.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 22318 at fs/btrfs/transaction.c:1854 btrfs_commit_transaction+0x2f5/0xcc0
BTRFS: Transaction aborted (error -5)
Modules linked in: bcache configs rc_hauppauge ir_kbd_i2c cpufreq_userspace cpufreq_powersave cpufreq_conservative autofs4 snd_hda_codec_hdmi joydev snd_hda_codec_realtek snd_hda_codec_generic tuner_simple tuner_types tda9887 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep tda8290 coretemp snd_pcm_oss snd_mixer_oss tuner snd_pcm msp3400 snd_seq_midi snd_seq_midi_event firewire_sbp2 saa7127 snd_rawmidi hwmon_vid dm_crypt dm_mod saa7115 snd_seq bttv hid_generic snd_seq_device snd_timer ehci_pci ivtv tea575x videobuf_dma_sg rc_core videobuf_core input_leds tveeprom cx2341x v4l2_common ehci_hcd videodev media acpi_cpufreq tpm_tis tpm_tis_core gpio_ich snd soundcore tpm psmouse lpc_ich evdev asus_atk0110 serio_raw lp parport raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx multipath usbhid hid sr_mod cdrom sg firewire_ohci firewire_core floppy crc_itu_t i915 atl1 fjes mii uhci_hcd usbcore usb_common
CPU: 0 PID: 22318 Comm: btrfs-balance Tainted: G        W       4.8.5-ia32-20161028 #2
Hardware name: System manufacturer P5E-VM HDMI/P5E-VM HDMI, BIOS 0604    07/16/2008
 00000286 00000286 d74a3ca4 df414827 d74a3ce8 dfa132ab d74a3cd4 df05677a
 dfa075cc d74a3d04 0000572e dfa132ab 0000073e df2d7de5 0000073e f698dc00
 e9173e70 fffffffb d74a3cf0 df0567db 00000009 00000000 d74a3ce8 dfa075cc
Call Trace:
 [<df414827>] dump_stack+0x58/0x81
 [<df05677a>] __warn+0xea/0x110
 [<df2d7de5>] ? btrfs_commit_transaction+0x2f5/0xcc0
 [<df0567db>] warn_slowpath_fmt+0x3b/0x40
 [<df2d7de5>] btrfs_commit_transaction+0x2f5/0xcc0
 [<df096800>] ? prepare_to_wait_event+0xd0/0xd0
 [<df33334f>] prepare_to_relocate+0x12f/0x180
 [<df339a41>] relocate_block_group+0x31/0x790
 [<df0b1427>] ? vprintk_default+0x37/0x40
 [<df796ca0>] ? mutex_lock+0x10/0x30
 [<df2f8f45>] ? btrfs_wait_ordered_roots+0x1d5/0x1f0
 [<df14eed6>] ? printk+0x17/0x19
 [<df2a47b2>] ? btrfs_printk+0x102/0x110
 [<df33a388>] btrfs_relocate_block_group+0x1e8/0x2e0
 [<df308a9f>] btrfs_relocate_chunk.isra.29+0x3f/0xf0
 [<df30221f>] ? free_extent_buffer+0x4f/0xa0
 [<df30a555>] btrfs_balance+0xb05/0x1820
 [<df0b0afa>] ? console_unlock+0x40a/0x630
 [<df30b2c1>] balance_kthread+0x51/0x80
 [<df30b270>] ? btrfs_balance+0x1820/0x1820
 [<df0722db>] kthread+0x9b/0xb0
 [<df799922>] ret_from_kernel_thread+0xe/0x24
 [<df072240>] ? kthread_stop+0x100/0x100
---[ end trace f461faff989bf259 ]---
BTRFS: error (device dm-0) in cleanup_transaction:1854: errno=-5 IO failure
BTRFS info (device dm-0): delayed_refs has NO entry

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-08  1:06                                           ` Marc MERLIN
@ 2016-11-08  1:17                                             ` Qu Wenruo
  2016-11-08 15:24                                               ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-11-08  1:17 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: David Sterba, Hugo Mills, linux-btrfs



At 11/08/2016 09:06 AM, Marc MERLIN wrote:
> On Tue, Nov 08, 2016 at 08:43:34AM +0800, Qu Wenruo wrote:
>> That's strange, balance is done completely in kernel space.
>>
>> Unless we're calling vfs_* function we won't go through the extra check.
>>
>> What's the error reported?
>
> See below. Note however that is may be because btrfs received messed up the
> filesystem first.

If receive can easily screw up the fs, then fsstress can also screw up 
btrfs easily.

So I didn't think that's the case. (Several years ago it's possible)

>
> BTRFS info (device dm-0): use zlib compression
> BTRFS info (device dm-0): disk space caching is enabled
> BTRFS info (device dm-0): has skinny extents
> BTRFS info (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, flush 0, corrupt 512, gen 0
> BTRFS info (device dm-0): detected SSD devices, enabling SSD mode
> BTRFS info (device dm-0): continuing balance
> BTRFS info (device dm-0): The free space cache file (1593999097856) is invalid. skip it
>
> BTRFS info (device dm-0): The free space cache file (1671308509184) is invalid. skip it
>
> BTRFS info (device dm-0): relocating block group 13835461197824 flags 34
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 22825 at fs/btrfs/disk-io.c:520 btree_csum_one_bio.isra.39+0xf7/0x100

Dirty tree block's bytenr doesn't match with page's logical.
It seems that the tree block is not up-to-date, maybe corrupted.

Seems not related to the 8T limit.

Could you please add pr_info() to print out the 'found_start' and 'start'?
Also I'm not familiar with this code, the number may has a clue to show 
what's going wrong.

Thanks,
Qu

> Modules linked in: bcache configs rc_hauppauge ir_kbd_i2c cpufreq_userspace cpufreq_powersave cpufreq_conservative autofs4 snd_hda_codec_hdmi joydev snd_hda_codec_realtek snd_hda_codec_generic tuner_simple tuner_types tda9887 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep tda8290 coretemp snd_pcm_oss snd_mixer_oss tuner snd_pcm msp3400 snd_seq_midi snd_seq_midi_event firewire_sbp2 saa7127 snd_rawmidi hwmon_vid dm_crypt dm_mod saa7115 snd_seq bttv hid_generic snd_seq_device snd_timer ehci_pci ivtv tea575x videobuf_dma_sg rc_core videobuf_core input_leds tveeprom cx2341x v4l2_common ehci_hcd videodev media acpi_cpufreq tpm_tis tpm_tis_core gpio_ich snd soundcore tpm psmouse lpc_ich evdev asus_atk0110 serio_raw lp parport raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx multipath usbhid hid sr_mod cdrom sg firewire_ohci firewire_core floppy crc_itu_t i915 atl1 fjes mii uhci_hcd usbcore usb_common
> CPU: 0 PID: 22825 Comm: kworker/u9:2 Tainted: G        W       4.8.5-ia32-20161028 #2
> Hardware name: System manufacturer P5E-VM HDMI/P5E-VM HDMI, BIOS 0604    07/16/2008
> Workqueue: btrfs-worker-high btrfs_worker_helper
>  00200286 00200286 d3d81e48 df414827 00000000 dfa12da5 d3d81e78 df05677a
>  df9ed884 00000000 00005929 dfa12da5 00000208 df2cf067 00000208 f7463fa0
>  f401a080 00000000 d3d81e8c df05684a 00000009 00000000 00000000 d3d81eb4
> Call Trace:
>  [<df414827>] dump_stack+0x58/0x81
>  [<df05677a>] __warn+0xea/0x110
>  [<df2cf067>] ? btree_csum_one_bio.isra.39+0xf7/0x100
>  [<df05684a>] warn_slowpath_null+0x2a/0x30
>  [<df2cf067>] btree_csum_one_bio.isra.39+0xf7/0x100
>  [<df2cf085>] __btree_submit_bio_start+0x15/0x20
>  [<df2cdd10>] run_one_async_start+0x30/0x40
>  [<df31286d>] btrfs_scrubparity_helper+0xcd/0x2d0
>  [<df2cde70>] ? run_one_async_free+0x20/0x20
>  [<df312bbd>] btrfs_worker_helper+0xd/0x10
>  [<df06d05b>] process_one_work+0x10b/0x400
>  [<df06d387>] worker_thread+0x37/0x4b0
>  [<df06d350>] ? process_one_work+0x400/0x400
>  [<df0722db>] kthread+0x9b/0xb0
>  [<df799922>] ret_from_kernel_thread+0xe/0x24
>  [<df072240>] ? kthread_stop+0x100/0x100
> ---[ end trace f461faff989bf258 ]---
> BTRFS: error (device dm-0) in btrfs_commit_transaction:2232: errno=-5 IO failure (Error while writing out transaction)
> BTRFS info (device dm-0): forced readonly
> BTRFS warning (device dm-0): Skipping commit of aborted transaction.
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 22318 at fs/btrfs/transaction.c:1854 btrfs_commit_transaction+0x2f5/0xcc0
> BTRFS: Transaction aborted (error -5)
> Modules linked in: bcache configs rc_hauppauge ir_kbd_i2c cpufreq_userspace cpufreq_powersave cpufreq_conservative autofs4 snd_hda_codec_hdmi joydev snd_hda_codec_realtek snd_hda_codec_generic tuner_simple tuner_types tda9887 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep tda8290 coretemp snd_pcm_oss snd_mixer_oss tuner snd_pcm msp3400 snd_seq_midi snd_seq_midi_event firewire_sbp2 saa7127 snd_rawmidi hwmon_vid dm_crypt dm_mod saa7115 snd_seq bttv hid_generic snd_seq_device snd_timer ehci_pci ivtv tea575x videobuf_dma_sg rc_core videobuf_core input_leds tveeprom cx2341x v4l2_common ehci_hcd videodev media acpi_cpufreq tpm_tis tpm_tis_core gpio_ich snd soundcore tpm psmouse lpc_ich evdev asus_atk0110 serio_raw lp parport raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx multipath usbhid hid sr_mod cdrom sg firewire_ohci firewire_core floppy crc_itu_t i915 atl1 fjes mii uhci_hcd usbcore usb_common
> CPU: 0 PID: 22318 Comm: btrfs-balance Tainted: G        W       4.8.5-ia32-20161028 #2
> Hardware name: System manufacturer P5E-VM HDMI/P5E-VM HDMI, BIOS 0604    07/16/2008
>  00000286 00000286 d74a3ca4 df414827 d74a3ce8 dfa132ab d74a3cd4 df05677a
>  dfa075cc d74a3d04 0000572e dfa132ab 0000073e df2d7de5 0000073e f698dc00
>  e9173e70 fffffffb d74a3cf0 df0567db 00000009 00000000 d74a3ce8 dfa075cc
> Call Trace:
>  [<df414827>] dump_stack+0x58/0x81
>  [<df05677a>] __warn+0xea/0x110
>  [<df2d7de5>] ? btrfs_commit_transaction+0x2f5/0xcc0
>  [<df0567db>] warn_slowpath_fmt+0x3b/0x40
>  [<df2d7de5>] btrfs_commit_transaction+0x2f5/0xcc0
>  [<df096800>] ? prepare_to_wait_event+0xd0/0xd0
>  [<df33334f>] prepare_to_relocate+0x12f/0x180
>  [<df339a41>] relocate_block_group+0x31/0x790
>  [<df0b1427>] ? vprintk_default+0x37/0x40
>  [<df796ca0>] ? mutex_lock+0x10/0x30
>  [<df2f8f45>] ? btrfs_wait_ordered_roots+0x1d5/0x1f0
>  [<df14eed6>] ? printk+0x17/0x19
>  [<df2a47b2>] ? btrfs_printk+0x102/0x110
>  [<df33a388>] btrfs_relocate_block_group+0x1e8/0x2e0
>  [<df308a9f>] btrfs_relocate_chunk.isra.29+0x3f/0xf0
>  [<df30221f>] ? free_extent_buffer+0x4f/0xa0
>  [<df30a555>] btrfs_balance+0xb05/0x1820
>  [<df0b0afa>] ? console_unlock+0x40a/0x630
>  [<df30b2c1>] balance_kthread+0x51/0x80
>  [<df30b270>] ? btrfs_balance+0x1820/0x1820
>  [<df0722db>] kthread+0x9b/0xb0
>  [<df799922>] ret_from_kernel_thread+0xe/0x24
>  [<df072240>] ? kthread_stop+0x100/0x100
> ---[ end trace f461faff989bf259 ]---
> BTRFS: error (device dm-0) in cleanup_transaction:1854: errno=-5 IO failure
> BTRFS info (device dm-0): delayed_refs has NO entry
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-08  1:17                                             ` Qu Wenruo
@ 2016-11-08 15:24                                               ` Marc MERLIN
  2016-11-09  1:50                                                 ` Qu Wenruo
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-08 15:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, Hugo Mills, linux-btrfs

On Tue, Nov 08, 2016 at 09:17:43AM +0800, Qu Wenruo wrote:
> 
> 
> At 11/08/2016 09:06 AM, Marc MERLIN wrote:
> >On Tue, Nov 08, 2016 at 08:43:34AM +0800, Qu Wenruo wrote:
> >>That's strange, balance is done completely in kernel space.
> >>
> >>Unless we're calling vfs_* function we won't go through the extra check.
> >>
> >>What's the error reported?
> >
> >See below. Note however that is may be because btrfs received messed up the
> >filesystem first.
> 
> If receive can easily screw up the fs, then fsstress can also screw up 
> btrfs easily.
> 
> So I didn't think that's the case. (Several years ago it's possible)
 
So now I'm even more confused. I put the array back in my 64bit system and
check --repair comes back clean, but scrub does not. Is that supposed to be possible?

gargamel:~# btrfs check -p --repair /dev/mapper/crypt_bcache2 2>&1 | tee /mnt/dshelf1/other/btrfs2
enabling repair mode
Checking filesystem on /dev/mapper/crypt_bcache2
UUID: 6692cf4c-93d9-438c-ac30-5db6381dc4f2
checking extents [.]
Fixed 0 roots.
cache and super generation don't match, space cache will be invalidated
checking fs roots [o]
checking csums
checking root refs
found 14622791987200 bytes used err is 0
total csum bytes: 14200176492
total tree bytes: 78239416320
total fs tree bytes: 59524497408
total extent tree bytes: 3236872192
btree space waste bytes: 10068589919
file data blocks allocated: 18101311373312
 referenced 18038641020928

Nov  8 06:55:40 gargamel kernel: [35631.988896] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 513, gen 0
Nov  8 06:55:40 gargamel kernel: [35631.988897] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 514, gen 0
Nov  8 06:55:40 gargamel kernel: [35631.988899] BTRFS warning (device dm-6): checksum error at logical 27885961216 on dev /dev/mapper/crypt_bcache2, sector 56578304, root 9461, inode 45837, offset 15459172352, length 4096, links 1 (path: system/mlocate/mlocate.db)
Nov  8 06:55:40 gargamel kernel: [35631.988900] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 515, gen 0
Nov  8 06:55:40 gargamel kernel: [35631.988903] BTRFS warning (device dm-6): checksum error at logical 27887534080 on dev /dev/mapper/crypt_bcache2, sector 56581376, root 9461, inode 45837, offset 15460745216, length 4096, links 1 (path: system/mlocate/mlocate.db)
Nov  8 06:55:40 gargamel kernel: [35631.988904] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27887009792 on dev /dev/mapper/crypt_bcache2
Nov  8 06:55:40 gargamel kernel: [35631.988905] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27886878720 on dev /dev/mapper/crypt_bcache2
Nov  8 06:55:40 gargamel kernel: [35631.988906] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 516, gen 0
Nov  8 06:55:40 gargamel kernel: [35631.988907] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27887837184 on dev /dev/mapper/crypt_bcache2
Nov  8 06:55:40 gargamel kernel: [35631.988908] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 517, gen 0
Nov  8 06:55:40 gargamel kernel: [35631.988909] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 518, gen 0
Nov  8 06:55:40 gargamel kernel: [35631.988910] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27885830144 on dev /dev/mapper/crypt_bcache2
Nov  8 06:55:40 gargamel kernel: [35631.988911] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27885961216 on dev /dev/mapper/crypt_bcache2
Nov  8 06:55:40 gargamel kernel: [35631.988912] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27887534080 on dev /dev/mapper/crypt_bcache2
Nov  8 06:55:40 gargamel kernel: [35631.988882] BTRFS warning (device dm-6): checksum error at logical 27887403008 on dev /dev/mapper/crypt_bcache2, sector 56581120, root 9461, inode 45837, offset 15460614144, length 4096, links 1 (path: system/mlocate/mlocate.db)
Nov  8 06:55:40 gargamel kernel: [35631.988885] BTRFS warning (device dm-6): checksum error at logical 27887009792 on dev /dev/mapper/crypt_bcache2, sector 56580352, root 9461, inode 45837, offset 15460220928, length 4096, links 1 (path: system/mlocate/mlocate.db)
Nov  8 06:55:40 gargamel kernel: [35631.988887] BTRFS warning (device dm-6): checksum error at logical 27886878720 on dev /dev/mapper/crypt_bcache2, sector 56580096, root 9461, inode 45837, offset 15460089856, length 4096, links 1 (path: system/mlocate/mlocate.db)
Nov  8 06:55:40 gargamel kernel: [35631.988890] BTRFS warning (device dm-6): checksum error at logical 27887837184 on dev /dev/mapper/crypt_bcache2, sector 56581968, root 9461, inode 45837, offset 15461048320, length 4096, links 1 (path: system/mlocate/mlocate.db)
Nov  8 06:55:40 gargamel kernel: [35631.988895] BTRFS warning (device dm-6): checksum error at logical 27885830144 on dev /dev/mapper/crypt_bcache2, sector 56578048, root 9461, inode 45837, offset 15459041280, length 4096, links 1 (path: system/mlocate/mlocate.db)
Nov  8 06:55:40 gargamel kernel: [35631.988896] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 513, gen 0
Nov  8 06:55:40 gargamel kernel: [35631.988897] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 514, gen 0
Nov  8 06:55:40 gargamel kernel: [35631.988899] BTRFS warning (device dm-6): checksum error at logical 27885961216 on dev /dev/mapper/crypt_bcache2, sector 56578304, root 9461, inode 45837, offset 15459172352, length 4096, links 1 (path: system/mlocate/mlocate.db)
Nov  8 06:55:40 gargamel kernel: [35631.988900] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 515, gen 0
Nov  8 06:55:40 gargamel kernel: [35631.988903] BTRFS warning (device dm-6): checksum error at logical 27887534080 on dev /dev/mapper/crypt_bcache2, sector 56581376, root 9461, inode 45837, offset 15460745216, length 4096, links 1 (path: system/mlocate/mlocate.db)
Nov  8 06:55:40 gargamel kernel: [35631.988904] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27887009792 on dev /dev/mapper/crypt_bcache2
Nov  8 06:55:40 gargamel kernel: [35631.988905] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27886878720 on dev /dev/mapper/crypt_bcache2
Nov  8 06:55:40 gargamel kernel: [35631.988906] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 516, gen 0
Nov  8 06:55:40 gargamel kernel: [35631.988907] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27887837184 on dev /dev/mapper/crypt_bcache2
Nov  8 06:55:40 gargamel kernel: [35631.988908] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 517, gen 0
Nov  8 06:55:40 gargamel kernel: [35631.988909] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 518, gen 0
Nov  8 06:55:40 gargamel kernel: [35631.988910] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27885830144 on dev /dev/mapper/crypt_bcache2
Nov  8 06:55:40 gargamel kernel: [35631.988911] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27885961216 on dev /dev/mapper/crypt_bcache2
Nov  8 06:55:40 gargamel kernel: [35631.988912] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27887534080 on dev /dev/mapper/crypt_bcache2



> >
> >BTRFS info (device dm-0): use zlib compression
> >BTRFS info (device dm-0): disk space caching is enabled
> >BTRFS info (device dm-0): has skinny extents
> >BTRFS info (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0, 
> >flush 0, corrupt 512, gen 0
> >BTRFS info (device dm-0): detected SSD devices, enabling SSD mode
> >BTRFS info (device dm-0): continuing balance
> >BTRFS info (device dm-0): The free space cache file (1593999097856) is 
> >invalid. skip it
> >
> >BTRFS info (device dm-0): The free space cache file (1671308509184) is 
> >invalid. skip it
> >
> >BTRFS info (device dm-0): relocating block group 13835461197824 flags 34
> >------------[ cut here ]------------
> >WARNING: CPU: 0 PID: 22825 at fs/btrfs/disk-io.c:520 
> >btree_csum_one_bio.isra.39+0xf7/0x100
> 
> Dirty tree block's bytenr doesn't match with page's logical.
> It seems that the tree block is not up-to-date, maybe corrupted.
> 
> Seems not related to the 8T limit.
> 
> Could you please add pr_info() to print out the 'found_start' and 'start'?
> Also I'm not familiar with this code, the number may has a clue to show 
> what's going wrong.
> 
> Thanks,
> Qu
> 
> >Modules linked in: bcache configs rc_hauppauge ir_kbd_i2c 
> >cpufreq_userspace cpufreq_powersave cpufreq_conservative autofs4 
> >snd_hda_codec_hdmi joydev snd_hda_codec_realtek snd_hda_codec_generic 
> >tuner_simple tuner_types tda9887 snd_hda_intel snd_hda_codec snd_hda_core 
> >snd_hwdep tda8290 coretemp snd_pcm_oss snd_mixer_oss tuner snd_pcm msp3400 
> >snd_seq_midi snd_seq_midi_event firewire_sbp2 saa7127 snd_rawmidi 
> >hwmon_vid dm_crypt dm_mod saa7115 snd_seq bttv hid_generic snd_seq_device 
> >snd_timer ehci_pci ivtv tea575x videobuf_dma_sg rc_core videobuf_core 
> >input_leds tveeprom cx2341x v4l2_common ehci_hcd videodev media 
> >acpi_cpufreq tpm_tis tpm_tis_core gpio_ich snd soundcore tpm psmouse 
> >lpc_ich evdev asus_atk0110 serio_raw lp parport raid456 async_raid6_recov 
> >async_pq async_xor async_memcpy async_tx multipath usbhid hid sr_mod cdrom 
> >sg firewire_ohci firewire_core floppy crc_itu_t i915 atl1 fjes mii 
> >uhci_hcd usbcore usb_common
> >CPU: 0 PID: 22825 Comm: kworker/u9:2 Tainted: G        W       
> >4.8.5-ia32-20161028 #2
> >Hardware name: System manufacturer P5E-VM HDMI/P5E-VM HDMI, BIOS 0604    
> >07/16/2008
> >Workqueue: btrfs-worker-high btrfs_worker_helper
> > 00200286 00200286 d3d81e48 df414827 00000000 dfa12da5 d3d81e78 df05677a
> > df9ed884 00000000 00005929 dfa12da5 00000208 df2cf067 00000208 f7463fa0
> > f401a080 00000000 d3d81e8c df05684a 00000009 00000000 00000000 d3d81eb4
> >Call Trace:
> > [<df414827>] dump_stack+0x58/0x81
> > [<df05677a>] __warn+0xea/0x110
> > [<df2cf067>] ? btree_csum_one_bio.isra.39+0xf7/0x100
> > [<df05684a>] warn_slowpath_null+0x2a/0x30
> > [<df2cf067>] btree_csum_one_bio.isra.39+0xf7/0x100
> > [<df2cf085>] __btree_submit_bio_start+0x15/0x20
> > [<df2cdd10>] run_one_async_start+0x30/0x40
> > [<df31286d>] btrfs_scrubparity_helper+0xcd/0x2d0
> > [<df2cde70>] ? run_one_async_free+0x20/0x20
> > [<df312bbd>] btrfs_worker_helper+0xd/0x10
> > [<df06d05b>] process_one_work+0x10b/0x400
> > [<df06d387>] worker_thread+0x37/0x4b0
> > [<df06d350>] ? process_one_work+0x400/0x400
> > [<df0722db>] kthread+0x9b/0xb0
> > [<df799922>] ret_from_kernel_thread+0xe/0x24
> > [<df072240>] ? kthread_stop+0x100/0x100
> >---[ end trace f461faff989bf258 ]---
> >BTRFS: error (device dm-0) in btrfs_commit_transaction:2232: errno=-5 IO 
> >failure (Error while writing out transaction)
> >BTRFS info (device dm-0): forced readonly
> >BTRFS warning (device dm-0): Skipping commit of aborted transaction.
> >------------[ cut here ]------------
> >WARNING: CPU: 0 PID: 22318 at fs/btrfs/transaction.c:1854 
> >btrfs_commit_transaction+0x2f5/0xcc0
> >BTRFS: Transaction aborted (error -5)
> >Modules linked in: bcache configs rc_hauppauge ir_kbd_i2c 
> >cpufreq_userspace cpufreq_powersave cpufreq_conservative autofs4 
> >snd_hda_codec_hdmi joydev snd_hda_codec_realtek snd_hda_codec_generic 
> >tuner_simple tuner_types tda9887 snd_hda_intel snd_hda_codec snd_hda_core 
> >snd_hwdep tda8290 coretemp snd_pcm_oss snd_mixer_oss tuner snd_pcm msp3400 
> >snd_seq_midi snd_seq_midi_event firewire_sbp2 saa7127 snd_rawmidi 
> >hwmon_vid dm_crypt dm_mod saa7115 snd_seq bttv hid_generic snd_seq_device 
> >snd_timer ehci_pci ivtv tea575x videobuf_dma_sg rc_core videobuf_core 
> >input_leds tveeprom cx2341x v4l2_common ehci_hcd videodev media 
> >acpi_cpufreq tpm_tis tpm_tis_core gpio_ich snd soundcore tpm psmouse 
> >lpc_ich evdev asus_atk0110 serio_raw lp parport raid456 async_raid6_recov 
> >async_pq async_xor async_memcpy async_tx multipath usbhid hid sr_mod cdrom 
> >sg firewire_ohci firewire_core floppy crc_itu_t i915 atl1 fjes mii 
> >uhci_hcd usbcore usb_common
> >CPU: 0 PID: 22318 Comm: btrfs-balance Tainted: G        W       
> >4.8.5-ia32-20161028 #2
> >Hardware name: System manufacturer P5E-VM HDMI/P5E-VM HDMI, BIOS 0604    
> >07/16/2008
> > 00000286 00000286 d74a3ca4 df414827 d74a3ce8 dfa132ab d74a3cd4 df05677a
> > dfa075cc d74a3d04 0000572e dfa132ab 0000073e df2d7de5 0000073e f698dc00
> > e9173e70 fffffffb d74a3cf0 df0567db 00000009 00000000 d74a3ce8 dfa075cc
> >Call Trace:
> > [<df414827>] dump_stack+0x58/0x81
> > [<df05677a>] __warn+0xea/0x110
> > [<df2d7de5>] ? btrfs_commit_transaction+0x2f5/0xcc0
> > [<df0567db>] warn_slowpath_fmt+0x3b/0x40
> > [<df2d7de5>] btrfs_commit_transaction+0x2f5/0xcc0
> > [<df096800>] ? prepare_to_wait_event+0xd0/0xd0
> > [<df33334f>] prepare_to_relocate+0x12f/0x180
> > [<df339a41>] relocate_block_group+0x31/0x790
> > [<df0b1427>] ? vprintk_default+0x37/0x40
> > [<df796ca0>] ? mutex_lock+0x10/0x30
> > [<df2f8f45>] ? btrfs_wait_ordered_roots+0x1d5/0x1f0
> > [<df14eed6>] ? printk+0x17/0x19
> > [<df2a47b2>] ? btrfs_printk+0x102/0x110
> > [<df33a388>] btrfs_relocate_block_group+0x1e8/0x2e0
> > [<df308a9f>] btrfs_relocate_chunk.isra.29+0x3f/0xf0
> > [<df30221f>] ? free_extent_buffer+0x4f/0xa0
> > [<df30a555>] btrfs_balance+0xb05/0x1820
> > [<df0b0afa>] ? console_unlock+0x40a/0x630
> > [<df30b2c1>] balance_kthread+0x51/0x80
> > [<df30b270>] ? btrfs_balance+0x1820/0x1820
> > [<df0722db>] kthread+0x9b/0xb0
> > [<df799922>] ret_from_kernel_thread+0xe/0x24
> > [<df072240>] ? kthread_stop+0x100/0x100
> >---[ end trace f461faff989bf259 ]---
> >BTRFS: error (device dm-0) in cleanup_transaction:1854: errno=-5 IO failure
> >BTRFS info (device dm-0): delayed_refs has NO entry
> >
> 
> 
> 

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-08 15:24                                               ` Marc MERLIN
@ 2016-11-09  1:50                                                 ` Qu Wenruo
  2016-11-09  2:05                                                   ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-11-09  1:50 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: David Sterba, Hugo Mills, linux-btrfs



At 11/08/2016 11:24 PM, Marc MERLIN wrote:
> On Tue, Nov 08, 2016 at 09:17:43AM +0800, Qu Wenruo wrote:
>>
>>
>> At 11/08/2016 09:06 AM, Marc MERLIN wrote:
>>> On Tue, Nov 08, 2016 at 08:43:34AM +0800, Qu Wenruo wrote:
>>>> That's strange, balance is done completely in kernel space.
>>>>
>>>> Unless we're calling vfs_* function we won't go through the extra check.
>>>>
>>>> What's the error reported?
>>>
>>> See below. Note however that is may be because btrfs received messed up the
>>> filesystem first.
>>
>> If receive can easily screw up the fs, then fsstress can also screw up
>> btrfs easily.
>>
>> So I didn't think that's the case. (Several years ago it's possible)
>
> So now I'm even more confused. I put the array back in my 64bit system and
> check --repair comes back clean, but scrub does not. Is that supposed to be possible?

Yeah, quite possible!

The truth is, current btrfs check only checks:
1) Metadata
    while --check-data-csum option will check data, but still
    follow the restriction 3).
2) Crossing reference of metadata (contents of metadata)
3) The first good mirror/backup

So quite a lot of problems can't be detected by btrfs check:
1) Data corruption (csum mismatch)
2) 2nd mirror corruption(DUP/RAID0/10) or parity error(RAID5/6)

For btrfsck to check all mirror and data, you could try out-of-tree 
offline scrub patchset:
https://github.com/adam900710/btrfs-progs/tree/fsck_scrub

Which implements the kernel scrub equivalent in btrfs-progs.

Thanks,
Qu

>
> gargamel:~# btrfs check -p --repair /dev/mapper/crypt_bcache2 2>&1 | tee /mnt/dshelf1/other/btrfs2
> enabling repair mode
> Checking filesystem on /dev/mapper/crypt_bcache2
> UUID: 6692cf4c-93d9-438c-ac30-5db6381dc4f2
> checking extents [.]
> Fixed 0 roots.
> cache and super generation don't match, space cache will be invalidated
> checking fs roots [o]
> checking csums
> checking root refs
> found 14622791987200 bytes used err is 0
> total csum bytes: 14200176492
> total tree bytes: 78239416320
> total fs tree bytes: 59524497408
> total extent tree bytes: 3236872192
> btree space waste bytes: 10068589919
> file data blocks allocated: 18101311373312
>  referenced 18038641020928
>
> Nov  8 06:55:40 gargamel kernel: [35631.988896] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 513, gen 0
> Nov  8 06:55:40 gargamel kernel: [35631.988897] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 514, gen 0
> Nov  8 06:55:40 gargamel kernel: [35631.988899] BTRFS warning (device dm-6): checksum error at logical 27885961216 on dev /dev/mapper/crypt_bcache2, sector 56578304, root 9461, inode 45837, offset 15459172352, length 4096, links 1 (path: system/mlocate/mlocate.db)
> Nov  8 06:55:40 gargamel kernel: [35631.988900] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 515, gen 0
> Nov  8 06:55:40 gargamel kernel: [35631.988903] BTRFS warning (device dm-6): checksum error at logical 27887534080 on dev /dev/mapper/crypt_bcache2, sector 56581376, root 9461, inode 45837, offset 15460745216, length 4096, links 1 (path: system/mlocate/mlocate.db)
> Nov  8 06:55:40 gargamel kernel: [35631.988904] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27887009792 on dev /dev/mapper/crypt_bcache2
> Nov  8 06:55:40 gargamel kernel: [35631.988905] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27886878720 on dev /dev/mapper/crypt_bcache2
> Nov  8 06:55:40 gargamel kernel: [35631.988906] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 516, gen 0
> Nov  8 06:55:40 gargamel kernel: [35631.988907] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27887837184 on dev /dev/mapper/crypt_bcache2
> Nov  8 06:55:40 gargamel kernel: [35631.988908] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 517, gen 0
> Nov  8 06:55:40 gargamel kernel: [35631.988909] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 518, gen 0
> Nov  8 06:55:40 gargamel kernel: [35631.988910] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27885830144 on dev /dev/mapper/crypt_bcache2
> Nov  8 06:55:40 gargamel kernel: [35631.988911] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27885961216 on dev /dev/mapper/crypt_bcache2
> Nov  8 06:55:40 gargamel kernel: [35631.988912] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27887534080 on dev /dev/mapper/crypt_bcache2
> Nov  8 06:55:40 gargamel kernel: [35631.988882] BTRFS warning (device dm-6): checksum error at logical 27887403008 on dev /dev/mapper/crypt_bcache2, sector 56581120, root 9461, inode 45837, offset 15460614144, length 4096, links 1 (path: system/mlocate/mlocate.db)
> Nov  8 06:55:40 gargamel kernel: [35631.988885] BTRFS warning (device dm-6): checksum error at logical 27887009792 on dev /dev/mapper/crypt_bcache2, sector 56580352, root 9461, inode 45837, offset 15460220928, length 4096, links 1 (path: system/mlocate/mlocate.db)
> Nov  8 06:55:40 gargamel kernel: [35631.988887] BTRFS warning (device dm-6): checksum error at logical 27886878720 on dev /dev/mapper/crypt_bcache2, sector 56580096, root 9461, inode 45837, offset 15460089856, length 4096, links 1 (path: system/mlocate/mlocate.db)
> Nov  8 06:55:40 gargamel kernel: [35631.988890] BTRFS warning (device dm-6): checksum error at logical 27887837184 on dev /dev/mapper/crypt_bcache2, sector 56581968, root 9461, inode 45837, offset 15461048320, length 4096, links 1 (path: system/mlocate/mlocate.db)
> Nov  8 06:55:40 gargamel kernel: [35631.988895] BTRFS warning (device dm-6): checksum error at logical 27885830144 on dev /dev/mapper/crypt_bcache2, sector 56578048, root 9461, inode 45837, offset 15459041280, length 4096, links 1 (path: system/mlocate/mlocate.db)
> Nov  8 06:55:40 gargamel kernel: [35631.988896] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 513, gen 0
> Nov  8 06:55:40 gargamel kernel: [35631.988897] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 514, gen 0
> Nov  8 06:55:40 gargamel kernel: [35631.988899] BTRFS warning (device dm-6): checksum error at logical 27885961216 on dev /dev/mapper/crypt_bcache2, sector 56578304, root 9461, inode 45837, offset 15459172352, length 4096, links 1 (path: system/mlocate/mlocate.db)
> Nov  8 06:55:40 gargamel kernel: [35631.988900] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 515, gen 0
> Nov  8 06:55:40 gargamel kernel: [35631.988903] BTRFS warning (device dm-6): checksum error at logical 27887534080 on dev /dev/mapper/crypt_bcache2, sector 56581376, root 9461, inode 45837, offset 15460745216, length 4096, links 1 (path: system/mlocate/mlocate.db)
> Nov  8 06:55:40 gargamel kernel: [35631.988904] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27887009792 on dev /dev/mapper/crypt_bcache2
> Nov  8 06:55:40 gargamel kernel: [35631.988905] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27886878720 on dev /dev/mapper/crypt_bcache2
> Nov  8 06:55:40 gargamel kernel: [35631.988906] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 516, gen 0
> Nov  8 06:55:40 gargamel kernel: [35631.988907] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27887837184 on dev /dev/mapper/crypt_bcache2
> Nov  8 06:55:40 gargamel kernel: [35631.988908] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 517, gen 0
> Nov  8 06:55:40 gargamel kernel: [35631.988909] BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 518, gen 0
> Nov  8 06:55:40 gargamel kernel: [35631.988910] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27885830144 on dev /dev/mapper/crypt_bcache2
> Nov  8 06:55:40 gargamel kernel: [35631.988911] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27885961216 on dev /dev/mapper/crypt_bcache2
> Nov  8 06:55:40 gargamel kernel: [35631.988912] BTRFS error (device dm-6): unable to fixup (regular) error at logical 27887534080 on dev /dev/mapper/crypt_bcache2
>
>
>
>>>
>>> BTRFS info (device dm-0): use zlib compression
>>> BTRFS info (device dm-0): disk space caching is enabled
>>> BTRFS info (device dm-0): has skinny extents
>>> BTRFS info (device dm-0): bdev /dev/mapper/crypt_bcache0 errs: wr 0, rd 0,
>>> flush 0, corrupt 512, gen 0
>>> BTRFS info (device dm-0): detected SSD devices, enabling SSD mode
>>> BTRFS info (device dm-0): continuing balance
>>> BTRFS info (device dm-0): The free space cache file (1593999097856) is
>>> invalid. skip it
>>>
>>> BTRFS info (device dm-0): The free space cache file (1671308509184) is
>>> invalid. skip it
>>>
>>> BTRFS info (device dm-0): relocating block group 13835461197824 flags 34
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 0 PID: 22825 at fs/btrfs/disk-io.c:520
>>> btree_csum_one_bio.isra.39+0xf7/0x100
>>
>> Dirty tree block's bytenr doesn't match with page's logical.
>> It seems that the tree block is not up-to-date, maybe corrupted.
>>
>> Seems not related to the 8T limit.
>>
>> Could you please add pr_info() to print out the 'found_start' and 'start'?
>> Also I'm not familiar with this code, the number may has a clue to show
>> what's going wrong.
>>
>> Thanks,
>> Qu
>>
>>> Modules linked in: bcache configs rc_hauppauge ir_kbd_i2c
>>> cpufreq_userspace cpufreq_powersave cpufreq_conservative autofs4
>>> snd_hda_codec_hdmi joydev snd_hda_codec_realtek snd_hda_codec_generic
>>> tuner_simple tuner_types tda9887 snd_hda_intel snd_hda_codec snd_hda_core
>>> snd_hwdep tda8290 coretemp snd_pcm_oss snd_mixer_oss tuner snd_pcm msp3400
>>> snd_seq_midi snd_seq_midi_event firewire_sbp2 saa7127 snd_rawmidi
>>> hwmon_vid dm_crypt dm_mod saa7115 snd_seq bttv hid_generic snd_seq_device
>>> snd_timer ehci_pci ivtv tea575x videobuf_dma_sg rc_core videobuf_core
>>> input_leds tveeprom cx2341x v4l2_common ehci_hcd videodev media
>>> acpi_cpufreq tpm_tis tpm_tis_core gpio_ich snd soundcore tpm psmouse
>>> lpc_ich evdev asus_atk0110 serio_raw lp parport raid456 async_raid6_recov
>>> async_pq async_xor async_memcpy async_tx multipath usbhid hid sr_mod cdrom
>>> sg firewire_ohci firewire_core floppy crc_itu_t i915 atl1 fjes mii
>>> uhci_hcd usbcore usb_common
>>> CPU: 0 PID: 22825 Comm: kworker/u9:2 Tainted: G        W
>>> 4.8.5-ia32-20161028 #2
>>> Hardware name: System manufacturer P5E-VM HDMI/P5E-VM HDMI, BIOS 0604
>>> 07/16/2008
>>> Workqueue: btrfs-worker-high btrfs_worker_helper
>>> 00200286 00200286 d3d81e48 df414827 00000000 dfa12da5 d3d81e78 df05677a
>>> df9ed884 00000000 00005929 dfa12da5 00000208 df2cf067 00000208 f7463fa0
>>> f401a080 00000000 d3d81e8c df05684a 00000009 00000000 00000000 d3d81eb4
>>> Call Trace:
>>> [<df414827>] dump_stack+0x58/0x81
>>> [<df05677a>] __warn+0xea/0x110
>>> [<df2cf067>] ? btree_csum_one_bio.isra.39+0xf7/0x100
>>> [<df05684a>] warn_slowpath_null+0x2a/0x30
>>> [<df2cf067>] btree_csum_one_bio.isra.39+0xf7/0x100
>>> [<df2cf085>] __btree_submit_bio_start+0x15/0x20
>>> [<df2cdd10>] run_one_async_start+0x30/0x40
>>> [<df31286d>] btrfs_scrubparity_helper+0xcd/0x2d0
>>> [<df2cde70>] ? run_one_async_free+0x20/0x20
>>> [<df312bbd>] btrfs_worker_helper+0xd/0x10
>>> [<df06d05b>] process_one_work+0x10b/0x400
>>> [<df06d387>] worker_thread+0x37/0x4b0
>>> [<df06d350>] ? process_one_work+0x400/0x400
>>> [<df0722db>] kthread+0x9b/0xb0
>>> [<df799922>] ret_from_kernel_thread+0xe/0x24
>>> [<df072240>] ? kthread_stop+0x100/0x100
>>> ---[ end trace f461faff989bf258 ]---
>>> BTRFS: error (device dm-0) in btrfs_commit_transaction:2232: errno=-5 IO
>>> failure (Error while writing out transaction)
>>> BTRFS info (device dm-0): forced readonly
>>> BTRFS warning (device dm-0): Skipping commit of aborted transaction.
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 0 PID: 22318 at fs/btrfs/transaction.c:1854
>>> btrfs_commit_transaction+0x2f5/0xcc0
>>> BTRFS: Transaction aborted (error -5)
>>> Modules linked in: bcache configs rc_hauppauge ir_kbd_i2c
>>> cpufreq_userspace cpufreq_powersave cpufreq_conservative autofs4
>>> snd_hda_codec_hdmi joydev snd_hda_codec_realtek snd_hda_codec_generic
>>> tuner_simple tuner_types tda9887 snd_hda_intel snd_hda_codec snd_hda_core
>>> snd_hwdep tda8290 coretemp snd_pcm_oss snd_mixer_oss tuner snd_pcm msp3400
>>> snd_seq_midi snd_seq_midi_event firewire_sbp2 saa7127 snd_rawmidi
>>> hwmon_vid dm_crypt dm_mod saa7115 snd_seq bttv hid_generic snd_seq_device
>>> snd_timer ehci_pci ivtv tea575x videobuf_dma_sg rc_core videobuf_core
>>> input_leds tveeprom cx2341x v4l2_common ehci_hcd videodev media
>>> acpi_cpufreq tpm_tis tpm_tis_core gpio_ich snd soundcore tpm psmouse
>>> lpc_ich evdev asus_atk0110 serio_raw lp parport raid456 async_raid6_recov
>>> async_pq async_xor async_memcpy async_tx multipath usbhid hid sr_mod cdrom
>>> sg firewire_ohci firewire_core floppy crc_itu_t i915 atl1 fjes mii
>>> uhci_hcd usbcore usb_common
>>> CPU: 0 PID: 22318 Comm: btrfs-balance Tainted: G        W
>>> 4.8.5-ia32-20161028 #2
>>> Hardware name: System manufacturer P5E-VM HDMI/P5E-VM HDMI, BIOS 0604
>>> 07/16/2008
>>> 00000286 00000286 d74a3ca4 df414827 d74a3ce8 dfa132ab d74a3cd4 df05677a
>>> dfa075cc d74a3d04 0000572e dfa132ab 0000073e df2d7de5 0000073e f698dc00
>>> e9173e70 fffffffb d74a3cf0 df0567db 00000009 00000000 d74a3ce8 dfa075cc
>>> Call Trace:
>>> [<df414827>] dump_stack+0x58/0x81
>>> [<df05677a>] __warn+0xea/0x110
>>> [<df2d7de5>] ? btrfs_commit_transaction+0x2f5/0xcc0
>>> [<df0567db>] warn_slowpath_fmt+0x3b/0x40
>>> [<df2d7de5>] btrfs_commit_transaction+0x2f5/0xcc0
>>> [<df096800>] ? prepare_to_wait_event+0xd0/0xd0
>>> [<df33334f>] prepare_to_relocate+0x12f/0x180
>>> [<df339a41>] relocate_block_group+0x31/0x790
>>> [<df0b1427>] ? vprintk_default+0x37/0x40
>>> [<df796ca0>] ? mutex_lock+0x10/0x30
>>> [<df2f8f45>] ? btrfs_wait_ordered_roots+0x1d5/0x1f0
>>> [<df14eed6>] ? printk+0x17/0x19
>>> [<df2a47b2>] ? btrfs_printk+0x102/0x110
>>> [<df33a388>] btrfs_relocate_block_group+0x1e8/0x2e0
>>> [<df308a9f>] btrfs_relocate_chunk.isra.29+0x3f/0xf0
>>> [<df30221f>] ? free_extent_buffer+0x4f/0xa0
>>> [<df30a555>] btrfs_balance+0xb05/0x1820
>>> [<df0b0afa>] ? console_unlock+0x40a/0x630
>>> [<df30b2c1>] balance_kthread+0x51/0x80
>>> [<df30b270>] ? btrfs_balance+0x1820/0x1820
>>> [<df0722db>] kthread+0x9b/0xb0
>>> [<df799922>] ret_from_kernel_thread+0xe/0x24
>>> [<df072240>] ? kthread_stop+0x100/0x100
>>> ---[ end trace f461faff989bf259 ]---
>>> BTRFS: error (device dm-0) in cleanup_transaction:1854: errno=-5 IO failure
>>> BTRFS info (device dm-0): delayed_refs has NO entry
>>>
>>
>>
>>
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-09  1:50                                                 ` Qu Wenruo
@ 2016-11-09  2:05                                                   ` Marc MERLIN
  2016-11-11  3:48                                                     ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-09  2:05 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, Hugo Mills, linux-btrfs

On Wed, Nov 09, 2016 at 09:50:08AM +0800, Qu Wenruo wrote:
> Yeah, quite possible!
> 
> The truth is, current btrfs check only checks:
> 1) Metadata
>    while --check-data-csum option will check data, but still
>    follow the restriction 3).
> 2) Crossing reference of metadata (contents of metadata)
> 3) The first good mirror/backup
> 
> So quite a lot of problems can't be detected by btrfs check:
> 1) Data corruption (csum mismatch)
> 2) 2nd mirror corruption(DUP/RAID0/10) or parity error(RAID5/6)
> 
> For btrfsck to check all mirror and data, you could try out-of-tree 
> offline scrub patchset:
> https://github.com/adam900710/btrfs-progs/tree/fsck_scrub
> 
> Which implements the kernel scrub equivalent in btrfs-progs.

I see, thanks for the answer.
Note that this is very confusing to the end user.
If check --repair returns success, the filesystem should be clean.
Hopefully that patchset can be included in btrfs-progs

But sure enough, I'm seeing a lot of these:
BTRFS warning (device dm-6): checksum error at logical 269783986176 on dev /dev/mapper/crypt_bcache2, sector 529035384, root 16755, inode 1225897, offset 77824, length 4096, links 5 (path: magic/20150624/home/merlin/public_html/rig3/img/thumb800_302_1-Wire.jpg)

This is bad because I would expect check --repair to find them all and offer
to remove all the corrupted files after giving me a list of what I've lost,
or just recompute the checksum to be correct, know the file is now corrupted
but "clean" and I have the option of keeping them as is (ok-ish for a video
file) or restore them from backup.

The worst part with scrub is that I have to find all these files, and then
find all the snapshots they're in (maybe 10 or 20) and delete them all, and
then some of those snapshots are read only because they are btrfs send
source, so I need to destroy those snapshots and lose my btrfs send
relationship and am forced to recreate it (maybe 2 to 6 days of syncing over
a slow-ish link)

When data is corrupted, no solution is perfect, but hopefully check --repair
will indeed be able to restore the entire filesystem to a clean state, even
if some data must be lost in the process.

Thanks for considering.

Marc

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-09  2:05                                                   ` Marc MERLIN
@ 2016-11-11  3:48                                                     ` Marc MERLIN
  2016-11-11  3:55                                                       ` Qu Wenruo
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-11  3:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, Hugo Mills, linux-btrfs

On Tue, Nov 08, 2016 at 06:05:19PM -0800, Marc MERLIN wrote:
> On Wed, Nov 09, 2016 at 09:50:08AM +0800, Qu Wenruo wrote:
> > Yeah, quite possible!
> > 
> > The truth is, current btrfs check only checks:
> > 1) Metadata
> >    while --check-data-csum option will check data, but still
> >    follow the restriction 3).
> > 2) Crossing reference of metadata (contents of metadata)
> > 3) The first good mirror/backup
> > 
> > So quite a lot of problems can't be detected by btrfs check:
> > 1) Data corruption (csum mismatch)
> > 2) 2nd mirror corruption(DUP/RAID0/10) or parity error(RAID5/6)
> > 
> > For btrfsck to check all mirror and data, you could try out-of-tree 
> > offline scrub patchset:
> > https://github.com/adam900710/btrfs-progs/tree/fsck_scrub
> > 
> > Which implements the kernel scrub equivalent in btrfs-progs.
> 
> I see, thanks for the answer.
> Note that this is very confusing to the end user.
> If check --repair returns success, the filesystem should be clean.
> Hopefully that patchset can be included in btrfs-progs
> 
> But sure enough, I'm seeing a lot of these:
> BTRFS warning (device dm-6): checksum error at logical 269783986176 on dev /dev/mapper/crypt_bcache2, sector 529035384, root 16755, inode 1225897, offset 77824, length 4096, links 5 (path: magic/20150624/home/merlin/public_html/rig3/img/thumb800_302_1-Wire.jpg)

So, I ran check -repair, then I ran scrub and I deleted all the files
that were referenced by pathname and failed scrub.
Now I have this:
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269785128960 on dev /dev/mapper/crypt_bcache2
BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1545, gen 0
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269785133056 on dev /dev/mapper/crypt_bcache2
BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1546, gen 0
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269785137152 on dev /dev/mapper/crypt_bcache2
BTRFS warning (device dm-6): checksum error at logical 269784580096 on dev /dev/mapper/crypt_bcache2, sector 529036544, root 17564, inode 1225903, offset 16384: path resolving failed with ret=-2
BTRFS warning (device dm-6): checksum error at logical 269784584192 on dev /dev/mapper/crypt_bcache2, sector 529036552, root 17564, inode 1225903, offset 20480: path resolving failed with ret=-2
BTRFS warning (device dm-6): checksum error at logical 269784588288 on dev /dev/mapper/crypt_bcache2, sector 529036560, root 17564, inode 1225903, offset 24576: path resolving failed with ret=-2
BTRFS warning (device dm-6): checksum error at logical 269784592384 on dev /dev/mapper/crypt_bcache2, sector 529036568, root 17564, inode 1225903, offset 28672: path resolving failed with ret=-2
BTRFS warning (device dm-6): checksum error at logical 269784596480 on dev /dev/mapper/crypt_bcache2, sector 529036576, root 17564, inode 1225903, offset 32768: path resolving failed with ret=-2
BTRFS warning (device dm-6): checksum error at logical 269784600576 on dev /dev/mapper/crypt_bcache2, sector 529036584, root 17564, inode 1225903, offset 36864: path resolving failed with ret=-2
BTRFS warning (device dm-6): checksum error at logical 269784604672 on dev /dev/mapper/crypt_bcache2, sector 529036592, root 17564, inode 1225903, offset 40960: path resolving failed with ret=-2
BTRFS warning (device dm-6): checksum error at logical 269784608768 on dev /dev/mapper/crypt_bcache2, sector 529036600, root 17564, inode 1225903, offset 45056: path resolving failed with ret=-2
BTRFS warning (device dm-6): checksum error at logical 269784612864 on dev /dev/mapper/crypt_bcache2, sector 529036608, root 17564, inode 1225903, offset 49152: path resolving failed with ret=-2

How am I supposed to deal with those?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: btrfs support for filesystems >8TB on 32bit architectures
  2016-11-11  3:48                                                     ` Marc MERLIN
@ 2016-11-11  3:55                                                       ` Qu Wenruo
  2016-11-12  3:17                                                         ` when btrfs scrub reports errors and btrfs check --repair does not Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Qu Wenruo @ 2016-11-11  3:55 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: David Sterba, Hugo Mills, linux-btrfs



At 11/11/2016 11:48 AM, Marc MERLIN wrote:
> On Tue, Nov 08, 2016 at 06:05:19PM -0800, Marc MERLIN wrote:
>> On Wed, Nov 09, 2016 at 09:50:08AM +0800, Qu Wenruo wrote:
>>> Yeah, quite possible!
>>>
>>> The truth is, current btrfs check only checks:
>>> 1) Metadata
>>>    while --check-data-csum option will check data, but still
>>>    follow the restriction 3).
>>> 2) Crossing reference of metadata (contents of metadata)
>>> 3) The first good mirror/backup
>>>
>>> So quite a lot of problems can't be detected by btrfs check:
>>> 1) Data corruption (csum mismatch)
>>> 2) 2nd mirror corruption(DUP/RAID0/10) or parity error(RAID5/6)
>>>
>>> For btrfsck to check all mirror and data, you could try out-of-tree
>>> offline scrub patchset:
>>> https://github.com/adam900710/btrfs-progs/tree/fsck_scrub
>>>
>>> Which implements the kernel scrub equivalent in btrfs-progs.
>>
>> I see, thanks for the answer.
>> Note that this is very confusing to the end user.
>> If check --repair returns success, the filesystem should be clean.
>> Hopefully that patchset can be included in btrfs-progs
>>
>> But sure enough, I'm seeing a lot of these:
>> BTRFS warning (device dm-6): checksum error at logical 269783986176 on dev /dev/mapper/crypt_bcache2, sector 529035384, root 16755, inode 1225897, offset 77824, length 4096, links 5 (path: magic/20150624/home/merlin/public_html/rig3/img/thumb800_302_1-Wire.jpg)
>
> So, I ran check -repair, then I ran scrub and I deleted all the files
> that were referenced by pathname and failed scrub.
> Now I have this:
> BTRFS error (device dm-6): unable to fixup (regular) error at logical 269785128960 on dev /dev/mapper/crypt_bcache2
> BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1545, gen 0
> BTRFS error (device dm-6): unable to fixup (regular) error at logical 269785133056 on dev /dev/mapper/crypt_bcache2
> BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1546, gen 0
> BTRFS error (device dm-6): unable to fixup (regular) error at logical 269785137152 on dev /dev/mapper/crypt_bcache2
> BTRFS warning (device dm-6): checksum error at logical 269784580096 on dev /dev/mapper/crypt_bcache2, sector 529036544, root 17564, inode 1225903, offset 16384: path resolving failed with ret=-2
> BTRFS warning (device dm-6): checksum error at logical 269784584192 on dev /dev/mapper/crypt_bcache2, sector 529036552, root 17564, inode 1225903, offset 20480: path resolving failed with ret=-2
> BTRFS warning (device dm-6): checksum error at logical 269784588288 on dev /dev/mapper/crypt_bcache2, sector 529036560, root 17564, inode 1225903, offset 24576: path resolving failed with ret=-2
> BTRFS warning (device dm-6): checksum error at logical 269784592384 on dev /dev/mapper/crypt_bcache2, sector 529036568, root 17564, inode 1225903, offset 28672: path resolving failed with ret=-2
> BTRFS warning (device dm-6): checksum error at logical 269784596480 on dev /dev/mapper/crypt_bcache2, sector 529036576, root 17564, inode 1225903, offset 32768: path resolving failed with ret=-2
> BTRFS warning (device dm-6): checksum error at logical 269784600576 on dev /dev/mapper/crypt_bcache2, sector 529036584, root 17564, inode 1225903, offset 36864: path resolving failed with ret=-2
> BTRFS warning (device dm-6): checksum error at logical 269784604672 on dev /dev/mapper/crypt_bcache2, sector 529036592, root 17564, inode 1225903, offset 40960: path resolving failed with ret=-2
> BTRFS warning (device dm-6): checksum error at logical 269784608768 on dev /dev/mapper/crypt_bcache2, sector 529036600, root 17564, inode 1225903, offset 45056: path resolving failed with ret=-2
> BTRFS warning (device dm-6): checksum error at logical 269784612864 on dev /dev/mapper/crypt_bcache2, sector 529036608, root 17564, inode 1225903, offset 49152: path resolving failed with ret=-2
>
> How am I supposed to deal with those?

It seems to be orphan inodes.
Btrfs doesn't remove all the contents of an inode at rm time.
It just unlink the inode and put it into a state called orphan 
inodes.(Can't be referred from any directory).

And then free their data extents in next several trans.


Try to find these inodes using inode number in specified subvolume.
If not found, then they are orphan inodes, nothing to worry.
These wrong data extent will disappear soon or later.

Or you can use "btrfs fi sync" to make sure orphan inodes are really 
removed from tree.

Thanks,
Qu
>
> Thanks,
> Marc
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: when btrfs scrub reports errors and btrfs check --repair does not
  2016-11-11  3:55                                                       ` Qu Wenruo
@ 2016-11-12  3:17                                                         ` Marc MERLIN
  2016-11-13 15:06                                                           ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-12  3:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, Hugo Mills, linux-btrfs

On Fri, Nov 11, 2016 at 11:55:21AM +0800, Qu Wenruo wrote:
> It seems to be orphan inodes.
> Btrfs doesn't remove all the contents of an inode at rm time.
> It just unlink the inode and put it into a state called orphan inodes.(Can't
> be referred from any directory).

BTRFS warning (device dm-6): checksum error at logical 269783928832 on dev /dev/mapper/crypt_bcache2, sector 529035272, root 17564, inode 1225897, offset 20480: path resolving failed with ret=-2
BTRFS warning (device dm-6): checksum error at logical 269783932928 on dev /dev/mapper/crypt_bcache2, sector 529035280, root 17564, inode 1225897, offset 24576: path resolving failed with ret=-2
 
Do you mean I should be using find /mnt/mnt -inum ?
Well, how about that, you're right:
gargamel:/mnt/mnt/DS2/backup# find /mnt/mnt -inum 1225897
/mnt/mnt/DS2/backup/debian64_rw.20160713_03:21:57/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y678z6.jpg
So basically the breakage in my filesystem is enough that the backlink
from the inode to the pathname is gone? That's not good :-/

> And then free their data extents in next several trans.
> 
> Try to find these inodes using inode number in specified subvolume.
> If not found, then they are orphan inodes, nothing to worry.
> These wrong data extent will disappear soon or later.
> 
> Or you can use "btrfs fi sync" to make sure orphan inodes are really removed
> from tree.
 
So, I ran btrfi fi sync /mnt/mnt, butit returned instantly.

scrub after that, still returns:
btrfs scrub start -Bd /mnt/mnt
BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1793, gen 0
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269785628672 on dev /dev/mapper/crypt_bcache2
BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1794, gen 0
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269784580096 on dev /dev/mapper/crypt_bcache2
BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1795, gen 0
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269785632768 on dev /dev/mapper/crypt_bcache2
BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1796, gen 0
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269785104384 on dev /dev/mapper/crypt_bcache2
BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1797, gen 0
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269784584192 on dev /dev/mapper/crypt_bcache2
BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1798, gen 0
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269785636864 on dev /dev/mapper/crypt_bcache2
BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1799, gen 0
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269785108480 on dev /dev/mapper/crypt_bcache2
BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1800, gen 0
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269784588288 on dev /dev/mapper/crypt_bcache2
BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1801, gen 0
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269784055808 on dev /dev/mapper/crypt_bcache2
BTRFS error (device dm-6): bdev /dev/mapper/crypt_bcache2 errs: wr 0, rd 0, flush 0, corrupt 1802, gen 0
BTRFS error (device dm-6): unable to fixup (regular) error at logical 269785640960 on dev /dev/mapper/crypt_bcache2

What am I supposed to do about these, I'm not even clear where this
corruption is located and how to clear it.

I understand you're saying that this does not seem to affect any
remaining data, but if scrub is not clean, it can't even see what
file an inode is linked to, and that inode doesn't get cleaned 2 days
later, my filesystem is in a bad state that check --repair should fix,
is it not?

Yes, I can wipe it and start over, but I'm trying to use this as a
learning experience as well as seeing if the tools are working as they
should.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: when btrfs scrub reports errors and btrfs check --repair does not
  2016-11-12  3:17                                                         ` when btrfs scrub reports errors and btrfs check --repair does not Marc MERLIN
@ 2016-11-13 15:06                                                           ` Marc MERLIN
  2016-11-13 15:13                                                             ` Roman Mamedov
  0 siblings, 1 reply; 67+ messages in thread
From: Marc MERLIN @ 2016-11-13 15:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, Hugo Mills, linux-btrfs

On Fri, Nov 11, 2016 at 07:17:08PM -0800, Marc MERLIN wrote:
> On Fri, Nov 11, 2016 at 11:55:21AM +0800, Qu Wenruo wrote:
> > It seems to be orphan inodes.
> > Btrfs doesn't remove all the contents of an inode at rm time.
> > It just unlink the inode and put it into a state called orphan inodes.(Can't
> > be referred from any directory).
> 
> BTRFS warning (device dm-6): checksum error at logical 269783928832 on dev /dev/mapper/crypt_bcache2, sector 529035272, root 17564, inode 1225897, offset 20480: path resolving failed with ret=-2
> BTRFS warning (device dm-6): checksum error at logical 269783932928 on dev /dev/mapper/crypt_bcache2, sector 529035280, root 17564, inode 1225897, offset 24576: path resolving failed with ret=-2
>  
> Do you mean I should be using find /mnt/mnt -inum ?
> Well, how about that, you're right:
> gargamel:/mnt/mnt/DS2/backup# find /mnt/mnt -inum 1225897
> /mnt/mnt/DS2/backup/debian64_rw.20160713_03:21:57/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y678z6.jpg
> So basically the breakage in my filesystem is enough that the backlink
> from the inode to the pathname is gone? That's not good :-/

Mmmn, been doing find -inum, deleting hits, running scrub, and then
scrub still fails with more, and now I'm seeing this;

gargamel:~# find /mnt/mnt -inum 1225897

/mnt/mnt/DS2/backup/ubuntu_rw.20160713_03:25:42/gandalfthegrey/20100718/var/local/www/Pix/albums/Trips/200509_Malaysia/500_KapalaiIsland/BestOf/33_Diving-Dive5-2_139.jpg
/mnt/mnt/DS2/backup/debian64_ro.20160720_02:58:38/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y678z6.jpg
/mnt/mnt/DS2/backup/debian64_ro.20160720_02:58:38/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y679z6.jpg
(...)
/mnt/mnt/DS2/backup/debian64_rw.20160727_02:59:03/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y678z6.jpg
/mnt/mnt/DS2/backup/debian64_rw.20160727_02:59:03/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y679z6.jpg
/mnt/mnt/DS2/backup/debian64_rw.20160727_02:59:03/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y81z9.jpg

And then I see this:
gargamel:~# ls -li /mnt/mnt/DS2/backup/ubuntu_rw.20160713_03:25:42/gandalfthegrey/20100718/var/local/www/Pix/albums/Trips/200509_Malaysia/500_KapalaiIsland/BestOf/33_Diving-Dive5-2_139.jpg /mnt/mnt/DS2/backup/debian64_ro.20160720_02:58:38/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y678z6.jpg /mnt/mnt/DS2/backup/debian64_ro.20160720_02:58:38/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y679z6.jpg /mnt/mnt/DS2/backup/debian64_rw.20160727_02:59:03/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y678z6.jpg /mnt/mnt/DS2/backup/debian64_rw.20160727_02:59:03/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y679z6.jpg /mnt/mnt/DS2/backup/debian64_rw.20160727_02:59:03/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y81z9.jpg
1225897 -rw-r--r-- 5 merlin merlin 13794 Jan  7  2012 /mnt/mnt/DS2/backup/debian64_ro.20160720_02:58:38/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y678z6.jpg
1225898 -rw-r--r-- 5 merlin merlin 13048 Jan  7  2012 /mnt/mnt/DS2/backup/debian64_ro.20160720_02:58:38/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y679z6.jpg
1225897 -rw-r--r-- 5 merlin merlin 13794 Jan  7  2012 /mnt/mnt/DS2/backup/debian64_rw.20160727_02:59:03/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y678z6.jpg
1225898 -rw-r--r-- 5 merlin merlin 13048 Jan  7  2012 /mnt/mnt/DS2/backup/debian64_rw.20160727_02:59:03/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y679z6.jpg
1225913 -rw-r--r-- 5 merlin merlin 15247 Jan  7  2012 /mnt/mnt/DS2/backup/debian64_rw.20160727_02:59:03/gandalfthegreat/20120409/home/merlin/public_html/mirrors/rwf/vfrcharts/x9y81z9.jpg
1225897 lrwxrwxrwx 1 merlin merlin    35 Aug  1  2010 /mnt/mnt/DS2/backup/ubuntu_rw.20160713_03:25:42/gandalfthegrey/20100718/var/local/www/Pix/albums/Trips/200509_Malaysia/500_KapalaiIsland/BestOf/33_Diving-Dive5-2_139.jpg -> ../33_Diving/BestOf/Dive5-2_139.jpg

So first:
a) find -inum returns some inodes that don't match
b) but argh, multiple files (very different) have the same inode number, so finding
files by inode number after scrub flagged an inode bad, isn't going to work :(

At this point, I'm starting to lose patience (and running out of time),
so I'm going to wipe this filesystem after I hear back from you, but
basically scrub and repair and still not up to what they should be IMO
(as per my previous comment):
One should be able to fully repair an unclean filesystem with check --repair, and scrub should
give me things I can either fix by hand (delete the corrupt file) or
that check --repair would fix, and neither is true here.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: when btrfs scrub reports errors and btrfs check --repair does not
  2016-11-13 15:06                                                           ` Marc MERLIN
@ 2016-11-13 15:13                                                             ` Roman Mamedov
  2016-11-13 15:52                                                               ` Marc MERLIN
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Mamedov @ 2016-11-13 15:13 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Sun, 13 Nov 2016 07:06:30 -0800
Marc MERLIN <marc@merlins.org> wrote:

> So first:
> a) find -inum returns some inodes that don't match
> b) but argh, multiple files (very different) have the same inode number, so finding
> files by inode number after scrub flagged an inode bad, isn't going to work :(

I wonder why do you even need scrub to verify file readability. Just try
reading all files by using e.g. "cfv -Crr", the read errors produced will
point you directly to files which are unreadable, without the need to lookup
them in a backward way via inum. Then just restore those from backups.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: when btrfs scrub reports errors and btrfs check --repair does not
  2016-11-13 15:13                                                             ` Roman Mamedov
@ 2016-11-13 15:52                                                               ` Marc MERLIN
  0 siblings, 0 replies; 67+ messages in thread
From: Marc MERLIN @ 2016-11-13 15:52 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

On Sun, Nov 13, 2016 at 08:13:29PM +0500, Roman Mamedov wrote:
> On Sun, 13 Nov 2016 07:06:30 -0800
> Marc MERLIN <marc@merlins.org> wrote:
> 
> > So first:
> > a) find -inum returns some inodes that don't match
> > b) but argh, multiple files (very different) have the same inode number, so finding
> > files by inode number after scrub flagged an inode bad, isn't going to work :(
> 
> I wonder why do you even need scrub to verify file readability. Just try
> reading all files by using e.g. "cfv -Crr", the read errors produced will
> point you directly to files which are unreadable, without the need to lookup
> them in a backward way via inum. Then just restore those from backups.

I could read the files, but we're talking about maybe 100 million files?
that would take a while... (and most of them are COW copies of the same
physical data), so scrub is _much_ faster.

Scrub is also reporting issues not related to files, but data structures
it seems, while repair is not fiding them.

As for the data, it's a backup device, so I can just wipe it, but again,
I'm using this as an example of how I would simply bring a drive back to
a clean state, and that's not pretty right now.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 67+ messages in thread

* btrfs check --repair: ERROR: cannot read chunk root
@ 2016-10-31  1:29 Janos Toth F.
  0 siblings, 0 replies; 67+ messages in thread
From: Janos Toth F. @ 2016-10-31  1:29 UTC (permalink / raw)
  To: Btrfs BTRFS

I stopped using Btrfs RAID-5 after encountering this problem two times
(once due to a failing SATA cable, once due to a random kernel problem
which caused the SATA or the block device driver to reset/crash).
As much as I can tell, the main problem is that after a de- and a
subsequent re-attach (on purpose or due to failing cable/controller,
kernel problem, etc), your de-synced disk is taken back to the "array"
as if it was still in sync despite having a lower generation counter.
The filesystem detects the errors later but it can't reliably handle,
let alone fully correct them. If you search on this list with "RAID 5"
you will see that RAID-5/6 scrub/repair has some known serious
problems (which probably contribute to this but I guess something on
top of those problems plays a part here to make this extra messy, like
the generations never getting synced). The de-synchronization will get
worse over time if your are in writable mode (the generation of the
de-synced disk is stuck), up to he point of the filesystem becoming
unmountable (if not already, probably due to scrub causing errors on
the disks in sync).
I was able to use "rescue" once and even achieve a read-only mount
during the second time (with only a handful of broken files). But I
could see a pattern and didn't want to end up in situations like that
for a third time.
Too bad, I would prefer RAID-5/6 over RAID-1/10 any day otherwise
(RAID-5 is faster than RAID-1 and you can loose any two disks from a
RAID-6, not just one from each mirrors on RAID-10) but most people
think it's slow and obsolete (I mean they say that about hardware or
mdadm RAID-5/6 and Btrfs's RAID-5/6 is frowned upon for true reasons)
while it's actually the opposite with a limited number of drives (<=6,
or may be up to 10).
It's not impossible to get right though, RAID-Z is nice (except the
inability to defrag the inevitable fragmentation), so I keep hoping...

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2016-11-13 15:52 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-30 18:34 btrfs check --repair: ERROR: cannot read chunk root Marc MERLIN
2016-10-31  1:02 ` Qu Wenruo
2016-10-31  2:06   ` Marc MERLIN
2016-10-31  4:21     ` Marc MERLIN
2016-10-31  5:27     ` Qu Wenruo
2016-10-31  5:47       ` Marc MERLIN
2016-10-31  6:04         ` Qu Wenruo
2016-10-31  6:25           ` Marc MERLIN
2016-10-31  6:32             ` Qu Wenruo
2016-10-31  6:37               ` Marc MERLIN
2016-10-31  7:04                 ` Qu Wenruo
2016-10-31  8:44                   ` Hugo Mills
2016-10-31 15:04                     ` Marc MERLIN
2016-11-01  3:48                       ` Marc MERLIN
2016-11-01  4:13                       ` Qu Wenruo
2016-11-01  4:21                         ` Marc MERLIN
2016-11-04  8:01                           ` Marc MERLIN
2016-11-04  9:00                             ` Roman Mamedov
2016-11-04 17:59                               ` Marc MERLIN
2016-11-07  1:11                             ` Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2016-10-31  1:29 Janos Toth F.
2016-10-30  2:16 Buffer I/O error on dev md5, logical block 7073536, async page read Marc MERLIN
2016-10-30  9:33 ` Andreas Klauer
2016-10-30 15:38   ` Marc MERLIN
2016-10-30 16:19     ` Andreas Klauer
2016-10-30 16:34       ` Phil Turmel
2016-10-30 17:12         ` clearing blocks wrongfully marked as bad if --update=no-bbl can't be used? Marc MERLIN
2016-10-30 17:16           ` Marc MERLIN
2016-11-04 18:18             ` Marc MERLIN
2016-11-04 18:22               ` Phil Turmel
2016-11-04 18:50                 ` Marc MERLIN
2016-11-04 18:59                   ` Roman Mamedov
2016-11-04 19:31                     ` Roman Mamedov
2016-11-04 20:02                       ` Marc MERLIN
2016-11-04 19:51                     ` Marc MERLIN
2016-11-07  0:16                       ` NeilBrown
2016-11-07  1:13                         ` Marc MERLIN
2016-11-07  3:36                           ` Phil Turmel
2016-11-07  1:20                         ` Marc MERLIN
2016-11-07  1:39                           ` Qu Wenruo
2016-11-07  4:18                             ` Qu Wenruo
2016-11-07  5:36                               ` btrfs support for filesystems >8TB on 32bit architectures Marc MERLIN
2016-11-07  6:16                                 ` Qu Wenruo
2016-11-07 14:55                                   ` Marc MERLIN
2016-11-08  0:35                                     ` Qu Wenruo
2016-11-08  0:39                                       ` Marc MERLIN
2016-11-08  0:43                                         ` Qu Wenruo
2016-11-08  1:06                                           ` Marc MERLIN
2016-11-08  1:17                                             ` Qu Wenruo
2016-11-08 15:24                                               ` Marc MERLIN
2016-11-09  1:50                                                 ` Qu Wenruo
2016-11-09  2:05                                                   ` Marc MERLIN
2016-11-11  3:48                                                     ` Marc MERLIN
2016-11-11  3:55                                                       ` Qu Wenruo
2016-11-12  3:17                                                         ` when btrfs scrub reports errors and btrfs check --repair does not Marc MERLIN
2016-11-13 15:06                                                           ` Marc MERLIN
2016-11-13 15:13                                                             ` Roman Mamedov
2016-11-13 15:52                                                               ` Marc MERLIN
2016-10-30 18:56         ` [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds TomK
2016-10-30 19:16           ` TomK
2016-10-30 20:13           ` Andreas Klauer
2016-10-30 21:08             ` TomK
2016-10-31 19:29           ` Wols Lists
2016-11-01  2:40             ` TomK
2016-10-30 16:43       ` Buffer I/O error on dev md5, logical block 7073536, async page read Marc MERLIN
2016-10-30 17:02         ` Andreas Klauer
2016-10-31 19:24         ` Wols Lists

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.