Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* Access Beyond End of Device & Input/Output Errors
@ 2020-08-01  6:51 Justin Brown
  2020-08-01  6:58 ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: Justin Brown @ 2020-08-01  6:51 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I've run into a strange problem that I haven't seen before, and I need
some help. I started getting generic "input/output" errors on a couple
of files, and when I looked deeper, the kernel logs are full of
messages like:

    sd 5:0:0:0: [sdf] tag#29 access beyond end of device

I've never seen anything like this before with any FS, so I figured it
was worth asking before I consider running the standard btrfs tools.
(I briefly started a scrub, but it was going crazy with uncorrectable
errors, so I cancelled it.)

Here's my system info:

Fedora 32, kernel 5.7.7-200.fc32.x86_64
btrfs-progs v5.7

/etc/fstab entry:
LABEL=media /var/media btrfs subvol=media,discard 0 2

btrfs fi show /var/media/
Label: 'media' uuid: 51eef0c7-2977-4037-b271-3270ea22c7d9
Total devices 6 FS bytes used 4.68TiB
devid 1 size 1.82TiB used 963.00GiB path /dev/sdf1
devid 2 size 1.82TiB used 962.00GiB path /dev/sde1
devid 4 size 1.82TiB used 963.00GiB path /dev/sdg1
devid 6 size 1.82TiB used 962.03GiB path /dev/sda1
devid 7 size 7.28TiB used 967.03GiB path /dev/sdb1
devid 8 size 7.28TiB used 967.03GiB path /dev/sdd1

btrfs fi df /var/media/
Data, RAID5: total=4.69TiB, used=4.68TiB
System, RAID1C3: total=32.00MiB, used=304.00KiB
Metadata, RAID1C3: total=6.00GiB, used=4.94GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

I can only mount -o degraded now. Here are the logs when mounting:

Aug 01 01:15:26 spaceman.fandingo.org sudo[275572]: justin : TTY=pts/0
; PWD=/home/justin ; USER=root ; COMMAND=/usr/bin/mount -t btrfs -o
degraded /dev/sda1 /var/media/
Aug 01 01:15:26 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
access beyond end of device
Aug 01 01:15:26 spaceman.fandingo.org kernel: blk_update_request: I/O
error, dev sdf, sector 2176 op 0x0:(READ) flags 0x0 phys_seg 1 prio
class 0
Aug 01 01:15:26 spaceman.fandingo.org kernel: Buffer I/O error on dev
sdf1, logical block 16, async page read
Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
sde1): allowing degraded mounts
Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
sde1): disk space caching is enabled
Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS warning (device
sde1): devid 1 uuid cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 is missing
Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
sde1): bdev /dev/sdf1 errs: wr 4458026, rd 14571, flush 0, corrupt 0,
gen 0

It seems like only relatively recently written files are encountering
I/O errors. If I `cat` one of the problematic files when the FS is
mounted normally, I see a ton of this:

Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#26
access beyond end of device
Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#27
access beyond end of device
Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#28
access beyond end of device
Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#29
access beyond end of device
Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
access beyond end of device
Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#0
access beyond end of device
Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#1
access beyond end of device
Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#13
access beyond end of device
Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#2
access beyond end of device

Now that I'm remounted in -o degraded, I'm getting more comprehensible
warnings, but it still results in I/O read failures:

Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
sde1): csum failed root 2820 ino 747435 off 99942400 csum 0x8941f998
expected csum 0xbe3f80a4 mirror 2
Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
sde1): csum failed root 2820 ino 747435 off 99946496 csum 0x8941f998
expected csum 0x9c36a6b4 mirror 2
Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
sde1): csum failed root 2820 ino 747435 off 99950592 csum 0x8941f998
expected csum 0x44d30ca2 mirror 2
Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
sde1): csum failed root 2820 ino 747435 off 99958784 csum 0x8941f998
expected csum 0xc0f08acc mirror 2
Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
sde1): csum failed root 2820 ino 747435 off 99954688 csum 0x8941f998
expected csum 0xcb11db59 mirror 2
Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
sde1): csum failed root 2820 ino 747435 off 99962880 csum 0x8941f998
expected csum 0x8a4ee0aa mirror 2
Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
sde1): csum failed root 2820 ino 747435 off 99971072 csum 0x8941f998
expected csum 0xdfb79e85 mirror 2
Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
sde1): csum failed root 2820 ino 747435 off 99966976 csum 0x8941f998
expected csum 0xc14921a0 mirror 2
Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
sde1): csum failed root 2820 ino 747435 off 99975168 csum 0x8941f998
expected csum 0xf2fe8774 mirror 2
Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
sde1): csum failed root 2820 ino 747435 off 99979264 csum 0x8941f998
expected csum 0xae1cafd6 mirror 2

Why trying to research this problem, I came across a Github issue
https://github.com/kdave/btrfs-progs/issues/282 and a patch from Qu
from yesterday ([PATCH] btrfs: trim: fix underflow in trim length to
prevent access beyond device boundary). I do use the discard mount
option, and I have a weekly fstrim.timer enabled. I did replace 2x2TB
drives with the 2x8TB drives about 1 month ago, which involved a
conversion to -d raid5 -m raid1c3, which I suppose could hit the same
code paths that resize2fs would?

Any advice on how to proceed would be greatly appreciated.

Thanks,
Justin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Access Beyond End of Device & Input/Output Errors
  2020-08-01  6:51 Access Beyond End of Device & Input/Output Errors Justin Brown
@ 2020-08-01  6:58 ` Qu Wenruo
  2020-08-01  7:02   ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2020-08-01  6:58 UTC (permalink / raw)
  To: Justin Brown, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 6999 bytes --]



On 2020/8/1 下午2:51, Justin Brown wrote:
> Hello,
> 
> I've run into a strange problem that I haven't seen before, and I need
> some help. I started getting generic "input/output" errors on a couple
> of files, and when I looked deeper, the kernel logs are full of
> messages like:
> 
>     sd 5:0:0:0: [sdf] tag#29 access beyond end of device

We had a new fix for trim. But according to your kernel message, it
doesn't look like the case.

(No obvious tag showing it's trim/discard)

> 
> I've never seen anything like this before with any FS, so I figured it
> was worth asking before I consider running the standard btrfs tools.
> (I briefly started a scrub, but it was going crazy with uncorrectable
> errors, so I cancelled it.)
> 
> Here's my system info:
> 
> Fedora 32, kernel 5.7.7-200.fc32.x86_64
> btrfs-progs v5.7
> 
> /etc/fstab entry:
> LABEL=media /var/media btrfs subvol=media,discard 0 2
> 
> btrfs fi show /var/media/
> Label: 'media' uuid: 51eef0c7-2977-4037-b271-3270ea22c7d9
> Total devices 6 FS bytes used 4.68TiB
> devid 1 size 1.82TiB used 963.00GiB path /dev/sdf1
> devid 2 size 1.82TiB used 962.00GiB path /dev/sde1
> devid 4 size 1.82TiB used 963.00GiB path /dev/sdg1
> devid 6 size 1.82TiB used 962.03GiB path /dev/sda1
> devid 7 size 7.28TiB used 967.03GiB path /dev/sdb1
> devid 8 size 7.28TiB used 967.03GiB path /dev/sdd1
> 
> btrfs fi df /var/media/
> Data, RAID5: total=4.69TiB, used=4.68TiB
> System, RAID1C3: total=32.00MiB, used=304.00KiB
> Metadata, RAID1C3: total=6.00GiB, used=4.94GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> I can only mount -o degraded now. Here are the logs when mounting:
> 
> Aug 01 01:15:26 spaceman.fandingo.org sudo[275572]: justin : TTY=pts/0
> ; PWD=/home/justin ; USER=root ; COMMAND=/usr/bin/mount -t btrfs -o
> degraded /dev/sda1 /var/media/
> Aug 01 01:15:26 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
> access beyond end of device
> Aug 01 01:15:26 spaceman.fandingo.org kernel: blk_update_request: I/O
> error, dev sdf, sector 2176 op 0x0:(READ) flags 0x0 phys_seg 1 prio
> class 0

OK, it's read, not DISCARD, thus a completely different problem.


> Aug 01 01:15:26 spaceman.fandingo.org kernel: Buffer I/O error on dev
> sdf1, logical block 16, async page read
> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
> sde1): allowing degraded mounts
> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
> sde1): disk space caching is enabled
> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS warning (device
> sde1): devid 1 uuid cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 is missing
> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
> sde1): bdev /dev/sdf1 errs: wr 4458026, rd 14571, flush 0, corrupt 0,
> gen 0
> 
> It seems like only relatively recently written files are encountering
> I/O errors. If I `cat` one of the problematic files when the FS is
> mounted normally, I see a ton of this:
> 
> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#26
> access beyond end of device
> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#27
> access beyond end of device
> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#28
> access beyond end of device
> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#29
> access beyond end of device
> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
> access beyond end of device
> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#0
> access beyond end of device
> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#1
> access beyond end of device
> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#13
> access beyond end of device
> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#2
> access beyond end of device
> 
> Now that I'm remounted in -o degraded, I'm getting more comprehensible
> warnings, but it still results in I/O read failures:
> 
> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> sde1): csum failed root 2820 ino 747435 off 99942400 csum 0x8941f998
> expected csum 0xbe3f80a4 mirror 2
> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> sde1): csum failed root 2820 ino 747435 off 99946496 csum 0x8941f998
> expected csum 0x9c36a6b4 mirror 2
> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> sde1): csum failed root 2820 ino 747435 off 99950592 csum 0x8941f998
> expected csum 0x44d30ca2 mirror 2
> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> sde1): csum failed root 2820 ino 747435 off 99958784 csum 0x8941f998
> expected csum 0xc0f08acc mirror 2
> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> sde1): csum failed root 2820 ino 747435 off 99954688 csum 0x8941f998
> expected csum 0xcb11db59 mirror 2
> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> sde1): csum failed root 2820 ino 747435 off 99962880 csum 0x8941f998
> expected csum 0x8a4ee0aa mirror 2
> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> sde1): csum failed root 2820 ino 747435 off 99971072 csum 0x8941f998
> expected csum 0xdfb79e85 mirror 2
> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> sde1): csum failed root 2820 ino 747435 off 99966976 csum 0x8941f998
> expected csum 0xc14921a0 mirror 2
> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> sde1): csum failed root 2820 ino 747435 off 99975168 csum 0x8941f998
> expected csum 0xf2fe8774 mirror 2
> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> sde1): csum failed root 2820 ino 747435 off 99979264 csum 0x8941f998
> expected csum 0xae1cafd6 mirror 2
> 
> Why trying to research this problem, I came across a Github issue
> https://github.com/kdave/btrfs-progs/issues/282 and a patch from Qu
> from yesterday ([PATCH] btrfs: trim: fix underflow in trim length to
> prevent access beyond device boundary). I do use the discard mount
> option, and I have a weekly fstrim.timer enabled. I did replace 2x2TB
> drives with the 2x8TB drives about 1 month ago, which involved a
> conversion to -d raid5 -m raid1c3, which I suppose could hit the same
> code paths that resize2fs would?

The problem doesn't look like a trim one, but more likely some device
boundary bug.

Would you please provide the following info?
- btrfs ins dump-tree -t chunk /dev/sde1
  This contains the device info and chunk tree dump. Doesn't contain
  any confidential info.
  We can use this info to determine if there is some chunk really beyond
  device boundary.
  I guess some chunks are already beyond device boundary by somehow.

Thanks,
Qu

> 
> Any advice on how to proceed would be greatly appreciated.
> 
> Thanks,
> Justin
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Access Beyond End of Device & Input/Output Errors
  2020-08-01  6:58 ` Qu Wenruo
@ 2020-08-01  7:02   ` Qu Wenruo
       [not found]     ` <CAKZK7uzmg19NDjGPPAxXKu7LJ-7ZdHu2cad22csj_chr2qxMJg@mail.gmail.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2020-08-01  7:02 UTC (permalink / raw)
  To: Justin Brown, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 7331 bytes --]



On 2020/8/1 下午2:58, Qu Wenruo wrote:
> 
> 
> On 2020/8/1 下午2:51, Justin Brown wrote:
>> Hello,
>>
>> I've run into a strange problem that I haven't seen before, and I need
>> some help. I started getting generic "input/output" errors on a couple
>> of files, and when I looked deeper, the kernel logs are full of
>> messages like:
>>
>>     sd 5:0:0:0: [sdf] tag#29 access beyond end of device
> 
> We had a new fix for trim. But according to your kernel message, it
> doesn't look like the case.
> 
> (No obvious tag showing it's trim/discard)
> 
>>
>> I've never seen anything like this before with any FS, so I figured it
>> was worth asking before I consider running the standard btrfs tools.
>> (I briefly started a scrub, but it was going crazy with uncorrectable
>> errors, so I cancelled it.)
>>
>> Here's my system info:
>>
>> Fedora 32, kernel 5.7.7-200.fc32.x86_64
>> btrfs-progs v5.7
>>
>> /etc/fstab entry:
>> LABEL=media /var/media btrfs subvol=media,discard 0 2
>>
>> btrfs fi show /var/media/
>> Label: 'media' uuid: 51eef0c7-2977-4037-b271-3270ea22c7d9
>> Total devices 6 FS bytes used 4.68TiB
>> devid 1 size 1.82TiB used 963.00GiB path /dev/sdf1
>> devid 2 size 1.82TiB used 962.00GiB path /dev/sde1
>> devid 4 size 1.82TiB used 963.00GiB path /dev/sdg1
>> devid 6 size 1.82TiB used 962.03GiB path /dev/sda1
>> devid 7 size 7.28TiB used 967.03GiB path /dev/sdb1
>> devid 8 size 7.28TiB used 967.03GiB path /dev/sdd1
>>
>> btrfs fi df /var/media/
>> Data, RAID5: total=4.69TiB, used=4.68TiB
>> System, RAID1C3: total=32.00MiB, used=304.00KiB
>> Metadata, RAID1C3: total=6.00GiB, used=4.94GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> I can only mount -o degraded now. Here are the logs when mounting:
>>
>> Aug 01 01:15:26 spaceman.fandingo.org sudo[275572]: justin : TTY=pts/0
>> ; PWD=/home/justin ; USER=root ; COMMAND=/usr/bin/mount -t btrfs -o
>> degraded /dev/sda1 /var/media/
>> Aug 01 01:15:26 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
>> access beyond end of device
>> Aug 01 01:15:26 spaceman.fandingo.org kernel: blk_update_request: I/O
>> error, dev sdf, sector 2176 op 0x0:(READ) flags 0x0 phys_seg 1 prio
>> class 0
> 
> OK, it's read, not DISCARD, thus a completely different problem.
> 
> 
>> Aug 01 01:15:26 spaceman.fandingo.org kernel: Buffer I/O error on dev
>> sdf1, logical block 16, async page read
>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
>> sde1): allowing degraded mounts
>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
>> sde1): disk space caching is enabled
>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS warning (device
>> sde1): devid 1 uuid cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 is missing
>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
>> sde1): bdev /dev/sdf1 errs: wr 4458026, rd 14571, flush 0, corrupt 0,
>> gen 0
>>
>> It seems like only relatively recently written files are encountering
>> I/O errors. If I `cat` one of the problematic files when the FS is
>> mounted normally, I see a ton of this:
>>
>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#26
>> access beyond end of device
>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#27
>> access beyond end of device
>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#28
>> access beyond end of device
>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#29
>> access beyond end of device
>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
>> access beyond end of device
>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#0
>> access beyond end of device
>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#1
>> access beyond end of device
>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#13
>> access beyond end of device
>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#2
>> access beyond end of device
>>
>> Now that I'm remounted in -o degraded, I'm getting more comprehensible
>> warnings, but it still results in I/O read failures:
>>
>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>> sde1): csum failed root 2820 ino 747435 off 99942400 csum 0x8941f998
>> expected csum 0xbe3f80a4 mirror 2
>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>> sde1): csum failed root 2820 ino 747435 off 99946496 csum 0x8941f998
>> expected csum 0x9c36a6b4 mirror 2
>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>> sde1): csum failed root 2820 ino 747435 off 99950592 csum 0x8941f998
>> expected csum 0x44d30ca2 mirror 2
>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>> sde1): csum failed root 2820 ino 747435 off 99958784 csum 0x8941f998
>> expected csum 0xc0f08acc mirror 2
>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>> sde1): csum failed root 2820 ino 747435 off 99954688 csum 0x8941f998
>> expected csum 0xcb11db59 mirror 2
>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>> sde1): csum failed root 2820 ino 747435 off 99962880 csum 0x8941f998
>> expected csum 0x8a4ee0aa mirror 2
>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>> sde1): csum failed root 2820 ino 747435 off 99971072 csum 0x8941f998
>> expected csum 0xdfb79e85 mirror 2
>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>> sde1): csum failed root 2820 ino 747435 off 99966976 csum 0x8941f998
>> expected csum 0xc14921a0 mirror 2
>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>> sde1): csum failed root 2820 ino 747435 off 99975168 csum 0x8941f998
>> expected csum 0xf2fe8774 mirror 2
>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>> sde1): csum failed root 2820 ino 747435 off 99979264 csum 0x8941f998
>> expected csum 0xae1cafd6 mirror 2
>>
>> Why trying to research this problem, I came across a Github issue
>> https://github.com/kdave/btrfs-progs/issues/282 and a patch from Qu
>> from yesterday ([PATCH] btrfs: trim: fix underflow in trim length to
>> prevent access beyond device boundary). I do use the discard mount
>> option, and I have a weekly fstrim.timer enabled. I did replace 2x2TB
>> drives with the 2x8TB drives about 1 month ago, which involved a
>> conversion to -d raid5 -m raid1c3, which I suppose could hit the same
>> code paths that resize2fs would?
> 
> The problem doesn't look like a trim one, but more likely some device
> boundary bug.
> 
> Would you please provide the following info?
> - btrfs ins dump-tree -t chunk /dev/sde1
>   This contains the device info and chunk tree dump. Doesn't contain
>   any confidential info.
>   We can use this info to determine if there is some chunk really beyond
>   device boundary.
>   I guess some chunks are already beyond device boundary by somehow.

And `lsblk -b` output.

It may be possible that device size in btrfs doesn't match with the real
device...
> 
> Thanks,
> Qu
> 
>>
>> Any advice on how to proceed would be greatly appreciated.
>>
>> Thanks,
>> Justin
>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Access Beyond End of Device & Input/Output Errors
       [not found]     ` <CAKZK7uzmg19NDjGPPAxXKu7LJ-7ZdHu2cad22csj_chr2qxMJg@mail.gmail.com>
@ 2020-08-01  9:31       ` Qu Wenruo
  2020-08-01 11:56         ` Justin Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2020-08-01  9:31 UTC (permalink / raw)
  To: Justin Brown; +Cc: linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 9437 bytes --]



On 2020/8/1 下午4:30, Justin Brown wrote:
> Hi Qu,
> 
> Thanks for the help.
> 
> Here's is the lsblk -b:
> 
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda 8:0 0 2000398934016 0 disk
> └─sda1 8:1 0 2000397868544 0 part
> sdb 8:16 0 8001563222016 0 disk
> └─sdb1 8:17 0 8001562156544 0 part
> sdc 8:32 0 120034123776 0 disk
> ├─sdc1 8:33 0 1048576 0 part
> ├─sdc2 8:34 0 524288000 0 part /boot
> └─sdc3 8:35 0 119507255296 0 part /home
> sdd 8:48 0 8001563222016 0 disk
> └─sdd1 8:49 0 8001562156544 0 part
> sde 8:64 0 2000398934016 0 disk
> └─sde1 8:65 0 2000397868544 0 part
> sdf 8:80 0 2000398934016 0 disk
> └─sdf1 8:81 0 2000397868544 0 part /var/media
> sdg 8:96 1 2000398934016 0 disk
> └─sdg1 8:97 1 2000397868544 0 part
> 
> The `btrfs ins...` output is quite long. I've attached it as a txt and
> also uploaded it at
> https://gist.github.com/fandingo/aa345d6c6fa97162f810e86c9ab20d6a


Thanks, this already shows some device size difference.

But all of them are in fact just a little smaller than device size, thus
it should be fine.

Another problem I found is, it looks like either size or start of some
partitions are not aligned to 4K.

It may be a problem for 4K aligned hard disks, so it may worthy some
concern after solving the btrfs problem.

Would you please also provide some extra dump?
- btrfs check /dev/sda1
  It should detect any problems I missed

- btrfs ins dump-super <device> | grep dev_item.uuid
  It's a little hard to find which device owns to which device id.
  So we need this dump of each btrfs device to make sure.

Thanks,
Qu


> 
> Thanks,
> Justin
> 
> On Sat, Aug 1, 2020 at 2:02 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/8/1 下午2:58, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/8/1 下午2:51, Justin Brown wrote:
>>>> Hello,
>>>>
>>>> I've run into a strange problem that I haven't seen before, and I need
>>>> some help. I started getting generic "input/output" errors on a couple
>>>> of files, and when I looked deeper, the kernel logs are full of
>>>> messages like:
>>>>
>>>>     sd 5:0:0:0: [sdf] tag#29 access beyond end of device
>>>
>>> We had a new fix for trim. But according to your kernel message, it
>>> doesn't look like the case.
>>>
>>> (No obvious tag showing it's trim/discard)
>>>
>>>>
>>>> I've never seen anything like this before with any FS, so I figured it
>>>> was worth asking before I consider running the standard btrfs tools.
>>>> (I briefly started a scrub, but it was going crazy with uncorrectable
>>>> errors, so I cancelled it.)
>>>>
>>>> Here's my system info:
>>>>
>>>> Fedora 32, kernel 5.7.7-200.fc32.x86_64
>>>> btrfs-progs v5.7
>>>>
>>>> /etc/fstab entry:
>>>> LABEL=media /var/media btrfs subvol=media,discard 0 2
>>>>
>>>> btrfs fi show /var/media/
>>>> Label: 'media' uuid: 51eef0c7-2977-4037-b271-3270ea22c7d9
>>>> Total devices 6 FS bytes used 4.68TiB
>>>> devid 1 size 1.82TiB used 963.00GiB path /dev/sdf1
>>>> devid 2 size 1.82TiB used 962.00GiB path /dev/sde1
>>>> devid 4 size 1.82TiB used 963.00GiB path /dev/sdg1
>>>> devid 6 size 1.82TiB used 962.03GiB path /dev/sda1
>>>> devid 7 size 7.28TiB used 967.03GiB path /dev/sdb1
>>>> devid 8 size 7.28TiB used 967.03GiB path /dev/sdd1
>>>>
>>>> btrfs fi df /var/media/
>>>> Data, RAID5: total=4.69TiB, used=4.68TiB
>>>> System, RAID1C3: total=32.00MiB, used=304.00KiB
>>>> Metadata, RAID1C3: total=6.00GiB, used=4.94GiB
>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>
>>>> I can only mount -o degraded now. Here are the logs when mounting:
>>>>
>>>> Aug 01 01:15:26 spaceman.fandingo.org sudo[275572]: justin : TTY=pts/0
>>>> ; PWD=/home/justin ; USER=root ; COMMAND=/usr/bin/mount -t btrfs -o
>>>> degraded /dev/sda1 /var/media/
>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
>>>> access beyond end of device
>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: blk_update_request: I/O
>>>> error, dev sdf, sector 2176 op 0x0:(READ) flags 0x0 phys_seg 1 prio
>>>> class 0
>>>
>>> OK, it's read, not DISCARD, thus a completely different problem.
>>>
>>>
>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: Buffer I/O error on dev
>>>> sdf1, logical block 16, async page read
>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
>>>> sde1): allowing degraded mounts
>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
>>>> sde1): disk space caching is enabled
>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS warning (device
>>>> sde1): devid 1 uuid cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 is missing
>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
>>>> sde1): bdev /dev/sdf1 errs: wr 4458026, rd 14571, flush 0, corrupt 0,
>>>> gen 0
>>>>
>>>> It seems like only relatively recently written files are encountering
>>>> I/O errors. If I `cat` one of the problematic files when the FS is
>>>> mounted normally, I see a ton of this:
>>>>
>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#26
>>>> access beyond end of device
>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#27
>>>> access beyond end of device
>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#28
>>>> access beyond end of device
>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#29
>>>> access beyond end of device
>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
>>>> access beyond end of device
>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#0
>>>> access beyond end of device
>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#1
>>>> access beyond end of device
>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#13
>>>> access beyond end of device
>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#2
>>>> access beyond end of device
>>>>
>>>> Now that I'm remounted in -o degraded, I'm getting more comprehensible
>>>> warnings, but it still results in I/O read failures:
>>>>
>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>> sde1): csum failed root 2820 ino 747435 off 99942400 csum 0x8941f998
>>>> expected csum 0xbe3f80a4 mirror 2
>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>> sde1): csum failed root 2820 ino 747435 off 99946496 csum 0x8941f998
>>>> expected csum 0x9c36a6b4 mirror 2
>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>> sde1): csum failed root 2820 ino 747435 off 99950592 csum 0x8941f998
>>>> expected csum 0x44d30ca2 mirror 2
>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>> sde1): csum failed root 2820 ino 747435 off 99958784 csum 0x8941f998
>>>> expected csum 0xc0f08acc mirror 2
>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>> sde1): csum failed root 2820 ino 747435 off 99954688 csum 0x8941f998
>>>> expected csum 0xcb11db59 mirror 2
>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>> sde1): csum failed root 2820 ino 747435 off 99962880 csum 0x8941f998
>>>> expected csum 0x8a4ee0aa mirror 2
>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>> sde1): csum failed root 2820 ino 747435 off 99971072 csum 0x8941f998
>>>> expected csum 0xdfb79e85 mirror 2
>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>> sde1): csum failed root 2820 ino 747435 off 99966976 csum 0x8941f998
>>>> expected csum 0xc14921a0 mirror 2
>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>> sde1): csum failed root 2820 ino 747435 off 99975168 csum 0x8941f998
>>>> expected csum 0xf2fe8774 mirror 2
>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>> sde1): csum failed root 2820 ino 747435 off 99979264 csum 0x8941f998
>>>> expected csum 0xae1cafd6 mirror 2
>>>>
>>>> Why trying to research this problem, I came across a Github issue
>>>> https://github.com/kdave/btrfs-progs/issues/282 and a patch from Qu
>>>> from yesterday ([PATCH] btrfs: trim: fix underflow in trim length to
>>>> prevent access beyond device boundary). I do use the discard mount
>>>> option, and I have a weekly fstrim.timer enabled. I did replace 2x2TB
>>>> drives with the 2x8TB drives about 1 month ago, which involved a
>>>> conversion to -d raid5 -m raid1c3, which I suppose could hit the same
>>>> code paths that resize2fs would?
>>>
>>> The problem doesn't look like a trim one, but more likely some device
>>> boundary bug.
>>>
>>> Would you please provide the following info?
>>> - btrfs ins dump-tree -t chunk /dev/sde1
>>>   This contains the device info and chunk tree dump. Doesn't contain
>>>   any confidential info.
>>>   We can use this info to determine if there is some chunk really beyond
>>>   device boundary.
>>>   I guess some chunks are already beyond device boundary by somehow.
>>
>> And `lsblk -b` output.
>>
>> It may be possible that device size in btrfs doesn't match with the real
>> device...
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> Any advice on how to proceed would be greatly appreciated.
>>>>
>>>> Thanks,
>>>> Justin
>>>>
>>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Access Beyond End of Device & Input/Output Errors
  2020-08-01  9:31       ` Qu Wenruo
@ 2020-08-01 11:56         ` Justin Brown
  2020-08-01 23:30           ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: Justin Brown @ 2020-08-01 11:56 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hi Qu,

Thanks for your continued help.

dump-super:

for i in a b d e f g; do x=$(sudo btrfs ins dump-super /dev/sd${i}1 |
grep dev_item.uuid | cut -f 3); echo "/dev/sd${i}1 $x"; done
/dev/sda1 cc3f9a00-bd69-4ceb-b6e5-4fb874be2aaf
/dev/sdb1 27e1cf24-9349-4f72-a23b-86668b2a9e78
/dev/sdd1 601d409e-8ffd-489c-91af-daf3e0cc9bd2
/dev/sde1 2908ebfb-e6b5-4991-b25d-32d1487ff6a4
/dev/sdf1 cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0

btrfs check:

sudo btrfs check /dev/sda1
Opening filesystem to check...
Checking filesystem on /dev/sda1
UUID: 51eef0c7-2977-4037-b271-3270ea22c7d9
[1/7] checking root items
[2/7] checking extents
WARNING: unaligned total_bytes detected for devid 2, have
2000397868544 should be aligned to 4096
WARNING: this is OK for older kernel, but may cause kernel warning for
newer kernels
WARNING: this can be fixed by 'btrfs rescue fix-device-size'
WARNING: unaligned total_bytes detected for devid 4, have
2000397868544 should be aligned to 4096
WARNING: this is OK for older kernel, but may cause kernel warning for
newer kernels
WARNING: this can be fixed by 'btrfs rescue fix-device-size'
WARNING: unaligned total_bytes detected for devid 6, have
2000397868544 should be aligned to 4096
WARNING: this is OK for older kernel, but may cause kernel warning for
newer kernels
WARNING: this can be fixed by 'btrfs rescue fix-device-size'
WARNING: minor unaligned/mismatch device size detected
WARNING: recommended to use 'btrfs rescue fix-device-size' to fix it
failed to load free space cache for block group 92568662507520
failed to load free space cache for block group 92574031216640
...
failed to load free space cache for block group 97722656817152
failed to load free space cache for block group 97728025526272
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 5148381876224 bytes used, no error found
total csum bytes: 4998903140
total tree bytes: 5301813248
total fs tree bytes: 96894976
total extent tree bytes: 41910272
btree space waste bytes: 135561977
file data blocks allocated: 8972043898880
referenced 5113155596288

The alignment issue would be confined to performance, correct?

Thanks,
Justin

/dev/sdg1 1b938c84-eafd-4396-b06c-8a5bf1339840On Sat, Aug 1, 2020 at
4:31 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/8/1 下午4:30, Justin Brown wrote:
> > Hi Qu,
> >
> > Thanks for the help.
> >
> > Here's is the lsblk -b:
> >
> > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> > sda 8:0 0 2000398934016 0 disk
> > └─sda1 8:1 0 2000397868544 0 part
> > sdb 8:16 0 8001563222016 0 disk
> > └─sdb1 8:17 0 8001562156544 0 part
> > sdc 8:32 0 120034123776 0 disk
> > ├─sdc1 8:33 0 1048576 0 part
> > ├─sdc2 8:34 0 524288000 0 part /boot
> > └─sdc3 8:35 0 119507255296 0 part /home
> > sdd 8:48 0 8001563222016 0 disk
> > └─sdd1 8:49 0 8001562156544 0 part
> > sde 8:64 0 2000398934016 0 disk
> > └─sde1 8:65 0 2000397868544 0 part
> > sdf 8:80 0 2000398934016 0 disk
> > └─sdf1 8:81 0 2000397868544 0 part /var/media
> > sdg 8:96 1 2000398934016 0 disk
> > └─sdg1 8:97 1 2000397868544 0 part
> >
> > The `btrfs ins...` output is quite long. I've attached it as a txt and
> > also uploaded it at
> > https://gist.github.com/fandingo/aa345d6c6fa97162f810e86c9ab20d6a
>
>
> Thanks, this already shows some device size difference.
>
> But all of them are in fact just a little smaller than device size, thus
> it should be fine.
>
> Another problem I found is, it looks like either size or start of some
> partitions are not aligned to 4K.
>
> It may be a problem for 4K aligned hard disks, so it may worthy some
> concern after solving the btrfs problem.
>
> Would you please also provide some extra dump?
> - btrfs check /dev/sda1
>   It should detect any problems I missed
>
> - btrfs ins dump-super <device> | grep dev_item.uuid
>   It's a little hard to find which device owns to which device id.
>   So we need this dump of each btrfs device to make sure.
>
> Thanks,
> Qu
>
>
> >
> > Thanks,
> > Justin
> >
> > On Sat, Aug 1, 2020 at 2:02 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/8/1 下午2:58, Qu Wenruo wrote:
> >>>
> >>>
> >>> On 2020/8/1 下午2:51, Justin Brown wrote:
> >>>> Hello,
> >>>>
> >>>> I've run into a strange problem that I haven't seen before, and I need
> >>>> some help. I started getting generic "input/output" errors on a couple
> >>>> of files, and when I looked deeper, the kernel logs are full of
> >>>> messages like:
> >>>>
> >>>>     sd 5:0:0:0: [sdf] tag#29 access beyond end of device
> >>>
> >>> We had a new fix for trim. But according to your kernel message, it
> >>> doesn't look like the case.
> >>>
> >>> (No obvious tag showing it's trim/discard)
> >>>
> >>>>
> >>>> I've never seen anything like this before with any FS, so I figured it
> >>>> was worth asking before I consider running the standard btrfs tools.
> >>>> (I briefly started a scrub, but it was going crazy with uncorrectable
> >>>> errors, so I cancelled it.)
> >>>>
> >>>> Here's my system info:
> >>>>
> >>>> Fedora 32, kernel 5.7.7-200.fc32.x86_64
> >>>> btrfs-progs v5.7
> >>>>
> >>>> /etc/fstab entry:
> >>>> LABEL=media /var/media btrfs subvol=media,discard 0 2
> >>>>
> >>>> btrfs fi show /var/media/
> >>>> Label: 'media' uuid: 51eef0c7-2977-4037-b271-3270ea22c7d9
> >>>> Total devices 6 FS bytes used 4.68TiB
> >>>> devid 1 size 1.82TiB used 963.00GiB path /dev/sdf1
> >>>> devid 2 size 1.82TiB used 962.00GiB path /dev/sde1
> >>>> devid 4 size 1.82TiB used 963.00GiB path /dev/sdg1
> >>>> devid 6 size 1.82TiB used 962.03GiB path /dev/sda1
> >>>> devid 7 size 7.28TiB used 967.03GiB path /dev/sdb1
> >>>> devid 8 size 7.28TiB used 967.03GiB path /dev/sdd1
> >>>>
> >>>> btrfs fi df /var/media/
> >>>> Data, RAID5: total=4.69TiB, used=4.68TiB
> >>>> System, RAID1C3: total=32.00MiB, used=304.00KiB
> >>>> Metadata, RAID1C3: total=6.00GiB, used=4.94GiB
> >>>> GlobalReserve, single: total=512.00MiB, used=0.00B
> >>>>
> >>>> I can only mount -o degraded now. Here are the logs when mounting:
> >>>>
> >>>> Aug 01 01:15:26 spaceman.fandingo.org sudo[275572]: justin : TTY=pts/0
> >>>> ; PWD=/home/justin ; USER=root ; COMMAND=/usr/bin/mount -t btrfs -o
> >>>> degraded /dev/sda1 /var/media/
> >>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
> >>>> access beyond end of device
> >>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: blk_update_request: I/O
> >>>> error, dev sdf, sector 2176 op 0x0:(READ) flags 0x0 phys_seg 1 prio
> >>>> class 0
> >>>
> >>> OK, it's read, not DISCARD, thus a completely different problem.
> >>>
> >>>
> >>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: Buffer I/O error on dev
> >>>> sdf1, logical block 16, async page read
> >>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
> >>>> sde1): allowing degraded mounts
> >>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
> >>>> sde1): disk space caching is enabled
> >>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>> sde1): devid 1 uuid cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 is missing
> >>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
> >>>> sde1): bdev /dev/sdf1 errs: wr 4458026, rd 14571, flush 0, corrupt 0,
> >>>> gen 0
> >>>>
> >>>> It seems like only relatively recently written files are encountering
> >>>> I/O errors. If I `cat` one of the problematic files when the FS is
> >>>> mounted normally, I see a ton of this:
> >>>>
> >>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#26
> >>>> access beyond end of device
> >>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#27
> >>>> access beyond end of device
> >>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#28
> >>>> access beyond end of device
> >>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#29
> >>>> access beyond end of device
> >>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
> >>>> access beyond end of device
> >>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#0
> >>>> access beyond end of device
> >>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#1
> >>>> access beyond end of device
> >>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#13
> >>>> access beyond end of device
> >>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#2
> >>>> access beyond end of device
> >>>>
> >>>> Now that I'm remounted in -o degraded, I'm getting more comprehensible
> >>>> warnings, but it still results in I/O read failures:
> >>>>
> >>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>> sde1): csum failed root 2820 ino 747435 off 99942400 csum 0x8941f998
> >>>> expected csum 0xbe3f80a4 mirror 2
> >>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>> sde1): csum failed root 2820 ino 747435 off 99946496 csum 0x8941f998
> >>>> expected csum 0x9c36a6b4 mirror 2
> >>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>> sde1): csum failed root 2820 ino 747435 off 99950592 csum 0x8941f998
> >>>> expected csum 0x44d30ca2 mirror 2
> >>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>> sde1): csum failed root 2820 ino 747435 off 99958784 csum 0x8941f998
> >>>> expected csum 0xc0f08acc mirror 2
> >>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>> sde1): csum failed root 2820 ino 747435 off 99954688 csum 0x8941f998
> >>>> expected csum 0xcb11db59 mirror 2
> >>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>> sde1): csum failed root 2820 ino 747435 off 99962880 csum 0x8941f998
> >>>> expected csum 0x8a4ee0aa mirror 2
> >>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>> sde1): csum failed root 2820 ino 747435 off 99971072 csum 0x8941f998
> >>>> expected csum 0xdfb79e85 mirror 2
> >>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>> sde1): csum failed root 2820 ino 747435 off 99966976 csum 0x8941f998
> >>>> expected csum 0xc14921a0 mirror 2
> >>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>> sde1): csum failed root 2820 ino 747435 off 99975168 csum 0x8941f998
> >>>> expected csum 0xf2fe8774 mirror 2
> >>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>> sde1): csum failed root 2820 ino 747435 off 99979264 csum 0x8941f998
> >>>> expected csum 0xae1cafd6 mirror 2
> >>>>
> >>>> Why trying to research this problem, I came across a Github issue
> >>>> https://github.com/kdave/btrfs-progs/issues/282 and a patch from Qu
> >>>> from yesterday ([PATCH] btrfs: trim: fix underflow in trim length to
> >>>> prevent access beyond device boundary). I do use the discard mount
> >>>> option, and I have a weekly fstrim.timer enabled. I did replace 2x2TB
> >>>> drives with the 2x8TB drives about 1 month ago, which involved a
> >>>> conversion to -d raid5 -m raid1c3, which I suppose could hit the same
> >>>> code paths that resize2fs would?
> >>>
> >>> The problem doesn't look like a trim one, but more likely some device
> >>> boundary bug.
> >>>
> >>> Would you please provide the following info?
> >>> - btrfs ins dump-tree -t chunk /dev/sde1
> >>>   This contains the device info and chunk tree dump. Doesn't contain
> >>>   any confidential info.
> >>>   We can use this info to determine if there is some chunk really beyond
> >>>   device boundary.
> >>>   I guess some chunks are already beyond device boundary by somehow.
> >>
> >> And `lsblk -b` output.
> >>
> >> It may be possible that device size in btrfs doesn't match with the real
> >> device...
> >>>
> >>> Thanks,
> >>> Qu
> >>>
> >>>>
> >>>> Any advice on how to proceed would be greatly appreciated.
> >>>>
> >>>> Thanks,
> >>>> Justin
> >>>>
> >>>
> >>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Access Beyond End of Device & Input/Output Errors
  2020-08-01 11:56         ` Justin Brown
@ 2020-08-01 23:30           ` Qu Wenruo
  2020-09-06  1:42             ` Justin Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2020-08-01 23:30 UTC (permalink / raw)
  To: Justin Brown; +Cc: linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 12562 bytes --]



On 2020/8/1 下午7:56, Justin Brown wrote:
> Hi Qu,
> 
> Thanks for your continued help.
> 
> dump-super:
> 
> for i in a b d e f g; do x=$(sudo btrfs ins dump-super /dev/sd${i}1 |
> grep dev_item.uuid | cut -f 3); echo "/dev/sd${i}1 $x"; done
> /dev/sda1 cc3f9a00-bd69-4ceb-b6e5-4fb874be2aaf
> /dev/sdb1 27e1cf24-9349-4f72-a23b-86668b2a9e78
> /dev/sdd1 601d409e-8ffd-489c-91af-daf3e0cc9bd2
> /dev/sde1 2908ebfb-e6b5-4991-b25d-32d1487ff6a4
> /dev/sdf1 cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0

They match with the device size. So no chunk item beyond device boundary.

> 
> btrfs check:
> 
> sudo btrfs check /dev/sda1
> Opening filesystem to check...
> Checking filesystem on /dev/sda1
> UUID: 51eef0c7-2977-4037-b271-3270ea22c7d9
> [1/7] checking root items
> [2/7] checking extents
...
> failed to load free space cache for block group 92568662507520
> failed to load free space cache for block group 92574031216640
> ...
> failed to load free space cache for block group 97722656817152
> failed to load free space cache for block group 97728025526272

This is interesting. Maybe that's related to the problem?

> [4/7] checking fs roots
> [5/7] checking only csums items (without verifying data)
> [6/7] checking root refs
> [7/7] checking quota groups skipped (not enabled on this FS)

Great that all metadata are fine.

> found 5148381876224 bytes used, no error found
> total csum bytes: 4998903140
> total tree bytes: 5301813248
> total fs tree bytes: 96894976
> total extent tree bytes: 41910272
> btree space waste bytes: 135561977
> file data blocks allocated: 8972043898880
> referenced 5113155596288
> 
> The alignment issue would be confined to performance, correct?

Yep, only related to performance and some noisy warning for newer kernel.
Not a big problem yet.

Since btrfs-check reports no obvious problem but free space cache
problems, maybe btrfs repair --clear-space-cache v1 is worthy trying.

BTW, since current kernel and btrfs-progs doesn't do restrict chunk
check against device boundary, I'll add such checks to both kernel and
progs soon.

In the mean time, I also see the following dmesg showing that kernel
failed to detect one device:

  Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS warning (device
  sde1): devid 1 uuid cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 is missing

Can you reproduce that problem? And if so, maybe try "btrfs device scan"
and then mount again?

Thanks,
Qu

> 
> Thanks,
> Justin
> 
> /dev/sdg1 1b938c84-eafd-4396-b06c-8a5bf1339840On Sat, Aug 1, 2020 at
> 4:31 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/8/1 下午4:30, Justin Brown wrote:
>>> Hi Qu,
>>>
>>> Thanks for the help.
>>>
>>> Here's is the lsblk -b:
>>>
>>> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
>>> sda 8:0 0 2000398934016 0 disk
>>> └─sda1 8:1 0 2000397868544 0 part
>>> sdb 8:16 0 8001563222016 0 disk
>>> └─sdb1 8:17 0 8001562156544 0 part
>>> sdc 8:32 0 120034123776 0 disk
>>> ├─sdc1 8:33 0 1048576 0 part
>>> ├─sdc2 8:34 0 524288000 0 part /boot
>>> └─sdc3 8:35 0 119507255296 0 part /home
>>> sdd 8:48 0 8001563222016 0 disk
>>> └─sdd1 8:49 0 8001562156544 0 part
>>> sde 8:64 0 2000398934016 0 disk
>>> └─sde1 8:65 0 2000397868544 0 part
>>> sdf 8:80 0 2000398934016 0 disk
>>> └─sdf1 8:81 0 2000397868544 0 part /var/media
>>> sdg 8:96 1 2000398934016 0 disk
>>> └─sdg1 8:97 1 2000397868544 0 part
>>>
>>> The `btrfs ins...` output is quite long. I've attached it as a txt and
>>> also uploaded it at
>>> https://gist.github.com/fandingo/aa345d6c6fa97162f810e86c9ab20d6a
>>
>>
>> Thanks, this already shows some device size difference.
>>
>> But all of them are in fact just a little smaller than device size, thus
>> it should be fine.
>>
>> Another problem I found is, it looks like either size or start of some
>> partitions are not aligned to 4K.
>>
>> It may be a problem for 4K aligned hard disks, so it may worthy some
>> concern after solving the btrfs problem.
>>
>> Would you please also provide some extra dump?
>> - btrfs check /dev/sda1
>>   It should detect any problems I missed
>>
>> - btrfs ins dump-super <device> | grep dev_item.uuid
>>   It's a little hard to find which device owns to which device id.
>>   So we need this dump of each btrfs device to make sure.
>>
>> Thanks,
>> Qu
>>
>>
>>>
>>> Thanks,
>>> Justin
>>>
>>> On Sat, Aug 1, 2020 at 2:02 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2020/8/1 下午2:58, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2020/8/1 下午2:51, Justin Brown wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I've run into a strange problem that I haven't seen before, and I need
>>>>>> some help. I started getting generic "input/output" errors on a couple
>>>>>> of files, and when I looked deeper, the kernel logs are full of
>>>>>> messages like:
>>>>>>
>>>>>>     sd 5:0:0:0: [sdf] tag#29 access beyond end of device
>>>>>
>>>>> We had a new fix for trim. But according to your kernel message, it
>>>>> doesn't look like the case.
>>>>>
>>>>> (No obvious tag showing it's trim/discard)
>>>>>
>>>>>>
>>>>>> I've never seen anything like this before with any FS, so I figured it
>>>>>> was worth asking before I consider running the standard btrfs tools.
>>>>>> (I briefly started a scrub, but it was going crazy with uncorrectable
>>>>>> errors, so I cancelled it.)
>>>>>>
>>>>>> Here's my system info:
>>>>>>
>>>>>> Fedora 32, kernel 5.7.7-200.fc32.x86_64
>>>>>> btrfs-progs v5.7
>>>>>>
>>>>>> /etc/fstab entry:
>>>>>> LABEL=media /var/media btrfs subvol=media,discard 0 2
>>>>>>
>>>>>> btrfs fi show /var/media/
>>>>>> Label: 'media' uuid: 51eef0c7-2977-4037-b271-3270ea22c7d9
>>>>>> Total devices 6 FS bytes used 4.68TiB
>>>>>> devid 1 size 1.82TiB used 963.00GiB path /dev/sdf1
>>>>>> devid 2 size 1.82TiB used 962.00GiB path /dev/sde1
>>>>>> devid 4 size 1.82TiB used 963.00GiB path /dev/sdg1
>>>>>> devid 6 size 1.82TiB used 962.03GiB path /dev/sda1
>>>>>> devid 7 size 7.28TiB used 967.03GiB path /dev/sdb1
>>>>>> devid 8 size 7.28TiB used 967.03GiB path /dev/sdd1
>>>>>>
>>>>>> btrfs fi df /var/media/
>>>>>> Data, RAID5: total=4.69TiB, used=4.68TiB
>>>>>> System, RAID1C3: total=32.00MiB, used=304.00KiB
>>>>>> Metadata, RAID1C3: total=6.00GiB, used=4.94GiB
>>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>>
>>>>>> I can only mount -o degraded now. Here are the logs when mounting:
>>>>>>
>>>>>> Aug 01 01:15:26 spaceman.fandingo.org sudo[275572]: justin : TTY=pts/0
>>>>>> ; PWD=/home/justin ; USER=root ; COMMAND=/usr/bin/mount -t btrfs -o
>>>>>> degraded /dev/sda1 /var/media/
>>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
>>>>>> access beyond end of device
>>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: blk_update_request: I/O
>>>>>> error, dev sdf, sector 2176 op 0x0:(READ) flags 0x0 phys_seg 1 prio
>>>>>> class 0
>>>>>
>>>>> OK, it's read, not DISCARD, thus a completely different problem.
>>>>>
>>>>>
>>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: Buffer I/O error on dev
>>>>>> sdf1, logical block 16, async page read
>>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
>>>>>> sde1): allowing degraded mounts
>>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
>>>>>> sde1): disk space caching is enabled
>>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS warning (device
>>>>>> sde1): devid 1 uuid cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 is missing
>>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
>>>>>> sde1): bdev /dev/sdf1 errs: wr 4458026, rd 14571, flush 0, corrupt 0,
>>>>>> gen 0
>>>>>>
>>>>>> It seems like only relatively recently written files are encountering
>>>>>> I/O errors. If I `cat` one of the problematic files when the FS is
>>>>>> mounted normally, I see a ton of this:
>>>>>>
>>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#26
>>>>>> access beyond end of device
>>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#27
>>>>>> access beyond end of device
>>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#28
>>>>>> access beyond end of device
>>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#29
>>>>>> access beyond end of device
>>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
>>>>>> access beyond end of device
>>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#0
>>>>>> access beyond end of device
>>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#1
>>>>>> access beyond end of device
>>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#13
>>>>>> access beyond end of device
>>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#2
>>>>>> access beyond end of device
>>>>>>
>>>>>> Now that I'm remounted in -o degraded, I'm getting more comprehensible
>>>>>> warnings, but it still results in I/O read failures:
>>>>>>
>>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>>>> sde1): csum failed root 2820 ino 747435 off 99942400 csum 0x8941f998
>>>>>> expected csum 0xbe3f80a4 mirror 2
>>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>>>> sde1): csum failed root 2820 ino 747435 off 99946496 csum 0x8941f998
>>>>>> expected csum 0x9c36a6b4 mirror 2
>>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>>>> sde1): csum failed root 2820 ino 747435 off 99950592 csum 0x8941f998
>>>>>> expected csum 0x44d30ca2 mirror 2
>>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>>>> sde1): csum failed root 2820 ino 747435 off 99958784 csum 0x8941f998
>>>>>> expected csum 0xc0f08acc mirror 2
>>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>>>> sde1): csum failed root 2820 ino 747435 off 99954688 csum 0x8941f998
>>>>>> expected csum 0xcb11db59 mirror 2
>>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>>>> sde1): csum failed root 2820 ino 747435 off 99962880 csum 0x8941f998
>>>>>> expected csum 0x8a4ee0aa mirror 2
>>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>>>> sde1): csum failed root 2820 ino 747435 off 99971072 csum 0x8941f998
>>>>>> expected csum 0xdfb79e85 mirror 2
>>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>>>> sde1): csum failed root 2820 ino 747435 off 99966976 csum 0x8941f998
>>>>>> expected csum 0xc14921a0 mirror 2
>>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>>>> sde1): csum failed root 2820 ino 747435 off 99975168 csum 0x8941f998
>>>>>> expected csum 0xf2fe8774 mirror 2
>>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
>>>>>> sde1): csum failed root 2820 ino 747435 off 99979264 csum 0x8941f998
>>>>>> expected csum 0xae1cafd6 mirror 2
>>>>>>
>>>>>> Why trying to research this problem, I came across a Github issue
>>>>>> https://github.com/kdave/btrfs-progs/issues/282 and a patch from Qu
>>>>>> from yesterday ([PATCH] btrfs: trim: fix underflow in trim length to
>>>>>> prevent access beyond device boundary). I do use the discard mount
>>>>>> option, and I have a weekly fstrim.timer enabled. I did replace 2x2TB
>>>>>> drives with the 2x8TB drives about 1 month ago, which involved a
>>>>>> conversion to -d raid5 -m raid1c3, which I suppose could hit the same
>>>>>> code paths that resize2fs would?
>>>>>
>>>>> The problem doesn't look like a trim one, but more likely some device
>>>>> boundary bug.
>>>>>
>>>>> Would you please provide the following info?
>>>>> - btrfs ins dump-tree -t chunk /dev/sde1
>>>>>   This contains the device info and chunk tree dump. Doesn't contain
>>>>>   any confidential info.
>>>>>   We can use this info to determine if there is some chunk really beyond
>>>>>   device boundary.
>>>>>   I guess some chunks are already beyond device boundary by somehow.
>>>>
>>>> And `lsblk -b` output.
>>>>
>>>> It may be possible that device size in btrfs doesn't match with the real
>>>> device...
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>> Any advice on how to proceed would be greatly appreciated.
>>>>>>
>>>>>> Thanks,
>>>>>> Justin
>>>>>>
>>>>>
>>>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Access Beyond End of Device & Input/Output Errors
  2020-08-01 23:30           ` Qu Wenruo
@ 2020-09-06  1:42             ` Justin Brown
  0 siblings, 0 replies; 7+ messages in thread
From: Justin Brown @ 2020-09-06  1:42 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hi Qu,

Sorry for the late reply. I've had this system powered off since we
last talked, so no actions taken.

Yes, /dev/sde is dropping out occasionally, but these errors happen
regardless of whether it's in the array or not. Once the disk drops
out, it's completely gone until a reboot (no response from fdisk -l
info, brtfs dev scan, etc.).

The disk was manufactured in 2014, so it's quite old, and the
motherboard/cpu/integrated SATA controller) are about a year older
than that. SMART data on that disk don't indicate any serious
failures. I should probably replace that disk, or maybe just drop it
from the array . However, I'm concerned about the migration path. Any
sort of btrfs remove and btrfs add for new disks will require a btrfs
balance to maintain redundancy. The "access beyond end of device"
errors have shown different disks, not just /dev/sde (most kernel
messages are about sdf, but maybe that's just how messages are
logged), which makes me concerned my problem isn't related to a single
disk and any attempt at a balance could be catastrophic.

What's the best way to get this FS back to a healthy state?

Thanks,
Justin


On Sat, Aug 1, 2020 at 6:30 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/8/1 下午7:56, Justin Brown wrote:
> > Hi Qu,
> >
> > Thanks for your continued help.
> >
> > dump-super:
> >
> > for i in a b d e f g; do x=$(sudo btrfs ins dump-super /dev/sd${i}1 |
> > grep dev_item.uuid | cut -f 3); echo "/dev/sd${i}1 $x"; done
> > /dev/sda1 cc3f9a00-bd69-4ceb-b6e5-4fb874be2aaf
> > /dev/sdb1 27e1cf24-9349-4f72-a23b-86668b2a9e78
> > /dev/sdd1 601d409e-8ffd-489c-91af-daf3e0cc9bd2
> > /dev/sde1 2908ebfb-e6b5-4991-b25d-32d1487ff6a4
> > /dev/sdf1 cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0
>
> They match with the device size. So no chunk item beyond device boundary.
>
> >
> > btrfs check:
> >
> > sudo btrfs check /dev/sda1
> > Opening filesystem to check...
> > Checking filesystem on /dev/sda1
> > UUID: 51eef0c7-2977-4037-b271-3270ea22c7d9
> > [1/7] checking root items
> > [2/7] checking extents
> ...
> > failed to load free space cache for block group 92568662507520
> > failed to load free space cache for block group 92574031216640
> > ...
> > failed to load free space cache for block group 97722656817152
> > failed to load free space cache for block group 97728025526272
>
> This is interesting. Maybe that's related to the problem?
>
> > [4/7] checking fs roots
> > [5/7] checking only csums items (without verifying data)
> > [6/7] checking root refs
> > [7/7] checking quota groups skipped (not enabled on this FS)
>
> Great that all metadata are fine.
>
> > found 5148381876224 bytes used, no error found
> > total csum bytes: 4998903140
> > total tree bytes: 5301813248
> > total fs tree bytes: 96894976
> > total extent tree bytes: 41910272
> > btree space waste bytes: 135561977
> > file data blocks allocated: 8972043898880
> > referenced 5113155596288
> >
> > The alignment issue would be confined to performance, correct?
>
> Yep, only related to performance and some noisy warning for newer kernel.
> Not a big problem yet.
>
> Since btrfs-check reports no obvious problem but free space cache
> problems, maybe btrfs repair --clear-space-cache v1 is worthy trying.
>
> BTW, since current kernel and btrfs-progs doesn't do restrict chunk
> check against device boundary, I'll add such checks to both kernel and
> progs soon.
>
> In the mean time, I also see the following dmesg showing that kernel
> failed to detect one device:
>
>   Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS warning (device
>   sde1): devid 1 uuid cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 is missing
>
> Can you reproduce that problem? And if so, maybe try "btrfs device scan"
> and then mount again?
>
> Thanks,
> Qu
>
> >
> > Thanks,
> > Justin
> >
> > /dev/sdg1 1b938c84-eafd-4396-b06c-8a5bf1339840On Sat, Aug 1, 2020 at
> > 4:31 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/8/1 下午4:30, Justin Brown wrote:
> >>> Hi Qu,
> >>>
> >>> Thanks for the help.
> >>>
> >>> Here's is the lsblk -b:
> >>>
> >>> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> >>> sda 8:0 0 2000398934016 0 disk
> >>> └─sda1 8:1 0 2000397868544 0 part
> >>> sdb 8:16 0 8001563222016 0 disk
> >>> └─sdb1 8:17 0 8001562156544 0 part
> >>> sdc 8:32 0 120034123776 0 disk
> >>> ├─sdc1 8:33 0 1048576 0 part
> >>> ├─sdc2 8:34 0 524288000 0 part /boot
> >>> └─sdc3 8:35 0 119507255296 0 part /home
> >>> sdd 8:48 0 8001563222016 0 disk
> >>> └─sdd1 8:49 0 8001562156544 0 part
> >>> sde 8:64 0 2000398934016 0 disk
> >>> └─sde1 8:65 0 2000397868544 0 part
> >>> sdf 8:80 0 2000398934016 0 disk
> >>> └─sdf1 8:81 0 2000397868544 0 part /var/media
> >>> sdg 8:96 1 2000398934016 0 disk
> >>> └─sdg1 8:97 1 2000397868544 0 part
> >>>
> >>> The `btrfs ins...` output is quite long. I've attached it as a txt and
> >>> also uploaded it at
> >>> https://gist.github.com/fandingo/aa345d6c6fa97162f810e86c9ab20d6a
> >>
> >>
> >> Thanks, this already shows some device size difference.
> >>
> >> But all of them are in fact just a little smaller than device size, thus
> >> it should be fine.
> >>
> >> Another problem I found is, it looks like either size or start of some
> >> partitions are not aligned to 4K.
> >>
> >> It may be a problem for 4K aligned hard disks, so it may worthy some
> >> concern after solving the btrfs problem.
> >>
> >> Would you please also provide some extra dump?
> >> - btrfs check /dev/sda1
> >>   It should detect any problems I missed
> >>
> >> - btrfs ins dump-super <device> | grep dev_item.uuid
> >>   It's a little hard to find which device owns to which device id.
> >>   So we need this dump of each btrfs device to make sure.
> >>
> >> Thanks,
> >> Qu
> >>
> >>
> >>>
> >>> Thanks,
> >>> Justin
> >>>
> >>> On Sat, Aug 1, 2020 at 2:02 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2020/8/1 下午2:58, Qu Wenruo wrote:
> >>>>>
> >>>>>
> >>>>> On 2020/8/1 下午2:51, Justin Brown wrote:
> >>>>>> Hello,
> >>>>>>
> >>>>>> I've run into a strange problem that I haven't seen before, and I need
> >>>>>> some help. I started getting generic "input/output" errors on a couple
> >>>>>> of files, and when I looked deeper, the kernel logs are full of
> >>>>>> messages like:
> >>>>>>
> >>>>>>     sd 5:0:0:0: [sdf] tag#29 access beyond end of device
> >>>>>
> >>>>> We had a new fix for trim. But according to your kernel message, it
> >>>>> doesn't look like the case.
> >>>>>
> >>>>> (No obvious tag showing it's trim/discard)
> >>>>>
> >>>>>>
> >>>>>> I've never seen anything like this before with any FS, so I figured it
> >>>>>> was worth asking before I consider running the standard btrfs tools.
> >>>>>> (I briefly started a scrub, but it was going crazy with uncorrectable
> >>>>>> errors, so I cancelled it.)
> >>>>>>
> >>>>>> Here's my system info:
> >>>>>>
> >>>>>> Fedora 32, kernel 5.7.7-200.fc32.x86_64
> >>>>>> btrfs-progs v5.7
> >>>>>>
> >>>>>> /etc/fstab entry:
> >>>>>> LABEL=media /var/media btrfs subvol=media,discard 0 2
> >>>>>>
> >>>>>> btrfs fi show /var/media/
> >>>>>> Label: 'media' uuid: 51eef0c7-2977-4037-b271-3270ea22c7d9
> >>>>>> Total devices 6 FS bytes used 4.68TiB
> >>>>>> devid 1 size 1.82TiB used 963.00GiB path /dev/sdf1
> >>>>>> devid 2 size 1.82TiB used 962.00GiB path /dev/sde1
> >>>>>> devid 4 size 1.82TiB used 963.00GiB path /dev/sdg1
> >>>>>> devid 6 size 1.82TiB used 962.03GiB path /dev/sda1
> >>>>>> devid 7 size 7.28TiB used 967.03GiB path /dev/sdb1
> >>>>>> devid 8 size 7.28TiB used 967.03GiB path /dev/sdd1
> >>>>>>
> >>>>>> btrfs fi df /var/media/
> >>>>>> Data, RAID5: total=4.69TiB, used=4.68TiB
> >>>>>> System, RAID1C3: total=32.00MiB, used=304.00KiB
> >>>>>> Metadata, RAID1C3: total=6.00GiB, used=4.94GiB
> >>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
> >>>>>>
> >>>>>> I can only mount -o degraded now. Here are the logs when mounting:
> >>>>>>
> >>>>>> Aug 01 01:15:26 spaceman.fandingo.org sudo[275572]: justin : TTY=pts/0
> >>>>>> ; PWD=/home/justin ; USER=root ; COMMAND=/usr/bin/mount -t btrfs -o
> >>>>>> degraded /dev/sda1 /var/media/
> >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
> >>>>>> access beyond end of device
> >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: blk_update_request: I/O
> >>>>>> error, dev sdf, sector 2176 op 0x0:(READ) flags 0x0 phys_seg 1 prio
> >>>>>> class 0
> >>>>>
> >>>>> OK, it's read, not DISCARD, thus a completely different problem.
> >>>>>
> >>>>>
> >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: Buffer I/O error on dev
> >>>>>> sdf1, logical block 16, async page read
> >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
> >>>>>> sde1): allowing degraded mounts
> >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
> >>>>>> sde1): disk space caching is enabled
> >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>>>> sde1): devid 1 uuid cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 is missing
> >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device
> >>>>>> sde1): bdev /dev/sdf1 errs: wr 4458026, rd 14571, flush 0, corrupt 0,
> >>>>>> gen 0
> >>>>>>
> >>>>>> It seems like only relatively recently written files are encountering
> >>>>>> I/O errors. If I `cat` one of the problematic files when the FS is
> >>>>>> mounted normally, I see a ton of this:
> >>>>>>
> >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#26
> >>>>>> access beyond end of device
> >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#27
> >>>>>> access beyond end of device
> >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#28
> >>>>>> access beyond end of device
> >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#29
> >>>>>> access beyond end of device
> >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30
> >>>>>> access beyond end of device
> >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#0
> >>>>>> access beyond end of device
> >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#1
> >>>>>> access beyond end of device
> >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#13
> >>>>>> access beyond end of device
> >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#2
> >>>>>> access beyond end of device
> >>>>>>
> >>>>>> Now that I'm remounted in -o degraded, I'm getting more comprehensible
> >>>>>> warnings, but it still results in I/O read failures:
> >>>>>>
> >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>>>> sde1): csum failed root 2820 ino 747435 off 99942400 csum 0x8941f998
> >>>>>> expected csum 0xbe3f80a4 mirror 2
> >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>>>> sde1): csum failed root 2820 ino 747435 off 99946496 csum 0x8941f998
> >>>>>> expected csum 0x9c36a6b4 mirror 2
> >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>>>> sde1): csum failed root 2820 ino 747435 off 99950592 csum 0x8941f998
> >>>>>> expected csum 0x44d30ca2 mirror 2
> >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>>>> sde1): csum failed root 2820 ino 747435 off 99958784 csum 0x8941f998
> >>>>>> expected csum 0xc0f08acc mirror 2
> >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>>>> sde1): csum failed root 2820 ino 747435 off 99954688 csum 0x8941f998
> >>>>>> expected csum 0xcb11db59 mirror 2
> >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>>>> sde1): csum failed root 2820 ino 747435 off 99962880 csum 0x8941f998
> >>>>>> expected csum 0x8a4ee0aa mirror 2
> >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>>>> sde1): csum failed root 2820 ino 747435 off 99971072 csum 0x8941f998
> >>>>>> expected csum 0xdfb79e85 mirror 2
> >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>>>> sde1): csum failed root 2820 ino 747435 off 99966976 csum 0x8941f998
> >>>>>> expected csum 0xc14921a0 mirror 2
> >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>>>> sde1): csum failed root 2820 ino 747435 off 99975168 csum 0x8941f998
> >>>>>> expected csum 0xf2fe8774 mirror 2
> >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device
> >>>>>> sde1): csum failed root 2820 ino 747435 off 99979264 csum 0x8941f998
> >>>>>> expected csum 0xae1cafd6 mirror 2
> >>>>>>
> >>>>>> Why trying to research this problem, I came across a Github issue
> >>>>>> https://github.com/kdave/btrfs-progs/issues/282 and a patch from Qu
> >>>>>> from yesterday ([PATCH] btrfs: trim: fix underflow in trim length to
> >>>>>> prevent access beyond device boundary). I do use the discard mount
> >>>>>> option, and I have a weekly fstrim.timer enabled. I did replace 2x2TB
> >>>>>> drives with the 2x8TB drives about 1 month ago, which involved a
> >>>>>> conversion to -d raid5 -m raid1c3, which I suppose could hit the same
> >>>>>> code paths that resize2fs would?
> >>>>>
> >>>>> The problem doesn't look like a trim one, but more likely some device
> >>>>> boundary bug.
> >>>>>
> >>>>> Would you please provide the following info?
> >>>>> - btrfs ins dump-tree -t chunk /dev/sde1
> >>>>>   This contains the device info and chunk tree dump. Doesn't contain
> >>>>>   any confidential info.
> >>>>>   We can use this info to determine if there is some chunk really beyond
> >>>>>   device boundary.
> >>>>>   I guess some chunks are already beyond device boundary by somehow.
> >>>>
> >>>> And `lsblk -b` output.
> >>>>
> >>>> It may be possible that device size in btrfs doesn't match with the real
> >>>> device...
> >>>>>
> >>>>> Thanks,
> >>>>> Qu
> >>>>>
> >>>>>>
> >>>>>> Any advice on how to proceed would be greatly appreciated.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Justin
> >>>>>>
> >>>>>
> >>>>
> >>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, back to index

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-01  6:51 Access Beyond End of Device & Input/Output Errors Justin Brown
2020-08-01  6:58 ` Qu Wenruo
2020-08-01  7:02   ` Qu Wenruo
     [not found]     ` <CAKZK7uzmg19NDjGPPAxXKu7LJ-7ZdHu2cad22csj_chr2qxMJg@mail.gmail.com>
2020-08-01  9:31       ` Qu Wenruo
2020-08-01 11:56         ` Justin Brown
2020-08-01 23:30           ` Qu Wenruo
2020-09-06  1:42             ` Justin Brown

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git