On 2020/8/1 下午2:58, Qu Wenruo wrote: > > > On 2020/8/1 下午2:51, Justin Brown wrote: >> Hello, >> >> I've run into a strange problem that I haven't seen before, and I need >> some help. I started getting generic "input/output" errors on a couple >> of files, and when I looked deeper, the kernel logs are full of >> messages like: >> >> sd 5:0:0:0: [sdf] tag#29 access beyond end of device > > We had a new fix for trim. But according to your kernel message, it > doesn't look like the case. > > (No obvious tag showing it's trim/discard) > >> >> I've never seen anything like this before with any FS, so I figured it >> was worth asking before I consider running the standard btrfs tools. >> (I briefly started a scrub, but it was going crazy with uncorrectable >> errors, so I cancelled it.) >> >> Here's my system info: >> >> Fedora 32, kernel 5.7.7-200.fc32.x86_64 >> btrfs-progs v5.7 >> >> /etc/fstab entry: >> LABEL=media /var/media btrfs subvol=media,discard 0 2 >> >> btrfs fi show /var/media/ >> Label: 'media' uuid: 51eef0c7-2977-4037-b271-3270ea22c7d9 >> Total devices 6 FS bytes used 4.68TiB >> devid 1 size 1.82TiB used 963.00GiB path /dev/sdf1 >> devid 2 size 1.82TiB used 962.00GiB path /dev/sde1 >> devid 4 size 1.82TiB used 963.00GiB path /dev/sdg1 >> devid 6 size 1.82TiB used 962.03GiB path /dev/sda1 >> devid 7 size 7.28TiB used 967.03GiB path /dev/sdb1 >> devid 8 size 7.28TiB used 967.03GiB path /dev/sdd1 >> >> btrfs fi df /var/media/ >> Data, RAID5: total=4.69TiB, used=4.68TiB >> System, RAID1C3: total=32.00MiB, used=304.00KiB >> Metadata, RAID1C3: total=6.00GiB, used=4.94GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> I can only mount -o degraded now. Here are the logs when mounting: >> >> Aug 01 01:15:26 spaceman.fandingo.org sudo[275572]: justin : TTY=pts/0 >> ; PWD=/home/justin ; USER=root ; COMMAND=/usr/bin/mount -t btrfs -o >> degraded /dev/sda1 /var/media/ >> Aug 01 01:15:26 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30 >> access beyond end of device >> Aug 01 01:15:26 spaceman.fandingo.org kernel: blk_update_request: I/O >> error, dev sdf, sector 2176 op 0x0:(READ) flags 0x0 phys_seg 1 prio >> class 0 > > OK, it's read, not DISCARD, thus a completely different problem. > > >> Aug 01 01:15:26 spaceman.fandingo.org kernel: Buffer I/O error on dev >> sdf1, logical block 16, async page read >> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device >> sde1): allowing degraded mounts >> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device >> sde1): disk space caching is enabled >> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS warning (device >> sde1): devid 1 uuid cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 is missing >> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device >> sde1): bdev /dev/sdf1 errs: wr 4458026, rd 14571, flush 0, corrupt 0, >> gen 0 >> >> It seems like only relatively recently written files are encountering >> I/O errors. If I `cat` one of the problematic files when the FS is >> mounted normally, I see a ton of this: >> >> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#26 >> access beyond end of device >> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#27 >> access beyond end of device >> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#28 >> access beyond end of device >> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#29 >> access beyond end of device >> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30 >> access beyond end of device >> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#0 >> access beyond end of device >> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#1 >> access beyond end of device >> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#13 >> access beyond end of device >> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#2 >> access beyond end of device >> >> Now that I'm remounted in -o degraded, I'm getting more comprehensible >> warnings, but it still results in I/O read failures: >> >> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >> sde1): csum failed root 2820 ino 747435 off 99942400 csum 0x8941f998 >> expected csum 0xbe3f80a4 mirror 2 >> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >> sde1): csum failed root 2820 ino 747435 off 99946496 csum 0x8941f998 >> expected csum 0x9c36a6b4 mirror 2 >> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >> sde1): csum failed root 2820 ino 747435 off 99950592 csum 0x8941f998 >> expected csum 0x44d30ca2 mirror 2 >> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >> sde1): csum failed root 2820 ino 747435 off 99958784 csum 0x8941f998 >> expected csum 0xc0f08acc mirror 2 >> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >> sde1): csum failed root 2820 ino 747435 off 99954688 csum 0x8941f998 >> expected csum 0xcb11db59 mirror 2 >> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >> sde1): csum failed root 2820 ino 747435 off 99962880 csum 0x8941f998 >> expected csum 0x8a4ee0aa mirror 2 >> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >> sde1): csum failed root 2820 ino 747435 off 99971072 csum 0x8941f998 >> expected csum 0xdfb79e85 mirror 2 >> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >> sde1): csum failed root 2820 ino 747435 off 99966976 csum 0x8941f998 >> expected csum 0xc14921a0 mirror 2 >> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >> sde1): csum failed root 2820 ino 747435 off 99975168 csum 0x8941f998 >> expected csum 0xf2fe8774 mirror 2 >> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >> sde1): csum failed root 2820 ino 747435 off 99979264 csum 0x8941f998 >> expected csum 0xae1cafd6 mirror 2 >> >> Why trying to research this problem, I came across a Github issue >> https://github.com/kdave/btrfs-progs/issues/282 and a patch from Qu >> from yesterday ([PATCH] btrfs: trim: fix underflow in trim length to >> prevent access beyond device boundary). I do use the discard mount >> option, and I have a weekly fstrim.timer enabled. I did replace 2x2TB >> drives with the 2x8TB drives about 1 month ago, which involved a >> conversion to -d raid5 -m raid1c3, which I suppose could hit the same >> code paths that resize2fs would? > > The problem doesn't look like a trim one, but more likely some device > boundary bug. > > Would you please provide the following info? > - btrfs ins dump-tree -t chunk /dev/sde1 > This contains the device info and chunk tree dump. Doesn't contain > any confidential info. > We can use this info to determine if there is some chunk really beyond > device boundary. > I guess some chunks are already beyond device boundary by somehow. And `lsblk -b` output. It may be possible that device size in btrfs doesn't match with the real device... > > Thanks, > Qu > >> >> Any advice on how to proceed would be greatly appreciated. >> >> Thanks, >> Justin >> >