On 2020/8/1 下午7:56, Justin Brown wrote: > Hi Qu, > > Thanks for your continued help. > > dump-super: > > for i in a b d e f g; do x=$(sudo btrfs ins dump-super /dev/sd${i}1 | > grep dev_item.uuid | cut -f 3); echo "/dev/sd${i}1 $x"; done > /dev/sda1 cc3f9a00-bd69-4ceb-b6e5-4fb874be2aaf > /dev/sdb1 27e1cf24-9349-4f72-a23b-86668b2a9e78 > /dev/sdd1 601d409e-8ffd-489c-91af-daf3e0cc9bd2 > /dev/sde1 2908ebfb-e6b5-4991-b25d-32d1487ff6a4 > /dev/sdf1 cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 They match with the device size. So no chunk item beyond device boundary. > > btrfs check: > > sudo btrfs check /dev/sda1 > Opening filesystem to check... > Checking filesystem on /dev/sda1 > UUID: 51eef0c7-2977-4037-b271-3270ea22c7d9 > [1/7] checking root items > [2/7] checking extents ... > failed to load free space cache for block group 92568662507520 > failed to load free space cache for block group 92574031216640 > ... > failed to load free space cache for block group 97722656817152 > failed to load free space cache for block group 97728025526272 This is interesting. Maybe that's related to the problem? > [4/7] checking fs roots > [5/7] checking only csums items (without verifying data) > [6/7] checking root refs > [7/7] checking quota groups skipped (not enabled on this FS) Great that all metadata are fine. > found 5148381876224 bytes used, no error found > total csum bytes: 4998903140 > total tree bytes: 5301813248 > total fs tree bytes: 96894976 > total extent tree bytes: 41910272 > btree space waste bytes: 135561977 > file data blocks allocated: 8972043898880 > referenced 5113155596288 > > The alignment issue would be confined to performance, correct? Yep, only related to performance and some noisy warning for newer kernel. Not a big problem yet. Since btrfs-check reports no obvious problem but free space cache problems, maybe btrfs repair --clear-space-cache v1 is worthy trying. BTW, since current kernel and btrfs-progs doesn't do restrict chunk check against device boundary, I'll add such checks to both kernel and progs soon. In the mean time, I also see the following dmesg showing that kernel failed to detect one device: Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS warning (device sde1): devid 1 uuid cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 is missing Can you reproduce that problem? And if so, maybe try "btrfs device scan" and then mount again? Thanks, Qu > > Thanks, > Justin > > /dev/sdg1 1b938c84-eafd-4396-b06c-8a5bf1339840On Sat, Aug 1, 2020 at > 4:31 AM Qu Wenruo wrote: >> >> >> >> On 2020/8/1 下午4:30, Justin Brown wrote: >>> Hi Qu, >>> >>> Thanks for the help. >>> >>> Here's is the lsblk -b: >>> >>> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT >>> sda 8:0 0 2000398934016 0 disk >>> └─sda1 8:1 0 2000397868544 0 part >>> sdb 8:16 0 8001563222016 0 disk >>> └─sdb1 8:17 0 8001562156544 0 part >>> sdc 8:32 0 120034123776 0 disk >>> ├─sdc1 8:33 0 1048576 0 part >>> ├─sdc2 8:34 0 524288000 0 part /boot >>> └─sdc3 8:35 0 119507255296 0 part /home >>> sdd 8:48 0 8001563222016 0 disk >>> └─sdd1 8:49 0 8001562156544 0 part >>> sde 8:64 0 2000398934016 0 disk >>> └─sde1 8:65 0 2000397868544 0 part >>> sdf 8:80 0 2000398934016 0 disk >>> └─sdf1 8:81 0 2000397868544 0 part /var/media >>> sdg 8:96 1 2000398934016 0 disk >>> └─sdg1 8:97 1 2000397868544 0 part >>> >>> The `btrfs ins...` output is quite long. I've attached it as a txt and >>> also uploaded it at >>> https://gist.github.com/fandingo/aa345d6c6fa97162f810e86c9ab20d6a >> >> >> Thanks, this already shows some device size difference. >> >> But all of them are in fact just a little smaller than device size, thus >> it should be fine. >> >> Another problem I found is, it looks like either size or start of some >> partitions are not aligned to 4K. >> >> It may be a problem for 4K aligned hard disks, so it may worthy some >> concern after solving the btrfs problem. >> >> Would you please also provide some extra dump? >> - btrfs check /dev/sda1 >> It should detect any problems I missed >> >> - btrfs ins dump-super | grep dev_item.uuid >> It's a little hard to find which device owns to which device id. >> So we need this dump of each btrfs device to make sure. >> >> Thanks, >> Qu >> >> >>> >>> Thanks, >>> Justin >>> >>> On Sat, Aug 1, 2020 at 2:02 AM Qu Wenruo wrote: >>>> >>>> >>>> >>>> On 2020/8/1 下午2:58, Qu Wenruo wrote: >>>>> >>>>> >>>>> On 2020/8/1 下午2:51, Justin Brown wrote: >>>>>> Hello, >>>>>> >>>>>> I've run into a strange problem that I haven't seen before, and I need >>>>>> some help. I started getting generic "input/output" errors on a couple >>>>>> of files, and when I looked deeper, the kernel logs are full of >>>>>> messages like: >>>>>> >>>>>> sd 5:0:0:0: [sdf] tag#29 access beyond end of device >>>>> >>>>> We had a new fix for trim. But according to your kernel message, it >>>>> doesn't look like the case. >>>>> >>>>> (No obvious tag showing it's trim/discard) >>>>> >>>>>> >>>>>> I've never seen anything like this before with any FS, so I figured it >>>>>> was worth asking before I consider running the standard btrfs tools. >>>>>> (I briefly started a scrub, but it was going crazy with uncorrectable >>>>>> errors, so I cancelled it.) >>>>>> >>>>>> Here's my system info: >>>>>> >>>>>> Fedora 32, kernel 5.7.7-200.fc32.x86_64 >>>>>> btrfs-progs v5.7 >>>>>> >>>>>> /etc/fstab entry: >>>>>> LABEL=media /var/media btrfs subvol=media,discard 0 2 >>>>>> >>>>>> btrfs fi show /var/media/ >>>>>> Label: 'media' uuid: 51eef0c7-2977-4037-b271-3270ea22c7d9 >>>>>> Total devices 6 FS bytes used 4.68TiB >>>>>> devid 1 size 1.82TiB used 963.00GiB path /dev/sdf1 >>>>>> devid 2 size 1.82TiB used 962.00GiB path /dev/sde1 >>>>>> devid 4 size 1.82TiB used 963.00GiB path /dev/sdg1 >>>>>> devid 6 size 1.82TiB used 962.03GiB path /dev/sda1 >>>>>> devid 7 size 7.28TiB used 967.03GiB path /dev/sdb1 >>>>>> devid 8 size 7.28TiB used 967.03GiB path /dev/sdd1 >>>>>> >>>>>> btrfs fi df /var/media/ >>>>>> Data, RAID5: total=4.69TiB, used=4.68TiB >>>>>> System, RAID1C3: total=32.00MiB, used=304.00KiB >>>>>> Metadata, RAID1C3: total=6.00GiB, used=4.94GiB >>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B >>>>>> >>>>>> I can only mount -o degraded now. Here are the logs when mounting: >>>>>> >>>>>> Aug 01 01:15:26 spaceman.fandingo.org sudo[275572]: justin : TTY=pts/0 >>>>>> ; PWD=/home/justin ; USER=root ; COMMAND=/usr/bin/mount -t btrfs -o >>>>>> degraded /dev/sda1 /var/media/ >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30 >>>>>> access beyond end of device >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: blk_update_request: I/O >>>>>> error, dev sdf, sector 2176 op 0x0:(READ) flags 0x0 phys_seg 1 prio >>>>>> class 0 >>>>> >>>>> OK, it's read, not DISCARD, thus a completely different problem. >>>>> >>>>> >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: Buffer I/O error on dev >>>>>> sdf1, logical block 16, async page read >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device >>>>>> sde1): allowing degraded mounts >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device >>>>>> sde1): disk space caching is enabled >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS warning (device >>>>>> sde1): devid 1 uuid cb05aae6-6c03-49d3-b46d-bf51a0eb8cd0 is missing >>>>>> Aug 01 01:15:26 spaceman.fandingo.org kernel: BTRFS info (device >>>>>> sde1): bdev /dev/sdf1 errs: wr 4458026, rd 14571, flush 0, corrupt 0, >>>>>> gen 0 >>>>>> >>>>>> It seems like only relatively recently written files are encountering >>>>>> I/O errors. If I `cat` one of the problematic files when the FS is >>>>>> mounted normally, I see a ton of this: >>>>>> >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#26 >>>>>> access beyond end of device >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#27 >>>>>> access beyond end of device >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#28 >>>>>> access beyond end of device >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#29 >>>>>> access beyond end of device >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#30 >>>>>> access beyond end of device >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#0 >>>>>> access beyond end of device >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#1 >>>>>> access beyond end of device >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#13 >>>>>> access beyond end of device >>>>>> Aug 01 01:13:49 spaceman.fandingo.org kernel: sd 5:0:0:0: [sdf] tag#2 >>>>>> access beyond end of device >>>>>> >>>>>> Now that I'm remounted in -o degraded, I'm getting more comprehensible >>>>>> warnings, but it still results in I/O read failures: >>>>>> >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >>>>>> sde1): csum failed root 2820 ino 747435 off 99942400 csum 0x8941f998 >>>>>> expected csum 0xbe3f80a4 mirror 2 >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >>>>>> sde1): csum failed root 2820 ino 747435 off 99946496 csum 0x8941f998 >>>>>> expected csum 0x9c36a6b4 mirror 2 >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >>>>>> sde1): csum failed root 2820 ino 747435 off 99950592 csum 0x8941f998 >>>>>> expected csum 0x44d30ca2 mirror 2 >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >>>>>> sde1): csum failed root 2820 ino 747435 off 99958784 csum 0x8941f998 >>>>>> expected csum 0xc0f08acc mirror 2 >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >>>>>> sde1): csum failed root 2820 ino 747435 off 99954688 csum 0x8941f998 >>>>>> expected csum 0xcb11db59 mirror 2 >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >>>>>> sde1): csum failed root 2820 ino 747435 off 99962880 csum 0x8941f998 >>>>>> expected csum 0x8a4ee0aa mirror 2 >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >>>>>> sde1): csum failed root 2820 ino 747435 off 99971072 csum 0x8941f998 >>>>>> expected csum 0xdfb79e85 mirror 2 >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >>>>>> sde1): csum failed root 2820 ino 747435 off 99966976 csum 0x8941f998 >>>>>> expected csum 0xc14921a0 mirror 2 >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >>>>>> sde1): csum failed root 2820 ino 747435 off 99975168 csum 0x8941f998 >>>>>> expected csum 0xf2fe8774 mirror 2 >>>>>> Aug 01 01:31:53 spaceman.fandingo.org kernel: BTRFS warning (device >>>>>> sde1): csum failed root 2820 ino 747435 off 99979264 csum 0x8941f998 >>>>>> expected csum 0xae1cafd6 mirror 2 >>>>>> >>>>>> Why trying to research this problem, I came across a Github issue >>>>>> https://github.com/kdave/btrfs-progs/issues/282 and a patch from Qu >>>>>> from yesterday ([PATCH] btrfs: trim: fix underflow in trim length to >>>>>> prevent access beyond device boundary). I do use the discard mount >>>>>> option, and I have a weekly fstrim.timer enabled. I did replace 2x2TB >>>>>> drives with the 2x8TB drives about 1 month ago, which involved a >>>>>> conversion to -d raid5 -m raid1c3, which I suppose could hit the same >>>>>> code paths that resize2fs would? >>>>> >>>>> The problem doesn't look like a trim one, but more likely some device >>>>> boundary bug. >>>>> >>>>> Would you please provide the following info? >>>>> - btrfs ins dump-tree -t chunk /dev/sde1 >>>>> This contains the device info and chunk tree dump. Doesn't contain >>>>> any confidential info. >>>>> We can use this info to determine if there is some chunk really beyond >>>>> device boundary. >>>>> I guess some chunks are already beyond device boundary by somehow. >>>> >>>> And `lsblk -b` output. >>>> >>>> It may be possible that device size in btrfs doesn't match with the real >>>> device... >>>>> >>>>> Thanks, >>>>> Qu >>>>> >>>>>> >>>>>> Any advice on how to proceed would be greatly appreciated. >>>>>> >>>>>> Thanks, >>>>>> Justin >>>>>> >>>>> >>>> >>