* BTRFS scrub reports an error but check doesn't find any errors. @ 2021-07-25 17:39 Dave T 2021-07-25 23:49 ` Qu Wenruo ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Dave T @ 2021-07-25 17:39 UTC (permalink / raw) To: Btrfs BTRFS What does the list recommend I do in this case? starting btrfs scrub ... scrub done for 56cea9cf-5374-4a43-b19d-6b0b143dc635 Scrub started: Sun Jul 25 00:40:43 2021 Status: finished Duration: 2:52:45 Total to scrub: 1.26TiB Rate: 113.72MiB/s Error summary: read=1 Corrected: 0 Uncorrectable: 1 Unverified: 0 ERROR: there are uncorrectable errors dmesg | grep "checksum error at" | tail -n 20 (no output) # dmesg | grep -i checksum [ +0.001698] xor: automatically using best checksumming function avx (not related to BTRFS, right?) # btrfs fi us /path/to/xyz Overall: Device size: 2.73TiB Device allocated: 1.26TiB Device unallocated: 1.47TiB Device missing: 0.00B Used: 1.12TiB Free (estimated): 1.60TiB (min: 888.70GiB) Free (statfs, df): 1.60TiB Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Multiple profiles: no Data,single: Size:1.25TiB, Used:1.11TiB (89.38%) /dev/mapper/userluks 1.25TiB Metadata,DUP: Size:6.00GiB, Used:5.26GiB (87.67%) /dev/mapper/userluks 12.00GiB System,DUP: Size:32.00MiB, Used:160.00KiB (0.49%) /dev/mapper/userluks 64.00MiB Unallocated: /dev/mapper/userluks 1.47TiB # btrfs check /dev/mapper/xyz Opening filesystem to check... Checking filesystem on /dev/mapper/xyz UUID: 56cea9cf-5374-4a43-b19d-6b0b143dc635 [1/7] checking root items [2/7] checking extents [3/7] checking free space cache [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) found 1230187327496 bytes used, no error found total csum bytes: 1195610680 total tree bytes: 5648285696 total fs tree bytes: 4011016192 total extent tree bytes: 379256832 btree space waste bytes: 827370015 file data blocks allocated: 5497457123328 referenced 5523039584256 If more info is needed, please let me know. Recommendations and advice are appreciated. Thank you. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BTRFS scrub reports an error but check doesn't find any errors. 2021-07-25 17:39 BTRFS scrub reports an error but check doesn't find any errors Dave T @ 2021-07-25 23:49 ` Qu Wenruo 2021-07-26 18:38 ` Chris Murphy 2021-07-27 21:41 ` Zygo Blaxell 2 siblings, 0 replies; 11+ messages in thread From: Qu Wenruo @ 2021-07-25 23:49 UTC (permalink / raw) To: Dave T, Btrfs BTRFS On 2021/7/26 上午1:39, Dave T wrote: > What does the list recommend I do in this case? > > starting btrfs scrub ... > scrub done for 56cea9cf-5374-4a43-b19d-6b0b143dc635 > Scrub started: Sun Jul 25 00:40:43 2021 > Status: finished > Duration: 2:52:45 > Total to scrub: 1.26TiB > Rate: 113.72MiB/s > Error summary: read=1 Scrub checks the csum for data and metadata, while btrfs-check only checks the metadata, it doesn't check the csum of data, unless with --check-data-csum option. > Corrected: 0 > Uncorrectable: 1 > Unverified: 0 > ERROR: there are uncorrectable errors > > dmesg | grep "checksum error at" | tail -n 20 > (no output) I'm more interested in this problem, scrub finds one csum error but no output is pretty weird already. > > # dmesg | grep -i checksum > [ +0.001698] xor: automatically using best checksumming function avx > (not related to BTRFS, right?) > > # btrfs fi us /path/to/xyz > Overall: > Device size: 2.73TiB > Device allocated: 1.26TiB > Device unallocated: 1.47TiB > Device missing: 0.00B > Used: 1.12TiB > Free (estimated): 1.60TiB (min: 888.70GiB) > Free (statfs, df): 1.60TiB > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:1.25TiB, Used:1.11TiB (89.38%) > /dev/mapper/userluks 1.25TiB > > Metadata,DUP: Size:6.00GiB, Used:5.26GiB (87.67%) > /dev/mapper/userluks 12.00GiB > > System,DUP: Size:32.00MiB, Used:160.00KiB (0.49%) > /dev/mapper/userluks 64.00MiB > > Unallocated: > /dev/mapper/userluks 1.47TiB > > # btrfs check /dev/mapper/xyz > Opening filesystem to check... > Checking filesystem on /dev/mapper/xyz > UUID: 56cea9cf-5374-4a43-b19d-6b0b143dc635 > [1/7] checking root items > [2/7] checking extents > [3/7] checking free space cache > [4/7] checking fs roots > [5/7] checking only csums items (without verifying data) > [6/7] checking root refs > [7/7] checking quota groups skipped (not enabled on this FS) > found 1230187327496 bytes used, no error found Since btrfs-check reports no error, it means if there is some real error, it's in data, not metadata. And since no error message, the only way to catch the problem is through "btrfs device stats" command to see which device gets is error accounting increased. And since the values are accumulated after the creation of the fs, it may not be that obvious. So you may want to record the output, run scrub again, then compare the output to determine which device is affected. Or, you can use "btrfs check --check-data-csum" to do a "scrub" in user space. Thanks, Qu > total csum bytes: 1195610680 > total tree bytes: 5648285696 > total fs tree bytes: 4011016192 > total extent tree bytes: 379256832 > btree space waste bytes: 827370015 > file data blocks allocated: 5497457123328 > referenced 5523039584256 > > If more info is needed, please let me know. Recommendations and advice > are appreciated. > Thank you. > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BTRFS scrub reports an error but check doesn't find any errors. 2021-07-25 17:39 BTRFS scrub reports an error but check doesn't find any errors Dave T 2021-07-25 23:49 ` Qu Wenruo @ 2021-07-26 18:38 ` Chris Murphy 2021-07-27 21:41 ` Zygo Blaxell 2 siblings, 0 replies; 11+ messages in thread From: Chris Murphy @ 2021-07-26 18:38 UTC (permalink / raw) To: Dave T; +Cc: Btrfs BTRFS On Sun, Jul 25, 2021 at 11:40 AM Dave T <davestechshop@gmail.com> wrote: > dmesg | grep "checksum error at" | tail -n 20 > (no output) > > # dmesg | grep -i checksum > [ +0.001698] xor: automatically using best checksumming function avx > (not related to BTRFS, right?) Search for "csum" or use -i btrfs instead. -- Chris Murphy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BTRFS scrub reports an error but check doesn't find any errors. 2021-07-25 17:39 BTRFS scrub reports an error but check doesn't find any errors Dave T 2021-07-25 23:49 ` Qu Wenruo 2021-07-26 18:38 ` Chris Murphy @ 2021-07-27 21:41 ` Zygo Blaxell 2021-07-28 15:15 ` Dave T 2 siblings, 1 reply; 11+ messages in thread From: Zygo Blaxell @ 2021-07-27 21:41 UTC (permalink / raw) To: Dave T; +Cc: Btrfs BTRFS On Sun, Jul 25, 2021 at 01:39:55PM -0400, Dave T wrote: > What does the list recommend I do in this case? > > starting btrfs scrub ... > scrub done for 56cea9cf-5374-4a43-b19d-6b0b143dc635 > Scrub started: Sun Jul 25 00:40:43 2021 > Status: finished > Duration: 2:52:45 > Total to scrub: 1.26TiB > Rate: 113.72MiB/s > Error summary: read=1 > Corrected: 0 > Uncorrectable: 1 > Unverified: 0 > ERROR: there are uncorrectable errors This is a read failure (data not available from device), not a csum error (data available but not correct). > dmesg | grep "checksum error at" | tail -n 20 > (no output) You should be looking for a IO failure on the underlying device (the one below /dev/mapper/userluks). Look for log messages that appear just before btrfs errors, or errors mentioning the device itself: dmesg | grep -B99 -i btrfs dmesg | grep -C9 sda > # dmesg | grep -i checksum > [ +0.001698] xor: automatically using best checksumming function avx > (not related to BTRFS, right?) > > # btrfs fi us /path/to/xyz > Overall: > Device size: 2.73TiB > Device allocated: 1.26TiB > Device unallocated: 1.47TiB > Device missing: 0.00B > Used: 1.12TiB > Free (estimated): 1.60TiB (min: 888.70GiB) > Free (statfs, df): 1.60TiB > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:1.25TiB, Used:1.11TiB (89.38%) > /dev/mapper/userluks 1.25TiB > > Metadata,DUP: Size:6.00GiB, Used:5.26GiB (87.67%) > /dev/mapper/userluks 12.00GiB > > System,DUP: Size:32.00MiB, Used:160.00KiB (0.49%) > /dev/mapper/userluks 64.00MiB Since the error was not corrected, it likely occurred in the data blocks. A metadata error would be correctable, so check wouldn't report it because the scrub will have already corrected it (assuming the underlying drive is still healthy enough to remap bad sectors). > Unallocated: > /dev/mapper/userluks 1.47TiB > > # btrfs check /dev/mapper/xyz That command won't read any data blocks, so it won't see any errors there. > Opening filesystem to check... > Checking filesystem on /dev/mapper/xyz > UUID: 56cea9cf-5374-4a43-b19d-6b0b143dc635 > [1/7] checking root items > [2/7] checking extents > [3/7] checking free space cache > [4/7] checking fs roots > [5/7] checking only csums items (without verifying data) > [6/7] checking root refs > [7/7] checking quota groups skipped (not enabled on this FS) > found 1230187327496 bytes used, no error found > total csum bytes: 1195610680 > total tree bytes: 5648285696 > total fs tree bytes: 4011016192 > total extent tree bytes: 379256832 > btree space waste bytes: 827370015 > file data blocks allocated: 5497457123328 > referenced 5523039584256 > > If more info is needed, please let me know. Recommendations and advice > are appreciated. > Thank you. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BTRFS scrub reports an error but check doesn't find any errors. 2021-07-27 21:41 ` Zygo Blaxell @ 2021-07-28 15:15 ` Dave T 2021-07-28 16:11 ` Andrei Borzenkov ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Dave T @ 2021-07-28 15:15 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Btrfs BTRFS On Tue, Jul 27, 2021 at 5:44 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > On Sun, Jul 25, 2021 at 01:39:55PM -0400, Dave T wrote: > > What does the list recommend I do in this case? > > > > starting btrfs scrub ... > > scrub done for 56cea9cf-5374-4a43-b19d-6b0b143dc635 > > Scrub started: Sun Jul 25 00:40:43 2021 > > Status: finished > > Duration: 2:52:45 > > Total to scrub: 1.26TiB > > Rate: 113.72MiB/s > > Error summary: read=1 > > Corrected: 0 > > Uncorrectable: 1 > > Unverified: 0 > > ERROR: there are uncorrectable errors > > This is a read failure (data not available from device), not a csum error > (data available but not correct). Thank you. > > > dmesg | grep "checksum error at" | tail -n 20 > > (no output) > > You should be looking for a IO failure on the underlying device (the > one below /dev/mapper/userluks). That would be /dev/sde in my case. > Look for log messages that appear just > before btrfs errors, or errors mentioning the device itself: > > dmesg | grep -B99 -i btrfs > > dmesg | grep -C9 sda > Here is the complete output from a new scrub, including the output of the commands above. # fpath="/home/userdata" # check_space "$fpath" Filesystem Size Used Avail Use% Mounted on /dev/mapper/userluks 2.8T 1.2T 1.7T 42% /home/userdata Data, single: total=1.25TiB, used=1.11TiB System, DUP: total=32.00MiB, used=160.00KiB Metadata, DUP: total=6.00GiB, used=5.32GiB GlobalReserve, single: total=512.00MiB, used=0.00B # balance "$fpath" starting btrfs balance for /home/userdata Done, had to relocate 10 out of 1284 chunks real 101m35.135s user 0m0.000s sys 0m32.991s ----------------------- # scrub "$fpath" starting btrfs scrub ... scrub done for 56cea9cf-5374-4a43-b19d-6b0b143dc635 Scrub started: Tue Jul 27 19:46:25 2021 Status: finished Duration: 2:09:17 Total to scrub: 1.25TiB Rate: 151.98MiB/s Error summary: read=1 Corrected: 0 Uncorrectable: 1 Unverified: 0 ERROR: there are uncorrectable errors # btrfs fi us /home/userdata/ Overall: Device size: 2.73TiB Device allocated: 1.25TiB Device unallocated: 1.48TiB Device missing: 0.00B Used: 1.12TiB Free (estimated): 1.60TiB (min: 883.62GiB) Free (statfs, df): 1.60TiB Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Multiple profiles: no Data,single: Size:1.24TiB, Used:1.11TiB (90.10%) /dev/mapper/userluks 1.24TiB Metadata,DUP: Size:6.00GiB, Used:5.35GiB (89.17%) /dev/mapper/userluks 12.00GiB System,DUP: Size:32.00MiB, Used:160.00KiB (0.49%) /dev/mapper/userluks 64.00MiB Unallocated: /dev/mapper/userluks 1.48TiB # lsblk NAME LABEL UUID PARTUUID MODEL SIZE SERIAL MOUNTPOINT sde ST3000DM001-1CH166 2.7T XXXXXXX └─sde1 56cea9cf-3566-49f3-8abf-e59246f88a43 5db1ed2f-572e-4388-920b-6e4bfabf9e72 2.7T └─userluks USERCOMMON 56cea9cf-5374-4a43-b19d-6b0b143dc635 2.7T /mnt/temp/user dmesg | grep -B99 -i btrfs The only relevant output is: [ +0.025578] BTRFS warning (device dm-4): qgroup rescan init failed, qgroup is not enabled [ +0.000983] BTRFS warning (device dm-4): qgroup rescan init failed, qgroup is not enabled [ +0.026646] BTRFS warning (device dm-4): qgroup rescan init failed, qgroup is not enabled [ +0.001383] BTRFS warning (device dm-4): qgroup rescan init failed, qgroup is not enabled [Jul28 00:03] BTRFS warning (device dm-4): qgroup rescan init failed, qgroup is not enabled [ +0.001119] BTRFS warning (device dm-4): qgroup rescan init failed, qgroup is not enabled [ +1.828343] BTRFS warning (device dm-0): qgroup rescan init failed, qgroup is not enabled [ +0.001132] BTRFS warning (device dm-0): qgroup rescan init failed, qgroup is not enabled [ +0.029263] BTRFS warning (device dm-0): qgroup rescan init failed, qgroup is not enabled [ +0.000969] BTRFS warning (device dm-0): qgroup rescan init failed, qgroup is not enabled [ +0.068749] BTRFS warning (device dm-0): qgroup rescan init failed, qgroup is not enabled [ +0.005558] BTRFS warning (device dm-0): qgroup rescan init failed, qgroup is not enabled [ +2.872178] BTRFS warning (device dm-2): qgroup rescan init failed, qgroup is not enabled [ +0.024708] BTRFS warning (device dm-2): qgroup rescan init failed, qgroup is not enabled [ +0.041130] BTRFS warning (device dm-2): qgroup rescan init failed, qgroup is not enabled [ +4.115641] BTRFS warning (device dm-2): qgroup rescan init failed, qgroup is not enabled [ +0.032633] BTRFS warning (device dm-2): qgroup rescan init failed, qgroup is not enabled [ +0.059885] BTRFS warning (device dm-2): qgroup rescan init failed, qgroup is not enabled # dmesg | grep -C9 sde (no output) The journal shows: Jul 27 21:54:39 server kernel: ata10.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x0 Jul 27 21:54:39 server kernel: ata10.00: irq_stat 0x40000008 Jul 27 21:54:39 server kernel: ata10.00: failed command: READ FPDMA QUEUED Jul 27 21:54:39 server kernel: ata10.00: cmd 60/00:90:98:2f:9f/03:00:a4:00:00/40 tag 18 ncq dma 393216 in res 41/40:00:20:32:9f/00:03:a4:00:00/00 Emask 0x409 (media error) <F> Jul 27 21:54:39 server kernel: ata10.00: status: { DRDY ERR } Jul 27 21:54:39 server kernel: ata10.00: error: { UNC } Jul 27 21:54:39 server kernel: ata10.00: configured for UDMA/133 Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Sense Key : Medium Error [current] Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Add. Sense: Unrecovered read error - auto reallocate failed Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 CDB: Read(16) 88 00 00 00 00 00 a4 9f 2f 98 00 00 03 00 00 00 Jul 27 21:54:39 server kernel: blk_update_request: I/O error, dev sde, sector 2761896480 op 0x0:(READ) flags 0x0 phys_seg 15 prio class 0 Jul 27 21:54:39 server kernel: ata10: EH complete Jul 27 21:54:45 server kernel: ata10.00: exception Emask 0x0 SAct 0x4000000 SErr 0x0 action 0x0 Jul 27 21:54:45 server kernel: ata10.00: irq_stat 0x40000008 Jul 27 21:54:45 server kernel: ata10.00: failed command: READ FPDMA QUEUED Jul 27 21:54:45 server kernel: ata10.00: cmd 60/08:d0:20:32:9f/00:00:a4:00:00/40 tag 26 ncq dma 4096 in res 41/40:08:20:32:9f/00:00:a4:00:00/00 Emask 0x409 (media error) <F> Jul 27 21:54:45 server kernel: ata10.00: status: { DRDY ERR } Jul 27 21:54:45 server kernel: ata10.00: error: { UNC } Jul 27 21:54:45 server kernel: ata10.00: configured for UDMA/133 Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=4s Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 Sense Key : Medium Error [current] Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 Add. Sense: Unrecovered read error - auto reallocate failed Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 CDB: Read(16) 88 00 00 00 00 00 a4 9f 32 20 00 00 00 08 00 00 Jul 27 21:54:45 server kernel: blk_update_request: I/O error, dev sde, sector 2761896480 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0 Jul 27 21:54:45 server kernel: ata10: EH complete Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error at logical 1567691653120 on dev /dev/mapper/userluks, physical 1414087852032, root 19911, inode 624993, offset 5717954560, length 4096, links 1 (path: path/to/file/filename.ext) Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error at logical 1567691653120 on dev /dev/mapper/userluks, physical 1414087852032, root 19989, inode 624993, offset 5717954560, length 4096, links 1 (path: path/to/file/filename.ext) Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error at logical 1567691653120 on dev /dev/mapper/userluks, physical 1414087852032, root 20199, inode 624993, offset 5717954560, length 4096, links 1 (path: path/to/file/filename.ext) Jul 27 21:54:45 server kernel: BTRFS error (device dm-2): bdev /dev/mapper/userluks errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 Jul 27 21:54:45 server kernel: BTRFS error (device dm-2): unable to fixup (regular) error at logical 1567691653120 on dev /dev/mapper/userluks Jul 27 21:55:42 server kernel: BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0 btrfs su li /home/userdata/ ... ID 19911 gen 152144 top level 257 path @usertop/18053/snapshot ID 19989 gen 152144 top level 257 path @usertop/18131/snapshot ID 20199 gen 152144 top level 257 path @usertop/18313/snapshot ... snapper -c userdata ls ... 18053 | single | | Fri 01 Jan 2021 12:00:18 AM EST | root | timeline | timeline | 18131 | single | | Mon 04 Jan 2021 12:00:15 AM EST | root | timeline | timeline | 18313 | single | | Mon 11 Jan 2021 12:00:20 AM EST | root | timeline | timeline | ... There are snapshots after that date without any errors. The live (r/w) file system does not show any errors. The volume is a 3TB disk, model ST3000DM001-1CH166 (Seagate Barracuda SATA HDD). Is there a way to mark sectors on the disk as bad? If so, is it advisable to keep using this physical disk? Thanks for sharing your knowledge. This is very helpful. > > # dmesg | grep -i checksum > > [ +0.001698] xor: automatically using best checksumming function avx > > (not related to BTRFS, right?) > > > > # btrfs fi us /path/to/xyz > > Overall: > > Device size: 2.73TiB > > Device allocated: 1.26TiB > > Device unallocated: 1.47TiB > > Device missing: 0.00B > > Used: 1.12TiB > > Free (estimated): 1.60TiB (min: 888.70GiB) > > Free (statfs, df): 1.60TiB > > Data ratio: 1.00 > > Metadata ratio: 2.00 > > Global reserve: 512.00MiB (used: 0.00B) > > Multiple profiles: no > > > > Data,single: Size:1.25TiB, Used:1.11TiB (89.38%) > > /dev/mapper/userluks 1.25TiB > > > > Metadata,DUP: Size:6.00GiB, Used:5.26GiB (87.67%) > > /dev/mapper/userluks 12.00GiB > > > > System,DUP: Size:32.00MiB, Used:160.00KiB (0.49%) > > /dev/mapper/userluks 64.00MiB > > Since the error was not corrected, it likely occurred in the data blocks. Yes, it appears so from the info above. > > A metadata error would be correctable, so check wouldn't report it because > the scrub will have already corrected it (assuming the underlying drive > is still healthy enough to remap bad sectors). > > > Unallocated: > > /dev/mapper/userluks 1.47TiB > > > > # btrfs check /dev/mapper/xyz > > That command won't read any data blocks, so it won't see any errors there. > > > Opening filesystem to check... > > Checking filesystem on /dev/mapper/xyz > > UUID: 56cea9cf-5374-4a43-b19d-6b0b143dc635 > > [1/7] checking root items > > [2/7] checking extents > > [3/7] checking free space cache > > [4/7] checking fs roots > > [5/7] checking only csums items (without verifying data) > > [6/7] checking root refs > > [7/7] checking quota groups skipped (not enabled on this FS) > > found 1230187327496 bytes used, no error found > > total csum bytes: 1195610680 > > total tree bytes: 5648285696 > > total fs tree bytes: 4011016192 > > total extent tree bytes: 379256832 > > btree space waste bytes: 827370015 > > file data blocks allocated: 5497457123328 > > referenced 5523039584256 > > > > If more info is needed, please let me know. Recommendations and advice > > are appreciated. > > Thank you. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BTRFS scrub reports an error but check doesn't find any errors. 2021-07-28 15:15 ` Dave T @ 2021-07-28 16:11 ` Andrei Borzenkov 2021-07-28 16:21 ` Dave T 2021-07-28 19:18 ` Zygo Blaxell 2021-07-29 3:12 ` Chris Murphy 2 siblings, 1 reply; 11+ messages in thread From: Andrei Borzenkov @ 2021-07-28 16:11 UTC (permalink / raw) To: Dave T, Zygo Blaxell; +Cc: Btrfs BTRFS On 28.07.2021 18:15, Dave T wrote: ... > > Jul 27 21:54:39 server kernel: ata10.00: exception Emask 0x0 SAct > 0xffffffff SErr 0x0 action 0x0 > Jul 27 21:54:39 server kernel: ata10.00: irq_stat 0x40000008 > Jul 27 21:54:39 server kernel: ata10.00: failed command: READ FPDMA QUEUED > Jul 27 21:54:39 server kernel: ata10.00: cmd > 60/00:90:98:2f:9f/03:00:a4:00:00/40 tag 18 ncq dma 393216 in > res > 41/40:00:20:32:9f/00:03:a4:00:00/00 Emask 0x409 (media error) <F> > Jul 27 21:54:39 server kernel: ata10.00: status: { DRDY ERR } > Jul 27 21:54:39 server kernel: ata10.00: error: { UNC } > Jul 27 21:54:39 server kernel: ata10.00: configured for UDMA/133 > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Sense Key : > Medium Error [current] > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Add. Sense: > Unrecovered read error - auto reallocate failed > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 CDB: Read(16) > 88 00 00 00 00 00 a4 9f 2f 98 00 00 03 00 00 00 > Jul 27 21:54:39 server kernel: blk_update_request: I/O error, dev sde, > sector 2761896480 op 0x0:(READ) flags 0x0 phys_seg 15 prio class 0 > Jul 27 21:54:39 server kernel: ata10: EH complete > Jul 27 21:54:45 server kernel: ata10.00: exception Emask 0x0 SAct > 0x4000000 SErr 0x0 action 0x0 > Jul 27 21:54:45 server kernel: ata10.00: irq_stat 0x40000008 > Jul 27 21:54:45 server kernel: ata10.00: failed command: READ FPDMA QUEUED > Jul 27 21:54:45 server kernel: ata10.00: cmd > 60/08:d0:20:32:9f/00:00:a4:00:00/40 tag 26 ncq dma 4096 in > res > 41/40:08:20:32:9f/00:00:a4:00:00/00 Emask 0x409 (media error) <F> > Jul 27 21:54:45 server kernel: ata10.00: status: { DRDY ERR } > Jul 27 21:54:45 server kernel: ata10.00: error: { UNC } > Jul 27 21:54:45 server kernel: ata10.00: configured for UDMA/133 > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=4s > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 Sense Key : > Medium Error [current] > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 Add. Sense: > Unrecovered read error - auto reallocate failed > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 CDB: Read(16) > 88 00 00 00 00 00 a4 9f 32 20 00 00 00 08 00 00 > Jul 27 21:54:45 server kernel: blk_update_request: I/O error, dev sde, > sector 2761896480 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0 > Jul 27 21:54:45 server kernel: ata10: EH complete > Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error > at logical 1567691653120 on dev /dev/mapper/userluks, physical > 1414087852032, root 19911, inode 624993, offset 5717954560, length > 4096, links 1 (path: path/to/file/filename.ext) > Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error > at logical 1567691653120 on dev /dev/mapper/userluks, physical > 1414087852032, root 19989, inode 624993, offset 5717954560, length > 4096, links 1 (path: path/to/file/filename.ext) > Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error > at logical 1567691653120 on dev /dev/mapper/userluks, physical > 1414087852032, root 20199, inode 624993, offset 5717954560, length > 4096, links 1 (path: path/to/file/filename.ext) > Jul 27 21:54:45 server kernel: BTRFS error (device dm-2): bdev > /dev/mapper/userluks errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 > Jul 27 21:54:45 server kernel: BTRFS error (device dm-2): unable to > fixup (regular) error at logical 1567691653120 on dev > /dev/mapper/userluks > Jul 27 21:55:42 server kernel: BTRFS info (device dm-2): scrub: > finished on devid 1 with status: 0 > ...> > The volume is a 3TB disk, model ST3000DM001-1CH166 (Seagate Barracuda > SATA HDD). > > Is there a way to mark sectors on the disk as bad? If so, is it Directly overwriting sector may "fix" it (of course, data is still lost) or trigger sector replacement. hdparm has --write-sector command although I do not have any experience with it. Or simple dd may suffice. Difference is that hdparm will bypass any kernel block layer recovery. If you had redundant data profile, btrfs scrub would likely have fixed it for you. > advisable to keep using this physical disk? > Well, this happens, if this is just one sector so far I would say yes. You probably need to keep an eye on it though. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BTRFS scrub reports an error but check doesn't find any errors. 2021-07-28 16:11 ` Andrei Borzenkov @ 2021-07-28 16:21 ` Dave T 2021-07-28 18:17 ` Andrei Borzenkov 2021-07-28 19:19 ` Zygo Blaxell 0 siblings, 2 replies; 11+ messages in thread From: Dave T @ 2021-07-28 16:21 UTC (permalink / raw) To: Andrei Borzenkov; +Cc: ce3g8jdj, Btrfs BTRFS On Wed, Jul 28, 2021 at 12:11 PM Andrei Borzenkov <arvidjaar@gmail.com> wrote: > > On 28.07.2021 18:15, Dave T wrote: > ... > > > > Jul 27 21:54:39 server kernel: ata10.00: exception Emask 0x0 SAct > > 0xffffffff SErr 0x0 action 0x0 > > Jul 27 21:54:39 server kernel: ata10.00: irq_stat 0x40000008 > > Jul 27 21:54:39 server kernel: ata10.00: failed command: READ FPDMA QUEUED > > Jul 27 21:54:39 server kernel: ata10.00: cmd > > 60/00:90:98:2f:9f/03:00:a4:00:00/40 tag 18 ncq dma 393216 in > > res > > 41/40:00:20:32:9f/00:03:a4:00:00/00 Emask 0x409 (media error) <F> > > Jul 27 21:54:39 server kernel: ata10.00: status: { DRDY ERR } > > Jul 27 21:54:39 server kernel: ata10.00: error: { UNC } > > Jul 27 21:54:39 server kernel: ata10.00: configured for UDMA/133 > > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 FAILED Result: > > hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s > > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Sense Key : > > Medium Error [current] > > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Add. Sense: > > Unrecovered read error - auto reallocate failed > > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 CDB: Read(16) > > 88 00 00 00 00 00 a4 9f 2f 98 00 00 03 00 00 00 > > Jul 27 21:54:39 server kernel: blk_update_request: I/O error, dev sde, > > sector 2761896480 op 0x0:(READ) flags 0x0 phys_seg 15 prio class 0 > > Jul 27 21:54:39 server kernel: ata10: EH complete > > Jul 27 21:54:45 server kernel: ata10.00: exception Emask 0x0 SAct > > 0x4000000 SErr 0x0 action 0x0 > > Jul 27 21:54:45 server kernel: ata10.00: irq_stat 0x40000008 > > Jul 27 21:54:45 server kernel: ata10.00: failed command: READ FPDMA QUEUED > > Jul 27 21:54:45 server kernel: ata10.00: cmd > > 60/08:d0:20:32:9f/00:00:a4:00:00/40 tag 26 ncq dma 4096 in > > res > > 41/40:08:20:32:9f/00:00:a4:00:00/00 Emask 0x409 (media error) <F> > > Jul 27 21:54:45 server kernel: ata10.00: status: { DRDY ERR } > > Jul 27 21:54:45 server kernel: ata10.00: error: { UNC } > > Jul 27 21:54:45 server kernel: ata10.00: configured for UDMA/133 > > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 FAILED Result: > > hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=4s > > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 Sense Key : > > Medium Error [current] > > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 Add. Sense: > > Unrecovered read error - auto reallocate failed > > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 CDB: Read(16) > > 88 00 00 00 00 00 a4 9f 32 20 00 00 00 08 00 00 > > Jul 27 21:54:45 server kernel: blk_update_request: I/O error, dev sde, > > sector 2761896480 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0 > > Jul 27 21:54:45 server kernel: ata10: EH complete > > Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error > > at logical 1567691653120 on dev /dev/mapper/userluks, physical > > 1414087852032, root 19911, inode 624993, offset 5717954560, length > > 4096, links 1 (path: path/to/file/filename.ext) > > Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error > > at logical 1567691653120 on dev /dev/mapper/userluks, physical > > 1414087852032, root 19989, inode 624993, offset 5717954560, length > > 4096, links 1 (path: path/to/file/filename.ext) > > Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error > > at logical 1567691653120 on dev /dev/mapper/userluks, physical > > 1414087852032, root 20199, inode 624993, offset 5717954560, length > > 4096, links 1 (path: path/to/file/filename.ext) > > Jul 27 21:54:45 server kernel: BTRFS error (device dm-2): bdev > > /dev/mapper/userluks errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 > > Jul 27 21:54:45 server kernel: BTRFS error (device dm-2): unable to > > fixup (regular) error at logical 1567691653120 on dev > > /dev/mapper/userluks > > Jul 27 21:55:42 server kernel: BTRFS info (device dm-2): scrub: > > finished on devid 1 with status: 0 > > > ...> > > The volume is a 3TB disk, model ST3000DM001-1CH166 (Seagate Barracuda > > SATA HDD). > > > > Is there a way to mark sectors on the disk as bad? If so, is it > > Directly overwriting sector may "fix" it (of course, data is still lost) > or trigger sector replacement. hdparm has --write-sector command > although I do not have any experience with it. Or simple dd may suffice. > Difference is that hdparm will bypass any kernel block layer recovery. > If you had redundant data profile, btrfs scrub would likely have fixed > it for you. I never knew BTRFS could duplicate data without RAID. This looks like a great feature for my situation. I think I may upgrade this disk to a larger one and enable DUP for data. Is this a good tutorial to follow? https://zejn.net/b/2017/04/30/single-device-data-redundancy-with-btrfs/ Should I expect my data to take twice as much space after enabling DUP? > > > advisable to keep using this physical disk? > > > > Well, this happens, if this is just one sector so far I would say yes. > You probably need to keep an eye on it though. Thanks. My guess is that this is an isolated issue. It doesn't seem to be growing, but I will watch it. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BTRFS scrub reports an error but check doesn't find any errors. 2021-07-28 16:21 ` Dave T @ 2021-07-28 18:17 ` Andrei Borzenkov 2021-07-28 19:19 ` Zygo Blaxell 1 sibling, 0 replies; 11+ messages in thread From: Andrei Borzenkov @ 2021-07-28 18:17 UTC (permalink / raw) To: Dave T; +Cc: ce3g8jdj, Btrfs BTRFS On 28.07.2021 19:21, Dave T wrote: > On Wed, Jul 28, 2021 at 12:11 PM Andrei Borzenkov <arvidjaar@gmail.com> wrote: >> >> On 28.07.2021 18:15, Dave T wrote: >> ... >>> >>> Jul 27 21:54:39 server kernel: ata10.00: exception Emask 0x0 SAct >>> 0xffffffff SErr 0x0 action 0x0 >>> Jul 27 21:54:39 server kernel: ata10.00: irq_stat 0x40000008 >>> Jul 27 21:54:39 server kernel: ata10.00: failed command: READ FPDMA QUEUED >>> Jul 27 21:54:39 server kernel: ata10.00: cmd >>> 60/00:90:98:2f:9f/03:00:a4:00:00/40 tag 18 ncq dma 393216 in >>> res >>> 41/40:00:20:32:9f/00:03:a4:00:00/00 Emask 0x409 (media error) <F> >>> Jul 27 21:54:39 server kernel: ata10.00: status: { DRDY ERR } >>> Jul 27 21:54:39 server kernel: ata10.00: error: { UNC } >>> Jul 27 21:54:39 server kernel: ata10.00: configured for UDMA/133 >>> Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 FAILED Result: >>> hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s >>> Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Sense Key : >>> Medium Error [current] >>> Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Add. Sense: >>> Unrecovered read error - auto reallocate failed >>> Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 CDB: Read(16) >>> 88 00 00 00 00 00 a4 9f 2f 98 00 00 03 00 00 00 >>> Jul 27 21:54:39 server kernel: blk_update_request: I/O error, dev sde, >>> sector 2761896480 op 0x0:(READ) flags 0x0 phys_seg 15 prio class 0 >>> Jul 27 21:54:39 server kernel: ata10: EH complete >>> Jul 27 21:54:45 server kernel: ata10.00: exception Emask 0x0 SAct >>> 0x4000000 SErr 0x0 action 0x0 >>> Jul 27 21:54:45 server kernel: ata10.00: irq_stat 0x40000008 >>> Jul 27 21:54:45 server kernel: ata10.00: failed command: READ FPDMA QUEUED >>> Jul 27 21:54:45 server kernel: ata10.00: cmd >>> 60/08:d0:20:32:9f/00:00:a4:00:00/40 tag 26 ncq dma 4096 in >>> res >>> 41/40:08:20:32:9f/00:00:a4:00:00/00 Emask 0x409 (media error) <F> >>> Jul 27 21:54:45 server kernel: ata10.00: status: { DRDY ERR } >>> Jul 27 21:54:45 server kernel: ata10.00: error: { UNC } >>> Jul 27 21:54:45 server kernel: ata10.00: configured for UDMA/133 >>> Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 FAILED Result: >>> hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=4s >>> Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 Sense Key : >>> Medium Error [current] >>> Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 Add. Sense: >>> Unrecovered read error - auto reallocate failed >>> Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 CDB: Read(16) >>> 88 00 00 00 00 00 a4 9f 32 20 00 00 00 08 00 00 >>> Jul 27 21:54:45 server kernel: blk_update_request: I/O error, dev sde, >>> sector 2761896480 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0 >>> Jul 27 21:54:45 server kernel: ata10: EH complete >>> Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error >>> at logical 1567691653120 on dev /dev/mapper/userluks, physical >>> 1414087852032, root 19911, inode 624993, offset 5717954560, length >>> 4096, links 1 (path: path/to/file/filename.ext) >>> Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error >>> at logical 1567691653120 on dev /dev/mapper/userluks, physical >>> 1414087852032, root 19989, inode 624993, offset 5717954560, length >>> 4096, links 1 (path: path/to/file/filename.ext) >>> Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error >>> at logical 1567691653120 on dev /dev/mapper/userluks, physical >>> 1414087852032, root 20199, inode 624993, offset 5717954560, length >>> 4096, links 1 (path: path/to/file/filename.ext) >>> Jul 27 21:54:45 server kernel: BTRFS error (device dm-2): bdev >>> /dev/mapper/userluks errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 >>> Jul 27 21:54:45 server kernel: BTRFS error (device dm-2): unable to >>> fixup (regular) error at logical 1567691653120 on dev >>> /dev/mapper/userluks >>> Jul 27 21:55:42 server kernel: BTRFS info (device dm-2): scrub: >>> finished on devid 1 with status: 0 >>> >> ...> >>> The volume is a 3TB disk, model ST3000DM001-1CH166 (Seagate Barracuda >>> SATA HDD). >>> >>> Is there a way to mark sectors on the disk as bad? If so, is it >> >> Directly overwriting sector may "fix" it (of course, data is still lost) >> or trigger sector replacement. hdparm has --write-sector command >> although I do not have any experience with it. Or simple dd may suffice. >> Difference is that hdparm will bypass any kernel block layer recovery. >> If you had redundant data profile, btrfs scrub would likely have fixed >> it for you. > > I never knew BTRFS could duplicate data without RAID. This looks like > a great feature for my situation. I think I may upgrade this disk to a > larger one and enable DUP for data. > > Is this a good tutorial to follow? > https://zejn.net/b/2017/04/30/single-device-data-redundancy-with-btrfs/ > > Should I expect my data to take twice as much space after enabling DUP? > Yes, of course. Your data will be stored twice. >> >>> advisable to keep using this physical disk? >>> >> >> Well, this happens, if this is just one sector so far I would say yes. >> You probably need to keep an eye on it though. > > Thanks. My guess is that this is an isolated issue. It doesn't seem to > be growing, but I will watch it. > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BTRFS scrub reports an error but check doesn't find any errors. 2021-07-28 16:21 ` Dave T 2021-07-28 18:17 ` Andrei Borzenkov @ 2021-07-28 19:19 ` Zygo Blaxell 1 sibling, 0 replies; 11+ messages in thread From: Zygo Blaxell @ 2021-07-28 19:19 UTC (permalink / raw) To: Dave T; +Cc: Andrei Borzenkov, Btrfs BTRFS On Wed, Jul 28, 2021 at 12:21:43PM -0400, Dave T wrote: > On Wed, Jul 28, 2021 at 12:11 PM Andrei Borzenkov <arvidjaar@gmail.com> wrote: > > > > On 28.07.2021 18:15, Dave T wrote: > > ... > > > > > > Jul 27 21:54:39 server kernel: ata10.00: exception Emask 0x0 SAct > > > 0xffffffff SErr 0x0 action 0x0 > > > Jul 27 21:54:39 server kernel: ata10.00: irq_stat 0x40000008 > > > Jul 27 21:54:39 server kernel: ata10.00: failed command: READ FPDMA QUEUED > > > Jul 27 21:54:39 server kernel: ata10.00: cmd > > > 60/00:90:98:2f:9f/03:00:a4:00:00/40 tag 18 ncq dma 393216 in > > > res > > > 41/40:00:20:32:9f/00:03:a4:00:00/00 Emask 0x409 (media error) <F> > > > Jul 27 21:54:39 server kernel: ata10.00: status: { DRDY ERR } > > > Jul 27 21:54:39 server kernel: ata10.00: error: { UNC } > > > Jul 27 21:54:39 server kernel: ata10.00: configured for UDMA/133 > > > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 FAILED Result: > > > hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s > > > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Sense Key : > > > Medium Error [current] > > > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Add. Sense: > > > Unrecovered read error - auto reallocate failed > > > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 CDB: Read(16) > > > 88 00 00 00 00 00 a4 9f 2f 98 00 00 03 00 00 00 > > > Jul 27 21:54:39 server kernel: blk_update_request: I/O error, dev sde, > > > sector 2761896480 op 0x0:(READ) flags 0x0 phys_seg 15 prio class 0 > > > Jul 27 21:54:39 server kernel: ata10: EH complete > > > Jul 27 21:54:45 server kernel: ata10.00: exception Emask 0x0 SAct > > > 0x4000000 SErr 0x0 action 0x0 > > > Jul 27 21:54:45 server kernel: ata10.00: irq_stat 0x40000008 > > > Jul 27 21:54:45 server kernel: ata10.00: failed command: READ FPDMA QUEUED > > > Jul 27 21:54:45 server kernel: ata10.00: cmd > > > 60/08:d0:20:32:9f/00:00:a4:00:00/40 tag 26 ncq dma 4096 in > > > res > > > 41/40:08:20:32:9f/00:00:a4:00:00/00 Emask 0x409 (media error) <F> > > > Jul 27 21:54:45 server kernel: ata10.00: status: { DRDY ERR } > > > Jul 27 21:54:45 server kernel: ata10.00: error: { UNC } > > > Jul 27 21:54:45 server kernel: ata10.00: configured for UDMA/133 > > > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 FAILED Result: > > > hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=4s > > > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 Sense Key : > > > Medium Error [current] > > > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 Add. Sense: > > > Unrecovered read error - auto reallocate failed > > > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 CDB: Read(16) > > > 88 00 00 00 00 00 a4 9f 32 20 00 00 00 08 00 00 > > > Jul 27 21:54:45 server kernel: blk_update_request: I/O error, dev sde, > > > sector 2761896480 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0 > > > Jul 27 21:54:45 server kernel: ata10: EH complete > > > Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error > > > at logical 1567691653120 on dev /dev/mapper/userluks, physical > > > 1414087852032, root 19911, inode 624993, offset 5717954560, length > > > 4096, links 1 (path: path/to/file/filename.ext) > > > Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error > > > at logical 1567691653120 on dev /dev/mapper/userluks, physical > > > 1414087852032, root 19989, inode 624993, offset 5717954560, length > > > 4096, links 1 (path: path/to/file/filename.ext) > > > Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error > > > at logical 1567691653120 on dev /dev/mapper/userluks, physical > > > 1414087852032, root 20199, inode 624993, offset 5717954560, length > > > 4096, links 1 (path: path/to/file/filename.ext) > > > Jul 27 21:54:45 server kernel: BTRFS error (device dm-2): bdev > > > /dev/mapper/userluks errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 > > > Jul 27 21:54:45 server kernel: BTRFS error (device dm-2): unable to > > > fixup (regular) error at logical 1567691653120 on dev > > > /dev/mapper/userluks > > > Jul 27 21:55:42 server kernel: BTRFS info (device dm-2): scrub: > > > finished on devid 1 with status: 0 > > > > > ...> > > > The volume is a 3TB disk, model ST3000DM001-1CH166 (Seagate Barracuda > > > SATA HDD). > > > > > > Is there a way to mark sectors on the disk as bad? If so, is it > > > > Directly overwriting sector may "fix" it (of course, data is still lost) > > or trigger sector replacement. hdparm has --write-sector command > > although I do not have any experience with it. Or simple dd may suffice. > > Difference is that hdparm will bypass any kernel block layer recovery. > > If you had redundant data profile, btrfs scrub would likely have fixed > > it for you. > > I never knew BTRFS could duplicate data without RAID. This looks like > a great feature for my situation. I think I may upgrade this disk to a > larger one and enable DUP for data. > > Is this a good tutorial to follow? > https://zejn.net/b/2017/04/30/single-device-data-redundancy-with-btrfs/ > > Should I expect my data to take twice as much space after enabling DUP? Twice as much space, and also there is a seeking cost because the data is written in two locations some distance apart on the media. It doesn't help if the entire device fails, so at best it typically only delays the inevitable total failure...but maybe that gives you time to finish a backup before the drive dies. > > > advisable to keep using this physical disk? > > > > > > > Well, this happens, if this is just one sector so far I would say yes. > > You probably need to keep an eye on it though. > > Thanks. My guess is that this is an isolated issue. It doesn't seem to > be growing, but I will watch it. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BTRFS scrub reports an error but check doesn't find any errors. 2021-07-28 15:15 ` Dave T 2021-07-28 16:11 ` Andrei Borzenkov @ 2021-07-28 19:18 ` Zygo Blaxell 2021-07-29 3:12 ` Chris Murphy 2 siblings, 0 replies; 11+ messages in thread From: Zygo Blaxell @ 2021-07-28 19:18 UTC (permalink / raw) To: Dave T; +Cc: Btrfs BTRFS On Wed, Jul 28, 2021 at 11:15:31AM -0400, Dave T wrote: > On Tue, Jul 27, 2021 at 5:44 PM Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > > > On Sun, Jul 25, 2021 at 01:39:55PM -0400, Dave T wrote: > > > What does the list recommend I do in this case? > > > > > > starting btrfs scrub ... > > > scrub done for 56cea9cf-5374-4a43-b19d-6b0b143dc635 > > > Scrub started: Sun Jul 25 00:40:43 2021 > > > Status: finished > > > Duration: 2:52:45 > > > Total to scrub: 1.26TiB > > > Rate: 113.72MiB/s > > > Error summary: read=1 > > > Corrected: 0 > > > Uncorrectable: 1 > > > Unverified: 0 > > > ERROR: there are uncorrectable errors > > > > This is a read failure (data not available from device), not a csum error > > (data available but not correct). > > Thank you. > > > > > > dmesg | grep "checksum error at" | tail -n 20 > > > (no output) > > > > You should be looking for a IO failure on the underlying device (the > > one below /dev/mapper/userluks). > > That would be /dev/sde in my case. > > > Look for log messages that appear just > > before btrfs errors, or errors mentioning the device itself: > > > > dmesg | grep -B99 -i btrfs > > > > dmesg | grep -C9 sda > > > > Here is the complete output from a new scrub, including the output of > the commands above. > > # fpath="/home/userdata" > > # check_space "$fpath" > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/userluks 2.8T 1.2T 1.7T 42% /home/userdata > Data, single: total=1.25TiB, used=1.11TiB > System, DUP: total=32.00MiB, used=160.00KiB > Metadata, DUP: total=6.00GiB, used=5.32GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > # balance "$fpath" > starting btrfs balance for /home/userdata > Done, had to relocate 10 out of 1284 chunks > > real 101m35.135s > user 0m0.000s > sys 0m32.991s > ----------------------- > > # scrub "$fpath" > starting btrfs scrub ... > scrub done for 56cea9cf-5374-4a43-b19d-6b0b143dc635 > Scrub started: Tue Jul 27 19:46:25 2021 > Status: finished > Duration: 2:09:17 > Total to scrub: 1.25TiB > Rate: 151.98MiB/s > Error summary: read=1 > Corrected: 0 > Uncorrectable: 1 > Unverified: 0 > ERROR: there are uncorrectable errors > > > # btrfs fi us /home/userdata/ > Overall: > Device size: 2.73TiB > Device allocated: 1.25TiB > Device unallocated: 1.48TiB > Device missing: 0.00B > Used: 1.12TiB > Free (estimated): 1.60TiB (min: 883.62GiB) > Free (statfs, df): 1.60TiB > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:1.24TiB, Used:1.11TiB (90.10%) > /dev/mapper/userluks 1.24TiB > > Metadata,DUP: Size:6.00GiB, Used:5.35GiB (89.17%) > /dev/mapper/userluks 12.00GiB > > System,DUP: Size:32.00MiB, Used:160.00KiB (0.49%) > /dev/mapper/userluks 64.00MiB > > Unallocated: > /dev/mapper/userluks 1.48TiB > > # lsblk > NAME LABEL UUID PARTUUID > MODEL SIZE SERIAL > MOUNTPOINT > > sde > ST3000DM001-1CH166 2.7T XXXXXXX > └─sde1 56cea9cf-3566-49f3-8abf-e59246f88a43 > 5db1ed2f-572e-4388-920b-6e4bfabf9e72 2.7T > └─userluks USERCOMMON 56cea9cf-5374-4a43-b19d-6b0b143dc635 > 2.7T > /mnt/temp/user > > > dmesg | grep -B99 -i btrfs > The only relevant output is: > > [ +0.025578] BTRFS warning (device dm-4): qgroup rescan init failed, > qgroup is not enabled > [ +0.000983] BTRFS warning (device dm-4): qgroup rescan init failed, > qgroup is not enabled > [ +0.026646] BTRFS warning (device dm-4): qgroup rescan init failed, > qgroup is not enabled > [ +0.001383] BTRFS warning (device dm-4): qgroup rescan init failed, > qgroup is not enabled > [Jul28 00:03] BTRFS warning (device dm-4): qgroup rescan init failed, > qgroup is not enabled > [ +0.001119] BTRFS warning (device dm-4): qgroup rescan init failed, > qgroup is not enabled > [ +1.828343] BTRFS warning (device dm-0): qgroup rescan init failed, > qgroup is not enabled > [ +0.001132] BTRFS warning (device dm-0): qgroup rescan init failed, > qgroup is not enabled > [ +0.029263] BTRFS warning (device dm-0): qgroup rescan init failed, > qgroup is not enabled > [ +0.000969] BTRFS warning (device dm-0): qgroup rescan init failed, > qgroup is not enabled > [ +0.068749] BTRFS warning (device dm-0): qgroup rescan init failed, > qgroup is not enabled > [ +0.005558] BTRFS warning (device dm-0): qgroup rescan init failed, > qgroup is not enabled > [ +2.872178] BTRFS warning (device dm-2): qgroup rescan init failed, > qgroup is not enabled > [ +0.024708] BTRFS warning (device dm-2): qgroup rescan init failed, > qgroup is not enabled > [ +0.041130] BTRFS warning (device dm-2): qgroup rescan init failed, > qgroup is not enabled > [ +4.115641] BTRFS warning (device dm-2): qgroup rescan init failed, > qgroup is not enabled > [ +0.032633] BTRFS warning (device dm-2): qgroup rescan init failed, > qgroup is not enabled > [ +0.059885] BTRFS warning (device dm-2): qgroup rescan init failed, > qgroup is not enabled > > # dmesg | grep -C9 sde > (no output) > > The journal shows: > > Jul 27 21:54:39 server kernel: ata10.00: exception Emask 0x0 SAct > 0xffffffff SErr 0x0 action 0x0 > Jul 27 21:54:39 server kernel: ata10.00: irq_stat 0x40000008 > Jul 27 21:54:39 server kernel: ata10.00: failed command: READ FPDMA QUEUED > Jul 27 21:54:39 server kernel: ata10.00: cmd > 60/00:90:98:2f:9f/03:00:a4:00:00/40 tag 18 ncq dma 393216 in > res > 41/40:00:20:32:9f/00:03:a4:00:00/00 Emask 0x409 (media error) <F> > Jul 27 21:54:39 server kernel: ata10.00: status: { DRDY ERR } > Jul 27 21:54:39 server kernel: ata10.00: error: { UNC } > Jul 27 21:54:39 server kernel: ata10.00: configured for UDMA/133 > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Sense Key : > Medium Error [current] > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Add. Sense: > Unrecovered read error - auto reallocate failed > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 CDB: Read(16) > 88 00 00 00 00 00 a4 9f 2f 98 00 00 03 00 00 00 > Jul 27 21:54:39 server kernel: blk_update_request: I/O error, dev sde, > sector 2761896480 op 0x0:(READ) flags 0x0 phys_seg 15 prio class 0 > Jul 27 21:54:39 server kernel: ata10: EH complete > Jul 27 21:54:45 server kernel: ata10.00: exception Emask 0x0 SAct > 0x4000000 SErr 0x0 action 0x0 > Jul 27 21:54:45 server kernel: ata10.00: irq_stat 0x40000008 > Jul 27 21:54:45 server kernel: ata10.00: failed command: READ FPDMA QUEUED > Jul 27 21:54:45 server kernel: ata10.00: cmd > 60/08:d0:20:32:9f/00:00:a4:00:00/40 tag 26 ncq dma 4096 in > res > 41/40:08:20:32:9f/00:00:a4:00:00/00 Emask 0x409 (media error) <F> > Jul 27 21:54:45 server kernel: ata10.00: status: { DRDY ERR } > Jul 27 21:54:45 server kernel: ata10.00: error: { UNC } > Jul 27 21:54:45 server kernel: ata10.00: configured for UDMA/133 > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=4s > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 Sense Key : > Medium Error [current] > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 Add. Sense: > Unrecovered read error - auto reallocate failed > Jul 27 21:54:45 server kernel: sd 9:0:0:0: [sde] tag#26 CDB: Read(16) > 88 00 00 00 00 00 a4 9f 32 20 00 00 00 08 00 00 > Jul 27 21:54:45 server kernel: blk_update_request: I/O error, dev sde, > sector 2761896480 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0 > Jul 27 21:54:45 server kernel: ata10: EH complete > Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error > at logical 1567691653120 on dev /dev/mapper/userluks, physical > 1414087852032, root 19911, inode 624993, offset 5717954560, length > 4096, links 1 (path: path/to/file/filename.ext) > Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error > at logical 1567691653120 on dev /dev/mapper/userluks, physical > 1414087852032, root 19989, inode 624993, offset 5717954560, length > 4096, links 1 (path: path/to/file/filename.ext) > Jul 27 21:54:45 server kernel: BTRFS warning (device dm-2): i/o error > at logical 1567691653120 on dev /dev/mapper/userluks, physical > 1414087852032, root 20199, inode 624993, offset 5717954560, length > 4096, links 1 (path: path/to/file/filename.ext) > Jul 27 21:54:45 server kernel: BTRFS error (device dm-2): bdev > /dev/mapper/userluks errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 > Jul 27 21:54:45 server kernel: BTRFS error (device dm-2): unable to > fixup (regular) error at logical 1567691653120 on dev > /dev/mapper/userluks > Jul 27 21:55:42 server kernel: BTRFS info (device dm-2): scrub: > finished on devid 1 with status: 0 > > > btrfs su li /home/userdata/ > > ... > ID 19911 gen 152144 top level 257 path @usertop/18053/snapshot > ID 19989 gen 152144 top level 257 path @usertop/18131/snapshot > ID 20199 gen 152144 top level 257 path @usertop/18313/snapshot > ... > > snapper -c userdata ls > > ... > 18053 | single | | Fri 01 Jan 2021 12:00:18 AM EST | root | > timeline | timeline | > 18131 | single | | Mon 04 Jan 2021 12:00:15 AM EST | root | > timeline | timeline | > 18313 | single | | Mon 11 Jan 2021 12:00:20 AM EST | root | > timeline | timeline | > ... > > There are snapshots after that date without any errors. The live (r/w) > file system does not show any errors. > > The volume is a 3TB disk, model ST3000DM001-1CH166 (Seagate Barracuda > SATA HDD). > > Is there a way to mark sectors on the disk as bad? If so, is it > advisable to keep using this physical disk? Delete or replace the file so the blocks are released to free space. The next write over that sector will clobber the bad sector with fresh data, and the drive should remap the bad sector at that time. You can also try to expedite that process with dd, but that's error prone (you have to calculate the offsets and not typo them into dd commands...) and not really necessary (I never bother with that). You should monitor the drive SMART data and run SMART self-tests periodically (in addition to btrfs scrubs, they check different things). If the number of UNC sectors increases over time, it's a good indicator something is wrong with the drive and it may need replacement soon. Seagate Barracudas often follow a pattern where over a period of a few days, there's one UNC sector, then, two, then ten, then a thousand, then the drive doesn't spin up any more and all data is lost. On those drives I'd run a SMART long self-test daily if possible, or as close to daily as your system load allows. > Thanks for sharing your knowledge. This is very helpful. > > > > # dmesg | grep -i checksum > > > [ +0.001698] xor: automatically using best checksumming function avx > > > (not related to BTRFS, right?) > > > > > > # btrfs fi us /path/to/xyz > > > Overall: > > > Device size: 2.73TiB > > > Device allocated: 1.26TiB > > > Device unallocated: 1.47TiB > > > Device missing: 0.00B > > > Used: 1.12TiB > > > Free (estimated): 1.60TiB (min: 888.70GiB) > > > Free (statfs, df): 1.60TiB > > > Data ratio: 1.00 > > > Metadata ratio: 2.00 > > > Global reserve: 512.00MiB (used: 0.00B) > > > Multiple profiles: no > > > > > > Data,single: Size:1.25TiB, Used:1.11TiB (89.38%) > > > /dev/mapper/userluks 1.25TiB > > > > > > Metadata,DUP: Size:6.00GiB, Used:5.26GiB (87.67%) > > > /dev/mapper/userluks 12.00GiB > > > > > > System,DUP: Size:32.00MiB, Used:160.00KiB (0.49%) > > > /dev/mapper/userluks 64.00MiB > > > > Since the error was not corrected, it likely occurred in the data blocks. > > Yes, it appears so from the info above. > > > > > A metadata error would be correctable, so check wouldn't report it because > > the scrub will have already corrected it (assuming the underlying drive > > is still healthy enough to remap bad sectors). > > > > > Unallocated: > > > /dev/mapper/userluks 1.47TiB > > > > > > # btrfs check /dev/mapper/xyz > > > > That command won't read any data blocks, so it won't see any errors there. > > > > > Opening filesystem to check... > > > Checking filesystem on /dev/mapper/xyz > > > UUID: 56cea9cf-5374-4a43-b19d-6b0b143dc635 > > > [1/7] checking root items > > > [2/7] checking extents > > > [3/7] checking free space cache > > > [4/7] checking fs roots > > > [5/7] checking only csums items (without verifying data) > > > [6/7] checking root refs > > > [7/7] checking quota groups skipped (not enabled on this FS) > > > found 1230187327496 bytes used, no error found > > > total csum bytes: 1195610680 > > > total tree bytes: 5648285696 > > > total fs tree bytes: 4011016192 > > > total extent tree bytes: 379256832 > > > btree space waste bytes: 827370015 > > > file data blocks allocated: 5497457123328 > > > referenced 5523039584256 > > > > > > If more info is needed, please let me know. Recommendations and advice > > > are appreciated. > > > Thank you. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BTRFS scrub reports an error but check doesn't find any errors. 2021-07-28 15:15 ` Dave T 2021-07-28 16:11 ` Andrei Borzenkov 2021-07-28 19:18 ` Zygo Blaxell @ 2021-07-29 3:12 ` Chris Murphy 2 siblings, 0 replies; 11+ messages in thread From: Chris Murphy @ 2021-07-29 3:12 UTC (permalink / raw) To: Dave T; +Cc: Zygo Blaxell, Btrfs BTRFS On Wed, Jul 28, 2021 at 9:15 AM Dave T <davestechshop@gmail.com> wrote: > > The journal shows: > > Jul 27 21:54:39 server kernel: ata10.00: exception Emask 0x0 SAct > 0xffffffff SErr 0x0 action 0x0 > Jul 27 21:54:39 server kernel: ata10.00: irq_stat 0x40000008 > Jul 27 21:54:39 server kernel: ata10.00: failed command: READ FPDMA QUEUED > Jul 27 21:54:39 server kernel: ata10.00: cmd > 60/00:90:98:2f:9f/03:00:a4:00:00/40 tag 18 ncq dma 393216 in > res > 41/40:00:20:32:9f/00:03:a4:00:00/00 Emask 0x409 (media error) <F> > Jul 27 21:54:39 server kernel: ata10.00: status: { DRDY ERR } > Jul 27 21:54:39 server kernel: ata10.00: error: { UNC } > Jul 27 21:54:39 server kernel: ata10.00: configured for UDMA/133 > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Sense Key : > Medium Error [current] > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 Add. Sense: > Unrecovered read error - auto reallocate failed > Jul 27 21:54:39 server kernel: sd 9:0:0:0: [sde] tag#18 CDB: Read(16) > 88 00 00 00 00 00 a4 9f 2f 98 00 00 03 00 00 00 > Jul 27 21:54:39 server kernel: blk_update_request: I/O error, dev sde, Bad sector. It'll need to be overwritten to be remapped by the drive firmware. I can't tell if it's 512n or 512e but the write needs to be the size of the physical sector or else the firmware turns the write into read-modify-write and you'll just get the same UNC error on the read. DUP profile will recover from this automatically. The file system is informed of the physical sector, it does a reverse lookup to find the logical block, reads good data from the good copy and overwrites the bad copy. And either that'll stick or fail with an internal write failure in which case the firmware remaps to a reserve sector. Once reserve sectors are gone, you get UNC write errors - and that pretty much means the drive is toast. -- Chris Murphy ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2021-07-29 3:12 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-07-25 17:39 BTRFS scrub reports an error but check doesn't find any errors Dave T 2021-07-25 23:49 ` Qu Wenruo 2021-07-26 18:38 ` Chris Murphy 2021-07-27 21:41 ` Zygo Blaxell 2021-07-28 15:15 ` Dave T 2021-07-28 16:11 ` Andrei Borzenkov 2021-07-28 16:21 ` Dave T 2021-07-28 18:17 ` Andrei Borzenkov 2021-07-28 19:19 ` Zygo Blaxell 2021-07-28 19:18 ` Zygo Blaxell 2021-07-29 3:12 ` Chris Murphy
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.