* DRDY errors are not consistent with scrub results @ 2018-08-27 22:51 Cerem Cem ASLAN [not found] ` <CAJCQCtSq5K90gpfGQN8JhqQddBg62m8EG_bFuWN5XyzdNStDfw@mail.gmail.com> 2018-08-29 9:56 ` ein 0 siblings, 2 replies; 13+ messages in thread From: Cerem Cem ASLAN @ 2018-08-27 22:51 UTC (permalink / raw) To: Btrfs BTRFS Hi, I'm getting DRDY ERR messages which causes system crash on the server: # tail -n 40 /var/log/kern.log.1 Aug 24 21:04:55 aea3 kernel: [ 939.228059] lxc-bridge: port 5(vethI7JDHN) entered disabled state Aug 24 21:04:55 aea3 kernel: [ 939.300602] eth0: renamed from vethQ5Y2OF Aug 24 21:04:55 aea3 kernel: [ 939.328245] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready Aug 24 21:04:55 aea3 kernel: [ 939.328453] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Aug 24 21:04:55 aea3 kernel: [ 939.328474] IPv6: ADDRCONF(NETDEV_CHANGE): vethI7JDHN: link becomes ready Aug 24 21:04:55 aea3 kernel: [ 939.328491] lxc-bridge: port 5(vethI7JDHN) entered blocking state Aug 24 21:04:55 aea3 kernel: [ 939.328493] lxc-bridge: port 5(vethI7JDHN) entered forwarding state Aug 24 21:04:59 aea3 kernel: [ 943.085647] cgroup: cgroup2: unknown option "nsdelegate" Aug 24 21:16:15 aea3 kernel: [ 1619.400016] perf: interrupt took too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 79750 Aug 24 21:17:11 aea3 kernel: [ 1675.515815] perf: interrupt took too long (3137 > 3132), lowering kernel.perf_event_max_sample_rate to 63750 Aug 24 21:17:13 aea3 kernel: [ 1677.080837] cgroup: cgroup2: unknown option "nsdelegate" Aug 25 22:38:31 aea3 kernel: [92955.512098] usb 4-2: USB disconnect, device number 2 Aug 26 02:14:21 aea3 kernel: [105906.035038] lxc-bridge: port 4(vethCTKU4K) entered disabled state Aug 26 02:15:30 aea3 kernel: [105974.107521] lxc-bridge: port 4(vethO59BPD) entered disabled state Aug 26 02:15:30 aea3 kernel: [105974.109991] device vethO59BPD left promiscuous mode Aug 26 02:15:30 aea3 kernel: [105974.109995] lxc-bridge: port 4(vethO59BPD) entered disabled state Aug 26 02:15:30 aea3 kernel: [105974.710490] lxc-bridge: port 4(vethBAYODL) entered blocking state Aug 26 02:15:30 aea3 kernel: [105974.710493] lxc-bridge: port 4(vethBAYODL) entered disabled state Aug 26 02:15:30 aea3 kernel: [105974.710545] device vethBAYODL entered promiscuous mode Aug 26 02:15:30 aea3 kernel: [105974.710598] IPv6: ADDRCONF(NETDEV_UP): vethBAYODL: link is not ready Aug 26 02:15:30 aea3 kernel: [105974.710600] lxc-bridge: port 4(vethBAYODL) entered blocking state Aug 26 02:15:30 aea3 kernel: [105974.710601] lxc-bridge: port 4(vethBAYODL) entered forwarding state Aug 26 02:16:35 aea3 kernel: [106039.674089] BTRFS: device fsid 5b844c7a-0cbd-40a7-a8e3-6bc636aba033 devid 1 transid 984 /dev/dm-3 Aug 26 02:17:21 aea3 kernel: [106085.352453] ata4.00: failed command: READ DMA Aug 26 02:17:21 aea3 kernel: [106085.352901] ata4.00: status: { DRDY ERR } Aug 26 02:18:56 aea3 kernel: [106180.648062] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Aug 26 02:18:56 aea3 kernel: [106180.648333] ata4.00: BMDMA stat 0x25 Aug 26 02:18:56 aea3 kernel: [106180.648515] ata4.00: failed command: READ DMA Aug 26 02:18:56 aea3 kernel: [106180.648706] ata4.00: cmd c8/00:08:80:9c:bb/00:00:00:00:00/e3 tag 0 dma 4096 in Aug 26 02:18:56 aea3 kernel: [106180.648706] res 51/40:00:80:9c:bb/00:00:00:00:00/03 Emask 0x9 (media error) Aug 26 02:18:56 aea3 kernel: [106180.649380] ata4.00: status: { DRDY ERR } Aug 26 02:18:56 aea3 kernel: [106180.649743] ata4.00: error: { UNC } Aug 26 02:18:56 aea3 kernel: [106180.779311] ata4.00: configured for UDMA/133 Aug 26 02:18:56 aea3 kernel: [106180.779331] sd 3:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Aug 26 02:18:56 aea3 kernel: [106180.779335] sd 3:0:0:0: [sda] tag#0 Sense Key : Medium Error [current] Aug 26 02:18:56 aea3 kernel: [106180.779339] sd 3:0:0:0: [sda] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed Aug 26 02:18:56 aea3 kernel: [106180.779343] sd 3:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 03 bb 9c 80 00 00 08 00 Aug 26 02:18:56 aea3 kernel: [106180.779346] blk_update_request: I/O error, dev sda, sector 62626944 Aug 26 02:18:56 aea3 kernel: [106180.779703] BTRFS error (device dm-2): bdev /dev/mapper/master-root errs: wr 0, rd 40, flush 0, corrupt 0, gen 0 Aug 26 02:18:56 aea3 kernel: [106180.779936] ata4: EH complete I always saw these DRDY errors whenever I experience physical hard drive errors, so I expect `btrfs scrub` show some kind of similar errors but it doesn't: btrfs scrub status /mnt/peynir/ scrub status for 8827cb0e-52d7-4f99-90fd-a975cafbfa46 scrub started at Tue Aug 28 00:43:55 2018 and finished after 00:02:07 total bytes scrubbed: 12.45GiB with 0 errors I took new snapshots for both root and the LXC containers and nothing gone wrong. To be confident, I reformat the swap partition (which I saw some messages about swap partition in the crash screen). I'm not sure how to proceed at the moment. Taking succesfull backups made me think that everything might be okay but I'm not sure if I should continue trusting the drive or not. What additional checks should I perform? ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <CAJCQCtSq5K90gpfGQN8JhqQddBg62m8EG_bFuWN5XyzdNStDfw@mail.gmail.com>]
[parent not found: <CAN4oSBeHwnsm5Ecz1hAQLk6s6utHfn5XeR8xMhnZpmT-sb-_iw@mail.gmail.com>]
* Re: DRDY errors are not consistent with scrub results [not found] ` <CAN4oSBeHwnsm5Ecz1hAQLk6s6utHfn5XeR8xMhnZpmT-sb-_iw@mail.gmail.com> @ 2018-08-28 0:38 ` Chris Murphy 2018-08-28 0:39 ` Chris Murphy 0 siblings, 1 reply; 13+ messages in thread From: Chris Murphy @ 2018-08-28 0:38 UTC (permalink / raw) To: Cerem Cem ASLAN, Btrfs BTRFS On Mon, Aug 27, 2018 at 6:05 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote: > Note that I've directly received this reply, not by mail list. I'm not > sure this is intended or not. I intended to do Reply to All but somehow this doesn't always work out between the user and Gmail, I'm just gonna assume gmail is being an asshole again. > Chris Murphy <lists@colorremedies.com>, 28 Ağu 2018 Sal, 02:25 > tarihinde şunu yazdı: >> >> On Mon, Aug 27, 2018 at 4:51 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote: >> > Hi, >> > >> > I'm getting DRDY ERR messages which causes system crash on the server: >> > >> > # tail -n 40 /var/log/kern.log.1 >> > Aug 24 21:04:55 aea3 kernel: [ 939.228059] lxc-bridge: port >> > 5(vethI7JDHN) entered disabled state >> > Aug 24 21:04:55 aea3 kernel: [ 939.300602] eth0: renamed from vethQ5Y2OF >> > Aug 24 21:04:55 aea3 kernel: [ 939.328245] IPv6: ADDRCONF(NETDEV_UP): >> > eth0: link is not ready >> > Aug 24 21:04:55 aea3 kernel: [ 939.328453] IPv6: >> > ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready >> > Aug 24 21:04:55 aea3 kernel: [ 939.328474] IPv6: >> > ADDRCONF(NETDEV_CHANGE): vethI7JDHN: link becomes ready >> > Aug 24 21:04:55 aea3 kernel: [ 939.328491] lxc-bridge: port >> > 5(vethI7JDHN) entered blocking state >> > Aug 24 21:04:55 aea3 kernel: [ 939.328493] lxc-bridge: port >> > 5(vethI7JDHN) entered forwarding state >> > Aug 24 21:04:59 aea3 kernel: [ 943.085647] cgroup: cgroup2: unknown >> > option "nsdelegate" >> > Aug 24 21:16:15 aea3 kernel: [ 1619.400016] perf: interrupt took too >> > long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to >> > 79750 >> > Aug 24 21:17:11 aea3 kernel: [ 1675.515815] perf: interrupt took too >> > long (3137 > 3132), lowering kernel.perf_event_max_sample_rate to >> > 63750 >> > Aug 24 21:17:13 aea3 kernel: [ 1677.080837] cgroup: cgroup2: unknown >> > option "nsdelegate" >> > Aug 25 22:38:31 aea3 kernel: [92955.512098] usb 4-2: USB disconnect, >> > device number 2 >> > Aug 26 02:14:21 aea3 kernel: [105906.035038] lxc-bridge: port >> > 4(vethCTKU4K) entered disabled state >> > Aug 26 02:15:30 aea3 kernel: [105974.107521] lxc-bridge: port >> > 4(vethO59BPD) entered disabled state >> > Aug 26 02:15:30 aea3 kernel: [105974.109991] device vethO59BPD left >> > promiscuous mode >> > Aug 26 02:15:30 aea3 kernel: [105974.109995] lxc-bridge: port >> > 4(vethO59BPD) entered disabled state >> > Aug 26 02:15:30 aea3 kernel: [105974.710490] lxc-bridge: port >> > 4(vethBAYODL) entered blocking state >> > Aug 26 02:15:30 aea3 kernel: [105974.710493] lxc-bridge: port >> > 4(vethBAYODL) entered disabled state >> > Aug 26 02:15:30 aea3 kernel: [105974.710545] device vethBAYODL entered >> > promiscuous mode >> > Aug 26 02:15:30 aea3 kernel: [105974.710598] IPv6: >> > ADDRCONF(NETDEV_UP): vethBAYODL: link is not ready >> > Aug 26 02:15:30 aea3 kernel: [105974.710600] lxc-bridge: port >> > 4(vethBAYODL) entered blocking state >> > Aug 26 02:15:30 aea3 kernel: [105974.710601] lxc-bridge: port >> > 4(vethBAYODL) entered forwarding state >> > Aug 26 02:16:35 aea3 kernel: [106039.674089] BTRFS: device fsid >> > 5b844c7a-0cbd-40a7-a8e3-6bc636aba033 devid 1 transid 984 /dev/dm-3 >> > Aug 26 02:17:21 aea3 kernel: [106085.352453] ata4.00: failed command: READ DMA >> > Aug 26 02:17:21 aea3 kernel: [106085.352901] ata4.00: status: { DRDY ERR } >> > Aug 26 02:18:56 aea3 kernel: [106180.648062] ata4.00: exception Emask >> > 0x0 SAct 0x0 SErr 0x0 action 0x0 >> > Aug 26 02:18:56 aea3 kernel: [106180.648333] ata4.00: BMDMA stat 0x25 >> > Aug 26 02:18:56 aea3 kernel: [106180.648515] ata4.00: failed command: READ DMA >> > Aug 26 02:18:56 aea3 kernel: [106180.648706] ata4.00: cmd >> > c8/00:08:80:9c:bb/00:00:00:00:00/e3 tag 0 dma 4096 in >> > Aug 26 02:18:56 aea3 kernel: [106180.648706] res >> > 51/40:00:80:9c:bb/00:00:00:00:00/03 Emask 0x9 (media error) >> > Aug 26 02:18:56 aea3 kernel: [106180.649380] ata4.00: status: { DRDY ERR } >> > Aug 26 02:18:56 aea3 kernel: [106180.649743] ata4.00: error: { UNC } >> >> Classic case of uncorrectable read error due to sector failure. >> >> >> >> > Aug 26 02:18:56 aea3 kernel: [106180.779311] ata4.00: configured for UDMA/133 >> > Aug 26 02:18:56 aea3 kernel: [106180.779331] sd 3:0:0:0: [sda] tag#0 >> > FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE >> > Aug 26 02:18:56 aea3 kernel: [106180.779335] sd 3:0:0:0: [sda] tag#0 >> > Sense Key : Medium Error [current] >> > Aug 26 02:18:56 aea3 kernel: [106180.779339] sd 3:0:0:0: [sda] tag#0 >> > Add. Sense: Unrecovered read error - auto reallocate failed >> > Aug 26 02:18:56 aea3 kernel: [106180.779343] sd 3:0:0:0: [sda] tag#0 >> > CDB: Read(10) 28 00 03 bb 9c 80 00 00 08 00 >> > Aug 26 02:18:56 aea3 kernel: [106180.779346] blk_update_request: I/O >> > error, dev sda, sector 62626944 >> >> And the drive has reported the physical sector that's failing. >> >> >> >> > Aug 26 02:18:56 aea3 kernel: [106180.779703] BTRFS error (device >> > dm-2): bdev /dev/mapper/master-root errs: wr 0, rd 40, flush 0, >> > corrupt 0, gen 0 >> > Aug 26 02:18:56 aea3 kernel: [106180.779936] ata4: EH complete >> >> And Btrfs reports it as a read error. Is this a single drive setup? > > Yes, this is a single drive setup. > >> And what's the profile for metadata and data? > > sudo btrfs fi usage /mnt/peynir/ > [sudo] password for aea: > Overall: > Device size: 931.32GiB > Device allocated: 16.08GiB > Device unallocated: 915.24GiB > Device missing: 0.00B > Used: 12.53GiB > Free (estimated): 915.81GiB (min: 458.19GiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 43.94MiB (used: 2.45MiB) > > Data,single: Size:12.01GiB, Used:11.43GiB > /dev/mapper/master-root 12.01GiB > > Metadata,single: Size:8.00MiB, Used:0.00B > /dev/mapper/master-root 8.00MiB > > Metadata,DUP: Size:2.00GiB, Used:562.08MiB > /dev/mapper/master-root 4.00GiB > > System,single: Size:4.00MiB, Used:0.00B > /dev/mapper/master-root 4.00MiB > > System,DUP: Size:32.00MiB, Used:16.00KiB > /dev/mapper/master-root 64.00MiB > > Unallocated: > /dev/mapper/master-root 915.24GiB OK this looks like it maybe was created a while ago, it has these empty single chunk items that was common a while back. There is a low risk to clean it up, but I still advise backup first: 'btrfs balance start -mconvert=dup <mountpoint>' OK so DUP metadata means that you have two copies. So either the previous email lacks a complete dmesg showing that Btrfs tried to do a fix up on metadata, or it was reading data and since there's no copy it fails. > > > Only if the >> data/metadata on this sector is DUP or raid1 or raid56 can Btrfs >> automatically fix it up. If there's only one copy, whatever is on that >> sector is lost, if this is a persistent error. But maybe it's >> transient. >> >> What do you get for >> >> sudo smartctl -x /dev/sda > > https://gist.github.com/ceremcem/55a219f4c46781c1d4d58e0659500c96 > >> >> That'll show stats on bad sectors, and also if the drive supports SCT >> ERC and what the settings are. >> > > I think the drive screams for help. Yep. >5 Reallocated_Sector_Ct PO--CK 070 051 036 - 40472 That's a lot. If the drive is under warranty I'd aggressively try to get it replaced. >187 Reported_Uncorrect -O--CK 001 001 000 - 4548 That's too many. It might be by now there are no more reserve sectors left so remapping isn't possible if there are this many uncorrectable. >197 Current_Pending_Sector -O--C- 070 069 000 - 5000 >198 Offline_Uncorrectable ----C- 070 069 000 - 5000 Same. >SCT Error Recovery Control command not supported OK too bad, no way to increase the recovery time and give it more of a chance to recover the data. So yeah, make a backup and get the drive replaced. > >> >> > >> > >> > I always saw these DRDY errors whenever I experience physical hard >> > drive errors, so I expect `btrfs scrub` show some kind of similar >> > errors but it doesn't: >> > >> > btrfs scrub status /mnt/peynir/ >> > scrub status for 8827cb0e-52d7-4f99-90fd-a975cafbfa46 >> > scrub started at Tue Aug 28 00:43:55 2018 and finished after 00:02:07 >> > total bytes scrubbed: 12.45GiB with 0 errors >> >> Well that suggests this is a transient problem. Make sure you have >> backups, drive could be dying or maybe it'll stay in this state for a >> while longer. > > I've very good set of backups, so when the drive dies it won't hurt at > all. But expecting a possible decease of the hard drive will make it > easier to get over. I would consider this drive usable only for educational and experimentation purposes at this point; real world Btrfs disaster testing ;-) > >> >> >> > >> > I took new snapshots for both root and the LXC containers and nothing >> > gone wrong. To be confident, I reformat the swap partition (which I >> > saw some messages about swap partition in the crash screen). >> > >> > I'm not sure how to proceed at the moment. Taking succesfull backups >> > made me think that everything might be okay but I'm not sure if I >> > should continue trusting the drive or not. What additional checks >> > should I perform? >> >> What you could do is a full balance. This will read everything like a >> scrub, and then write it back out. So in theory, if the write hits the >> transient sector the firmware will determine whether the sector needs >> remapping or not. > > I've started a full balance job right now. With this many sectors pending, I suspect it will fail spectacularly. But it's a great test in a way. I mean, the block layer might complain about failed writes, and on any failed write for rootfs by any file system, it should just fall over and probably not gracefully. But it might be educational. -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: DRDY errors are not consistent with scrub results 2018-08-28 0:38 ` Chris Murphy @ 2018-08-28 0:39 ` Chris Murphy 2018-08-28 0:49 ` Cerem Cem ASLAN 0 siblings, 1 reply; 13+ messages in thread From: Chris Murphy @ 2018-08-28 0:39 UTC (permalink / raw) To: Chris Murphy; +Cc: Cerem Cem ASLAN, Btrfs BTRFS On Mon, Aug 27, 2018 at 6:38 PM, Chris Murphy <lists@colorremedies.com> wrote: >> Metadata,single: Size:8.00MiB, Used:0.00B >> /dev/mapper/master-root 8.00MiB >> >> Metadata,DUP: Size:2.00GiB, Used:562.08MiB >> /dev/mapper/master-root 4.00GiB >> >> System,single: Size:4.00MiB, Used:0.00B >> /dev/mapper/master-root 4.00MiB >> >> System,DUP: Size:32.00MiB, Used:16.00KiB >> /dev/mapper/master-root 64.00MiB >> >> Unallocated: >> /dev/mapper/master-root 915.24GiB > > > OK this looks like it maybe was created a while ago, it has these > empty single chunk items that was common a while back. There is a low > risk to clean it up, but I still advise backup first: > > 'btrfs balance start -mconvert=dup <mountpoint>' You can skip this advise now, it really doesn't matter. But future Btrfs shouldn't have both single and DUP chunks like this one is showing, if you're using relatively recent btrfs-progs to create the file system. -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: DRDY errors are not consistent with scrub results 2018-08-28 0:39 ` Chris Murphy @ 2018-08-28 0:49 ` Cerem Cem ASLAN 2018-08-28 1:08 ` Chris Murphy 0 siblings, 1 reply; 13+ messages in thread From: Cerem Cem ASLAN @ 2018-08-28 0:49 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS Thanks for your guidance, I'll get the device replaced first thing in the morning. Here is balance results which I think resulted not too bad: sudo btrfs balance start /mnt/peynir/ WARNING: Full balance without filters requested. This operation is very intense and takes potentially very long. It is recommended to use the balance filters to narrow down the balanced data. Use 'btrfs balance start --full-balance' option to skip this warning. The operation will start in 10 seconds. Use Ctrl-C to stop it. 10 9 8 7 6 5 4 3 2 1 Starting balance without any filters. Done, had to relocate 18 out of 18 chunks I suppose this means I've not lost any data, but I'm very prone to due to previous `smartctl ...` results. Chris Murphy <lists@colorremedies.com>, 28 Ağu 2018 Sal, 03:39 tarihinde şunu yazdı: > > On Mon, Aug 27, 2018 at 6:38 PM, Chris Murphy <lists@colorremedies.com> wrote: > > >> Metadata,single: Size:8.00MiB, Used:0.00B > >> /dev/mapper/master-root 8.00MiB > >> > >> Metadata,DUP: Size:2.00GiB, Used:562.08MiB > >> /dev/mapper/master-root 4.00GiB > >> > >> System,single: Size:4.00MiB, Used:0.00B > >> /dev/mapper/master-root 4.00MiB > >> > >> System,DUP: Size:32.00MiB, Used:16.00KiB > >> /dev/mapper/master-root 64.00MiB > >> > >> Unallocated: > >> /dev/mapper/master-root 915.24GiB > > > > > > OK this looks like it maybe was created a while ago, it has these > > empty single chunk items that was common a while back. There is a low > > risk to clean it up, but I still advise backup first: > > > > 'btrfs balance start -mconvert=dup <mountpoint>' > > You can skip this advise now, it really doesn't matter. But future > Btrfs shouldn't have both single and DUP chunks like this one is > showing, if you're using relatively recent btrfs-progs to create the > file system. > > > -- > Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: DRDY errors are not consistent with scrub results 2018-08-28 0:49 ` Cerem Cem ASLAN @ 2018-08-28 1:08 ` Chris Murphy 2018-08-28 18:50 ` Cerem Cem ASLAN 0 siblings, 1 reply; 13+ messages in thread From: Chris Murphy @ 2018-08-28 1:08 UTC (permalink / raw) To: Cerem Cem ASLAN; +Cc: Chris Murphy, Btrfs BTRFS On Mon, Aug 27, 2018 at 6:49 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote: > Thanks for your guidance, I'll get the device replaced first thing in > the morning. > > Here is balance results which I think resulted not too bad: > > sudo btrfs balance start /mnt/peynir/ > WARNING: > > Full balance without filters requested. This operation is very > intense and takes potentially very long. It is recommended to > use the balance filters to narrow down the balanced data. > Use 'btrfs balance start --full-balance' option to skip this > warning. The operation will start in 10 seconds. > Use Ctrl-C to stop it. > 10 9 8 7 6 5 4 3 2 1 > Starting balance without any filters. > Done, had to relocate 18 out of 18 chunks > > I suppose this means I've not lost any data, but I'm very prone to due > to previous `smartctl ...` results. OK so nothing fatal anyway. We'd have to see any kernel messages that appeared during the balance to see if there were read or write errors, but presumably any failure means the balance fails so... might get you by for a while actually. -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: DRDY errors are not consistent with scrub results 2018-08-28 1:08 ` Chris Murphy @ 2018-08-28 18:50 ` Cerem Cem ASLAN 2018-08-28 21:07 ` Chris Murphy 0 siblings, 1 reply; 13+ messages in thread From: Cerem Cem ASLAN @ 2018-08-28 18:50 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS I've successfully moved everything to another disk. (The only hard part was configuring the kernel parameters, as my root partition was on LVM which is on LUKS partition. Here are the notes, if anyone needs: https://github.com/ceremcem/smith-sync/blob/master/create-bootable-backup.md) Now I'm seekin for trouble :) I tried to convert my new system (booted with new disk) into raid1 coupled with the problematic old disk. To do so, I issued: sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ /dev/mapper/master-root appears to contain an existing filesystem (btrfs). ERROR: use the -f option to force overwrite of /dev/mapper/master-root aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ -f ERROR: error adding device '/dev/mapper/master-root': Input/output error aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ sudo: unable to open /var/lib/sudo/ts/aea: Read-only file system Now I ended up with a readonly file system. Isn't it possible to add a device to a running system? Chris Murphy <lists@colorremedies.com>, 28 Ağu 2018 Sal, 04:08 tarihinde şunu yazdı: > > On Mon, Aug 27, 2018 at 6:49 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote: > > Thanks for your guidance, I'll get the device replaced first thing in > > the morning. > > > > Here is balance results which I think resulted not too bad: > > > > sudo btrfs balance start /mnt/peynir/ > > WARNING: > > > > Full balance without filters requested. This operation is very > > intense and takes potentially very long. It is recommended to > > use the balance filters to narrow down the balanced data. > > Use 'btrfs balance start --full-balance' option to skip this > > warning. The operation will start in 10 seconds. > > Use Ctrl-C to stop it. > > 10 9 8 7 6 5 4 3 2 1 > > Starting balance without any filters. > > Done, had to relocate 18 out of 18 chunks > > > > I suppose this means I've not lost any data, but I'm very prone to due > > to previous `smartctl ...` results. > > > OK so nothing fatal anyway. We'd have to see any kernel messages that > appeared during the balance to see if there were read or write errors, > but presumably any failure means the balance fails so... might get you > by for a while actually. > > > > > > > > -- > Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: DRDY errors are not consistent with scrub results 2018-08-28 18:50 ` Cerem Cem ASLAN @ 2018-08-28 21:07 ` Chris Murphy 2018-08-28 23:04 ` Cerem Cem ASLAN 0 siblings, 1 reply; 13+ messages in thread From: Chris Murphy @ 2018-08-28 21:07 UTC (permalink / raw) To: Cerem Cem ASLAN; +Cc: Chris Murphy, Btrfs BTRFS On Tue, Aug 28, 2018 at 12:50 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote: > I've successfully moved everything to another disk. (The only hard > part was configuring the kernel parameters, as my root partition was > on LVM which is on LUKS partition. Here are the notes, if anyone > needs: https://github.com/ceremcem/smith-sync/blob/master/create-bootable-backup.md) > > Now I'm seekin for trouble :) I tried to convert my new system (booted > with new disk) into raid1 coupled with the problematic old disk. To do > so, I issued: > > sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ > /dev/mapper/master-root appears to contain an existing filesystem (btrfs). > ERROR: use the -f option to force overwrite of /dev/mapper/master-root > aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ -f > ERROR: error adding device '/dev/mapper/master-root': Input/output error > aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ > sudo: unable to open /var/lib/sudo/ts/aea: Read-only file system > > Now I ended up with a readonly file system. Isn't it possible to add a > device to a running system? Yes. The problem is the 2nd error message: ERROR: error adding device '/dev/mapper/master-root': Input/output error So you need to look in dmesg to see what Btrfs kernel messages occurred at that time. I'm gonna guess it's a failed write. You have a few of those in the smartctl log output. Any time a write failure happens, the operation is always fatal regardless of the file system. -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: DRDY errors are not consistent with scrub results 2018-08-28 21:07 ` Chris Murphy @ 2018-08-28 23:04 ` Cerem Cem ASLAN 2018-08-28 23:58 ` Chris Murphy 0 siblings, 1 reply; 13+ messages in thread From: Cerem Cem ASLAN @ 2018-08-28 23:04 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS What I want to achive is that I want to add the problematic disk as raid1 and see how/when it fails and how BTRFS recovers these fails. While the party goes on, the main system shouldn't be interrupted since this is a production system. For example, I would never expect to be ended up with such a readonly state while trying to add a disk with "unknown health" to the system. Was it somewhat expected? Although we know that disk is about to fail, it still survives. Shouldn't we expect in such a scenario that when system tries to read or write some data from/to that BROKEN_DISK and when it recognizes it failed, it will try to recover the part of the data from GOOD_DISK and try to store that recovered data in some other part of the BROKEN_DISK? Or did I misunderstood the whole thing? Chris Murphy <lists@colorremedies.com>, 29 Ağu 2018 Çar, 00:07 tarihinde şunu yazdı: > > On Tue, Aug 28, 2018 at 12:50 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote: > > I've successfully moved everything to another disk. (The only hard > > part was configuring the kernel parameters, as my root partition was > > on LVM which is on LUKS partition. Here are the notes, if anyone > > needs: https://github.com/ceremcem/smith-sync/blob/master/create-bootable-backup.md) > > > > Now I'm seekin for trouble :) I tried to convert my new system (booted > > with new disk) into raid1 coupled with the problematic old disk. To do > > so, I issued: > > > > sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ > > /dev/mapper/master-root appears to contain an existing filesystem (btrfs). > > ERROR: use the -f option to force overwrite of /dev/mapper/master-root > > aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ -f > > ERROR: error adding device '/dev/mapper/master-root': Input/output error > > aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ > > sudo: unable to open /var/lib/sudo/ts/aea: Read-only file system > > > > Now I ended up with a readonly file system. Isn't it possible to add a > > device to a running system? > > Yes. > > The problem is the 2nd error message: > > ERROR: error adding device '/dev/mapper/master-root': Input/output error > > So you need to look in dmesg to see what Btrfs kernel messages > occurred at that time. I'm gonna guess it's a failed write. You have a > few of those in the smartctl log output. Any time a write failure > happens, the operation is always fatal regardless of the file system. > > > > -- > Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: DRDY errors are not consistent with scrub results 2018-08-28 23:04 ` Cerem Cem ASLAN @ 2018-08-28 23:58 ` Chris Murphy 2018-08-29 6:58 ` Cerem Cem ASLAN 0 siblings, 1 reply; 13+ messages in thread From: Chris Murphy @ 2018-08-28 23:58 UTC (permalink / raw) To: Cerem Cem ASLAN; +Cc: Chris Murphy, Btrfs BTRFS On Tue, Aug 28, 2018 at 5:04 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote: > What I want to achive is that I want to add the problematic disk as > raid1 and see how/when it fails and how BTRFS recovers these fails. > While the party goes on, the main system shouldn't be interrupted > since this is a production system. For example, I would never expect > to be ended up with such a readonly state while trying to add a disk > with "unknown health" to the system. Was it somewhat expected? I don't know. I also can't tell you how LVM or mdraid behave in the same situation either though. For sure I've come across bug reports where underlying devices go read only and the file system falls over totally and developers shrug and say they can't do anything. This situation is a little different and difficult. You're starting out with a one drive setup so the profile is single/DUP or single/single, and that doesn't change when adding. So the 2nd drive is actually *mandatory* for a brief period of time before you've made it raid1 or higher. It's a developer question what is the design, and if this is a bug: maybe the device being added should be written to with placeholder supers or even just zeros in all the places for 'dev add' metadata, and only if that succeeds, to then write real updated supers to all devices. It's possible the 'dev add' presently writes updated supers to all devices at the same time, and has a brief period where the state is fragile and if it fails, it goes read only to prevent damaging the file system. Anyway, without a call trace, no idea why it ended up read only. So I have to speculate. > > Although we know that disk is about to fail, it still survives. That's very tenuous rationalization, a drive that rejects even a single write is considered failed by the md driver. Btrfs is still very tolerant of this, so if it had successfully added and you were running in production, you should expect to see thousands of write errors dumped to the kernel log because Btrfs never ejects a bad drive still. It keeps trying. And keeps reporting the failures. And all those errors being logged can end up causing more write demand if the logs are on the same volume as the failing device, even more errors to record, and you get an escalating situation with heavy log writing. > Shouldn't we expect in such a scenario that when system tries to read > or write some data from/to that BROKEN_DISK and when it recognizes it > failed, it will try to recover the part of the data from GOOD_DISK and > try to store that recovered data in some other part of the > BROKEN_DISK? Nope. Btrfs can only write supers to fixed locations on the drive, same as any other file system. Btrfs metadata could possibly go elsewhere because it doesn't have fixed locations, but Btrfs doesn't do bad sector tracking. So once it decides metadata goes in location X, if X reports a write error it will not try to write elsewhere and insofar as I'm aware ext4 and XFS and LVM and md don't either; md does have an optional bad block map it will use for tracking bad sectors and remap to known good sectors. Normally the drive firmware should do this and when that fails the drive is considered toast for production purpose >Or did I misunderstood the whole thing? Well in a way this is sorta user sabotage. It's a valid test and I'd say ideally things should fail safely, rather than fall over. But at the same time it's not wrong for developers to say: "look if you add a bad device there's a decent chance we're going face plant and go read only to avoid causing worse problems, so next time you should qualify the drive before putting it into production." I'm willing to bet all the other file system devs would say something like that even if Btrfs devs think something better could happen, it's probably not a super high priority. -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: DRDY errors are not consistent with scrub results 2018-08-28 23:58 ` Chris Murphy @ 2018-08-29 6:58 ` Cerem Cem ASLAN 2018-08-29 9:58 ` Duncan 0 siblings, 1 reply; 13+ messages in thread From: Cerem Cem ASLAN @ 2018-08-29 6:58 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS Chris Murphy <lists@colorremedies.com>, 29 Ağu 2018 Çar, 02:58 tarihinde şunu yazdı: > > On Tue, Aug 28, 2018 at 5:04 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote: > > What I want to achive is that I want to add the problematic disk as > > raid1 and see how/when it fails and how BTRFS recovers these fails. > > While the party goes on, the main system shouldn't be interrupted > > since this is a production system. For example, I would never expect > > to be ended up with such a readonly state while trying to add a disk > > with "unknown health" to the system. Was it somewhat expected? > > I don't know. I also can't tell you how LVM or mdraid behave in the > same situation either though. For sure I've come across bug reports > where underlying devices go read only and the file system falls over > totally and developers shrug and say they can't do anything. > > This situation is a little different and difficult. You're starting > out with a one drive setup so the profile is single/DUP or > single/single, and that doesn't change when adding. So the 2nd drive > is actually *mandatory* for a brief period of time before you've made > it raid1 or higher. It's a developer question what is the design, and > if this is a bug: maybe the device being added should be written to > with placeholder supers or even just zeros in all the places for 'dev > add' metadata, and only if that succeeds, to then write real updated > supers to all devices. It's possible the 'dev add' presently writes > updated supers to all devices at the same time, and has a brief period > where the state is fragile and if it fails, it goes read only to > prevent damaging the file system. Thinking again, this is totally acceptable. If the requirement was a good health disk, then I think I must check the disk health by myself. I may believe that the disk is in a good state, or make a quick test or make some very detailed tests to be sure. Likewise, ending up with readonly state is not the end of the world, even over SSH, because system still functions and all I need to do is a reboot in the worst case. That's also acceptable *while adding a new disk*. > > Anyway, without a call trace, no idea why it ended up read only. So I > have to speculate. > I may try adding the disk again any time and provide any requested logs, it is still attached to the server. I'm only not sure if this is a useful experiment from the point of view of the rest of the people. > > > > > Although we know that disk is about to fail, it still survives. > > That's very tenuous rationalization, a drive that rejects even a > single write is considered failed by the md driver. Btrfs is still > very tolerant of this, so if it had successfully added and you were > running in production, you should expect to see thousands of write > errors dumped to the kernel log That's exactly what I expected :) because Btrfs never ejects a bad drive > still. It keeps trying. And keeps reporting the failures. And all > those errors being logged can end up causing more write demand if the > logs are on the same volume as the failing device, even more errors to > record, and you get an escalating situation with heavy log writing. > Good to point this. Maybe I should arrange an on-ram virtual machine that writes back to local disk if no hardware errors found and start sending logs in a different server *if* such a hardware failure occurs. > > > Shouldn't we expect in such a scenario that when system tries to read > > or write some data from/to that BROKEN_DISK and when it recognizes it > > failed, it will try to recover the part of the data from GOOD_DISK and > > try to store that recovered data in some other part of the > > BROKEN_DISK? > > Nope. Btrfs can only write supers to fixed locations on the drive, > same as any other file system. Btrfs metadata could possibly go > elsewhere because it doesn't have fixed locations, but Btrfs doesn't > do bad sector tracking. So once it decides metadata goes in location > X, if X reports a write error it will not try to write elsewhere and > insofar as I'm aware ext4 and XFS and LVM and md don't either; md does > have an optional bad block map it will use for tracking bad sectors > and remap to known good sectors. Normally the drive firmware should do > this and when that fails the drive is considered toast for production > purpose That's also plausible. Thinking again (again? :) if BTRFS would behave as I expected, that retries might never end if the disk is in a very bad situation and that would add very intensive IO load on a production system. I think in such a situation I should remove the raid device, try to reformat it and attach it again. > > >Or did I misunderstood the whole thing? > > Well in a way this is sorta user sabotage. It's a valid test and I'd > say ideally things should fail safely, rather than fall over. But at > the same time it's not wrong for developers to say: "look if you add a > bad device there's a decent chance we're going face plant and go read > only to avoid causing worse problems, so next time you should qualify > the drive before putting it into production." Agreed. > > I'm willing to bet all the other file system devs would say something > like that even if Btrfs devs think something better could happen, it's > probably not a super high priority. > > Devs doing lots of things already and yes, this is not an urgent task. I appreciate your helps, thank you! > > > -- > Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: DRDY errors are not consistent with scrub results 2018-08-29 6:58 ` Cerem Cem ASLAN @ 2018-08-29 9:58 ` Duncan 2018-08-29 10:04 ` Hugo Mills 0 siblings, 1 reply; 13+ messages in thread From: Duncan @ 2018-08-29 9:58 UTC (permalink / raw) To: linux-btrfs Cerem Cem ASLAN posted on Wed, 29 Aug 2018 09:58:21 +0300 as excerpted: > Thinking again, this is totally acceptable. If the requirement was a > good health disk, then I think I must check the disk health by myself. > I may believe that the disk is in a good state, or make a quick test or > make some very detailed tests to be sure. For testing you might try badblocks. It's most useful on a device that doesn't have a filesystem on it you're trying to save, so you can use the -w write-test option. See the manpage for details. The -w option should force the device to remap bad blocks where it can as well, and you can take your previous smartctl read and compare it to a new one after the test. Hint if testing multiple spinning-rust devices: Try running multiple tests at once. While this might have been slower on old EIDE, at least with spinning rust, on SATA and similar you should be able to test multiple devices at once without them slowing down significantly, because the bottleneck is the spinning rust, not the bus, controller or CPU. I used badblocks years ago to test my new disks before setting up mdraid on them, and with full disk tests on spinning rust taking (at the time) nearly a day a pass and four passes for the -w test, the multiple tests at once trick saved me quite a bit of time! It's not a great idea to do the test on new SSDs as it's unnecessary wear, writing the entire device four times with different patterns each time for a -w, but it might be worthwhile to try it on an ssd you're just trying to salvage, forcing it to swap out any bad sectors it encounters in the process. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: DRDY errors are not consistent with scrub results 2018-08-29 9:58 ` Duncan @ 2018-08-29 10:04 ` Hugo Mills 0 siblings, 0 replies; 13+ messages in thread From: Hugo Mills @ 2018-08-29 10:04 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1798 bytes --] On Wed, Aug 29, 2018 at 09:58:58AM +0000, Duncan wrote: > Cerem Cem ASLAN posted on Wed, 29 Aug 2018 09:58:21 +0300 as excerpted: > > > Thinking again, this is totally acceptable. If the requirement was a > > good health disk, then I think I must check the disk health by myself. > > I may believe that the disk is in a good state, or make a quick test or > > make some very detailed tests to be sure. > > For testing you might try badblocks. It's most useful on a device that > doesn't have a filesystem on it you're trying to save, so you can use the > -w write-test option. See the manpage for details. > > The -w option should force the device to remap bad blocks where it can as > well, and you can take your previous smartctl read and compare it to a > new one after the test. > > Hint if testing multiple spinning-rust devices: Try running multiple > tests at once. While this might have been slower on old EIDE, at least > with spinning rust, on SATA and similar you should be able to test > multiple devices at once without them slowing down significantly, because > the bottleneck is the spinning rust, not the bus, controller or CPU. I > used badblocks years ago to test my new disks before setting up mdraid on > them, and with full disk tests on spinning rust taking (at the time) > nearly a day a pass and four passes for the -w test, the multiple tests > at once trick saved me quite a bit of time! Hah. Only a day? It's up to 2 days now. The devices get bigger. The interfaces don't get faster at the same rate. Back in the late '90s, it was only an hour or so to run a badblocks pass on a big disk... Hugo. -- Hugo Mills | Nostalgia isn't what it used to be. hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: DRDY errors are not consistent with scrub results 2018-08-27 22:51 DRDY errors are not consistent with scrub results Cerem Cem ASLAN [not found] ` <CAJCQCtSq5K90gpfGQN8JhqQddBg62m8EG_bFuWN5XyzdNStDfw@mail.gmail.com> @ 2018-08-29 9:56 ` ein 1 sibling, 0 replies; 13+ messages in thread From: ein @ 2018-08-29 9:56 UTC (permalink / raw) To: Cerem Cem ASLAN, Btrfs BTRFS On 08/28/2018 12:51 AM, Cerem Cem ASLAN wrote: > Hi, > Good morning. > > I'm not sure how to proceed at the moment. Taking succesfull backups > made me think that everything might be okay but I'm not sure if I > should continue trusting the drive or not. What additional checks > should I perform? > Can you please show also: btrfs dev stats /path/to/the/mount/point -- PGP Public Key (RSA/4096b): ID: 0xF2C6EA10 SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10 ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-08-29 14:00 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-08-27 22:51 DRDY errors are not consistent with scrub results Cerem Cem ASLAN [not found] ` <CAJCQCtSq5K90gpfGQN8JhqQddBg62m8EG_bFuWN5XyzdNStDfw@mail.gmail.com> [not found] ` <CAN4oSBeHwnsm5Ecz1hAQLk6s6utHfn5XeR8xMhnZpmT-sb-_iw@mail.gmail.com> 2018-08-28 0:38 ` Chris Murphy 2018-08-28 0:39 ` Chris Murphy 2018-08-28 0:49 ` Cerem Cem ASLAN 2018-08-28 1:08 ` Chris Murphy 2018-08-28 18:50 ` Cerem Cem ASLAN 2018-08-28 21:07 ` Chris Murphy 2018-08-28 23:04 ` Cerem Cem ASLAN 2018-08-28 23:58 ` Chris Murphy 2018-08-29 6:58 ` Cerem Cem ASLAN 2018-08-29 9:58 ` Duncan 2018-08-29 10:04 ` Hugo Mills 2018-08-29 9:56 ` ein
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.