All of lore.kernel.org
 help / color / mirror / Atom feed
* DRDY errors are not consistent with scrub results
@ 2018-08-27 22:51 Cerem Cem ASLAN
       [not found] ` <CAJCQCtSq5K90gpfGQN8JhqQddBg62m8EG_bFuWN5XyzdNStDfw@mail.gmail.com>
  2018-08-29  9:56 ` ein
  0 siblings, 2 replies; 13+ messages in thread
From: Cerem Cem ASLAN @ 2018-08-27 22:51 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi,

I'm getting DRDY ERR messages which causes system crash on the server:

# tail -n 40 /var/log/kern.log.1
Aug 24 21:04:55 aea3 kernel: [  939.228059] lxc-bridge: port
5(vethI7JDHN) entered disabled state
Aug 24 21:04:55 aea3 kernel: [  939.300602] eth0: renamed from vethQ5Y2OF
Aug 24 21:04:55 aea3 kernel: [  939.328245] IPv6: ADDRCONF(NETDEV_UP):
eth0: link is not ready
Aug 24 21:04:55 aea3 kernel: [  939.328453] IPv6:
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Aug 24 21:04:55 aea3 kernel: [  939.328474] IPv6:
ADDRCONF(NETDEV_CHANGE): vethI7JDHN: link becomes ready
Aug 24 21:04:55 aea3 kernel: [  939.328491] lxc-bridge: port
5(vethI7JDHN) entered blocking state
Aug 24 21:04:55 aea3 kernel: [  939.328493] lxc-bridge: port
5(vethI7JDHN) entered forwarding state
Aug 24 21:04:59 aea3 kernel: [  943.085647] cgroup: cgroup2: unknown
option "nsdelegate"
Aug 24 21:16:15 aea3 kernel: [ 1619.400016] perf: interrupt took too
long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to
79750
Aug 24 21:17:11 aea3 kernel: [ 1675.515815] perf: interrupt took too
long (3137 > 3132), lowering kernel.perf_event_max_sample_rate to
63750
Aug 24 21:17:13 aea3 kernel: [ 1677.080837] cgroup: cgroup2: unknown
option "nsdelegate"
Aug 25 22:38:31 aea3 kernel: [92955.512098] usb 4-2: USB disconnect,
device number 2
Aug 26 02:14:21 aea3 kernel: [105906.035038] lxc-bridge: port
4(vethCTKU4K) entered disabled state
Aug 26 02:15:30 aea3 kernel: [105974.107521] lxc-bridge: port
4(vethO59BPD) entered disabled state
Aug 26 02:15:30 aea3 kernel: [105974.109991] device vethO59BPD left
promiscuous mode
Aug 26 02:15:30 aea3 kernel: [105974.109995] lxc-bridge: port
4(vethO59BPD) entered disabled state
Aug 26 02:15:30 aea3 kernel: [105974.710490] lxc-bridge: port
4(vethBAYODL) entered blocking state
Aug 26 02:15:30 aea3 kernel: [105974.710493] lxc-bridge: port
4(vethBAYODL) entered disabled state
Aug 26 02:15:30 aea3 kernel: [105974.710545] device vethBAYODL entered
promiscuous mode
Aug 26 02:15:30 aea3 kernel: [105974.710598] IPv6:
ADDRCONF(NETDEV_UP): vethBAYODL: link is not ready
Aug 26 02:15:30 aea3 kernel: [105974.710600] lxc-bridge: port
4(vethBAYODL) entered blocking state
Aug 26 02:15:30 aea3 kernel: [105974.710601] lxc-bridge: port
4(vethBAYODL) entered forwarding state
Aug 26 02:16:35 aea3 kernel: [106039.674089] BTRFS: device fsid
5b844c7a-0cbd-40a7-a8e3-6bc636aba033 devid 1 transid 984 /dev/dm-3
Aug 26 02:17:21 aea3 kernel: [106085.352453] ata4.00: failed command: READ DMA
Aug 26 02:17:21 aea3 kernel: [106085.352901] ata4.00: status: { DRDY ERR }
Aug 26 02:18:56 aea3 kernel: [106180.648062] ata4.00: exception Emask
0x0 SAct 0x0 SErr 0x0 action 0x0
Aug 26 02:18:56 aea3 kernel: [106180.648333] ata4.00: BMDMA stat 0x25
Aug 26 02:18:56 aea3 kernel: [106180.648515] ata4.00: failed command: READ DMA
Aug 26 02:18:56 aea3 kernel: [106180.648706] ata4.00: cmd
c8/00:08:80:9c:bb/00:00:00:00:00/e3 tag 0 dma 4096 in
Aug 26 02:18:56 aea3 kernel: [106180.648706]          res
51/40:00:80:9c:bb/00:00:00:00:00/03 Emask 0x9 (media error)
Aug 26 02:18:56 aea3 kernel: [106180.649380] ata4.00: status: { DRDY ERR }
Aug 26 02:18:56 aea3 kernel: [106180.649743] ata4.00: error: { UNC }
Aug 26 02:18:56 aea3 kernel: [106180.779311] ata4.00: configured for UDMA/133
Aug 26 02:18:56 aea3 kernel: [106180.779331] sd 3:0:0:0: [sda] tag#0
FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 26 02:18:56 aea3 kernel: [106180.779335] sd 3:0:0:0: [sda] tag#0
Sense Key : Medium Error [current]
Aug 26 02:18:56 aea3 kernel: [106180.779339] sd 3:0:0:0: [sda] tag#0
Add. Sense: Unrecovered read error - auto reallocate failed
Aug 26 02:18:56 aea3 kernel: [106180.779343] sd 3:0:0:0: [sda] tag#0
CDB: Read(10) 28 00 03 bb 9c 80 00 00 08 00
Aug 26 02:18:56 aea3 kernel: [106180.779346] blk_update_request: I/O
error, dev sda, sector 62626944
Aug 26 02:18:56 aea3 kernel: [106180.779703] BTRFS error (device
dm-2): bdev /dev/mapper/master-root errs: wr 0, rd 40, flush 0,
corrupt 0, gen 0
Aug 26 02:18:56 aea3 kernel: [106180.779936] ata4: EH complete


I always saw these DRDY errors whenever I experience physical hard
drive errors, so I expect `btrfs scrub` show some kind of similar
errors but it doesn't:

btrfs scrub status /mnt/peynir/
scrub status for 8827cb0e-52d7-4f99-90fd-a975cafbfa46
scrub started at Tue Aug 28 00:43:55 2018 and finished after 00:02:07
total bytes scrubbed: 12.45GiB with 0 errors

I took new snapshots for both root and the LXC containers and nothing
gone wrong. To be confident, I reformat the swap partition (which I
saw some messages about swap partition in the crash screen).

I'm not sure how to proceed at the moment. Taking succesfull backups
made me think that everything might be okay but I'm not sure if I
should continue trusting the drive or not. What additional checks
should I perform?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: DRDY errors are not consistent with scrub results
       [not found]   ` <CAN4oSBeHwnsm5Ecz1hAQLk6s6utHfn5XeR8xMhnZpmT-sb-_iw@mail.gmail.com>
@ 2018-08-28  0:38     ` Chris Murphy
  2018-08-28  0:39       ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2018-08-28  0:38 UTC (permalink / raw)
  To: Cerem Cem ASLAN, Btrfs BTRFS

On Mon, Aug 27, 2018 at 6:05 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote:
> Note that I've directly received this reply, not by mail list. I'm not
> sure this is intended or not.

I intended to do Reply to All but somehow this doesn't always work out
between the user and Gmail, I'm just gonna assume gmail is being an
asshole again.


> Chris Murphy <lists@colorremedies.com>, 28 Ağu 2018 Sal, 02:25
> tarihinde şunu yazdı:
>>
>> On Mon, Aug 27, 2018 at 4:51 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote:
>> > Hi,
>> >
>> > I'm getting DRDY ERR messages which causes system crash on the server:
>> >
>> > # tail -n 40 /var/log/kern.log.1
>> > Aug 24 21:04:55 aea3 kernel: [  939.228059] lxc-bridge: port
>> > 5(vethI7JDHN) entered disabled state
>> > Aug 24 21:04:55 aea3 kernel: [  939.300602] eth0: renamed from vethQ5Y2OF
>> > Aug 24 21:04:55 aea3 kernel: [  939.328245] IPv6: ADDRCONF(NETDEV_UP):
>> > eth0: link is not ready
>> > Aug 24 21:04:55 aea3 kernel: [  939.328453] IPv6:
>> > ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
>> > Aug 24 21:04:55 aea3 kernel: [  939.328474] IPv6:
>> > ADDRCONF(NETDEV_CHANGE): vethI7JDHN: link becomes ready
>> > Aug 24 21:04:55 aea3 kernel: [  939.328491] lxc-bridge: port
>> > 5(vethI7JDHN) entered blocking state
>> > Aug 24 21:04:55 aea3 kernel: [  939.328493] lxc-bridge: port
>> > 5(vethI7JDHN) entered forwarding state
>> > Aug 24 21:04:59 aea3 kernel: [  943.085647] cgroup: cgroup2: unknown
>> > option "nsdelegate"
>> > Aug 24 21:16:15 aea3 kernel: [ 1619.400016] perf: interrupt took too
>> > long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to
>> > 79750
>> > Aug 24 21:17:11 aea3 kernel: [ 1675.515815] perf: interrupt took too
>> > long (3137 > 3132), lowering kernel.perf_event_max_sample_rate to
>> > 63750
>> > Aug 24 21:17:13 aea3 kernel: [ 1677.080837] cgroup: cgroup2: unknown
>> > option "nsdelegate"
>> > Aug 25 22:38:31 aea3 kernel: [92955.512098] usb 4-2: USB disconnect,
>> > device number 2
>> > Aug 26 02:14:21 aea3 kernel: [105906.035038] lxc-bridge: port
>> > 4(vethCTKU4K) entered disabled state
>> > Aug 26 02:15:30 aea3 kernel: [105974.107521] lxc-bridge: port
>> > 4(vethO59BPD) entered disabled state
>> > Aug 26 02:15:30 aea3 kernel: [105974.109991] device vethO59BPD left
>> > promiscuous mode
>> > Aug 26 02:15:30 aea3 kernel: [105974.109995] lxc-bridge: port
>> > 4(vethO59BPD) entered disabled state
>> > Aug 26 02:15:30 aea3 kernel: [105974.710490] lxc-bridge: port
>> > 4(vethBAYODL) entered blocking state
>> > Aug 26 02:15:30 aea3 kernel: [105974.710493] lxc-bridge: port
>> > 4(vethBAYODL) entered disabled state
>> > Aug 26 02:15:30 aea3 kernel: [105974.710545] device vethBAYODL entered
>> > promiscuous mode
>> > Aug 26 02:15:30 aea3 kernel: [105974.710598] IPv6:
>> > ADDRCONF(NETDEV_UP): vethBAYODL: link is not ready
>> > Aug 26 02:15:30 aea3 kernel: [105974.710600] lxc-bridge: port
>> > 4(vethBAYODL) entered blocking state
>> > Aug 26 02:15:30 aea3 kernel: [105974.710601] lxc-bridge: port
>> > 4(vethBAYODL) entered forwarding state
>> > Aug 26 02:16:35 aea3 kernel: [106039.674089] BTRFS: device fsid
>> > 5b844c7a-0cbd-40a7-a8e3-6bc636aba033 devid 1 transid 984 /dev/dm-3
>> > Aug 26 02:17:21 aea3 kernel: [106085.352453] ata4.00: failed command: READ DMA
>> > Aug 26 02:17:21 aea3 kernel: [106085.352901] ata4.00: status: { DRDY ERR }
>> > Aug 26 02:18:56 aea3 kernel: [106180.648062] ata4.00: exception Emask
>> > 0x0 SAct 0x0 SErr 0x0 action 0x0
>> > Aug 26 02:18:56 aea3 kernel: [106180.648333] ata4.00: BMDMA stat 0x25
>> > Aug 26 02:18:56 aea3 kernel: [106180.648515] ata4.00: failed command: READ DMA
>> > Aug 26 02:18:56 aea3 kernel: [106180.648706] ata4.00: cmd
>> > c8/00:08:80:9c:bb/00:00:00:00:00/e3 tag 0 dma 4096 in
>> > Aug 26 02:18:56 aea3 kernel: [106180.648706]          res
>> > 51/40:00:80:9c:bb/00:00:00:00:00/03 Emask 0x9 (media error)
>> > Aug 26 02:18:56 aea3 kernel: [106180.649380] ata4.00: status: { DRDY ERR }
>> > Aug 26 02:18:56 aea3 kernel: [106180.649743] ata4.00: error: { UNC }
>>
>> Classic case of uncorrectable read error due to sector failure.
>>
>>
>>
>> > Aug 26 02:18:56 aea3 kernel: [106180.779311] ata4.00: configured for UDMA/133
>> > Aug 26 02:18:56 aea3 kernel: [106180.779331] sd 3:0:0:0: [sda] tag#0
>> > FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> > Aug 26 02:18:56 aea3 kernel: [106180.779335] sd 3:0:0:0: [sda] tag#0
>> > Sense Key : Medium Error [current]
>> > Aug 26 02:18:56 aea3 kernel: [106180.779339] sd 3:0:0:0: [sda] tag#0
>> > Add. Sense: Unrecovered read error - auto reallocate failed
>> > Aug 26 02:18:56 aea3 kernel: [106180.779343] sd 3:0:0:0: [sda] tag#0
>> > CDB: Read(10) 28 00 03 bb 9c 80 00 00 08 00
>> > Aug 26 02:18:56 aea3 kernel: [106180.779346] blk_update_request: I/O
>> > error, dev sda, sector 62626944
>>
>> And the drive has reported the physical sector that's failing.
>>
>>
>>
>> > Aug 26 02:18:56 aea3 kernel: [106180.779703] BTRFS error (device
>> > dm-2): bdev /dev/mapper/master-root errs: wr 0, rd 40, flush 0,
>> > corrupt 0, gen 0
>> > Aug 26 02:18:56 aea3 kernel: [106180.779936] ata4: EH complete
>>
>> And Btrfs reports it as a read error. Is this a single drive setup?
>
> Yes, this is a single drive setup.
>
>> And what's the profile for metadata and data?
>
> sudo btrfs fi usage /mnt/peynir/
> [sudo] password for aea:
> Overall:
>     Device size:                 931.32GiB
>     Device allocated:             16.08GiB
>     Device unallocated:          915.24GiB
>     Device missing:                  0.00B
>     Used:                         12.53GiB
>     Free (estimated):            915.81GiB      (min: 458.19GiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   2.00
>     Global reserve:               43.94MiB      (used: 2.45MiB)
>
> Data,single: Size:12.01GiB, Used:11.43GiB
>    /dev/mapper/master-root        12.01GiB
>
> Metadata,single: Size:8.00MiB, Used:0.00B
>    /dev/mapper/master-root         8.00MiB
>
> Metadata,DUP: Size:2.00GiB, Used:562.08MiB
>    /dev/mapper/master-root         4.00GiB
>
> System,single: Size:4.00MiB, Used:0.00B
>    /dev/mapper/master-root         4.00MiB
>
> System,DUP: Size:32.00MiB, Used:16.00KiB
>    /dev/mapper/master-root        64.00MiB
>
> Unallocated:
>    /dev/mapper/master-root       915.24GiB


OK this looks like it maybe was created a while ago, it has these
empty single chunk items that was common a while back. There is a low
risk to clean it up, but I still advise backup first:

'btrfs balance start -mconvert=dup <mountpoint>'

OK so DUP metadata means that you have two copies. So either the
previous email lacks a complete dmesg showing that Btrfs tried to do a
fix up on metadata, or it was reading data and since there's no copy
it fails.


>
>
> Only if the
>> data/metadata on this sector is DUP or raid1 or raid56 can Btrfs
>> automatically fix it up. If there's only one copy, whatever is on that
>> sector is lost, if this is a persistent error. But maybe it's
>> transient.
>>
>> What do you get for
>>
>> sudo smartctl -x /dev/sda
>
> https://gist.github.com/ceremcem/55a219f4c46781c1d4d58e0659500c96
>
>>
>> That'll show stats on bad sectors, and also if the drive supports SCT
>> ERC and what the settings are.
>>
>
> I think the drive screams for help.


Yep.

>5 Reallocated_Sector_Ct PO--CK 070 051 036 - 40472

That's a lot. If the drive is under warranty I'd aggressively try to
get it replaced.

>187 Reported_Uncorrect -O--CK 001 001 000 - 4548

That's too many. It might be by now there are no more reserve sectors
left so remapping isn't possible if there are this many uncorrectable.


>197 Current_Pending_Sector -O--C- 070 069 000 - 5000
>198 Offline_Uncorrectable ----C- 070 069 000 - 5000

Same.



>SCT Error Recovery Control command not supported


OK too bad, no way to increase the recovery time and give it more of a
chance to recover the data.

So yeah, make a backup and get the drive replaced.



>
>>
>> >
>> >
>> > I always saw these DRDY errors whenever I experience physical hard
>> > drive errors, so I expect `btrfs scrub` show some kind of similar
>> > errors but it doesn't:
>> >
>> > btrfs scrub status /mnt/peynir/
>> > scrub status for 8827cb0e-52d7-4f99-90fd-a975cafbfa46
>> > scrub started at Tue Aug 28 00:43:55 2018 and finished after 00:02:07
>> > total bytes scrubbed: 12.45GiB with 0 errors
>>
>> Well that suggests this is a transient problem. Make sure you have
>> backups, drive could be dying or maybe it'll stay in this state for a
>> while longer.
>
> I've very good set of backups, so when the drive dies it won't hurt at
> all. But expecting a possible decease of the hard drive will make it
> easier to get over.


I would consider this drive usable only for educational and
experimentation purposes at this point; real world Btrfs disaster
testing ;-)



>
>>
>>
>> >
>> > I took new snapshots for both root and the LXC containers and nothing
>> > gone wrong. To be confident, I reformat the swap partition (which I
>> > saw some messages about swap partition in the crash screen).
>> >
>> > I'm not sure how to proceed at the moment. Taking succesfull backups
>> > made me think that everything might be okay but I'm not sure if I
>> > should continue trusting the drive or not. What additional checks
>> > should I perform?
>>
>> What you could do is a full balance. This will read everything like a
>> scrub, and then write it back out. So in theory, if the write hits the
>> transient sector the firmware will determine whether the sector needs
>> remapping or not.
>
> I've started a full balance job right now.


With this many sectors pending, I suspect it will fail spectacularly.
But it's a great test in a way. I mean, the block layer might complain
about failed writes, and on any failed write for rootfs by any file
system, it should just fall over and probably not gracefully. But it
might be educational.





-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: DRDY errors are not consistent with scrub results
  2018-08-28  0:38     ` Chris Murphy
@ 2018-08-28  0:39       ` Chris Murphy
  2018-08-28  0:49         ` Cerem Cem ASLAN
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2018-08-28  0:39 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Cerem Cem ASLAN, Btrfs BTRFS

On Mon, Aug 27, 2018 at 6:38 PM, Chris Murphy <lists@colorremedies.com> wrote:

>> Metadata,single: Size:8.00MiB, Used:0.00B
>>    /dev/mapper/master-root         8.00MiB
>>
>> Metadata,DUP: Size:2.00GiB, Used:562.08MiB
>>    /dev/mapper/master-root         4.00GiB
>>
>> System,single: Size:4.00MiB, Used:0.00B
>>    /dev/mapper/master-root         4.00MiB
>>
>> System,DUP: Size:32.00MiB, Used:16.00KiB
>>    /dev/mapper/master-root        64.00MiB
>>
>> Unallocated:
>>    /dev/mapper/master-root       915.24GiB
>
>
> OK this looks like it maybe was created a while ago, it has these
> empty single chunk items that was common a while back. There is a low
> risk to clean it up, but I still advise backup first:
>
> 'btrfs balance start -mconvert=dup <mountpoint>'

You can skip this advise now, it really doesn't matter. But future
Btrfs shouldn't have both single and DUP chunks like this one is
showing, if you're using relatively recent btrfs-progs to create the
file system.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: DRDY errors are not consistent with scrub results
  2018-08-28  0:39       ` Chris Murphy
@ 2018-08-28  0:49         ` Cerem Cem ASLAN
  2018-08-28  1:08           ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Cerem Cem ASLAN @ 2018-08-28  0:49 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Thanks for your guidance, I'll get the device replaced first thing in
the morning.

Here is balance results which I think resulted not too bad:

sudo btrfs balance start /mnt/peynir/
WARNING:

        Full balance without filters requested. This operation is very
        intense and takes potentially very long. It is recommended to
        use the balance filters to narrow down the balanced data.
        Use 'btrfs balance start --full-balance' option to skip this
        warning. The operation will start in 10 seconds.
        Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting balance without any filters.
Done, had to relocate 18 out of 18 chunks

I suppose this means I've not lost any data, but I'm very prone to due
to previous `smartctl ...` results.

Chris Murphy <lists@colorremedies.com>, 28 Ağu 2018 Sal, 03:39
tarihinde şunu yazdı:
>
> On Mon, Aug 27, 2018 at 6:38 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
> >> Metadata,single: Size:8.00MiB, Used:0.00B
> >>    /dev/mapper/master-root         8.00MiB
> >>
> >> Metadata,DUP: Size:2.00GiB, Used:562.08MiB
> >>    /dev/mapper/master-root         4.00GiB
> >>
> >> System,single: Size:4.00MiB, Used:0.00B
> >>    /dev/mapper/master-root         4.00MiB
> >>
> >> System,DUP: Size:32.00MiB, Used:16.00KiB
> >>    /dev/mapper/master-root        64.00MiB
> >>
> >> Unallocated:
> >>    /dev/mapper/master-root       915.24GiB
> >
> >
> > OK this looks like it maybe was created a while ago, it has these
> > empty single chunk items that was common a while back. There is a low
> > risk to clean it up, but I still advise backup first:
> >
> > 'btrfs balance start -mconvert=dup <mountpoint>'
>
> You can skip this advise now, it really doesn't matter. But future
> Btrfs shouldn't have both single and DUP chunks like this one is
> showing, if you're using relatively recent btrfs-progs to create the
> file system.
>
>
> --
> Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: DRDY errors are not consistent with scrub results
  2018-08-28  0:49         ` Cerem Cem ASLAN
@ 2018-08-28  1:08           ` Chris Murphy
  2018-08-28 18:50             ` Cerem Cem ASLAN
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2018-08-28  1:08 UTC (permalink / raw)
  To: Cerem Cem ASLAN; +Cc: Chris Murphy, Btrfs BTRFS

On Mon, Aug 27, 2018 at 6:49 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote:
> Thanks for your guidance, I'll get the device replaced first thing in
> the morning.
>
> Here is balance results which I think resulted not too bad:
>
> sudo btrfs balance start /mnt/peynir/
> WARNING:
>
>         Full balance without filters requested. This operation is very
>         intense and takes potentially very long. It is recommended to
>         use the balance filters to narrow down the balanced data.
>         Use 'btrfs balance start --full-balance' option to skip this
>         warning. The operation will start in 10 seconds.
>         Use Ctrl-C to stop it.
> 10 9 8 7 6 5 4 3 2 1
> Starting balance without any filters.
> Done, had to relocate 18 out of 18 chunks
>
> I suppose this means I've not lost any data, but I'm very prone to due
> to previous `smartctl ...` results.


OK so nothing fatal anyway. We'd have to see any kernel messages that
appeared during the balance to see if there were read or write errors,
but presumably any failure means the balance fails so... might get you
by for a while actually.







-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: DRDY errors are not consistent with scrub results
  2018-08-28  1:08           ` Chris Murphy
@ 2018-08-28 18:50             ` Cerem Cem ASLAN
  2018-08-28 21:07               ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Cerem Cem ASLAN @ 2018-08-28 18:50 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

I've successfully moved everything to another disk. (The only hard
part was configuring the kernel parameters, as my root partition was
on LVM which is on LUKS partition. Here are the notes, if anyone
needs: https://github.com/ceremcem/smith-sync/blob/master/create-bootable-backup.md)

Now I'm seekin for trouble :) I tried to convert my new system (booted
with new disk) into raid1 coupled with the problematic old disk. To do
so, I issued:

sudo btrfs device add /dev/mapper/master-root /mnt/peynir/
/dev/mapper/master-root appears to contain an existing filesystem (btrfs).
ERROR: use the -f option to force overwrite of /dev/mapper/master-root
aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ -f
ERROR: error adding device '/dev/mapper/master-root': Input/output error
aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/
sudo: unable to open /var/lib/sudo/ts/aea: Read-only file system

Now I ended up with a readonly file system. Isn't it possible to add a
device to a running system?

Chris Murphy <lists@colorremedies.com>, 28 Ağu 2018 Sal, 04:08
tarihinde şunu yazdı:
>
> On Mon, Aug 27, 2018 at 6:49 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote:
> > Thanks for your guidance, I'll get the device replaced first thing in
> > the morning.
> >
> > Here is balance results which I think resulted not too bad:
> >
> > sudo btrfs balance start /mnt/peynir/
> > WARNING:
> >
> >         Full balance without filters requested. This operation is very
> >         intense and takes potentially very long. It is recommended to
> >         use the balance filters to narrow down the balanced data.
> >         Use 'btrfs balance start --full-balance' option to skip this
> >         warning. The operation will start in 10 seconds.
> >         Use Ctrl-C to stop it.
> > 10 9 8 7 6 5 4 3 2 1
> > Starting balance without any filters.
> > Done, had to relocate 18 out of 18 chunks
> >
> > I suppose this means I've not lost any data, but I'm very prone to due
> > to previous `smartctl ...` results.
>
>
> OK so nothing fatal anyway. We'd have to see any kernel messages that
> appeared during the balance to see if there were read or write errors,
> but presumably any failure means the balance fails so... might get you
> by for a while actually.
>
>
>
>
>
>
>
> --
> Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: DRDY errors are not consistent with scrub results
  2018-08-28 18:50             ` Cerem Cem ASLAN
@ 2018-08-28 21:07               ` Chris Murphy
  2018-08-28 23:04                 ` Cerem Cem ASLAN
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2018-08-28 21:07 UTC (permalink / raw)
  To: Cerem Cem ASLAN; +Cc: Chris Murphy, Btrfs BTRFS

On Tue, Aug 28, 2018 at 12:50 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote:
> I've successfully moved everything to another disk. (The only hard
> part was configuring the kernel parameters, as my root partition was
> on LVM which is on LUKS partition. Here are the notes, if anyone
> needs: https://github.com/ceremcem/smith-sync/blob/master/create-bootable-backup.md)
>
> Now I'm seekin for trouble :) I tried to convert my new system (booted
> with new disk) into raid1 coupled with the problematic old disk. To do
> so, I issued:
>
> sudo btrfs device add /dev/mapper/master-root /mnt/peynir/
> /dev/mapper/master-root appears to contain an existing filesystem (btrfs).
> ERROR: use the -f option to force overwrite of /dev/mapper/master-root
> aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ -f
> ERROR: error adding device '/dev/mapper/master-root': Input/output error
> aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/
> sudo: unable to open /var/lib/sudo/ts/aea: Read-only file system
>
> Now I ended up with a readonly file system. Isn't it possible to add a
> device to a running system?

Yes.

The problem is the 2nd error message:

ERROR: error adding device '/dev/mapper/master-root': Input/output error

So you need to look in dmesg to see what Btrfs kernel messages
occurred at that time. I'm gonna guess it's a failed write. You have a
few of those in the smartctl log output. Any time a write failure
happens, the operation is always fatal regardless of the file system.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: DRDY errors are not consistent with scrub results
  2018-08-28 21:07               ` Chris Murphy
@ 2018-08-28 23:04                 ` Cerem Cem ASLAN
  2018-08-28 23:58                   ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Cerem Cem ASLAN @ 2018-08-28 23:04 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

What I want to achive is that I want to add the problematic disk as
raid1 and see how/when it fails and how BTRFS recovers these fails.
While the party goes on, the main system shouldn't be interrupted
since this is a production system. For example, I would never expect
to be ended up with such a readonly state while trying to add a disk
with "unknown health" to the system. Was it somewhat expected?

Although we know that disk is about to fail, it still survives.
Shouldn't we expect in such a scenario that when system tries to read
or write some data from/to that BROKEN_DISK and when it recognizes it
failed, it will try to recover the part of the data from GOOD_DISK and
try to store that recovered data in some other part of the
BROKEN_DISK? Or did I misunderstood the whole thing?
Chris Murphy <lists@colorremedies.com>, 29 Ağu 2018 Çar, 00:07
tarihinde şunu yazdı:
>
> On Tue, Aug 28, 2018 at 12:50 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote:
> > I've successfully moved everything to another disk. (The only hard
> > part was configuring the kernel parameters, as my root partition was
> > on LVM which is on LUKS partition. Here are the notes, if anyone
> > needs: https://github.com/ceremcem/smith-sync/blob/master/create-bootable-backup.md)
> >
> > Now I'm seekin for trouble :) I tried to convert my new system (booted
> > with new disk) into raid1 coupled with the problematic old disk. To do
> > so, I issued:
> >
> > sudo btrfs device add /dev/mapper/master-root /mnt/peynir/
> > /dev/mapper/master-root appears to contain an existing filesystem (btrfs).
> > ERROR: use the -f option to force overwrite of /dev/mapper/master-root
> > aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ -f
> > ERROR: error adding device '/dev/mapper/master-root': Input/output error
> > aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/
> > sudo: unable to open /var/lib/sudo/ts/aea: Read-only file system
> >
> > Now I ended up with a readonly file system. Isn't it possible to add a
> > device to a running system?
>
> Yes.
>
> The problem is the 2nd error message:
>
> ERROR: error adding device '/dev/mapper/master-root': Input/output error
>
> So you need to look in dmesg to see what Btrfs kernel messages
> occurred at that time. I'm gonna guess it's a failed write. You have a
> few of those in the smartctl log output. Any time a write failure
> happens, the operation is always fatal regardless of the file system.
>
>
>
> --
> Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: DRDY errors are not consistent with scrub results
  2018-08-28 23:04                 ` Cerem Cem ASLAN
@ 2018-08-28 23:58                   ` Chris Murphy
  2018-08-29  6:58                     ` Cerem Cem ASLAN
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2018-08-28 23:58 UTC (permalink / raw)
  To: Cerem Cem ASLAN; +Cc: Chris Murphy, Btrfs BTRFS

On Tue, Aug 28, 2018 at 5:04 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote:
> What I want to achive is that I want to add the problematic disk as
> raid1 and see how/when it fails and how BTRFS recovers these fails.
> While the party goes on, the main system shouldn't be interrupted
> since this is a production system. For example, I would never expect
> to be ended up with such a readonly state while trying to add a disk
> with "unknown health" to the system. Was it somewhat expected?

I don't know. I also can't tell you how LVM or mdraid behave in the
same situation either though. For sure I've come across bug reports
where underlying devices go read only and the file system falls over
totally and developers shrug and say they can't do anything.

This situation is a little different and difficult. You're starting
out with a one drive setup so the profile is single/DUP or
single/single, and that doesn't change when adding. So the 2nd drive
is actually *mandatory* for a brief period of time before you've made
it raid1 or higher. It's a developer question what is the design, and
if this is a bug: maybe the device being added should be written to
with placeholder supers or even just zeros in all the places for 'dev
add' metadata, and only if that succeeds, to then write real updated
supers to all devices. It's possible the 'dev add' presently writes
updated supers to all devices at the same time, and has a brief period
where the state is fragile and if it fails, it goes read only to
prevent damaging the file system.

Anyway, without a call trace, no idea why it ended up read only. So I
have to speculate.


>
> Although we know that disk is about to fail, it still survives.

That's very tenuous rationalization, a drive that rejects even a
single write is considered failed by the md driver. Btrfs is still
very tolerant of this, so if it had successfully added and you were
running in production, you should expect to see thousands of write
errors dumped to the kernel log because Btrfs never ejects a bad drive
still. It keeps trying. And keeps reporting the failures. And all
those errors being logged can end up causing more write demand if the
logs are on the same volume as the failing device, even more errors to
record, and you get an escalating situation with heavy log writing.


> Shouldn't we expect in such a scenario that when system tries to read
> or write some data from/to that BROKEN_DISK and when it recognizes it
> failed, it will try to recover the part of the data from GOOD_DISK and
> try to store that recovered data in some other part of the
> BROKEN_DISK?

Nope. Btrfs can only write supers to fixed locations on the drive,
same as any other file system. Btrfs metadata could possibly go
elsewhere because it doesn't have fixed locations, but Btrfs doesn't
do bad sector tracking. So once it decides metadata goes in location
X, if X reports a write error it will not try to write elsewhere and
insofar as I'm aware ext4 and XFS and LVM and md don't either; md does
have an optional bad block map it will use for tracking bad sectors
and remap to known good sectors. Normally the drive firmware should do
this and when that fails the drive is considered toast for production
purpose

>Or did I misunderstood the whole thing?

Well in a way this is sorta user sabotage. It's a valid test and I'd
say ideally things should fail safely, rather than fall over. But at
the same time it's not wrong for developers to say: "look if you add a
bad device there's a decent chance we're going face plant and go read
only to avoid causing worse problems, so next time you should qualify
the drive before putting it into production."

I'm willing to bet all the other file system devs would say something
like that even if Btrfs devs think something better could happen, it's
probably not a super high priority.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: DRDY errors are not consistent with scrub results
  2018-08-28 23:58                   ` Chris Murphy
@ 2018-08-29  6:58                     ` Cerem Cem ASLAN
  2018-08-29  9:58                       ` Duncan
  0 siblings, 1 reply; 13+ messages in thread
From: Cerem Cem ASLAN @ 2018-08-29  6:58 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Chris Murphy <lists@colorremedies.com>, 29 Ağu 2018 Çar, 02:58
tarihinde şunu yazdı:
>
> On Tue, Aug 28, 2018 at 5:04 PM, Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote:
> > What I want to achive is that I want to add the problematic disk as
> > raid1 and see how/when it fails and how BTRFS recovers these fails.
> > While the party goes on, the main system shouldn't be interrupted
> > since this is a production system. For example, I would never expect
> > to be ended up with such a readonly state while trying to add a disk
> > with "unknown health" to the system. Was it somewhat expected?
>
> I don't know. I also can't tell you how LVM or mdraid behave in the
> same situation either though. For sure I've come across bug reports
> where underlying devices go read only and the file system falls over
> totally and developers shrug and say they can't do anything.
>
> This situation is a little different and difficult. You're starting
> out with a one drive setup so the profile is single/DUP or
> single/single, and that doesn't change when adding. So the 2nd drive
> is actually *mandatory* for a brief period of time before you've made
> it raid1 or higher. It's a developer question what is the design, and
> if this is a bug: maybe the device being added should be written to
> with placeholder supers or even just zeros in all the places for 'dev
> add' metadata, and only if that succeeds, to then write real updated
> supers to all devices. It's possible the 'dev add' presently writes
> updated supers to all devices at the same time, and has a brief period
> where the state is fragile and if it fails, it goes read only to
> prevent damaging the file system.

Thinking again, this is totally acceptable. If the requirement was a
good health disk, then I think I must check the disk health by myself.
I may believe that the disk is in a good state, or make a quick test
or make some very detailed tests to be sure.

Likewise, ending up with readonly state is not the end of the world,
even over SSH, because system still functions and all I need to do is
a reboot in the worst case. That's also acceptable *while adding a new
disk*.

>
> Anyway, without a call trace, no idea why it ended up read only. So I
> have to speculate.
>

I may try adding the disk again any time and provide any requested
logs, it is still attached to the server. I'm only not sure if this is
a useful experiment from the point of view of the rest of the people.

>
> >
> > Although we know that disk is about to fail, it still survives.
>
> That's very tenuous rationalization, a drive that rejects even a
> single write is considered failed by the md driver. Btrfs is still
> very tolerant of this, so if it had successfully added and you were
> running in production, you should expect to see thousands of write
> errors dumped to the kernel log

That's exactly what I expected :)

because Btrfs never ejects a bad drive
> still. It keeps trying. And keeps reporting the failures. And all
> those errors being logged can end up causing more write demand if the
> logs are on the same volume as the failing device, even more errors to
> record, and you get an escalating situation with heavy log writing.
>

Good to point this. Maybe I should arrange an on-ram virtual machine
that writes back to local disk if no hardware errors found and start
sending logs in a different server *if* such a hardware failure
occurs.

>
> > Shouldn't we expect in such a scenario that when system tries to read
> > or write some data from/to that BROKEN_DISK and when it recognizes it
> > failed, it will try to recover the part of the data from GOOD_DISK and
> > try to store that recovered data in some other part of the
> > BROKEN_DISK?
>
> Nope. Btrfs can only write supers to fixed locations on the drive,
> same as any other file system. Btrfs metadata could possibly go
> elsewhere because it doesn't have fixed locations, but Btrfs doesn't
> do bad sector tracking. So once it decides metadata goes in location
> X, if X reports a write error it will not try to write elsewhere and
> insofar as I'm aware ext4 and XFS and LVM and md don't either; md does
> have an optional bad block map it will use for tracking bad sectors
> and remap to known good sectors. Normally the drive firmware should do
> this and when that fails the drive is considered toast for production
> purpose

That's also plausible. Thinking again (again? :) if BTRFS would behave
as I expected, that retries might never end if the disk is in a very
bad situation and that would add very intensive IO load on a
production system.

I think in such a situation I should remove the raid device, try to
reformat it and attach it again.

>
> >Or did I misunderstood the whole thing?
>
> Well in a way this is sorta user sabotage. It's a valid test and I'd
> say ideally things should fail safely, rather than fall over. But at
> the same time it's not wrong for developers to say: "look if you add a
> bad device there's a decent chance we're going face plant and go read
> only to avoid causing worse problems, so next time you should qualify
> the drive before putting it into production."

Agreed.

>
> I'm willing to bet all the other file system devs would say something
> like that even if Btrfs devs think something better could happen, it's
> probably not a super high priority.
>
>

Devs doing lots of things already and yes, this is not an urgent task.

I appreciate your helps, thank you!

>
>
> --
> Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: DRDY errors are not consistent with scrub results
  2018-08-27 22:51 DRDY errors are not consistent with scrub results Cerem Cem ASLAN
       [not found] ` <CAJCQCtSq5K90gpfGQN8JhqQddBg62m8EG_bFuWN5XyzdNStDfw@mail.gmail.com>
@ 2018-08-29  9:56 ` ein
  1 sibling, 0 replies; 13+ messages in thread
From: ein @ 2018-08-29  9:56 UTC (permalink / raw)
  To: Cerem Cem ASLAN, Btrfs BTRFS

On 08/28/2018 12:51 AM, Cerem Cem ASLAN wrote:
> Hi,
>

Good morning.

>
> I'm not sure how to proceed at the moment. Taking succesfull backups
> made me think that everything might be okay but I'm not sure if I
> should continue trusting the drive or not. What additional checks
> should I perform?
> 

Can you please show also:

btrfs dev stats /path/to/the/mount/point


-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: DRDY errors are not consistent with scrub results
  2018-08-29  6:58                     ` Cerem Cem ASLAN
@ 2018-08-29  9:58                       ` Duncan
  2018-08-29 10:04                         ` Hugo Mills
  0 siblings, 1 reply; 13+ messages in thread
From: Duncan @ 2018-08-29  9:58 UTC (permalink / raw)
  To: linux-btrfs

Cerem Cem ASLAN posted on Wed, 29 Aug 2018 09:58:21 +0300 as excerpted:

> Thinking again, this is totally acceptable. If the requirement was a
> good health disk, then I think I must check the disk health by myself.
> I may believe that the disk is in a good state, or make a quick test or
> make some very detailed tests to be sure.

For testing you might try badblocks.  It's most useful on a device that 
doesn't have a filesystem on it you're trying to save, so you can use the 
-w write-test option.  See the manpage for details.

The -w option should force the device to remap bad blocks where it can as 
well, and you can take your previous smartctl read and compare it to a 
new one after the test.

Hint if testing multiple spinning-rust devices:  Try running multiple 
tests at once.  While this might have been slower on old EIDE, at least 
with spinning rust, on SATA and similar you should be able to test 
multiple devices at once without them slowing down significantly, because 
the bottleneck is the spinning rust, not the bus, controller or CPU.  I 
used badblocks years ago to test my new disks before setting up mdraid on 
them, and with full disk tests on spinning rust taking (at the time) 
nearly a day a pass and four passes for the -w test, the multiple tests 
at once trick saved me quite a bit of time!

It's not a great idea to do the test on new SSDs as it's unnecessary 
wear, writing the entire device four times with different patterns each 
time for a -w, but it might be worthwhile to try it on an ssd you're just 
trying to salvage, forcing it to swap out any bad sectors it encounters 
in the process.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: DRDY errors are not consistent with scrub results
  2018-08-29  9:58                       ` Duncan
@ 2018-08-29 10:04                         ` Hugo Mills
  0 siblings, 0 replies; 13+ messages in thread
From: Hugo Mills @ 2018-08-29 10:04 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1798 bytes --]

On Wed, Aug 29, 2018 at 09:58:58AM +0000, Duncan wrote:
> Cerem Cem ASLAN posted on Wed, 29 Aug 2018 09:58:21 +0300 as excerpted:
> 
> > Thinking again, this is totally acceptable. If the requirement was a
> > good health disk, then I think I must check the disk health by myself.
> > I may believe that the disk is in a good state, or make a quick test or
> > make some very detailed tests to be sure.
> 
> For testing you might try badblocks.  It's most useful on a device that 
> doesn't have a filesystem on it you're trying to save, so you can use the 
> -w write-test option.  See the manpage for details.
> 
> The -w option should force the device to remap bad blocks where it can as 
> well, and you can take your previous smartctl read and compare it to a 
> new one after the test.
> 
> Hint if testing multiple spinning-rust devices:  Try running multiple 
> tests at once.  While this might have been slower on old EIDE, at least 
> with spinning rust, on SATA and similar you should be able to test 
> multiple devices at once without them slowing down significantly, because 
> the bottleneck is the spinning rust, not the bus, controller or CPU.  I 
> used badblocks years ago to test my new disks before setting up mdraid on 
> them, and with full disk tests on spinning rust taking (at the time) 
> nearly a day a pass and four passes for the -w test, the multiple tests 
> at once trick saved me quite a bit of time!

   Hah. Only a day? It's up to 2 days now.

   The devices get bigger. The interfaces don't get faster at the same
rate. Back in the late '90s, it was only an hour or so to run a
badblocks pass on a big disk...

   Hugo.

-- 
Hugo Mills             | Nostalgia isn't what it used to be.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-08-29 14:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-27 22:51 DRDY errors are not consistent with scrub results Cerem Cem ASLAN
     [not found] ` <CAJCQCtSq5K90gpfGQN8JhqQddBg62m8EG_bFuWN5XyzdNStDfw@mail.gmail.com>
     [not found]   ` <CAN4oSBeHwnsm5Ecz1hAQLk6s6utHfn5XeR8xMhnZpmT-sb-_iw@mail.gmail.com>
2018-08-28  0:38     ` Chris Murphy
2018-08-28  0:39       ` Chris Murphy
2018-08-28  0:49         ` Cerem Cem ASLAN
2018-08-28  1:08           ` Chris Murphy
2018-08-28 18:50             ` Cerem Cem ASLAN
2018-08-28 21:07               ` Chris Murphy
2018-08-28 23:04                 ` Cerem Cem ASLAN
2018-08-28 23:58                   ` Chris Murphy
2018-08-29  6:58                     ` Cerem Cem ASLAN
2018-08-29  9:58                       ` Duncan
2018-08-29 10:04                         ` Hugo Mills
2018-08-29  9:56 ` ein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.