Please show descriptive message about degraded raid when booting

All of lore.kernel.org
 help / color / mirror / Atom feed

* Please show descriptive message about degraded raid when booting
@ 2020-03-05 14:57 Patrick Dung
  2020-03-20 16:47 ` Patrick Dung
  0 siblings, 1 reply; 6+ messages in thread
From: Patrick Dung @ 2020-03-05 14:57 UTC (permalink / raw)
  To: linux-raid

Hello,

The system have Linux software raid (md) raid 1.
One of the disk is missing or have problem.

The raid is degraded.
When the OS boot, it hangs at the message for outputting to kernel at
about three seconds.
There is no descriptive message that the RAID is degraded.
I know the problem because I had wrote zero to one of the disk of the
raid 1. If I don't know the problem (maybe cable is loose or disk
failure), it is confusing.

Related log:

[    2.917387] sd 32:0:0:0: [sda] 56623104 512-byte logical blocks:
(29.0 GB/27.0 GiB)
[    2.917446] sd 32:0:1:0: [sdb] 56623104 512-byte logical blocks:
(29.0 GB/27.0 GiB)
[    2.917499] sd 32:0:0:0: [sda] Write Protect is off
[    2.917516] sd 32:0:0:0: [sda] Mode Sense: 61 00 00 00
[    2.917557] sd 32:0:1:0: [sdb] Write Protect is off
[    2.917575] sd 32:0:1:0: [sdb] Mode Sense: 61 00 00 00
[    2.917615] sd 32:0:0:0: [sda] Cache data unavailable
[    2.917636] sd 32:0:0:0: [sda] Assuming drive cache: write through
[    2.917661] sd 32:0:1:0: [sdb] Cache data unavailable
[    2.917677] sd 32:0:1:0: [sdb] Assuming drive cache: write through
[    2.927076] sd 32:0:0:0: [sda] Attached SCSI disk
[    2.927458]  sdb: sdb1 sdb2 sdb3 sdb4
[    2.929018] sd 32:0:1:0: [sdb] Attached SCSI disk
[    3.060855] vmxnet3 0000:0b:00.0 ens192: intr type 3, mode 0, 3
vectors allocated
[    3.061826] vmxnet3 0000:0b:00.0 ens192: NIC Link is Up 10000 Mbps
[  139.411464] md/raid1:md125: active with 1 out of 2 mirrors
[  139.412176] md125: detected capacity change from 0 to 1073676288
[  139.433441] md/raid1:md126: active with 1 out of 2 mirrors
[  139.434182] md126: detected capacity change from 0 to 314507264
[  139.436894]  md126:
[  139.455511] md/raid1:md127: active with 1 out of 2 mirrors
[  139.456739] md127: detected capacity change from 0 to 27582726144

So there are about 130 seconds without any descriptive messages. I
thought the system had hanged.

Could the kernel display more descriptive messages about the RAID?

If I use rd.debug boot parameters, I know the kernel is still running.
But it is scrolling very fast without actually knowing what is the the
problem.

Thanks,
Patrick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Please show descriptive message about degraded raid when booting
  2020-03-05 14:57 Please show descriptive message about degraded raid when booting Patrick Dung
@ 2020-03-20 16:47 ` Patrick Dung
  2020-03-23 18:13   ` Roger Heflin
  0 siblings, 1 reply; 6+ messages in thread
From: Patrick Dung @ 2020-03-20 16:47 UTC (permalink / raw)
  To: linux-raid

Hello,

Bump.

Got a reply from Fedora support but asking me to find upstream.
https://bugzilla.redhat.com/show_bug.cgi?id=1794139

Thanks,
Patrick

On Thu, Mar 5, 2020 at 10:57 PM Patrick Dung <patdung100@gmail.com> wrote:
>
> Hello,
>
> The system have Linux software raid (md) raid 1.
> One of the disk is missing or have problem.
>
> The raid is degraded.
> When the OS boot, it hangs at the message for outputting to kernel at
> about three seconds.
> There is no descriptive message that the RAID is degraded.
> I know the problem because I had wrote zero to one of the disk of the
> raid 1. If I don't know the problem (maybe cable is loose or disk
> failure), it is confusing.
>
> Related log:
>
> [    2.917387] sd 32:0:0:0: [sda] 56623104 512-byte logical blocks:
> (29.0 GB/27.0 GiB)
> [    2.917446] sd 32:0:1:0: [sdb] 56623104 512-byte logical blocks:
> (29.0 GB/27.0 GiB)
> [    2.917499] sd 32:0:0:0: [sda] Write Protect is off
> [    2.917516] sd 32:0:0:0: [sda] Mode Sense: 61 00 00 00
> [    2.917557] sd 32:0:1:0: [sdb] Write Protect is off
> [    2.917575] sd 32:0:1:0: [sdb] Mode Sense: 61 00 00 00
> [    2.917615] sd 32:0:0:0: [sda] Cache data unavailable
> [    2.917636] sd 32:0:0:0: [sda] Assuming drive cache: write through
> [    2.917661] sd 32:0:1:0: [sdb] Cache data unavailable
> [    2.917677] sd 32:0:1:0: [sdb] Assuming drive cache: write through
> [    2.927076] sd 32:0:0:0: [sda] Attached SCSI disk
> [    2.927458]  sdb: sdb1 sdb2 sdb3 sdb4
> [    2.929018] sd 32:0:1:0: [sdb] Attached SCSI disk
> [    3.060855] vmxnet3 0000:0b:00.0 ens192: intr type 3, mode 0, 3
> vectors allocated
> [    3.061826] vmxnet3 0000:0b:00.0 ens192: NIC Link is Up 10000 Mbps
> [  139.411464] md/raid1:md125: active with 1 out of 2 mirrors
> [  139.412176] md125: detected capacity change from 0 to 1073676288
> [  139.433441] md/raid1:md126: active with 1 out of 2 mirrors
> [  139.434182] md126: detected capacity change from 0 to 314507264
> [  139.436894]  md126:
> [  139.455511] md/raid1:md127: active with 1 out of 2 mirrors
> [  139.456739] md127: detected capacity change from 0 to 27582726144
>
> So there are about 130 seconds without any descriptive messages. I
> thought the system had hanged.
>
> Could the kernel display more descriptive messages about the RAID?
>
> If I use rd.debug boot parameters, I know the kernel is still running.
> But it is scrolling very fast without actually knowing what is the the
> problem.
>
> Thanks,
> Patrick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Please show descriptive message about degraded raid when booting
  2020-03-20 16:47 ` Patrick Dung
@ 2020-03-23 18:13   ` Roger Heflin
  2020-03-23 18:33     ` Patrick Dung
  2020-03-24  0:55     ` antlists
  0 siblings, 2 replies; 6+ messages in thread
From: Roger Heflin @ 2020-03-23 18:13 UTC (permalink / raw)
  To: Patrick Dung; +Cc: Linux RAID

The system had hung.  The disks are failing inside the SCSI subsystem,
I don't believe (raid, lvm, multipath) will know anything about what
is going on inside the scsi layer.

Those default timeouts are usually at least 30 seconds, but in the
past the scsi subsystem did some retrying internally.  The timeout
needs to be higher than the length of time the disk could take.
Non-enterprise, non-raid disks generally have this timeout set 60-120
seconds hence MD waiting to see if the failure is a sector read
failure (will be a no-response until the disk timeout) or a complete
disk failure (no response ever).

cat /sys/block/sda/device/timeout shows the timeout.

Read about seterc, tler and smartctl for discussions about what is going on.

If you can then turn down your disks max timeout with the smartctl
commands then the disk will report back a sector failure faster and
that is usually what is happening.  If you turn down the disks timeout
to a max of say 7 seconds then you can set the scsi layers timeout to
say 10 seconds.   Then the only time the scsi timeout matters if if
the disk is there but not responding.


On Fri, Mar 20, 2020 at 11:50 AM Patrick Dung <patdung100@gmail.com> wrote:
>
> Hello,
>
> Bump.
>
> Got a reply from Fedora support but asking me to find upstream.
> https://bugzilla.redhat.com/show_bug.cgi?id=1794139
>
> Thanks,
> Patrick
>
> On Thu, Mar 5, 2020 at 10:57 PM Patrick Dung <patdung100@gmail.com> wrote:
> >
> > Hello,
> >
> > The system have Linux software raid (md) raid 1.
> > One of the disk is missing or have problem.
> >
> > The raid is degraded.
> > When the OS boot, it hangs at the message for outputting to kernel at
> > about three seconds.
> > There is no descriptive message that the RAID is degraded.
> > I know the problem because I had wrote zero to one of the disk of the
> > raid 1. If I don't know the problem (maybe cable is loose or disk
> > failure), it is confusing.
> >
> > Related log:
> >
> > [    2.917387] sd 32:0:0:0: [sda] 56623104 512-byte logical blocks:
> > (29.0 GB/27.0 GiB)
> > [    2.917446] sd 32:0:1:0: [sdb] 56623104 512-byte logical blocks:
> > (29.0 GB/27.0 GiB)
> > [    2.917499] sd 32:0:0:0: [sda] Write Protect is off
> > [    2.917516] sd 32:0:0:0: [sda] Mode Sense: 61 00 00 00
> > [    2.917557] sd 32:0:1:0: [sdb] Write Protect is off
> > [    2.917575] sd 32:0:1:0: [sdb] Mode Sense: 61 00 00 00
> > [    2.917615] sd 32:0:0:0: [sda] Cache data unavailable
> > [    2.917636] sd 32:0:0:0: [sda] Assuming drive cache: write through
> > [    2.917661] sd 32:0:1:0: [sdb] Cache data unavailable
> > [    2.917677] sd 32:0:1:0: [sdb] Assuming drive cache: write through
> > [    2.927076] sd 32:0:0:0: [sda] Attached SCSI disk
> > [    2.927458]  sdb: sdb1 sdb2 sdb3 sdb4
> > [    2.929018] sd 32:0:1:0: [sdb] Attached SCSI disk
> > [    3.060855] vmxnet3 0000:0b:00.0 ens192: intr type 3, mode 0, 3
> > vectors allocated
> > [    3.061826] vmxnet3 0000:0b:00.0 ens192: NIC Link is Up 10000 Mbps
> > [  139.411464] md/raid1:md125: active with 1 out of 2 mirrors
> > [  139.412176] md125: detected capacity change from 0 to 1073676288
> > [  139.433441] md/raid1:md126: active with 1 out of 2 mirrors
> > [  139.434182] md126: detected capacity change from 0 to 314507264
> > [  139.436894]  md126:
> > [  139.455511] md/raid1:md127: active with 1 out of 2 mirrors
> > [  139.456739] md127: detected capacity change from 0 to 27582726144
> >
> > So there are about 130 seconds without any descriptive messages. I
> > thought the system had hanged.
> >
> > Could the kernel display more descriptive messages about the RAID?
> >
> > If I use rd.debug boot parameters, I know the kernel is still running.
> > But it is scrolling very fast without actually knowing what is the the
> > problem.
> >
> > Thanks,
> > Patrick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Please show descriptive message about degraded raid when booting
  2020-03-23 18:13   ` Roger Heflin
@ 2020-03-23 18:33     ` Patrick Dung
  2020-03-24  4:45       ` Patrick Dung
  2020-03-24  0:55     ` antlists
  1 sibling, 1 reply; 6+ messages in thread
From: Patrick Dung @ 2020-03-23 18:33 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Linux RAID

Thanks for reply.

The problem occurs with my physical hardware and in virtual machine
(can't set TLER).
The log you see in my original post is captured/simulated from a
virtual machine.

The system is not 'hung '. If I run rd.debug it would have lots of
messages scrolling quickly that you can't see clearly.

What I am asking for is a more descriptive message from the MD raid,
try to display the status like:
Try to activate md/raid1:md125, currently 1 of of 2 disk online.
Timeout in X seconds.
Something like that.

Thanks,
Patrick

On Tue, Mar 24, 2020 at 2:14 AM Roger Heflin <rogerheflin@gmail.com> wrote:
>
> The system had hung.  The disks are failing inside the SCSI subsystem,
> I don't believe (raid, lvm, multipath) will know anything about what
> is going on inside the scsi layer.
>
> Those default timeouts are usually at least 30 seconds, but in the
> past the scsi subsystem did some retrying internally.  The timeout
> needs to be higher than the length of time the disk could take.
> Non-enterprise, non-raid disks generally have this timeout set 60-120
> seconds hence MD waiting to see if the failure is a sector read
> failure (will be a no-response until the disk timeout) or a complete
> disk failure (no response ever).
>
> cat /sys/block/sda/device/timeout shows the timeout.
>
> Read about seterc, tler and smartctl for discussions about what is going on.
>
> If you can then turn down your disks max timeout with the smartctl
> commands then the disk will report back a sector failure faster and
> that is usually what is happening.  If you turn down the disks timeout
> to a max of say 7 seconds then you can set the scsi layers timeout to
> say 10 seconds.   Then the only time the scsi timeout matters if if
> the disk is there but not responding.
>
>
> On Fri, Mar 20, 2020 at 11:50 AM Patrick Dung <patdung100@gmail.com> wrote:
> >
> > Hello,
> >
> > Bump.
> >
> > Got a reply from Fedora support but asking me to find upstream.
> > https://bugzilla.redhat.com/show_bug.cgi?id=1794139
> >
> > Thanks,
> > Patrick
> >
> > On Thu, Mar 5, 2020 at 10:57 PM Patrick Dung <patdung100@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > The system have Linux software raid (md) raid 1.
> > > One of the disk is missing or have problem.
> > >
> > > The raid is degraded.
> > > When the OS boot, it hangs at the message for outputting to kernel at
> > > about three seconds.
> > > There is no descriptive message that the RAID is degraded.
> > > I know the problem because I had wrote zero to one of the disk of the
> > > raid 1. If I don't know the problem (maybe cable is loose or disk
> > > failure), it is confusing.
> > >
> > > Related log:
> > >
> > > [    2.917387] sd 32:0:0:0: [sda] 56623104 512-byte logical blocks:
> > > (29.0 GB/27.0 GiB)
> > > [    2.917446] sd 32:0:1:0: [sdb] 56623104 512-byte logical blocks:
> > > (29.0 GB/27.0 GiB)
> > > [    2.917499] sd 32:0:0:0: [sda] Write Protect is off
> > > [    2.917516] sd 32:0:0:0: [sda] Mode Sense: 61 00 00 00
> > > [    2.917557] sd 32:0:1:0: [sdb] Write Protect is off
> > > [    2.917575] sd 32:0:1:0: [sdb] Mode Sense: 61 00 00 00
> > > [    2.917615] sd 32:0:0:0: [sda] Cache data unavailable
> > > [    2.917636] sd 32:0:0:0: [sda] Assuming drive cache: write through
> > > [    2.917661] sd 32:0:1:0: [sdb] Cache data unavailable
> > > [    2.917677] sd 32:0:1:0: [sdb] Assuming drive cache: write through
> > > [    2.927076] sd 32:0:0:0: [sda] Attached SCSI disk
> > > [    2.927458]  sdb: sdb1 sdb2 sdb3 sdb4
> > > [    2.929018] sd 32:0:1:0: [sdb] Attached SCSI disk
> > > [    3.060855] vmxnet3 0000:0b:00.0 ens192: intr type 3, mode 0, 3
> > > vectors allocated
> > > [    3.061826] vmxnet3 0000:0b:00.0 ens192: NIC Link is Up 10000 Mbps
> > > [  139.411464] md/raid1:md125: active with 1 out of 2 mirrors
> > > [  139.412176] md125: detected capacity change from 0 to 1073676288
> > > [  139.433441] md/raid1:md126: active with 1 out of 2 mirrors
> > > [  139.434182] md126: detected capacity change from 0 to 314507264
> > > [  139.436894]  md126:
> > > [  139.455511] md/raid1:md127: active with 1 out of 2 mirrors
> > > [  139.456739] md127: detected capacity change from 0 to 27582726144
> > >
> > > So there are about 130 seconds without any descriptive messages. I
> > > thought the system had hanged.
> > >
> > > Could the kernel display more descriptive messages about the RAID?
> > >
> > > If I use rd.debug boot parameters, I know the kernel is still running.
> > > But it is scrolling very fast without actually knowing what is the the
> > > problem.
> > >
> > > Thanks,
> > > Patrick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Please show descriptive message about degraded raid when booting
  2020-03-23 18:13   ` Roger Heflin
  2020-03-23 18:33     ` Patrick Dung
@ 2020-03-24  0:55     ` antlists
  1 sibling, 0 replies; 6+ messages in thread
From: antlists @ 2020-03-24  0:55 UTC (permalink / raw)
  To: Roger Heflin, Patrick Dung; +Cc: Linux RAID

On 23/03/2020 18:13, Roger Heflin wrote:
> Those default timeouts are usually at least 30 seconds, but in the
> past the scsi subsystem did some retrying internally.  The timeout
> needs to be higher than the length of time the disk could take.
> Non-enterprise, non-raid disks generally have this timeout set 60-120
> seconds hence MD waiting to see if the failure is a sector read
> failure (will be a no-response until the disk timeout) or a complete
> disk failure (no response ever).

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

The whole website is reasonably up-to-date, so it's worth a read.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Please show descriptive message about degraded raid when booting
  2020-03-23 18:33     ` Patrick Dung
@ 2020-03-24  4:45       ` Patrick Dung
  0 siblings, 0 replies; 6+ messages in thread
From: Patrick Dung @ 2020-03-24  4:45 UTC (permalink / raw)
  To: Linux RAID

By the way, for my original post, it's a virtual machine. I disconnect
one of the members from the raid 1.
I can't simulate hardware failure with VM. So there are no 'SCT Error
Recovery Control/TLER' timeout involved.

Thanks,
Patrick

On Tue, Mar 24, 2020 at 2:33 AM Patrick Dung <patdung100@gmail.com> wrote:
>
> Thanks for reply.
>
> The problem occurs with my physical hardware and in virtual machine
> (can't set TLER).
> The log you see in my original post is captured/simulated from a
> virtual machine.
>
> The system is not 'hung '. If I run rd.debug it would have lots of
> messages scrolling quickly that you can't see clearly.
>
> What I am asking for is a more descriptive message from the MD raid,
> try to display the status like:
> Try to activate md/raid1:md125, currently 1 of of 2 disk online.
> Timeout in X seconds.
> Something like that.
>
> Thanks,
> Patrick
>
> On Tue, Mar 24, 2020 at 2:14 AM Roger Heflin <rogerheflin@gmail.com> wrote:
> >
> > The system had hung.  The disks are failing inside the SCSI subsystem,
> > I don't believe (raid, lvm, multipath) will know anything about what
> > is going on inside the scsi layer.
> >
> > Those default timeouts are usually at least 30 seconds, but in the
> > past the scsi subsystem did some retrying internally.  The timeout
> > needs to be higher than the length of time the disk could take.
> > Non-enterprise, non-raid disks generally have this timeout set 60-120
> > seconds hence MD waiting to see if the failure is a sector read
> > failure (will be a no-response until the disk timeout) or a complete
> > disk failure (no response ever).
> >
> > cat /sys/block/sda/device/timeout shows the timeout.
> >
> > Read about seterc, tler and smartctl for discussions about what is going on.
> >
> > If you can then turn down your disks max timeout with the smartctl
> > commands then the disk will report back a sector failure faster and
> > that is usually what is happening.  If you turn down the disks timeout
> > to a max of say 7 seconds then you can set the scsi layers timeout to
> > say 10 seconds.   Then the only time the scsi timeout matters if if
> > the disk is there but not responding.
> >
> >
> > On Fri, Mar 20, 2020 at 11:50 AM Patrick Dung <patdung100@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > Bump.
> > >
> > > Got a reply from Fedora support but asking me to find upstream.
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1794139
> > >
> > > Thanks,
> > > Patrick
> > >
> > > On Thu, Mar 5, 2020 at 10:57 PM Patrick Dung <patdung100@gmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > The system have Linux software raid (md) raid 1.
> > > > One of the disk is missing or have problem.
> > > >
> > > > The raid is degraded.
> > > > When the OS boot, it hangs at the message for outputting to kernel at
> > > > about three seconds.
> > > > There is no descriptive message that the RAID is degraded.
> > > > I know the problem because I had wrote zero to one of the disk of the
> > > > raid 1. If I don't know the problem (maybe cable is loose or disk
> > > > failure), it is confusing.
> > > >
> > > > Related log:
> > > >
> > > > [    2.917387] sd 32:0:0:0: [sda] 56623104 512-byte logical blocks:
> > > > (29.0 GB/27.0 GiB)
> > > > [    2.917446] sd 32:0:1:0: [sdb] 56623104 512-byte logical blocks:
> > > > (29.0 GB/27.0 GiB)
> > > > [    2.917499] sd 32:0:0:0: [sda] Write Protect is off
> > > > [    2.917516] sd 32:0:0:0: [sda] Mode Sense: 61 00 00 00
> > > > [    2.917557] sd 32:0:1:0: [sdb] Write Protect is off
> > > > [    2.917575] sd 32:0:1:0: [sdb] Mode Sense: 61 00 00 00
> > > > [    2.917615] sd 32:0:0:0: [sda] Cache data unavailable
> > > > [    2.917636] sd 32:0:0:0: [sda] Assuming drive cache: write through
> > > > [    2.917661] sd 32:0:1:0: [sdb] Cache data unavailable
> > > > [    2.917677] sd 32:0:1:0: [sdb] Assuming drive cache: write through
> > > > [    2.927076] sd 32:0:0:0: [sda] Attached SCSI disk
> > > > [    2.927458]  sdb: sdb1 sdb2 sdb3 sdb4
> > > > [    2.929018] sd 32:0:1:0: [sdb] Attached SCSI disk
> > > > [    3.060855] vmxnet3 0000:0b:00.0 ens192: intr type 3, mode 0, 3
> > > > vectors allocated
> > > > [    3.061826] vmxnet3 0000:0b:00.0 ens192: NIC Link is Up 10000 Mbps
> > > > [  139.411464] md/raid1:md125: active with 1 out of 2 mirrors
> > > > [  139.412176] md125: detected capacity change from 0 to 1073676288
> > > > [  139.433441] md/raid1:md126: active with 1 out of 2 mirrors
> > > > [  139.434182] md126: detected capacity change from 0 to 314507264
> > > > [  139.436894]  md126:
> > > > [  139.455511] md/raid1:md127: active with 1 out of 2 mirrors
> > > > [  139.456739] md127: detected capacity change from 0 to 27582726144
> > > >
> > > > So there are about 130 seconds without any descriptive messages. I
> > > > thought the system had hanged.
> > > >
> > > > Could the kernel display more descriptive messages about the RAID?
> > > >
> > > > If I use rd.debug boot parameters, I know the kernel is still running.
> > > > But it is scrolling very fast without actually knowing what is the the
> > > > problem.
> > > >
> > > > Thanks,
> > > > Patrick

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-03-24  4:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-05 14:57 Please show descriptive message about degraded raid when booting Patrick Dung
2020-03-20 16:47 ` Patrick Dung
2020-03-23 18:13   ` Roger Heflin
2020-03-23 18:33     ` Patrick Dung
2020-03-24  4:45       ` Patrick Dung
2020-03-24  0:55     ` antlists

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.