All of lore.kernel.org
 help / color / mirror / Atom feed
* Detecting disk failures on XFS
       [not found] <CAG5wfU0E+y_gnfQLP4x2Ctan0Ts4d3frjVgZ9dt-xegVrucdXQ@mail.gmail.com>
@ 2022-11-09  4:58 ` Alexander Hartner
  2022-11-09 11:22   ` Carlos Maiolino
  2022-11-14 22:45   ` Dave Chinner
  0 siblings, 2 replies; 3+ messages in thread
From: Alexander Hartner @ 2022-11-09  4:58 UTC (permalink / raw)
  To: linux-xfs

We have dealing with a problem where a NVME drive fails every so
often. More than it really should. While we are trying to make sense
of the hardware issue, we are also looking at the recovery options.

Currently we are using Ubuntu 20.04 LTS on XFS with a single NVME
disk. If the disk fails the following error is reported.

Nov 6, 2022 @ 20:27:12.000    [1095930.104279] nvme nvme0: controller
is down; will reset: CSTS=0x3, PCI_STATUS=0x10
Nov 6, 2022 @ 20:27:12.000    [1095930.451711] nvme nvme0: 64/0/0
default/read/poll queues
Nov 6, 2022 @ 20:27:12.000    [1095930.453846] blk_update_request: I/O
error, dev nvme0n1, sector 34503744 op 0x1:(WRITE) flags 0x800
phys_seg 1 prio class 0

And the system becomes completely unresponsive.

I am looking for a solution to stop the system when this happens, so
the other nodes in our cluster can carry the work. However since the
system is unresponsive and the disk presumably in read-only mode we
stuck in a sort of zombie state, where the processes are still running
but don't have access to the disk. On EXT3/4 there is an option to
take the system down.

errors={continue|remount-ro|panic}
Define the behavior when an error is encountered.  (Either ignore
errors and just mark the filesystem erroneous and continue, or remount
the filesystem read-only, or panic and halt the system.)  The default
is set in the filesystem superblock, and can be changed using
tune2fs(8).

Is there an equivalent for XFS ? I didn't find anything similar on the
XFS man page.

Also any other suggestions to better handle this ?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Detecting disk failures on XFS
  2022-11-09  4:58 ` Detecting disk failures on XFS Alexander Hartner
@ 2022-11-09 11:22   ` Carlos Maiolino
  2022-11-14 22:45   ` Dave Chinner
  1 sibling, 0 replies; 3+ messages in thread
From: Carlos Maiolino @ 2022-11-09 11:22 UTC (permalink / raw)
  To: Alexander Hartner; +Cc: linux-xfs

On Wed, Nov 09, 2022 at 12:58:55PM +0800, Alexander Hartner wrote:
> We have dealing with a problem where a NVME drive fails every so
> often. More than it really should. While we are trying to make sense
> of the hardware issue, we are also looking at the recovery options.
> 
> Currently we are using Ubuntu 20.04 LTS on XFS with a single NVME
> disk. If the disk fails the following error is reported.
> 
> Nov 6, 2022 @ 20:27:12.000    [1095930.104279] nvme nvme0: controller
> is down; will reset: CSTS=0x3, PCI_STATUS=0x10
> Nov 6, 2022 @ 20:27:12.000    [1095930.451711] nvme nvme0: 64/0/0
> default/read/poll queues
> Nov 6, 2022 @ 20:27:12.000    [1095930.453846] blk_update_request: I/O
> error, dev nvme0n1, sector 34503744 op 0x1:(WRITE) flags 0x800
> phys_seg 1 prio class 0
> 
> And the system becomes completely unresponsive.
> 
> I am looking for a solution to stop the system when this happens, so
> the other nodes in our cluster can carry the work. However since the
> system is unresponsive and the disk presumably in read-only mode we
> stuck in a sort of zombie state, where the processes are still running
> but don't have access to the disk. On EXT3/4 there is an option to
> take the system down.
> 

XFS doesn't work like that, it will either shutdown the filesystem or keep
trying the IO waiting the storage to come back in case of transient IO errors.
We don't keep a filesystem alive if it might me inconsistent.

> Is there an equivalent for XFS ? I didn't find anything similar on the
> XFS man page.
> 
> Also any other suggestions to better handle this ?

Look at "Error handling" section at kernel's Documentation:

https://docs.kernel.org/admin-guide/xfs.html

This might help. But I don't know how it translates to the distro kernel you
are using though.

Cheers.

-- 
Carlos Maiolino

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Detecting disk failures on XFS
  2022-11-09  4:58 ` Detecting disk failures on XFS Alexander Hartner
  2022-11-09 11:22   ` Carlos Maiolino
@ 2022-11-14 22:45   ` Dave Chinner
  1 sibling, 0 replies; 3+ messages in thread
From: Dave Chinner @ 2022-11-14 22:45 UTC (permalink / raw)
  To: Alexander Hartner; +Cc: linux-xfs

On Wed, Nov 09, 2022 at 12:58:55PM +0800, Alexander Hartner wrote:
> We have dealing with a problem where a NVME drive fails every so
> often. More than it really should. While we are trying to make sense
> of the hardware issue, we are also looking at the recovery options.
> 
> Currently we are using Ubuntu 20.04 LTS on XFS with a single NVME
> disk. If the disk fails the following error is reported.
> 
> Nov 6, 2022 @ 20:27:12.000    [1095930.104279] nvme nvme0: controller
> is down; will reset: CSTS=0x3, PCI_STATUS=0x10
> Nov 6, 2022 @ 20:27:12.000    [1095930.451711] nvme nvme0: 64/0/0
> default/read/poll queues
> Nov 6, 2022 @ 20:27:12.000    [1095930.453846] blk_update_request: I/O
> error, dev nvme0n1, sector 34503744 op 0x1:(WRITE) flags 0x800
> phys_seg 1 prio class 0
> 
> And the system becomes completely unresponsive.

What is the system stuck on? The output of sysrq-w will would help
us understand what is happening as a result of this failed NVMe
drive.

> I am looking for a solution to stop the system when this happens, so
> the other nodes in our cluster can carry the work. However since the
> system is unresponsive and the disk presumably in read-only mode we
> stuck in a sort of zombie state, where the processes are still running
> but don't have access to the disk. On EXT3/4 there is an option to
> take the system down.

On XFS, there are some configurable error behaviours that
can be changed under /sys/fs/xfs/<dev>/error/metadata.

See the Error Handling of the linux kernel admin guide XFS page:

https://docs.kernel.org/admin-guide/xfs.html#error-handling

I'm guessing that the behaviour you are seeing is that metadata
write EIO errors default to "retry until unmount" behaviour (i.e.
retry writes forever, fail_at_unmount = true).

> Is there an equivalent for XFS ? I didn't find anything similar on the
> XFS man page.

Hmmmm.  It might be worth documenting this sysfs stuff in xfs(5),
not just the mount options supported...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-11-14 22:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAG5wfU0E+y_gnfQLP4x2Ctan0Ts4d3frjVgZ9dt-xegVrucdXQ@mail.gmail.com>
2022-11-09  4:58 ` Detecting disk failures on XFS Alexander Hartner
2022-11-09 11:22   ` Carlos Maiolino
2022-11-14 22:45   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.