[Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing
@ 2018-12-17 17:15 Mark Syms
  2018-12-17 20:20 ` Bob Peterson
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Syms @ 2018-12-17 17:15 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Dec 17, 2018 at 09:58:47AM -0500, Bob Peterson wrote:
> Dave Teigland recommended. Unless I'm mistaken, Dave has said that 
> GFS2 should never withdraw; it should always just kernel panic (Dave, 
> correct me if I'm wrong). At least this patch confines that behavior 
> to a small subset of withdraws.

The basic idea is that you want to get a malfunctioning node out of the way as quickly as possible so others can recover and carry on.  Escalating a partial failure into a total node failure is the best way to do that in this case.  Specialized recovery paths run from a partially failed node won't be as reliable, and are prone to blocking all the nodes.

I think a reasonable alternative to this is to just sit in an infinite retry loop until the i/o succeeds.

Dave
[Mark Syms] I would hope that this code would only trigger after some effort has been put into  retrying as panicing the host on the first I/O failure seems like a sure fire way to get unhappy users (and in our case paying customers). As Edvin points out there may be other filesystems that may be able to cleanly unmount and thus avoid having to check everything on restart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing
  2018-12-17 17:15 [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing Mark Syms
@ 2018-12-17 20:20 ` Bob Peterson
  2018-12-18  9:49   ` Mark Syms
  0 siblings, 1 reply; 6+ messages in thread
From: Bob Peterson @ 2018-12-17 20:20 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> I think a reasonable alternative to this is to just sit in an infinite retry
> loop until the i/o succeeds.
> 
> Dave
> [Mark Syms] I would hope that this code would only trigger after some effort
> has been put into  retrying as panicing the host on the first I/O failure
> seems like a sure fire way to get unhappy users (and in our case paying
> customers). As Edvin points out there may be other filesystems that may be
> able to cleanly unmount and thus avoid having to check everything on
> restart.

Hi Mark,

Perhaps. I'm not block layer or iscsi expert, but afaik, it's not the file
system's job to retry IO, and never has been, right?

There are already iscsi tuning parameters, vfs tuning parameters, etc. So
if an IO error is sent to GFS2 for a write operation, it means the retry
algorithms and operation timeout algorithms built into the layers below us
(the iscsi layer, scsi layer, block layer, tcp/ip layer etc.) have all failed
and given up on the IO operation. We can't really justify adding yet another
layer of retries on top of all that, can we?

I see your point, and perhaps the system should stay up to continue other
mission-critical operations that may not require the faulted hardware.
But what's a viable alternative? As Dave T. suggested, we can keep resubmitting
the IO until it completes, but then the journal never gets replayed and nobody
can have those locks ever again, and that would cause a potential hang of the
entire cluster, especially in cases where there's only one device that's failed
and the whole cluster is using it.

In GFS2, we've got a concept of marking a resource group "in error" so the
other nodes won't try to use it, but the same corruption that affects resource
groups could be extrapolated to "hot" dinodes as well. For example, suppose the
root (mount-point) dinode was in the journal. Now the whole cluster is hung
rather than just the one node.

Regards,

Bob Peterson
Red Hat File Systems

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing
  2018-12-17 20:20 ` Bob Peterson
@ 2018-12-18  9:49   ` Mark Syms
  2018-12-18 15:51     ` Bob Peterson
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Syms @ 2018-12-18  9:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Bob,

I agree, it's a hard problem. I'm just trying to understand that we've done the absolute best we can and that if this condition is hit then the best solution really is to just kill the node. I guess it's also a question of how common this actually ends up being. We have now got customers starting to use GFS2 for VM storage on XenServer so I guess we'll just have to see how many support calls we get in on it.

Thanks,

Mark.

-----Original Message-----
From: Bob Peterson <rpeterso@redhat.com> 
Sent: 17 December 2018 20:20
To: Mark Syms <Mark.Syms@citrix.com>
Cc: cluster-devel at redhat.com
Subject: Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing

----- Original Message -----
> I think a reasonable alternative to this is to just sit in an infinite 
> retry loop until the i/o succeeds.
> 
> Dave
> [Mark Syms] I would hope that this code would only trigger after some 
> effort has been put into  retrying as panicing the host on the first 
> I/O failure seems like a sure fire way to get unhappy users (and in 
> our case paying customers). As Edvin points out there may be other 
> filesystems that may be able to cleanly unmount and thus avoid having 
> to check everything on restart.

Hi Mark,

Perhaps. I'm not block layer or iscsi expert, but afaik, it's not the file system's job to retry IO, and never has been, right?

There are already iscsi tuning parameters, vfs tuning parameters, etc. So if an IO error is sent to GFS2 for a write operation, it means the retry algorithms and operation timeout algorithms built into the layers below us (the iscsi layer, scsi layer, block layer, tcp/ip layer etc.) have all failed and given up on the IO operation. We can't really justify adding yet another layer of retries on top of all that, can we?

I see your point, and perhaps the system should stay up to continue other mission-critical operations that may not require the faulted hardware.
But what's a viable alternative? As Dave T. suggested, we can keep resubmitting the IO until it completes, but then the journal never gets replayed and nobody can have those locks ever again, and that would cause a potential hang of the entire cluster, especially in cases where there's only one device that's failed and the whole cluster is using it.

In GFS2, we've got a concept of marking a resource group "in error" so the other nodes won't try to use it, but the same corruption that affects resource groups could be extrapolated to "hot" dinodes as well. For example, suppose the root (mount-point) dinode was in the journal. Now the whole cluster is hung rather than just the one node.

Regards,

Bob Peterson
Red Hat File Systems

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing
  2018-12-18  9:49   ` Mark Syms
@ 2018-12-18 15:51     ` Bob Peterson
  2018-12-18 16:09       ` Mark Syms
  0 siblings, 1 reply; 6+ messages in thread
From: Bob Peterson @ 2018-12-18 15:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> Hi Bob,
> 
> I agree, it's a hard problem. I'm just trying to understand that we've done
> the absolute best we can and that if this condition is hit then the best
> solution really is to just kill the node. I guess it's also a question of
> how common this actually ends up being. We have now got customers starting
> to use GFS2 for VM storage on XenServer so I guess we'll just have to see
> how many support calls we get in on it.
> 
> Thanks,
> 
> Mark.

Hi Mark,

I don't expect the problem to be very common in the real world. 
The user has to get IO errors while writing to the GFS2 journal, which is
not very common. The patch is basically reacting to a phenomenon we
recently started noticing in which the HBA (qla2xxx) driver shuts down
and stops accepting requests when you do abnormal reboots (which we sometimes
do to test node recovery). In these cases, the node doesn't go down right away.
It stays up just long enough to cause IO errors with subsequent withdraws,
which, we discovered, results in file system corruption.
Normal reboots, "/sbin/reboot -fin", and "echo b > /proc/sysrq-trigger" should
not have this problem, nor should node fencing, etc.

And like I said, I'm open to suggestions on how to fix it. I wish there was a
better solution.

As it is, I'd kind of like to get something into this merge window for the
upstream kernel, but I'll need to submit the pull request for that probably
tomorrow or Thursday. If we find a better solution, we can always revert these
changes and implement a new one.

Regards,

Bob Peterson

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing
  2018-12-18 15:51     ` Bob Peterson
@ 2018-12-18 16:09       ` Mark Syms
  2018-12-19  9:16         ` Steven Whitehouse
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Syms @ 2018-12-18 16:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Thanks Bob,

We believe we have seen these issues from time to time in our automated testing but I suspect that they're indicating a configuration problem with the backing storage. For flexibility a proportion of our purely functional testing will use storage provided by a VM running a software iSCSI target and these tests seem to be somewhat susceptible to getting I/O errors, some of which will inevitably end up being in the journal. If we start to see a lot we'll need to look at the config of the VMs first I think.

	Mark.

-----Original Message-----
From: Bob Peterson <rpeterso@redhat.com> 
Sent: 18 December 2018 15:52
To: Mark Syms <Mark.Syms@citrix.com>
Cc: cluster-devel at redhat.com
Subject: Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing

----- Original Message -----
> Hi Bob,
> 
> I agree, it's a hard problem. I'm just trying to understand that we've 
> done the absolute best we can and that if this condition is hit then 
> the best solution really is to just kill the node. I guess it's also a 
> question of how common this actually ends up being. We have now got 
> customers starting to use GFS2 for VM storage on XenServer so I guess 
> we'll just have to see how many support calls we get in on it.
> 
> Thanks,
> 
> Mark.

Hi Mark,

I don't expect the problem to be very common in the real world. 
The user has to get IO errors while writing to the GFS2 journal, which is not very common. The patch is basically reacting to a phenomenon we recently started noticing in which the HBA (qla2xxx) driver shuts down and stops accepting requests when you do abnormal reboots (which we sometimes do to test node recovery). In these cases, the node doesn't go down right away.
It stays up just long enough to cause IO errors with subsequent withdraws, which, we discovered, results in file system corruption.
Normal reboots, "/sbin/reboot -fin", and "echo b > /proc/sysrq-trigger" should not have this problem, nor should node fencing, etc.

And like I said, I'm open to suggestions on how to fix it. I wish there was a better solution.

As it is, I'd kind of like to get something into this merge window for the upstream kernel, but I'll need to submit the pull request for that probably tomorrow or Thursday. If we find a better solution, we can always revert these changes and implement a new one.

Regards,

Bob Peterson

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing
  2018-12-18 16:09       ` Mark Syms
@ 2018-12-19  9:16         ` Steven Whitehouse
  0 siblings, 0 replies; 6+ messages in thread
From: Steven Whitehouse @ 2018-12-19  9:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 18/12/2018 16:09, Mark Syms wrote:
> Thanks Bob,
>
> We believe we have seen these issues from time to time in our automated testing but I suspect that they're indicating a configuration problem with the backing storage. For flexibility a proportion of our purely functional testing will use storage provided by a VM running a software iSCSI target and these tests seem to be somewhat susceptible to getting I/O errors, some of which will inevitably end up being in the journal. If we start to see a lot we'll need to look at the config of the VMs first I think.
>
> 	Mark.

I think there are a few things here... firstly Bob is right that in 
general if we are going to retry I/O, then this would be done at the 
block layer, by multipath for example. However, having a way to 
gracefully deal with failure aside from fencing/rebooting a node is useful.

One issue with that is tracking outstanding I/O. For the journal we do 
that anyway, since we count the number of in flight I/Os. In other cases 
this is more difficult, for example where we use the VFS library 
functions for readpages/writepages. If we were able to track all the I/O 
that GFS2 produces and be certain to be able to turn off future I/O (or 
writes at least) internally then we could avoid using the dm based 
solution for withdraw that we currently have. That would be an 
improvement in terms of reliability.

The other issue is the one that Bob has been looking at, namely a way to 
signal that recovery is due, but without requiring fencing. If we can 
solve both of those issues, then that would certainly go a long way 
towards improving this,

Steve.



> -----Original Message-----
> From: Bob Peterson <rpeterso@redhat.com>
> Sent: 18 December 2018 15:52
> To: Mark Syms <Mark.Syms@citrix.com>
> Cc: cluster-devel at redhat.com
> Subject: Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing
>
> ----- Original Message -----
>> Hi Bob,
>>
>> I agree, it's a hard problem. I'm just trying to understand that we've
>> done the absolute best we can and that if this condition is hit then
>> the best solution really is to just kill the node. I guess it's also a
>> question of how common this actually ends up being. We have now got
>> customers starting to use GFS2 for VM storage on XenServer so I guess
>> we'll just have to see how many support calls we get in on it.
>>
>> Thanks,
>>
>> Mark.
> Hi Mark,
>
> I don't expect the problem to be very common in the real world.
> The user has to get IO errors while writing to the GFS2 journal, which is not very common. The patch is basically reacting to a phenomenon we recently started noticing in which the HBA (qla2xxx) driver shuts down and stops accepting requests when you do abnormal reboots (which we sometimes do to test node recovery). In these cases, the node doesn't go down right away.
> It stays up just long enough to cause IO errors with subsequent withdraws, which, we discovered, results in file system corruption.
> Normal reboots, "/sbin/reboot -fin", and "echo b > /proc/sysrq-trigger" should not have this problem, nor should node fencing, etc.
>
> And like I said, I'm open to suggestions on how to fix it. I wish there was a better solution.
>
> As it is, I'd kind of like to get something into this merge window for the upstream kernel, but I'll need to submit the pull request for that probably tomorrow or Thursday. If we find a better solution, we can always revert these changes and implement a new one.
>
> Regards,
>
> Bob Peterson
>



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-12-19  9:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-17 17:15 [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing Mark Syms
2018-12-17 20:20 ` Bob Peterson
2018-12-18  9:49   ` Mark Syms
2018-12-18 15:51     ` Bob Peterson
2018-12-18 16:09       ` Mark Syms
2018-12-19  9:16         ` Steven Whitehouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.