linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jean-Louis Dupond <jean-louis@dupond.be>
To: "Theodore Y. Ts'o" <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
Subject: Re: Filesystem corruption after unreachable storage
Date: Mon, 9 Mar 2020 16:33:52 +0100	[thread overview]
Message-ID: <93e74f9f-6694-a3e9-4fac-981389522d25@dupond.be> (raw)
In-Reply-To: <20200309151838.GA4852@mit.edu>

On 9/03/2020 16:18, Theodore Y. Ts'o wrote:
> Did the panic happen immediately, or did things hang until the storage
> recovered, and*then*  it rebooted.  Or did the hard reset and reboot
> happened before the storage network connection was restored?

The panic (well it was just frozen, no stacktrace or automatic reboot) 
did happen *after* storage came back online.
So nothing happens while the storage is offline, even if we wait until 
the scsi timeout is exceeded (180s * 6).
It's only when the storage returns that the filesystem goes read-only / 
panic (depending on the error setting).
>
> Fundamentally I think what's going on is that even though there is an
> I/O error reported back to the OS, but in some cases, the outstanding
> I/O actually happens.  So in the error=panic case, we do update the
> superblock saying that the file system contains inconsistencies.  And
> then we reboot.  But it appears that even though host rebooted, the
> storage area network*did*  manage to send the I/O to the device.
It seems that by updating the superblock to state that filesystem 
contains errors, things are made worse.
At the moment it does this, the storage is already accessible again, so 
it seems logic the I/O is written.
>
> I'm not sure what we can really do here, other than simply making the
> SCSI timeout infinite.  The problem is that storage area networks are
> flaky.  Sometimes I/O's make it through, and even though we get an
> error, it's an error from the local SCSI layer --- and it's possible
> that I/O will make it through.  In other cases, even though the
> storage area network was disconnected at the time we sent the I/O
> saying the file system has problems, and then rebooted, the I/O
> actually makes it through.  Given that, assuming that if we're not
> sure, forcing an full file system check is better part of valor.
If we do reset the VM before storage is back, the filesystem check just 
goes fine in automatic mode.
So I think we should (in some cases) not try to update the superblock 
anymore on I/O errors, but just go read-only/panic.
Cause it seems like updating the superblock makes things worse.

Or changes could be made to e2fsck to allow automatic repair of this 
kind of error for example?
>
> And if it hangs forever, and we do a hard reset reboot, I don't know
> *what*  to trust from the storage area network.  Ideally, there would
> be some way to do a hard reset of the storage area network so that all
> outstanding I/O's from the host that we are about to reset will get
> forgotten before we do actually the hard reset.
>
> 						- Ted

  reply	other threads:[~2020-03-09 15:33 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-24 10:57 Filesystem corruption after unreachable storage Jean-Louis Dupond
2020-01-24 20:37 ` Theodore Y. Ts'o
2020-02-20  9:08   ` Jean-Louis Dupond
2020-02-20  9:14     ` Jean-Louis Dupond
2020-02-20 15:50     ` Theodore Y. Ts'o
2020-02-20 16:14       ` Jean-Louis Dupond
2020-02-25 13:19         ` Jean-Louis Dupond
2020-02-25 17:23           ` Theodore Y. Ts'o
2020-02-28 11:06             ` Jean-Louis Dupond
2020-03-09 13:52               ` Jean-Louis Dupond
2020-03-09 15:18                 ` Theodore Y. Ts'o
2020-03-09 15:33                   ` Jean-Louis Dupond [this message]
2020-03-09 22:32                     ` Theodore Y. Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=93e74f9f-6694-a3e9-4fac-981389522d25@dupond.be \
    --to=jean-louis@dupond.be \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).