On Fri, Jan 13 2017, Kevin Wolf wrote:

> [ Unknown signature status ]
> Am 13.01.2017 um 05:51 hat NeilBrown geschrieben:
>> On Wed, Jan 11 2017, Kevin Wolf wrote:
>> 
>> > Am 11.01.2017 um 06:03 hat Theodore Ts'o geschrieben:
>> >> A couple of thoughts.
>> >> 
>> >> First of all, one of the reasons why this probably hasn't been
>> >> addressed for so long is because programs who really care about issues
>> >> like this tend to use Direct I/O, and don't use the page cache at all.
>> >> And perhaps this is an option open to qemu as well?
>> >
>> > For our immediate case, yes, O_DIRECT can be enabled as an option in
>> > qemu, and it is generally recommended to do that at least for long-lived
>> > VMs. For other cases it might be nice to use the cache e.g. for quicker
>> > startup, but those might be cases where error recovery isn't as
>> > important.
>> >
>> > I just see a much broader problem here than just for qemu. Essentially
>> > this approach would mean that every program that cares about the state
>> > it sees being safe on disk after a successful fsync() would have to use
>> > O_DIRECT. I'm not sure if that's what we want.
>> 
>> This is not correct.  If an application has exclusive write access to a
>> file (which is common, even if only enforced by convention) and if that
>> program checks the return of every write() and every fsync() (which, for
>> example, stdio does, allowing ferror() to report if there have ever been
>> errors), then it will know if its data if safe.
>> 
>> If any of these writes returned an error, then there is NOTHING IT CAN
>> DO about that file.  It should be considered to be toast.
>> If there is a separate filesystem it can use, then maybe there is a way
>> forward, but normally it would just report an error in whatever way is
>> appropriate.
>> 
>> My position on this is primarily that if you get a single write error,
>> then you cannot trust anything any more.
>
> But why? Do you think this is inevitable and therefore it is the most
> useful approach to handle errors, or is it just more convenient because
> then you don't have to think as much about error cases?

If you get an EIO from a write, or fsync, it tells you that the
underlying storage module has run out of options and cannot cope.
Maybe it was a media error, then the drive tried writing is a reserved
area but got a media error there as well, or found it was full.
Maybe it was a drive mechanism error - the head won't move properly any
more.
Maybe the flash storage is too worn and it won't hold data any more.
Maybe the network-attached server said "that object doesn't exist any more"
Maybe .... all sorts of other possibilities.

What is the chance that the underlying storage mechanism has failed to
store this one block for you, but everything else is working smoothly?
I would suggest that the chance is very close to zero.

Trying recovery strategies when you have no idea what went wrong, is an
exercise in futility.  "EIO" doesn't carry enough information, in
general, for you to do anything other the bail-out and admit failure.

NeilBrown


>
> The semantics I know is that a failed write means that the contents of
> the blocks touched by a failed write request is undefined now, but why
> can't I trust anything else in the same file (we're talking about what
> is often a whole block device in the case of qemu) any more?
>
>> You suggested before that NFS problems can cause errors which can be
>> fixed by the sysadmin so subsequent writes succeed.  I disagreed - NFS
>> will block, not return an error.  Your last paragraph below indicates
>> that you agree.  So I ask again: can you provide a genuine example of a
>> case where a write might result in an error, but that sysadmin
>> involvement can allow a subsequent attempt to write to succeed.   I
>> don't think you can, but I'm open...
>
> I think I replied to that in the other email now, so in order to keep it
> in one place I don't repeat my answer here
>
>> I note that ext4 has an option "errors=remount-ro".  I think that
>> actually makes a lot of sense.  I could easily see an argument for
>> supporting this at the file level, when it isn't enabled at the
>> filesystem level. If there is any write error, then all subsequent
>> writes should cause an error, only reads should be allowed.
>
> Obviously, that doesn't solve the problems we have to recover, but makes
> them only worse. However, I admit it would be the only reasonable choice
> if "after a single write error, you can't trust the whole file" is the
> official semantics. (Which I hope it isn't.)
>
> Kevin