On Thu, Jan 26 2017, Theodore Ts'o wrote:

> On Fri, Jan 27, 2017 at 09:19:10AM +1100, NeilBrown wrote:
>> I don't think it has.
>> The original topic was about gracefully handling of recoverable IO errors.
>> The question was framed as about retrying fsync() is it reported an
>> error, but this was based on a misunderstand.  fsync() doesn't report
>> an error for recoverable errors.  It hangs.
>> So the original topic is really about gracefully handling IO operations
>> which currently can hang indefinitely.
>
> Well, the problem is that it is up to the device driver to decide when
> an error is recoverable or not.  This might include waiting X minutes,
> and then deciding that the fibre channel connection isn't coming back,
> and then turning it into an unrecoverable error.  Or for other
> devices, the timeout might be much smaller.
>
> Which is fine --- I think that's where the decision ought to live, and
> if users want to tune a different timeout before the driver stops
> waiting, that should be between the system administrator and the
> device driver /sys tuning knob.

Completely agree.  Whether a particular condition should be treated as
recoverable or unrecoverable is a question and that driver authors and
sysadmins could reasonably provide input to.
But once that decision has been made, the application must accept the
decision.  EIO means unrecoverable.  There is never any point retrying.
Recoverable manifests as a hang, awaiting recovery.

I recently noticed that PG_error is effectively meaningless for write
errors.  filemap_fdatawait_range() can clear it, and the return value is
often ignored. AS_EIO is the really meaningful flag for write errors,
and it is per-file, not per-page.

>
>> >> When combined with O_DIRECT, it effectively means "no retries".  For
>> >> block devices and files backed by block devices,
>> >> REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
>> >> reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
>> >> help.
>
> Absolutely no retries?  Even TCP retries in the case of iSCSI?  I
> don't think turning every TCP packet drop into EWOULDBLOCK would make
> sense under any circumstances.  What might make sense is to have a
> "short timeout" where it's up to the block device to decide what
> "short timeout" means.

The implemented semantics of REQ_FAILFAST_* are to disable retries on
certain types of fail.  That is what I was meaning to refer to.
There are retries are many levels in the protocol stack, from the
collision detection retries at the data-link layer, to packet-level and
connection level and command level.  Some have predefined timeouts and
should be left alone.  Others have no timeouts and need to be disabled.
There are probably others in the middle.
I was looking for a semantic that could be implemented on top of current
interfaces, which means working with the REQ_FAILFAST_* semantic.

>
> EWOULDBLOCK is also a little misleading, because even if the I/O
> request is submitted immediately to the block device and immediately
> serviced and returned, the I/O request would still be "blocking".
> Maybe ETIMEDOUT instead?

Maybe - I won't argue.

>
>> And aio_write() isn't non-blocking for O_DIRECT already because .... oh,
>> it doesn't even try.  Is there something intrinsically hard about async
>> O_DIRECT writes, or is it just that no-one has written acceptable code
>> yet?
>
> AIO/DIO writes can indeed be non-blocking, if the file system doesn't
> need to do any metadata operations.  So if the file is preallocated,
> you should be able to issue an async DIO write without losing the CPU.

Yes, I see that now.  I misread some of the code.
Thanks.

NeilBrown


>
>> A truly async O_DIRECT aio_write() combined with a working io_cancel()
>> would probably be sufficient.  The block layer doesn't provide any way
>> to cancel a bio though, so that would need to be wired up.
>
> Kent Overstreet worked up io_cancel for AIO/DIO writes when he was at
> Google.  As I recall the patchset did get posted a few times, but it
> never ended up getted accepted for upstream adoption.
>
> We even had some very rough code that would propagate the cancellation
> request to the hard drive, for those hard drives that had a facility
> for accepting a cancellation request for an I/O which was queued via
> NCQ but which hadn't executed yet.  It sort-of worked, but it never
> hit a state where it could be published before the project was
> abandoned.
>
> 						- Ted