Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()

From: NeilBrown <neilb@suse.com>
To: Theodore Ts'o <tytso@mit.edu>, Trond Myklebust <trondmy@primarydata.com>
Cc: "kwolf\@redhat.com" <kwolf@redhat.com>,
	"jlayton\@poochiereds.net" <jlayton@poochiereds.net>,
	"hch\@infradead.org" <hch@infradead.org>,
	"riel\@redhat.com" <riel@redhat.com>,
	"linux-mm\@kvack.org" <linux-mm@kvack.org>,
	"rwheeler\@redhat.com" <rwheeler@redhat.com>,
	"lsf-pc\@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"linux-fsdevel\@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
Date: Thu, 26 Jan 2017 11:36:35 +1100	[thread overview]
Message-ID: <87ziieu06k.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <20170125183542.557drncuktc5wgzy@thunk.org>

[-- Attachment #1: Type: text/plain, Size: 6574 bytes --]

On Wed, Jan 25 2017, Theodore Ts'o wrote:

> On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote:
>> The reason why I'm thinking open() is because it has to be a contract
>> between a specific application and the kernel. If the application
>> doesn't open the file with the O_TIMEOUT flag, then it shouldn't see
>> nasty non-POSIX timeout errors, even if there is another process that
>> is using that flag on the same file.
>> 
>> The only place where that is difficult to manage is when the file is
>> mmap()ed (no file descriptor), so you'd presumably have to disallow
>> mixing mmap and O_TIMEOUT.
>
> Well, technically there *is* a file descriptor when you do an mmap.
> You can close the fd after you call mmap(), but the mmap bumps the
> refcount on the struct file while the memory map is active.
>
> I would argue though that at least for buffered writes, the timeout
> has to be property of the underlying inode, and if there is an attempt
> to set timeout on an inode that already has a timeout set to some
> other non-zero value, the "set timeout" operation should fail with a
> "timeout already set".  That's becuase we really don't want to have to
> keep track, on a per-page basis, which struct file was responsible for
> dirtying a page --- and what if it is dirtied by two different file
> descriptors?

You seem to have a very different idea to the one that is forming in my
mind.  In my vision, once the data has entered the page cache, it
doesn't matter at all where it came from.  It will remain in the page
cache, as a dirty page, until it is successfully written or until an
unrecoverable error occurs.  There are no timeouts once the data is in
the page cache.

Actually, I'm leaning away from timeouts in general.  I'm not against
them, but not entirely sure they are useful.

To be more specific, I imagine a new open flag "O_IO_NDELAY".  It is a
bit like O_NDELAY, but it explicitly affects IO, never the actual open()
call, and it is explicitly allowed on regular files and block devices.

When combined with O_DIRECT, it effectively means "no retries".  For
block devices and files backed by block devices,
REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
help.
Non-block-device filesystems would behave differently.  e.g. NFS would
probably use a RPC_TASK_SOFT call instead of the normal 'hard' call.

When used without O_DIRECT:
 - read would trigger read-ahead much as it does now (which can do
   nothing if there are resource issues) and would only return data
   if it was already in the cache.
 - write would try to allocate a page, tell the filesystem that it
   is dirty so that journal space is reserved or whatever is needed,
   and would tell the dirty_pages rate-limiting that another page was
   dirty.  If the rate-limiting reported that we cannot dirty a page
   without waiting, or if any other needed resources were not available,
   then the write would fail (-EWOULDBLOCK).
 - fsync would just fail if there were any dirty pages.  It might also
   do the equivalent of sync_file_range(SYNC_FILE_RANGE_WRITE) without
   any *WAIT* flags. (alternately, fsync could remain unchanged, and
   sync_file_range() could gain a SYNC_FILE_RANGE_TEST flag).

With O_DIRECT there would be a delay, but it would be limited and there
would be no retry.  There is not currently any way to impose a specific
delay on REQ_FAILFAST* requests.
Without O_DIRECT, there could be no significant delay, though code might
have to wait for a mutex or similar.
There are a few places that a timeout could usefully be inserted, but
I'm not sure that would be better than just having the app try again in
a little while - it would have to be prepared for that anyway.

I would like O_DIRECT|O_IO_NDELAY for mdadm so we could safely work with
devices that block when no paths are available.

>
> That being said, I suspect that for many applications, the timeout is
> going to be *much* more interesting for O_DIRECT writes, and there we
> can certainly have different timeouts on a per-fd basis.  This is
> especially for cases where the timeout is implemented in storage
> device, using multi-media extensions, and where the timout might be
> measured in milliseconds (e.g., no point reading a video frame if its
> been delayed too long).  That being said, it block layer would need to
> know about this as well, since the timeout needs to be relative to
> when the read(2) system call is issued, not to when it is finally
> submitted to the storage device.

Yes. If a deadline could be added to "struct bio", and honoured by
drivers, then that would make a timeout much more interesting for
O_DIRECT.

Thanks,
NeilBrown

>
> And if the process has suitable privileges, perhaps the I/O scheduler
> should take the timeout into account, so that reads with a timeout
> attached should be submitted, with the presumption that reads w/o a
> timeout can afford to be queued.  If the process doesn't have suitable
> privileges, or if cgroup has exceeded its I/O quota, perhaps the right
> answer would be to fail the read right away.  In the case of a cluster
> file system such, if a particular server knows its can't serve a
> particular low latency read within the SLO, it might be worthwhile to
> signal to the cluster file system client that it should start doing an
> erasure code reconstruction right away (or read from one of the
> mirrors if the file is stored with n=3 replication, etc.)
>
> So depending on what the goals of userspace are, there are number of
> different kernel policies that might be the best match for the
> particular application in question.  In particular, if you are trying
> to provide low latency reads to assure decent response time for web
> applications, it may be *reads* that are much more interesting for
> timeout purposes rather than *writes*.
>
> (Especially in a distributed system, you're going to be using some
> kind of encoding with redundancy, so as long as enough of the writes
> have completed, it doesn't matter if the other writes take a long time
> --- although if you eventually decide that the write's never going to
> make it, it's ideal if you can reshard the chunk more aggressively,
> instead of waiting for the scurbbing pass to notice that some of the
> redundant copies of the chunk had gotten corrupted or were never
> written out.)
>
> Cheers,
>
> 					- Ted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]