All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] I/O error handling and fsync()
@ 2017-01-10 16:02 Kevin Wolf
  2017-01-11  0:41 ` NeilBrown
  2017-01-11  5:03 ` Theodore Ts'o
  0 siblings, 2 replies; 53+ messages in thread
From: Kevin Wolf @ 2017-01-10 16:02 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, linux-mm, Christoph Hellwig, Ric Wheeler, Rik van Riel

Hi all,

when I mentioned the I/O error handling problem especially with fsync()
we have in QEMU to Christoph Hellwig, he thought it would be great topic
for LSF/MM, so here I am. This came up a few months ago on qemu-devel [1]
and we managed to ignore it for a while, but it's a real and potentially
serious problem, so I think I agree with Christoph that it makes sense
to get it discussed at LSF/MM.


At the heart of it is the semantics of fsync(). A few years ago, fsync()
was fixed to actually flush data to the disk, so we now have a defined
and useful meaning of fsync() as long as all your fsync() calls return
success.

However, as soon as one fsync() call fails, even if the root problem is
solved later (network connection restored, some space freed for thin
provisioned storage, etc.), the state we're in is mostly undefined. As
Ric Wheeler told me back in the qemu-devel discussion, when a writeout
fails, you get an fsync() error returned (once), but the kernel page
cache simply marks the respective page as clean and consequently won't
ever retry the writeout. Instead, it can evict it from the cache even
though it isn't actually consistent with the state on disk, which means
throwing away data that was written by some process.

So if you do another fsync() and it returns success, this doesn't
currently mean that all of the data you wrote is on disk, but if
anything, it's just about the data you wrote after the failed fsync().
This isn't very helpful, to say the least, because you called fsync() in
order to get a consistent state on disk, and you still don't have that.

Essentially this means that once you got a fsync() failure, there is no
hope to recover for the application and it has to stop using the file.


To give some context about my perspective as the maintainer for the QEMU
block subsystem: QEMU has a mode (which is usually enabled in
production) where I/O failure isn't communicated to the guest, which
would probably offline the filesystem, thinking its hard disk has died,
but instead QEMU pauses the VM and allows the administrator to resume
when the problem has been fixed. Often the problem is only temporary,
e.g. a network hiccup when a disk image is stored on NFS, so this is a
quite helpful approach.

When QEMU is told to resume the VM, the request is just resubmitted.
This works fine for read/write, but not so much for fsync, because after
the first failure all bets are off even if a subsequent fsync()
succeeds.

So this is the aspect that directly affects me, even though the problem
is much broader and by far doesn't only affect QEMU.


This leads to a few invidivual points to be discussed:

1. Fix the data corruption problem that follows from the current
   behaviour. Imagine the following scenario:

   Process A writes to some file, calls fsync() and gets a failure. The
   data it wrote is marked clean in the page cache even though it's
   inconsistent with the disk. Process A knows that fsync() fails, so
   maybe it can deal with it, at least by stop using the file.

   Now process B opens the same file, reads the updated data that
   process A wrote, makes some additional changes based on that and
   calls fsync() again.  Now fsync() return success. The data written by
   B is on disk, but the data written by A isn't. Oops, this is data
   corruption, and process B doesn't even know about it because all its
   operations succeeded.

2. Define fsync() semantics that include the state after a failure (this
   probably goes a long way towards fixing 1.).

   The semantics that QEMU uses internally (and which it needs to map)
   is that after a successful flush, all writes to the disk image that
   have successfully completed before the flush was issued are stable on
   disk (no matter whether a previous flush failed).

   A possible adaption to Linux, which considers that unlike QEMU
   images, files can be opened more than once, might be that a
   succeeding fsync() on a file descriptor means that all data that has
   been read or written through this file descriptor is consistent
   between the page cache and the disk (the read part is for avoiding
   the scenario from 1.; it means that fsync flushes data written on a
   different file descriptor if it has been seen by this one; hence, the
   page cache can't contain non-dirty pages which aren't consistent with
   the disk).

3. Actually make fsync() failure recoverable.

   You can implement 2. by making sure that a file descriptor for which
   pages have been thrown away always returns an error and never goes
   back to suceeding (it can't succeed according to the definition of 2.
   because the data that would have to be written out is gone). This is
   already a much better interface, but it doesn't really solve the
   actual problem we have.

   We also need to make sure that after a failed fsync() there is a
   chance to recover. This means that the pages shouldn't be thrown away
   immediately; but at the same time, you probably also don't want to
   keep pages indefinitely when there is a permanent writeout error.
   However, if we can make sure that these pages are only evicted in
   case of actual memory pressure, and only if there are no actually
   clean page to evict, I think a lot would be already won.

   In the common case, you could then recover from a temporary failure,
   but if this state isn't maintainable, at least we get consistent
   fsync() failure telling us that the data is gone.


I think I've summarised most aspects here, but if something is unclear
or you'd like to see some more context, please refer to the qemu-devel
discussion [1] that I mentioned, or feel free to just ask.

Thanks,
Kevin

[1] https://lists.gnu.org/archive/html/qemu-block/2016-04/msg00576.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-10 16:02 [LSF/MM TOPIC] I/O error handling and fsync() Kevin Wolf
@ 2017-01-11  0:41 ` NeilBrown
  2017-01-13 11:09   ` Kevin Wolf
  2017-01-11  5:03 ` Theodore Ts'o
  1 sibling, 1 reply; 53+ messages in thread
From: NeilBrown @ 2017-01-11  0:41 UTC (permalink / raw)
  To: Kevin Wolf, lsf-pc
  Cc: linux-fsdevel, linux-mm, Christoph Hellwig, Ric Wheeler, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 8460 bytes --]

On Wed, Jan 11 2017, Kevin Wolf wrote:

> Hi all,
>
> when I mentioned the I/O error handling problem especially with fsync()
> we have in QEMU to Christoph Hellwig, he thought it would be great topic
> for LSF/MM, so here I am. This came up a few months ago on qemu-devel [1]
> and we managed to ignore it for a while, but it's a real and potentially
> serious problem, so I think I agree with Christoph that it makes sense
> to get it discussed at LSF/MM.
>
>
> At the heart of it is the semantics of fsync(). A few years ago, fsync()
> was fixed to actually flush data to the disk, so we now have a defined
> and useful meaning of fsync() as long as all your fsync() calls return
> success.
>
> However, as soon as one fsync() call fails, even if the root problem is
> solved later (network connection restored, some space freed for thin
> provisioned storage, etc.), the state we're in is mostly undefined. As
> Ric Wheeler told me back in the qemu-devel discussion, when a writeout
> fails, you get an fsync() error returned (once), but the kernel page
> cache simply marks the respective page as clean and consequently won't
> ever retry the writeout. Instead, it can evict it from the cache even
> though it isn't actually consistent with the state on disk, which means
> throwing away data that was written by some process.
>
> So if you do another fsync() and it returns success, this doesn't
> currently mean that all of the data you wrote is on disk, but if
> anything, it's just about the data you wrote after the failed fsync().
> This isn't very helpful, to say the least, because you called fsync() in
> order to get a consistent state on disk, and you still don't have that.
>
> Essentially this means that once you got a fsync() failure, there is no
> hope to recover for the application and it has to stop using the file.

This is not strictly correct.  The application could repeat all the
recent writes.  It might fsync after each write so it can find out
exactly where the problem is.
So it could a a lot of work to recover, but it is not intrinsically
impossible.


>
>
> To give some context about my perspective as the maintainer for the QEMU
> block subsystem: QEMU has a mode (which is usually enabled in
> production) where I/O failure isn't communicated to the guest, which
> would probably offline the filesystem, thinking its hard disk has died,
> but instead QEMU pauses the VM and allows the administrator to resume
> when the problem has been fixed. Often the problem is only temporary,
> e.g. a network hiccup when a disk image is stored on NFS, so this is a
> quite helpful approach.

If the disk image is stored over NFS, the write should hang, not cause
an error. (Of course if you mount with '-o soft' you will get an error,
but if you mount with '-o soft', then "you get to keep both halves").

Is there a more realistic situation where you might get a write error
that might succeed if the write is repeated?

>
> When QEMU is told to resume the VM, the request is just resubmitted.
> This works fine for read/write, but not so much for fsync, because after
> the first failure all bets are off even if a subsequent fsync()
> succeeds.
>
> So this is the aspect that directly affects me, even though the problem
> is much broader and by far doesn't only affect QEMU.
>
>
> This leads to a few invidivual points to be discussed:
>
> 1. Fix the data corruption problem that follows from the current
>    behaviour. Imagine the following scenario:
>
>    Process A writes to some file, calls fsync() and gets a failure. The
>    data it wrote is marked clean in the page cache even though it's
>    inconsistent with the disk. Process A knows that fsync() fails, so
>    maybe it can deal with it, at least by stop using the file.
>
>    Now process B opens the same file, reads the updated data that
>    process A wrote, makes some additional changes based on that and
>    calls fsync() again.  Now fsync() return success. The data written by
>    B is on disk, but the data written by A isn't. Oops, this is data
>    corruption, and process B doesn't even know about it because all its
>    operations succeeded.

Can that really happen? I would expect the filesystem to call
SetPageError() if there was a write error, then I would expect a read to
report an error for that page if it were still in cache (or maybe flush
it out).  I admit that I haven't traced through the code in detail, but
I did find some examples for SetPageError after a write error.

>
> 2. Define fsync() semantics that include the state after a failure (this
>    probably goes a long way towards fixing 1.).
>
>    The semantics that QEMU uses internally (and which it needs to map)
>    is that after a successful flush, all writes to the disk image that
>    have successfully completed before the flush was issued are stable on
>    disk (no matter whether a previous flush failed).
>
>    A possible adaption to Linux, which considers that unlike QEMU
>    images, files can be opened more than once, might be that a
>    succeeding fsync() on a file descriptor means that all data that has
>    been read or written through this file descriptor is consistent
>    between the page cache and the disk (the read part is for avoiding
>    the scenario from 1.; it means that fsync flushes data written on a
>    different file descriptor if it has been seen by this one; hence, the
>    page cache can't contain non-dirty pages which aren't consistent with
>    the disk).

I think it would be useful to try to describe the behaviour of page
flags, particularly PG_error PG_uptodate PG_dirty in the different
scenarios.

For example, a successful read sets PG_uptodate and a successful write
clears PG_dirty.
A failed read doesn't set PG_uptodate, and maybe sets PG_error.
A failed read probably shouldn't clear PG_dirty but should set PG_error.

If background-write finds a PG_dirty|PG_error page, should it try to
write it out again?  Or should only a foreground (fsync) write?

If we did this, PG_error|PG_dirty pages would be pinned in memory until
a write was successful.  We would need a way to purge these pages
without writing them.  We would also need a way to ensure they didn't
consume a large fraction of memory.

It isn't clear to me that the behaviour can be different for different
file descriptors.  Once the data has been written to the page cache, it
belongs to the file, not to any particular fd.  So enabling
"keep-data-after-write-error" would need to be per-file rather than
per-fd, and would probably need to be a privileged operations due to the
memory consumption concerns.

>
> 3. Actually make fsync() failure recoverable.
>
>    You can implement 2. by making sure that a file descriptor for which
>    pages have been thrown away always returns an error and never goes
>    back to suceeding (it can't succeed according to the definition of 2.
>    because the data that would have to be written out is gone). This is
>    already a much better interface, but it doesn't really solve the
>    actual problem we have.
>
>    We also need to make sure that after a failed fsync() there is a
>    chance to recover. This means that the pages shouldn't be thrown away
>    immediately; but at the same time, you probably also don't want to
>    keep pages indefinitely when there is a permanent writeout error.
>    However, if we can make sure that these pages are only evicted in
>    case of actual memory pressure, and only if there are no actually
>    clean page to evict, I think a lot would be already won.

I think this would make behaviour unpredictable, being dependent on how
much memory pressure there is.  Predictability is nice!

>
>    In the common case, you could then recover from a temporary failure,
>    but if this state isn't maintainable, at least we get consistent
>    fsync() failure telling us that the data is gone.
>
>
> I think I've summarised most aspects here, but if something is unclear
> or you'd like to see some more context, please refer to the qemu-devel
> discussion [1] that I mentioned, or feel free to just ask.

Definitely an interesting question!

Thanks,
NeilBrown

>
> Thanks,
> Kevin
>
> [1] https://lists.gnu.org/archive/html/qemu-block/2016-04/msg00576.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-10 16:02 [LSF/MM TOPIC] I/O error handling and fsync() Kevin Wolf
  2017-01-11  0:41 ` NeilBrown
@ 2017-01-11  5:03 ` Theodore Ts'o
  2017-01-11  9:47   ` [Lsf-pc] " Jan Kara
                     ` (3 more replies)
  1 sibling, 4 replies; 53+ messages in thread
From: Theodore Ts'o @ 2017-01-11  5:03 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: lsf-pc, linux-fsdevel, linux-mm, Christoph Hellwig, Ric Wheeler,
	Rik van Riel

A couple of thoughts.

First of all, one of the reasons why this probably hasn't been
addressed for so long is because programs who really care about issues
like this tend to use Direct I/O, and don't use the page cache at all.
And perhaps this is an option open to qemu as well?

Secondly, one of the reasons why we mark the page clean is because we
didn't want a failing disk to memory to be trapped with no way of
releasing the pages.  For example, if a user plugs in a USB
thumbstick, writes to it, and then rudely yanks it out before all of
the pages have been writeback, it would be unfortunate if the dirty
pages can only be released by rebooting the system.

So an approach that might work is fsync() will keep the pages dirty
--- but only while the file descriptor is open.  This could either be
the default behavior, or something that has to be specifically
requested via fcntl(2).  That way, as soon as the process exits (at
which point it will be too late for it do anything to save the
contents of the file) we also release the memory.  And if the process
gets OOM killed, again, the right thing happens.  But if the process
wants to take emergency measures to write the file somewhere else, it
knows that the pages won't get lost until the file gets closed.

(BTW, a process could guarantee this today without any kernel changes
by mmap'ing the whole file and mlock'ing the pages that it had
modified.  That way, even if there is an I/O error and the fsync
causes the pages to be marked clean, the pages wouldn't go away.
However, this is really a hack, and it would probably be easier for
the process to use Direct I/O instead.  :-)


Finally, if the kernel knows that an error might be one that could be
resolved by the simple expedient of waiting (for example, if a fibre
channel cable is temporarily unplugged so it can be rerouted, but the
user might plug it back in a minute or two later, or a dm-thin device
is full, but the system administrator might do something to fix it),
in the ideal world, the kernel should deal with it without requiring
any magic from userspace applications.  There might be a helper system
daemon that enacts policy (we've paged the sysadmin, so it's OK to
keep the page dirty and retry the writebacks to the dm-thin volume
after the helper daemon gives the all-clear), but we shouldn't require
all user space applications to have magic, Linux-specific retry code.

Cheers,

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-11  5:03 ` Theodore Ts'o
@ 2017-01-11  9:47   ` Jan Kara
  2017-01-11 15:45     ` Theodore Ts'o
  2017-01-11 10:55   ` Chris Vest
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2017-01-11  9:47 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Kevin Wolf, Rik van Riel, Christoph Hellwig, linux-mm,
	linux-fsdevel, lsf-pc, Ric Wheeler

On Wed 11-01-17 00:03:56, Ted Tso wrote:
> A couple of thoughts.
> 
> First of all, one of the reasons why this probably hasn't been
> addressed for so long is because programs who really care about issues
> like this tend to use Direct I/O, and don't use the page cache at all.
> And perhaps this is an option open to qemu as well?
> 
> Secondly, one of the reasons why we mark the page clean is because we
> didn't want a failing disk to memory to be trapped with no way of
> releasing the pages.  For example, if a user plugs in a USB
> thumbstick, writes to it, and then rudely yanks it out before all of
> the pages have been writeback, it would be unfortunate if the dirty
> pages can only be released by rebooting the system.
> 
> So an approach that might work is fsync() will keep the pages dirty
> --- but only while the file descriptor is open.  This could either be
> the default behavior, or something that has to be specifically
> requested via fcntl(2).  That way, as soon as the process exits (at
> which point it will be too late for it do anything to save the
> contents of the file) we also release the memory.  And if the process
> gets OOM killed, again, the right thing happens.  But if the process
> wants to take emergency measures to write the file somewhere else, it
> knows that the pages won't get lost until the file gets closed.

Well, as Neil pointed out, the problem is that once the data hits page
cache, we lose the association with a file descriptor. So for example
background writeback or sync(2) can find the dirty data and try to write
it, get EIO, and then you have to do something about it because you don't
know whether fsync(2) is coming or not.

That being said if we'd just keep the pages which failed write out dirty,
the system will eventually block all writers in balance_dirty_pages() and
at that point it is IMO a policy decision (probably per device or per fs)
whether you should just keep things blocked waiting for better times or
whether you just want to start discarding dirty data on a failed write.
Now discarding data that failed to write only when we are close to dirty
limit (or after some timeout or whatever) has a disadvantage that it is
not easy to predict from user POV so I'm not sure if we want to go down
that path. But I can see two options making sense:

1) Just hold onto data and wait indefinitely. Possibly provide a way for
   userspace to forcibly unmount a filesystem in such state.

2) Do what we do now.
 
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-11  5:03 ` Theodore Ts'o
  2017-01-11  9:47   ` [Lsf-pc] " Jan Kara
@ 2017-01-11 10:55   ` Chris Vest
  2017-01-11 11:40   ` Kevin Wolf
  2017-01-11 12:14   ` Chris Vest
  3 siblings, 0 replies; 53+ messages in thread
From: Chris Vest @ 2017-01-11 10:55 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Kevin Wolf, lsf-pc, linux-fsdevel, linux-mm, Christoph Hellwig,
	Ric Wheeler, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 1227 bytes --]


> On 11 Jan 2017, at 06.03, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> So an approach that might work is fsync() will keep the pages dirty
> --- but only while the file descriptor is open.  This could either be
> the default behavior, or something that has to be specifically
> requested via fcntl(2).  That way, as soon as the process exits (at
> which point it will be too late for it do anything to save the
> contents of the file) we also release the memory.  And if the process
> gets OOM killed, again, the right thing happens.  But if the process
> wants to take emergency measures to write the file somewhere else, it
> knows that the pages won't get lost until the file gets closed.


I think this sounds like a very reasonable default. Before reading this thread, it would have been my first guess as to how this worked. It gives the program the opportunity to retry the fsyncs, before aborting. It will also allow a database, for instance, to keep servicing reads until the issue resolves itself, or an administrator intervenes. A program cannot allow reads from the file if pages that has been written to can be evicted, and their changes lost, and then brought back with old data.

--
Chris Vest

[-- Attachment #2: Type: text/html, Size: 7561 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-11  5:03 ` Theodore Ts'o
  2017-01-11  9:47   ` [Lsf-pc] " Jan Kara
  2017-01-11 10:55   ` Chris Vest
@ 2017-01-11 11:40   ` Kevin Wolf
  2017-01-13  4:51     ` NeilBrown
  2017-01-11 12:14   ` Chris Vest
  3 siblings, 1 reply; 53+ messages in thread
From: Kevin Wolf @ 2017-01-11 11:40 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: lsf-pc, linux-fsdevel, linux-mm, Christoph Hellwig, Ric Wheeler,
	Rik van Riel

Am 11.01.2017 um 06:03 hat Theodore Ts'o geschrieben:
> A couple of thoughts.
> 
> First of all, one of the reasons why this probably hasn't been
> addressed for so long is because programs who really care about issues
> like this tend to use Direct I/O, and don't use the page cache at all.
> And perhaps this is an option open to qemu as well?

For our immediate case, yes, O_DIRECT can be enabled as an option in
qemu, and it is generally recommended to do that at least for long-lived
VMs. For other cases it might be nice to use the cache e.g. for quicker
startup, but those might be cases where error recovery isn't as
important.

I just see a much broader problem here than just for qemu. Essentially
this approach would mean that every program that cares about the state
it sees being safe on disk after a successful fsync() would have to use
O_DIRECT. I'm not sure if that's what we want.

> Secondly, one of the reasons why we mark the page clean is because we
> didn't want a failing disk to memory to be trapped with no way of
> releasing the pages.  For example, if a user plugs in a USB
> thumbstick, writes to it, and then rudely yanks it out before all of
> the pages have been writeback, it would be unfortunate if the dirty
> pages can only be released by rebooting the system.

Yes, I understand that and permanent failure is definitely a case to
consider while making any changes. That's why I suggested to still allow
releasing such pages, but at a lower priority than actually clean pages.
And of course, after losing data, an fsync() may never succeed again on
a file descriptor that was open when the data was thrown away.

> So an approach that might work is fsync() will keep the pages dirty
> --- but only while the file descriptor is open.  This could either be
> the default behavior, or something that has to be specifically
> requested via fcntl(2).  That way, as soon as the process exits (at
> which point it will be too late for it do anything to save the
> contents of the file) we also release the memory.  And if the process
> gets OOM killed, again, the right thing happens.  But if the process
> wants to take emergency measures to write the file somewhere else, it
> knows that the pages won't get lost until the file gets closed.

This sounds more or less like what I had in mind, so I agree.

The fcntl() flag is an interesting thought, too, but would there be
any situation where the userspace would have an advantage from not
requesting the flag?

> (BTW, a process could guarantee this today without any kernel changes
> by mmap'ing the whole file and mlock'ing the pages that it had
> modified.  That way, even if there is an I/O error and the fsync
> causes the pages to be marked clean, the pages wouldn't go away.
> However, this is really a hack, and it would probably be easier for
> the process to use Direct I/O instead.  :-)

That, and even if the pages would still in memory, as I understand it,
the writeout would never be retried because they are still marked clean.
So it wouldn't be usable for a temporary failure, but only for reading
the data back from the cache into a different file.

> Finally, if the kernel knows that an error might be one that could be
> resolved by the simple expedient of waiting (for example, if a fibre
> channel cable is temporarily unplugged so it can be rerouted, but the
> user might plug it back in a minute or two later, or a dm-thin device
> is full, but the system administrator might do something to fix it),
> in the ideal world, the kernel should deal with it without requiring
> any magic from userspace applications.  There might be a helper system
> daemon that enacts policy (we've paged the sysadmin, so it's OK to
> keep the page dirty and retry the writebacks to the dm-thin volume
> after the helper daemon gives the all-clear), but we shouldn't require
> all user space applications to have magic, Linux-specific retry code.

Yes and no. I agree that the kernel should mostly make things just work.
We're talking about a relatively obscure error case here, so if
userspace applications have to do something extraordinary, chances are
they won't be doing it.

On the other hand, indefinitely blocking on fsync() isn't really what we
want either, so while the kernel should keep trying to get the data
written in the background, a failing fsync() would be okay, as long as a
succeeding fsync() afterwards means that we're fully consistent again.

In qemu, indefinitely blocking read/write syscalls are already a problem
(on NFS), because instead of getting an error and then stopping the VM,
the request hangs so long that the guest kernel sees a timeout and
offlines the disk anyway. But that's a separate problem...

Kevin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-11  5:03 ` Theodore Ts'o
                     ` (2 preceding siblings ...)
  2017-01-11 11:40   ` Kevin Wolf
@ 2017-01-11 12:14   ` Chris Vest
  3 siblings, 0 replies; 53+ messages in thread
From: Chris Vest @ 2017-01-11 12:14 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Kevin Wolf, lsf-pc, linux-fsdevel, linux-mm, Christoph Hellwig,
	Ric Wheeler, Rik van Riel


> On 11 Jan 2017, at 06.03, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> So an approach that might work is fsync() will keep the pages dirty
> --- but only while the file descriptor is open.  This could either be
> the default behavior, or something that has to be specifically
> requested via fcntl(2).  That way, as soon as the process exits (at
> which point it will be too late for it do anything to save the
> contents of the file) we also release the memory.  And if the process
> gets OOM killed, again, the right thing happens.  But if the process
> wants to take emergency measures to write the file somewhere else, it
> knows that the pages won't get lost until the file gets closed.

I think this sounds like a very reasonable default. Before reading this thread, it would have been my first guess as to how this worked. It gives the program the opportunity to retry the fsyncs, before aborting. It will also allow a database, for instance, to keep servicing reads until the issue resolves itself, or an administrator intervenes. A program cannot allow reads from the file if pages that has been written to can be evicted, and their changes lost, and then brought back with old data.

--
Chris Vest
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-11  9:47   ` [Lsf-pc] " Jan Kara
@ 2017-01-11 15:45     ` Theodore Ts'o
  0 siblings, 0 replies; 53+ messages in thread
From: Theodore Ts'o @ 2017-01-11 15:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kevin Wolf, Rik van Riel, Christoph Hellwig, linux-mm,
	linux-fsdevel, lsf-pc, Ric Wheeler

On Wed, Jan 11, 2017 at 10:47:29AM +0100, Jan Kara wrote:
> Well, as Neil pointed out, the problem is that once the data hits page
> cache, we lose the association with a file descriptor. So for example
> background writeback or sync(2) can find the dirty data and try to write
> it, get EIO, and then you have to do something about it because you don't
> know whether fsync(2) is coming or not.

We could solve that by being able to track the number of open file
descriptors in struct inode.  We have i_writecount and i_readcount (if
CONFIG_IMA is defined).  So we can *almost* do this already.  If we
always tracked i_readcount, then we would have the count of all struct
files opened that are writeable or read-only.  So we *can* know
whether or not the page is backed by an inode that has an open file
descriptor.

So the hueristic I'm suggesting is "if i_writecount + i_readcount is
non-zero, then keep the pages".  The pages would be marked with the
error flag, so fsync() can harvest the fact that there was an error,
but afterwards, the pages would be left marked dirty.  After the last
file descriptor is closed, on the next attempt to writeback those
pages, if the I/O error is still occuring, we can make the pages go
away.

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-11 11:40   ` Kevin Wolf
@ 2017-01-13  4:51     ` NeilBrown
  2017-01-13 11:51       ` Kevin Wolf
  0 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2017-01-13  4:51 UTC (permalink / raw)
  To: Kevin Wolf, Theodore Ts'o
  Cc: lsf-pc, linux-fsdevel, linux-mm, Christoph Hellwig, Ric Wheeler,
	Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 6761 bytes --]

On Wed, Jan 11 2017, Kevin Wolf wrote:

> Am 11.01.2017 um 06:03 hat Theodore Ts'o geschrieben:
>> A couple of thoughts.
>> 
>> First of all, one of the reasons why this probably hasn't been
>> addressed for so long is because programs who really care about issues
>> like this tend to use Direct I/O, and don't use the page cache at all.
>> And perhaps this is an option open to qemu as well?
>
> For our immediate case, yes, O_DIRECT can be enabled as an option in
> qemu, and it is generally recommended to do that at least for long-lived
> VMs. For other cases it might be nice to use the cache e.g. for quicker
> startup, but those might be cases where error recovery isn't as
> important.
>
> I just see a much broader problem here than just for qemu. Essentially
> this approach would mean that every program that cares about the state
> it sees being safe on disk after a successful fsync() would have to use
> O_DIRECT. I'm not sure if that's what we want.

This is not correct.  If an application has exclusive write access to a
file (which is common, even if only enforced by convention) and if that
program checks the return of every write() and every fsync() (which, for
example, stdio does, allowing ferror() to report if there have ever been
errors), then it will know if its data if safe.

If any of these writes returned an error, then there is NOTHING IT CAN
DO about that file.  It should be considered to be toast.
If there is a separate filesystem it can use, then maybe there is a way
forward, but normally it would just report an error in whatever way is
appropriate.

My position on this is primarily that if you get a single write error,
then you cannot trust anything any more.
You suggested before that NFS problems can cause errors which can be
fixed by the sysadmin so subsequent writes succeed.  I disagreed - NFS
will block, not return an error.  Your last paragraph below indicates
that you agree.  So I ask again: can you provide a genuine example of a
case where a write might result in an error, but that sysadmin
involvement can allow a subsequent attempt to write to succeed.   I
don't think you can, but I'm open...

I note that ext4 has an option "errors=remount-ro".  I think that
actually makes a lot of sense.  I could easily see an argument for
supporting this at the file level, when it isn't enabled at the
filesystem level. If there is any write error, then all subsequent
writes should cause an error, only reads should be allowed.

Thanks,
NeilBrown


>
>> Secondly, one of the reasons why we mark the page clean is because we
>> didn't want a failing disk to memory to be trapped with no way of
>> releasing the pages.  For example, if a user plugs in a USB
>> thumbstick, writes to it, and then rudely yanks it out before all of
>> the pages have been writeback, it would be unfortunate if the dirty
>> pages can only be released by rebooting the system.
>
> Yes, I understand that and permanent failure is definitely a case to
> consider while making any changes. That's why I suggested to still allow
> releasing such pages, but at a lower priority than actually clean pages.
> And of course, after losing data, an fsync() may never succeed again on
> a file descriptor that was open when the data was thrown away.
>
>> So an approach that might work is fsync() will keep the pages dirty
>> --- but only while the file descriptor is open.  This could either be
>> the default behavior, or something that has to be specifically
>> requested via fcntl(2).  That way, as soon as the process exits (at
>> which point it will be too late for it do anything to save the
>> contents of the file) we also release the memory.  And if the process
>> gets OOM killed, again, the right thing happens.  But if the process
>> wants to take emergency measures to write the file somewhere else, it
>> knows that the pages won't get lost until the file gets closed.
>
> This sounds more or less like what I had in mind, so I agree.
>
> The fcntl() flag is an interesting thought, too, but would there be
> any situation where the userspace would have an advantage from not
> requesting the flag?
>
>> (BTW, a process could guarantee this today without any kernel changes
>> by mmap'ing the whole file and mlock'ing the pages that it had
>> modified.  That way, even if there is an I/O error and the fsync
>> causes the pages to be marked clean, the pages wouldn't go away.
>> However, this is really a hack, and it would probably be easier for
>> the process to use Direct I/O instead.  :-)
>
> That, and even if the pages would still in memory, as I understand it,
> the writeout would never be retried because they are still marked clean.
> So it wouldn't be usable for a temporary failure, but only for reading
> the data back from the cache into a different file.
>
>> Finally, if the kernel knows that an error might be one that could be
>> resolved by the simple expedient of waiting (for example, if a fibre
>> channel cable is temporarily unplugged so it can be rerouted, but the
>> user might plug it back in a minute or two later, or a dm-thin device
>> is full, but the system administrator might do something to fix it),
>> in the ideal world, the kernel should deal with it without requiring
>> any magic from userspace applications.  There might be a helper system
>> daemon that enacts policy (we've paged the sysadmin, so it's OK to
>> keep the page dirty and retry the writebacks to the dm-thin volume
>> after the helper daemon gives the all-clear), but we shouldn't require
>> all user space applications to have magic, Linux-specific retry code.
>
> Yes and no. I agree that the kernel should mostly make things just work.
> We're talking about a relatively obscure error case here, so if
> userspace applications have to do something extraordinary, chances are
> they won't be doing it.
>
> On the other hand, indefinitely blocking on fsync() isn't really what we
> want either, so while the kernel should keep trying to get the data
> written in the background, a failing fsync() would be okay, as long as a
> succeeding fsync() afterwards means that we're fully consistent again.
>
> In qemu, indefinitely blocking read/write syscalls are already a problem
> (on NFS), because instead of getting an error and then stopping the VM,
> the request hangs so long that the guest kernel sees a timeout and
> offlines the disk anyway. But that's a separate problem...
>
> Kevin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-11  0:41 ` NeilBrown
@ 2017-01-13 11:09   ` Kevin Wolf
  2017-01-13 14:21     ` Theodore Ts'o
  2017-01-13 18:40     ` Al Viro
  0 siblings, 2 replies; 53+ messages in thread
From: Kevin Wolf @ 2017-01-13 11:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: lsf-pc, linux-fsdevel, linux-mm, Christoph Hellwig, Ric Wheeler,
	Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 13434 bytes --]

Am 11.01.2017 um 01:41 hat NeilBrown geschrieben:
> On Wed, Jan 11 2017, Kevin Wolf wrote:
> 
> > Hi all,
> >
> > when I mentioned the I/O error handling problem especially with fsync()
> > we have in QEMU to Christoph Hellwig, he thought it would be great topic
> > for LSF/MM, so here I am. This came up a few months ago on qemu-devel [1]
> > and we managed to ignore it for a while, but it's a real and potentially
> > serious problem, so I think I agree with Christoph that it makes sense
> > to get it discussed at LSF/MM.
> >
> >
> > At the heart of it is the semantics of fsync(). A few years ago, fsync()
> > was fixed to actually flush data to the disk, so we now have a defined
> > and useful meaning of fsync() as long as all your fsync() calls return
> > success.
> >
> > However, as soon as one fsync() call fails, even if the root problem is
> > solved later (network connection restored, some space freed for thin
> > provisioned storage, etc.), the state we're in is mostly undefined. As
> > Ric Wheeler told me back in the qemu-devel discussion, when a writeout
> > fails, you get an fsync() error returned (once), but the kernel page
> > cache simply marks the respective page as clean and consequently won't
> > ever retry the writeout. Instead, it can evict it from the cache even
> > though it isn't actually consistent with the state on disk, which means
> > throwing away data that was written by some process.
> >
> > So if you do another fsync() and it returns success, this doesn't
> > currently mean that all of the data you wrote is on disk, but if
> > anything, it's just about the data you wrote after the failed fsync().
> > This isn't very helpful, to say the least, because you called fsync() in
> > order to get a consistent state on disk, and you still don't have that.
> >
> > Essentially this means that once you got a fsync() failure, there is no
> > hope to recover for the application and it has to stop using the file.
> 
> This is not strictly correct.  The application could repeat all the
> recent writes.  It might fsync after each write so it can find out
> exactly where the problem is.
> So it could a a lot of work to recover, but it is not intrinsically
> impossible.

You are right, I probably overgeneralised from our situation. qemu
doesn't have the written data any more, and basically duplicating the
page cache by keeping a second copy of all data in qemu until the next
flush isn't really practicable and would both consume a considerable
amount of memory (if we don't add artificial flushes that the guest
didn't request, potentially unbounded) and impact performance because we
wouldn't be zero-copy any more.

So it is not intrinsically impossible, but practically impossible for at
least some applications. As you say, it's probably also too much extra
code to deal with an unlikely corner case for applications where it
would be possible, so it's still unlikely they will do this.

> > To give some context about my perspective as the maintainer for the QEMU
> > block subsystem: QEMU has a mode (which is usually enabled in
> > production) where I/O failure isn't communicated to the guest, which
> > would probably offline the filesystem, thinking its hard disk has died,
> > but instead QEMU pauses the VM and allows the administrator to resume
> > when the problem has been fixed. Often the problem is only temporary,
> > e.g. a network hiccup when a disk image is stored on NFS, so this is a
> > quite helpful approach.
> 
> If the disk image is stored over NFS, the write should hang, not cause
> an error. (Of course if you mount with '-o soft' you will get an error,
> but if you mount with '-o soft', then "you get to keep both halves").

Yes, bad example. (The hanging write is a problem of its own, and I
think one of the reasons why '-o soft' is bad is the behaviour of the
page cache if we let it fail, but while possibly related, it's a
separate problem.)

> Is there a more realistic situation where you might get a write error
> that might succeed if the write is repeated?

So where we noticed this problem in practice wasn't the kernel page
cache, but the userspace gluster implementation, which exposed a similar
behaviour: It threw away the cache contents on a failed fsync() and the
next fsync() would report success again.

In the following discussion we came to think of the kernel and that the
same problem exists there in theory. This was confirmed by Ric Wheeler
and Rik van Riel, who I trust to have some knowledge about this, and my
own superificial read of some kernel code didn't contradict. Neither did
anyone in this thread disagree, so I assume that the problem does exist
on the page cache level.

Now even if at the moment there were no storage backend where a write
failure can be temporary (which I find hard to believe, but who knows),
a single new driver is enough to expose the problem. Are you confident
enough that no single driver will ever behave this way to make data
integrity depend on the assumption?

Now to answer your question a bit more directly: The other example we
had in mind was ENOSPC in thin provisioned block devices, which can be
fixed by freeing up some space. I also still see potential for such
behaviour in things using the network, but I haven't checked them in
detail.

> > When QEMU is told to resume the VM, the request is just resubmitted.
> > This works fine for read/write, but not so much for fsync, because after
> > the first failure all bets are off even if a subsequent fsync()
> > succeeds.
> >
> > So this is the aspect that directly affects me, even though the problem
> > is much broader and by far doesn't only affect QEMU.
> >
> >
> > This leads to a few invidivual points to be discussed:
> >
> > 1. Fix the data corruption problem that follows from the current
> >    behaviour. Imagine the following scenario:
> >
> >    Process A writes to some file, calls fsync() and gets a failure. The
> >    data it wrote is marked clean in the page cache even though it's
> >    inconsistent with the disk. Process A knows that fsync() fails, so
> >    maybe it can deal with it, at least by stop using the file.
> >
> >    Now process B opens the same file, reads the updated data that
> >    process A wrote, makes some additional changes based on that and
> >    calls fsync() again.  Now fsync() return success. The data written by
> >    B is on disk, but the data written by A isn't. Oops, this is data
> >    corruption, and process B doesn't even know about it because all its
> >    operations succeeded.
> 
> Can that really happen? I would expect the filesystem to call
> SetPageError() if there was a write error, then I would expect a read to
> report an error for that page if it were still in cache (or maybe flush
> it out).  I admit that I haven't traced through the code in detail, but
> I did find some examples for SetPageError after a write error.

To be honest, I kept the proposal intentionally on the high-level
userspace API semantics level because I'm not familiar with the
internals. I did have a look and could have been lucky enough to spot
something that contradicts the theoretical considerations (which I
didn't), but by far didn't spend enough time to make the opposite
statement, whether there isn't something that prevents it from
happening. I took Rik's word on this.

Anyway, it would probably be good if someone had a closer look.

> >
> > 2. Define fsync() semantics that include the state after a failure (this
> >    probably goes a long way towards fixing 1.).
> >
> >    The semantics that QEMU uses internally (and which it needs to map)
> >    is that after a successful flush, all writes to the disk image that
> >    have successfully completed before the flush was issued are stable on
> >    disk (no matter whether a previous flush failed).
> >
> >    A possible adaption to Linux, which considers that unlike QEMU
> >    images, files can be opened more than once, might be that a
> >    succeeding fsync() on a file descriptor means that all data that has
> >    been read or written through this file descriptor is consistent
> >    between the page cache and the disk (the read part is for avoiding
> >    the scenario from 1.; it means that fsync flushes data written on a
> >    different file descriptor if it has been seen by this one; hence, the
> >    page cache can't contain non-dirty pages which aren't consistent with
> >    the disk).
> 
> I think it would be useful to try to describe the behaviour of page
> flags, particularly PG_error PG_uptodate PG_dirty in the different
> scenarios.
> 
> For example, a successful read sets PG_uptodate and a successful write
> clears PG_dirty.
> A failed read doesn't set PG_uptodate, and maybe sets PG_error.
> A failed read probably shouldn't clear PG_dirty but should set PG_error.
> 
> If background-write finds a PG_dirty|PG_error page, should it try to
> write it out again?  Or should only a foreground (fsync) write?

That's a good question. I think a background write (if that includes
anything not coming from userspace) needs to be able to retry writing
out pages at least sometimes, specifically as the final attempt when we
need the memory and are about to throw the data away for good.

> If we did this, PG_error|PG_dirty pages would be pinned in memory until
> a write was successful.  We would need a way to purge these pages
> without writing them.  We would also need a way to ensure they didn't
> consume a large fraction of memory.

Yes, at some point throwing them away is unavoidable. If we do, a good
fsync() behaviour is important to communicate this to userspace.

> It isn't clear to me that the behaviour can be different for different
> file descriptors.  Once the data has been written to the page cache, it
> belongs to the file, not to any particular fd.  So enabling
> "keep-data-after-write-error" would need to be per-file rather than
> per-fd, and would probably need to be a privileged operations due to the
> memory consumption concerns.

Note that I didn't think of a "keep-data-after-write-error" flag,
neither per-fd nor per-file, because I assumed that everyone would want
it as long as there is some hope that the data could still be
successfully written out later.

The per-fd thing I envisioned was a flag that basically tells "this fd
has gone bad, fsync() won't ever return success for it again" and that
would be set for all open file descriptors for a file when we release
PG_error|PG_dirty pages in it without having written them.

I had assumed that there is a way to get back from the file to all file
descriptors that are open for it, but looking at the code I don't see
one indeed. Is this an intentional design decision or is it just that
nobody needed it?

You could still mark the whole file as "gone bad", but then this would
also affect new file descriptors that never saw the content that we
threw away. If I understand correctly, you would have to close all file
descriptors on the file first to get rid of the "gone bad" flag (is this
enough or are files kept around for longer than their fds?), and only
then you could get a working new one again. This sounds a bit too heavy
to me.

> >
> > 3. Actually make fsync() failure recoverable.
> >
> >    You can implement 2. by making sure that a file descriptor for which
> >    pages have been thrown away always returns an error and never goes
> >    back to suceeding (it can't succeed according to the definition of 2.
> >    because the data that would have to be written out is gone). This is
> >    already a much better interface, but it doesn't really solve the
> >    actual problem we have.
> >
> >    We also need to make sure that after a failed fsync() there is a
> >    chance to recover. This means that the pages shouldn't be thrown away
> >    immediately; but at the same time, you probably also don't want to
> >    keep pages indefinitely when there is a permanent writeout error.
> >    However, if we can make sure that these pages are only evicted in
> >    case of actual memory pressure, and only if there are no actually
> >    clean page to evict, I think a lot would be already won.
> 
> I think this would make behaviour unpredictable, being dependent on how
> much memory pressure there is.  Predictability is nice!

Yes, predictability is nice. Recovering from errors and not losing data
is nice, too. I think I would generally value the latter higher, but I
see that there may be cases where a different tradeoff might make sense.
A sign that it should be an option?

On the other hand, I wouldn't really consider page cache writeouts
particularly predictable for userspace anyway.

> >
> >    In the common case, you could then recover from a temporary failure,
> >    but if this state isn't maintainable, at least we get consistent
> >    fsync() failure telling us that the data is gone.
> >
> >
> > I think I've summarised most aspects here, but if something is unclear
> > or you'd like to see some more context, please refer to the qemu-devel
> > discussion [1] that I mentioned, or feel free to just ask.
> 
> Definitely an interesting question!
> 
> Thanks,
> NeilBrown

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-13  4:51     ` NeilBrown
@ 2017-01-13 11:51       ` Kevin Wolf
  2017-01-13 21:55         ` NeilBrown
  0 siblings, 1 reply; 53+ messages in thread
From: Kevin Wolf @ 2017-01-13 11:51 UTC (permalink / raw)
  To: NeilBrown
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm,
	Christoph Hellwig, Ric Wheeler, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 3554 bytes --]

Am 13.01.2017 um 05:51 hat NeilBrown geschrieben:
> On Wed, Jan 11 2017, Kevin Wolf wrote:
> 
> > Am 11.01.2017 um 06:03 hat Theodore Ts'o geschrieben:
> >> A couple of thoughts.
> >> 
> >> First of all, one of the reasons why this probably hasn't been
> >> addressed for so long is because programs who really care about issues
> >> like this tend to use Direct I/O, and don't use the page cache at all.
> >> And perhaps this is an option open to qemu as well?
> >
> > For our immediate case, yes, O_DIRECT can be enabled as an option in
> > qemu, and it is generally recommended to do that at least for long-lived
> > VMs. For other cases it might be nice to use the cache e.g. for quicker
> > startup, but those might be cases where error recovery isn't as
> > important.
> >
> > I just see a much broader problem here than just for qemu. Essentially
> > this approach would mean that every program that cares about the state
> > it sees being safe on disk after a successful fsync() would have to use
> > O_DIRECT. I'm not sure if that's what we want.
> 
> This is not correct.  If an application has exclusive write access to a
> file (which is common, even if only enforced by convention) and if that
> program checks the return of every write() and every fsync() (which, for
> example, stdio does, allowing ferror() to report if there have ever been
> errors), then it will know if its data if safe.
> 
> If any of these writes returned an error, then there is NOTHING IT CAN
> DO about that file.  It should be considered to be toast.
> If there is a separate filesystem it can use, then maybe there is a way
> forward, but normally it would just report an error in whatever way is
> appropriate.
> 
> My position on this is primarily that if you get a single write error,
> then you cannot trust anything any more.

But why? Do you think this is inevitable and therefore it is the most
useful approach to handle errors, or is it just more convenient because
then you don't have to think as much about error cases?

The semantics I know is that a failed write means that the contents of
the blocks touched by a failed write request is undefined now, but why
can't I trust anything else in the same file (we're talking about what
is often a whole block device in the case of qemu) any more?

> You suggested before that NFS problems can cause errors which can be
> fixed by the sysadmin so subsequent writes succeed.  I disagreed - NFS
> will block, not return an error.  Your last paragraph below indicates
> that you agree.  So I ask again: can you provide a genuine example of a
> case where a write might result in an error, but that sysadmin
> involvement can allow a subsequent attempt to write to succeed.   I
> don't think you can, but I'm open...

I think I replied to that in the other email now, so in order to keep it
in one place I don't repeat my answer here

> I note that ext4 has an option "errors=remount-ro".  I think that
> actually makes a lot of sense.  I could easily see an argument for
> supporting this at the file level, when it isn't enabled at the
> filesystem level. If there is any write error, then all subsequent
> writes should cause an error, only reads should be allowed.

Obviously, that doesn't solve the problems we have to recover, but makes
them only worse. However, I admit it would be the only reasonable choice
if "after a single write error, you can't trust the whole file" is the
official semantics. (Which I hope it isn't.)

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-13 11:09   ` Kevin Wolf
@ 2017-01-13 14:21     ` Theodore Ts'o
  2017-01-13 16:00       ` Kevin Wolf
  2017-01-13 18:40     ` Al Viro
  1 sibling, 1 reply; 53+ messages in thread
From: Theodore Ts'o @ 2017-01-13 14:21 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: NeilBrown, lsf-pc, linux-fsdevel, linux-mm, Christoph Hellwig,
	Ric Wheeler, Rik van Riel

On Fri, Jan 13, 2017 at 12:09:59PM +0100, Kevin Wolf wrote:
> Now even if at the moment there were no storage backend where a write
> failure can be temporary (which I find hard to believe, but who knows),
> a single new driver is enough to expose the problem. Are you confident
> enough that no single driver will ever behave this way to make data
> integrity depend on the assumption?

This is really a philosophical question.  It very much simplifiees
things if we can make the assumption that a driver that *does* behave
this way is **broken**.  If the I/O error is temporary, then the
driver should simply not complete the write, and wait.  If it fails,
it should only be because it has timed out on waiting and has assumed
that the problem is permanent.

Otherwise, every single application is going to have to learn how to
deal with temporary errors, and everything that implies (throwing up
dialog boxes to the user, who may not be able to do anything --- this
is why in the dm-thin case, if you think it should be temporary,
dm-thin should be calling out to a usr space program that pages an
system administrator; why do you think the process or the user who
started the process can do anything about it/)

Now, perhaps there ought to be a way for the application to say, "you
know, if you are going to have to wait more than <timeval>, don't
bother".  This might be interesting from a general sense, even for
working hardware, since there are HDD's with media extensions where
you can tell the disk drive not to bother with the I/O operation if
it's going to take more than XX milliseconds, and if there is a way to
reflect that back to userspace, that can be useful for other
applications, such as video or other soft realtime programs.

But forcing every single application to have to deal with retries in
the case of temporary errors?  That way lies madness, and there's no
way we can get to all of the applications to make them do the right
thing.

> Note that I didn't think of a "keep-data-after-write-error" flag,
> neither per-fd nor per-file, because I assumed that everyone would want
> it as long as there is some hope that the data could still be
> successfully written out later.

But not everyone is going to know to do this.  This is why the retry
really should be done by the device driver, and if it fails, everyone
lives will be much simpler if the failure should be a permanent
failure where there is no hope.

Are there use cases you are concerned about where this model wouldn't
suit?

	       	    	      	      	    - Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-13 14:21     ` Theodore Ts'o
@ 2017-01-13 16:00       ` Kevin Wolf
  2017-01-13 22:28         ` NeilBrown
  0 siblings, 1 reply; 53+ messages in thread
From: Kevin Wolf @ 2017-01-13 16:00 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: NeilBrown, lsf-pc, linux-fsdevel, linux-mm, Christoph Hellwig,
	Ric Wheeler, Rik van Riel

Am 13.01.2017 um 15:21 hat Theodore Ts'o geschrieben:
> On Fri, Jan 13, 2017 at 12:09:59PM +0100, Kevin Wolf wrote:
> > Now even if at the moment there were no storage backend where a write
> > failure can be temporary (which I find hard to believe, but who knows),
> > a single new driver is enough to expose the problem. Are you confident
> > enough that no single driver will ever behave this way to make data
> > integrity depend on the assumption?
> 
> This is really a philosophical question.  It very much simplifiees
> things if we can make the assumption that a driver that *does* behave
> this way is **broken**.  If the I/O error is temporary, then the
> driver should simply not complete the write, and wait.

If we are sure that (at least we make it so that) every error is
permanent, then yes, this simplifies things a bit because it saves you
the retries that we know wouldn't succeed anyway.

In that case, what's possibly left is modifying fsync() so that it
consistently returns an error; or if not, we need to promise this
behaviour to userspace so that on the first fsync() failure it can give
up on the file without doing less for the user than it could do.

> If it fails, it should only be because it has timed out on waiting and
> has assumed that the problem is permanent.

If a manual action is required to restore the functionality, how can you
use a timeout for determining whether a problem is permanent or not?

This is exactly the kind of errors from which we want to recover in
qemu instead of killing the VMs. Assuming that errors are permanent when
they aren't, but just require some action before they can succeed, is
not a solution to the problem, but it's pretty much the description of
the problem that we had before we implemented the retry logic.

So if you say that all errors are permanent, fine; but if some of them
are actually temporary, we're back to square one.

> Otherwise, every single application is going to have to learn how to
> deal with temporary errors, and everything that implies (throwing up
> dialog boxes to the user, who may not be able to do anything

Yes, that's obviously not a realistic option.

> --- this is why in the dm-thin case, if you think it should be
> temporary, dm-thin should be calling out to a usr space program that
> pages an system administrator; why do you think the process or the
> user who started the process can do anything about it/)

In the case of qemu, we can't do anything about it in terms of making
the request work, but we can do something useful with the information:
We limit the damage done, by pausing the VM and preventing it from
seeing a broken hard disk from which it wouldn't recover without a
reboot. So in our case, both the system administrator and the process
want to be informed.

A timeout could serve as a trigger for qemu, but we could possibly do
better for things like the dm-thin case where we know immediately that
we'll have to wait for manual action.

> Now, perhaps there ought to be a way for the application to say, "you
> know, if you are going to have to wait more than <timeval>, don't
> bother".  This might be interesting from a general sense, even for
> working hardware, since there are HDD's with media extensions where
> you can tell the disk drive not to bother with the I/O operation if
> it's going to take more than XX milliseconds, and if there is a way to
> reflect that back to userspace, that can be useful for other
> applications, such as video or other soft realtime programs.
> 
> But forcing every single application to have to deal with retries in
> the case of temporary errors?  That way lies madness, and there's no
> way we can get to all of the applications to make them do the right
> thing.

Agree on both points.

> > Note that I didn't think of a "keep-data-after-write-error" flag,
> > neither per-fd nor per-file, because I assumed that everyone would want
> > it as long as there is some hope that the data could still be
> > successfully written out later.
> 
> But not everyone is going to know to do this.  This is why the retry
> really should be done by the device driver, and if it fails, everyone
> lives will be much simpler if the failure should be a permanent
> failure where there is no hope.
> 
> Are there use cases you are concerned about where this model wouldn't
> suit?

If, and only if, all permanent errors are actually permanent, I think
this works.

Of course, this makes handling hanging requests even more important for
us. We have certain places where we want to get to a clean state with no
pending requests. We could probably use timeouts in userspace, but we
would also want to get the thread doing the syscall unstuck and ideally
be sure that the kernel doesn't still try changing the file behind our
back (maybe the latter part is only thinkable with direct I/O, though).

In other words, we're the only user of a file and we want to cancel
hanging I/O syscalls. I think we once came to the conclusion that this
isn't currently possible, but it's been a while...

Kevin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-13 11:09   ` Kevin Wolf
  2017-01-13 14:21     ` Theodore Ts'o
@ 2017-01-13 18:40     ` Al Viro
  2017-01-13 19:06       ` Kevin Wolf
  1 sibling, 1 reply; 53+ messages in thread
From: Al Viro @ 2017-01-13 18:40 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: NeilBrown, lsf-pc, linux-fsdevel, linux-mm, Christoph Hellwig,
	Ric Wheeler, Rik van Riel

On Fri, Jan 13, 2017 at 12:09:59PM +0100, Kevin Wolf wrote:

> I had assumed that there is a way to get back from the file to all file
> descriptors that are open for it, but looking at the code I don't see
> one indeed. Is this an intentional design decision or is it just that
> nobody needed it?

The locking required for that would be horrible.  Ditto for the memory
*and* dirty cache footprint.  Besides, what kind of locking would the
callers need, simply to keep the answer from going stale by the time
they see it?  System-wide exclusion of operations that might affect
descriptors (including fork and exit, BTW)?

And that's aside of the fact that an opened file might have no descriptors
whatsoever - e.g. stuff it into SCM_RIGHTS, send to another process (or
to yourself) and close the descriptor you've used.  recvmsg() will reattach
it to descriptor table nicely...

If you are not actually talking about the descriptors and want all
struct file associated with given... inode, presumably?  That one is
merely a nasty headache from dirty cache footprint on a bunch of
hot paths.  That, and the same "how do you keep the results valid by the
time they are returned to caller" problem - e.g. how do you know that
another process has not opened the same thing just as you'd been examining
the set of opened files with that inode?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-13 18:40     ` Al Viro
@ 2017-01-13 19:06       ` Kevin Wolf
  0 siblings, 0 replies; 53+ messages in thread
From: Kevin Wolf @ 2017-01-13 19:06 UTC (permalink / raw)
  To: Al Viro
  Cc: NeilBrown, lsf-pc, linux-fsdevel, linux-mm, Christoph Hellwig,
	Ric Wheeler, Rik van Riel

Am 13.01.2017 um 19:40 hat Al Viro geschrieben:
> On Fri, Jan 13, 2017 at 12:09:59PM +0100, Kevin Wolf wrote:
> 
> > I had assumed that there is a way to get back from the file to all file
> > descriptors that are open for it, but looking at the code I don't see
> > one indeed. Is this an intentional design decision or is it just that
> > nobody needed it?
> 
> The locking required for that would be horrible.  Ditto for the memory
> *and* dirty cache footprint.  Besides, what kind of locking would the
> callers need, simply to keep the answer from going stale by the time
> they see it?  System-wide exclusion of operations that might affect
> descriptors (including fork and exit, BTW)?
> 
> And that's aside of the fact that an opened file might have no descriptors
> whatsoever - e.g. stuff it into SCM_RIGHTS, send to another process (or
> to yourself) and close the descriptor you've used.  recvmsg() will reattach
> it to descriptor table nicely...
> 
> If you are not actually talking about the descriptors and want all
> struct file associated with given... inode, presumably?  That one is
> merely a nasty headache from dirty cache footprint on a bunch of
> hot paths.  That, and the same "how do you keep the results valid by the
> time they are returned to caller" problem - e.g. how do you know that
> another process has not opened the same thing just as you'd been examining
> the set of opened files with that inode?

Sorry, yes, I was really thinking of struct file rather than the
descriptors per se.

I kind of expected that locking might play a role, but I was curious
whether there's more to it, so thanks for explaining.

Kevin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-13 11:51       ` Kevin Wolf
@ 2017-01-13 21:55         ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-13 21:55 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm,
	Christoph Hellwig, Ric Wheeler, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 4688 bytes --]

On Fri, Jan 13 2017, Kevin Wolf wrote:

> [ Unknown signature status ]
> Am 13.01.2017 um 05:51 hat NeilBrown geschrieben:
>> On Wed, Jan 11 2017, Kevin Wolf wrote:
>> 
>> > Am 11.01.2017 um 06:03 hat Theodore Ts'o geschrieben:
>> >> A couple of thoughts.
>> >> 
>> >> First of all, one of the reasons why this probably hasn't been
>> >> addressed for so long is because programs who really care about issues
>> >> like this tend to use Direct I/O, and don't use the page cache at all.
>> >> And perhaps this is an option open to qemu as well?
>> >
>> > For our immediate case, yes, O_DIRECT can be enabled as an option in
>> > qemu, and it is generally recommended to do that at least for long-lived
>> > VMs. For other cases it might be nice to use the cache e.g. for quicker
>> > startup, but those might be cases where error recovery isn't as
>> > important.
>> >
>> > I just see a much broader problem here than just for qemu. Essentially
>> > this approach would mean that every program that cares about the state
>> > it sees being safe on disk after a successful fsync() would have to use
>> > O_DIRECT. I'm not sure if that's what we want.
>> 
>> This is not correct.  If an application has exclusive write access to a
>> file (which is common, even if only enforced by convention) and if that
>> program checks the return of every write() and every fsync() (which, for
>> example, stdio does, allowing ferror() to report if there have ever been
>> errors), then it will know if its data if safe.
>> 
>> If any of these writes returned an error, then there is NOTHING IT CAN
>> DO about that file.  It should be considered to be toast.
>> If there is a separate filesystem it can use, then maybe there is a way
>> forward, but normally it would just report an error in whatever way is
>> appropriate.
>> 
>> My position on this is primarily that if you get a single write error,
>> then you cannot trust anything any more.
>
> But why? Do you think this is inevitable and therefore it is the most
> useful approach to handle errors, or is it just more convenient because
> then you don't have to think as much about error cases?

If you get an EIO from a write, or fsync, it tells you that the
underlying storage module has run out of options and cannot cope.
Maybe it was a media error, then the drive tried writing is a reserved
area but got a media error there as well, or found it was full.
Maybe it was a drive mechanism error - the head won't move properly any
more.
Maybe the flash storage is too worn and it won't hold data any more.
Maybe the network-attached server said "that object doesn't exist any more"
Maybe .... all sorts of other possibilities.

What is the chance that the underlying storage mechanism has failed to
store this one block for you, but everything else is working smoothly?
I would suggest that the chance is very close to zero.

Trying recovery strategies when you have no idea what went wrong, is an
exercise in futility.  "EIO" doesn't carry enough information, in
general, for you to do anything other the bail-out and admit failure.

NeilBrown


>
> The semantics I know is that a failed write means that the contents of
> the blocks touched by a failed write request is undefined now, but why
> can't I trust anything else in the same file (we're talking about what
> is often a whole block device in the case of qemu) any more?
>
>> You suggested before that NFS problems can cause errors which can be
>> fixed by the sysadmin so subsequent writes succeed.  I disagreed - NFS
>> will block, not return an error.  Your last paragraph below indicates
>> that you agree.  So I ask again: can you provide a genuine example of a
>> case where a write might result in an error, but that sysadmin
>> involvement can allow a subsequent attempt to write to succeed.   I
>> don't think you can, but I'm open...
>
> I think I replied to that in the other email now, so in order to keep it
> in one place I don't repeat my answer here
>
>> I note that ext4 has an option "errors=remount-ro".  I think that
>> actually makes a lot of sense.  I could easily see an argument for
>> supporting this at the file level, when it isn't enabled at the
>> filesystem level. If there is any write error, then all subsequent
>> writes should cause an error, only reads should be allowed.
>
> Obviously, that doesn't solve the problems we have to recover, but makes
> them only worse. However, I admit it would be the only reasonable choice
> if "after a single write error, you can't trust the whole file" is the
> official semantics. (Which I hope it isn't.)
>
> Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-13 16:00       ` Kevin Wolf
@ 2017-01-13 22:28         ` NeilBrown
  2017-01-14  6:18           ` Darrick J. Wong
  2017-01-16 12:14           ` [Lsf-pc] " Jeff Layton
  0 siblings, 2 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-13 22:28 UTC (permalink / raw)
  To: Kevin Wolf, Theodore Ts'o
  Cc: lsf-pc, linux-fsdevel, linux-mm, Christoph Hellwig, Ric Wheeler,
	Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 5927 bytes --]

On Sat, Jan 14 2017, Kevin Wolf wrote:

> Am 13.01.2017 um 15:21 hat Theodore Ts'o geschrieben:
>> On Fri, Jan 13, 2017 at 12:09:59PM +0100, Kevin Wolf wrote:
>> > Now even if at the moment there were no storage backend where a write
>> > failure can be temporary (which I find hard to believe, but who knows),
>> > a single new driver is enough to expose the problem. Are you confident
>> > enough that no single driver will ever behave this way to make data
>> > integrity depend on the assumption?
>> 
>> This is really a philosophical question.  It very much simplifiees
>> things if we can make the assumption that a driver that *does* behave
>> this way is **broken**.  If the I/O error is temporary, then the
>> driver should simply not complete the write, and wait.
>
> If we are sure that (at least we make it so that) every error is
> permanent, then yes, this simplifies things a bit because it saves you
> the retries that we know wouldn't succeed anyway.
>
> In that case, what's possibly left is modifying fsync() so that it
> consistently returns an error; or if not, we need to promise this
> behaviour to userspace so that on the first fsync() failure it can give
> up on the file without doing less for the user than it could do.

I think we can (and implicitly do) make that promise: if you get EIO
From fsync, then there is no credible recovery action you can try.

>
>> If it fails, it should only be because it has timed out on waiting and
>> has assumed that the problem is permanent.
>
> If a manual action is required to restore the functionality, how can you
> use a timeout for determining whether a problem is permanent or not?

If manual action is required, and can reasonably be expected, then the
device driver should block indefinitely.
As an example, the IBM s390 systems have a "dasd" storage driver, which
I think is a fiber-attached storage array.  If the connection to the
array stops working (and no more paths are available), it will (by
default) block indefinitely.  I presume it logs the problem and the
sysadmin can find out and fix things - or if "things" are unfixable,
they can change the configuration to report an error.

Similary the DM multipath module has an option "queue_if_no_path" (aka
"no_path_retry") which means that if no working paths are found, the
request should be queued and retried (no error reported).

If manual action is an option, then the driver must be configured to wait for
manual action.

>
> This is exactly the kind of errors from which we want to recover in
> qemu instead of killing the VMs. Assuming that errors are permanent when
> they aren't, but just require some action before they can succeed, is
> not a solution to the problem, but it's pretty much the description of
> the problem that we had before we implemented the retry logic.
>
> So if you say that all errors are permanent, fine; but if some of them
> are actually temporary, we're back to square one.
>
>> Otherwise, every single application is going to have to learn how to
>> deal with temporary errors, and everything that implies (throwing up
>> dialog boxes to the user, who may not be able to do anything
>
> Yes, that's obviously not a realistic option.
>
>> --- this is why in the dm-thin case, if you think it should be
>> temporary, dm-thin should be calling out to a usr space program that
>> pages an system administrator; why do you think the process or the
>> user who started the process can do anything about it/)
>
> In the case of qemu, we can't do anything about it in terms of making
> the request work, but we can do something useful with the information:
> We limit the damage done, by pausing the VM and preventing it from
> seeing a broken hard disk from which it wouldn't recover without a
> reboot. So in our case, both the system administrator and the process
> want to be informed.

In theory, using aio_fsync() should allow the process to determine if
any writes are blocking indefinitely.   I have a suspicion that
aio_fsync() is not actually asynchronous, but that might be old
information.
Alternately a child process could call "fsync" and report when it completed.

>
> A timeout could serve as a trigger for qemu, but we could possibly do
> better for things like the dm-thin case where we know immediately that
> we'll have to wait for manual action.

A consistent way for devices to be able to report "operator
intervention required" would certainly be useful.  I'm not sure how easy
it would be for a particular application to determine if such a report
was relevant for any of its IO though.

It might not be too hard to add a flag to "sync_file_range()" to ask it to
report the status of queues, e.g.:
 0 - nothing queued, no data to sync
 1 - writes are being queued, and progress appears to be normal
 2 - queue appears to be stalled
 3 - queue reports that admin intervention is required.

The last one would require a fair bit of plumbing to get the information
to the right place.  The others are probably fairly easy if they can be
defined properly.
If you look in /sys/kernel/bdi/*/stats you will see statistic for each
bdi (backing device info) which roughly correspond to filesystems.  You
can easily map from a file descriptor to a bdi.
The "BdiWriteBandwidth" will (presumably) drop if there is data to be
written which cannot get out.  Monitoring these stats might give an
application a useful understanding about what is happening in a particular
storage device.
I don't suggest that qemu should access this file, because it is a
'debugfs' file and not part of the api.  But the information is there
and might be useful.  If you can show that it is directly useful to an
application in some way, that would a useful step towards making the
information more directly available in an api-stable way.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-13 22:28         ` NeilBrown
@ 2017-01-14  6:18           ` Darrick J. Wong
  2017-01-16 12:14           ` [Lsf-pc] " Jeff Layton
  1 sibling, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2017-01-14  6:18 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kevin Wolf, Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm,
	Christoph Hellwig, Ric Wheeler, Rik van Riel

On Sat, Jan 14, 2017 at 09:28:53AM +1100, NeilBrown wrote:
> On Sat, Jan 14 2017, Kevin Wolf wrote:
> 
> > Am 13.01.2017 um 15:21 hat Theodore Ts'o geschrieben:
> >> On Fri, Jan 13, 2017 at 12:09:59PM +0100, Kevin Wolf wrote:
> >> > Now even if at the moment there were no storage backend where a write
> >> > failure can be temporary (which I find hard to believe, but who knows),
> >> > a single new driver is enough to expose the problem. Are you confident
> >> > enough that no single driver will ever behave this way to make data
> >> > integrity depend on the assumption?
> >> 
> >> This is really a philosophical question.  It very much simplifiees
> >> things if we can make the assumption that a driver that *does* behave
> >> this way is **broken**.  If the I/O error is temporary, then the
> >> driver should simply not complete the write, and wait.
> >
> > If we are sure that (at least we make it so that) every error is
> > permanent, then yes, this simplifies things a bit because it saves you
> > the retries that we know wouldn't succeed anyway.
> >
> > In that case, what's possibly left is modifying fsync() so that it
> > consistently returns an error; or if not, we need to promise this
> > behaviour to userspace so that on the first fsync() failure it can give
> > up on the file without doing less for the user than it could do.
> 
> I think we can (and implicitly do) make that promise: if you get EIO
> From fsync, then there is no credible recovery action you can try.
> 
> >
> >> If it fails, it should only be because it has timed out on waiting and
> >> has assumed that the problem is permanent.
> >
> > If a manual action is required to restore the functionality, how can you
> > use a timeout for determining whether a problem is permanent or not?
> 
> If manual action is required, and can reasonably be expected, then the
> device driver should block indefinitely.
> As an example, the IBM s390 systems have a "dasd" storage driver, which
> I think is a fiber-attached storage array.  If the connection to the
> array stops working (and no more paths are available), it will (by
> default) block indefinitely.  I presume it logs the problem and the
> sysadmin can find out and fix things - or if "things" are unfixable,
> they can change the configuration to report an error.
> 
> Similary the DM multipath module has an option "queue_if_no_path" (aka
> "no_path_retry") which means that if no working paths are found, the
> request should be queued and retried (no error reported).
> 
> If manual action is an option, then the driver must be configured to wait for
> manual action.
> 
> >
> > This is exactly the kind of errors from which we want to recover in
> > qemu instead of killing the VMs. Assuming that errors are permanent when
> > they aren't, but just require some action before they can succeed, is
> > not a solution to the problem, but it's pretty much the description of
> > the problem that we had before we implemented the retry logic.
> >
> > So if you say that all errors are permanent, fine; but if some of them
> > are actually temporary, we're back to square one.
> >
> >> Otherwise, every single application is going to have to learn how to
> >> deal with temporary errors, and everything that implies (throwing up
> >> dialog boxes to the user, who may not be able to do anything
> >
> > Yes, that's obviously not a realistic option.
> >
> >> --- this is why in the dm-thin case, if you think it should be
> >> temporary, dm-thin should be calling out to a usr space program that
> >> pages an system administrator; why do you think the process or the
> >> user who started the process can do anything about it/)
> >
> > In the case of qemu, we can't do anything about it in terms of making
> > the request work, but we can do something useful with the information:
> > We limit the damage done, by pausing the VM and preventing it from
> > seeing a broken hard disk from which it wouldn't recover without a
> > reboot. So in our case, both the system administrator and the process
> > want to be informed.
> 
> In theory, using aio_fsync() should allow the process to determine if
> any writes are blocking indefinitely.   I have a suspicion that
> aio_fsync() is not actually asynchronous, but that might be old
> information.
> Alternately a child process could call "fsync" and report when it completed.

<shrug> I thought aio_fsync wasn't implemented?

--D

> >
> > A timeout could serve as a trigger for qemu, but we could possibly do
> > better for things like the dm-thin case where we know immediately that
> > we'll have to wait for manual action.
> 
> A consistent way for devices to be able to report "operator
> intervention required" would certainly be useful.  I'm not sure how easy
> it would be for a particular application to determine if such a report
> was relevant for any of its IO though.
> 
> It might not be too hard to add a flag to "sync_file_range()" to ask it to
> report the status of queues, e.g.:
>  0 - nothing queued, no data to sync
>  1 - writes are being queued, and progress appears to be normal
>  2 - queue appears to be stalled
>  3 - queue reports that admin intervention is required.
> 
> The last one would require a fair bit of plumbing to get the information
> to the right place.  The others are probably fairly easy if they can be
> defined properly.
> If you look in /sys/kernel/bdi/*/stats you will see statistic for each
> bdi (backing device info) which roughly correspond to filesystems.  You
> can easily map from a file descriptor to a bdi.
> The "BdiWriteBandwidth" will (presumably) drop if there is data to be
> written which cannot get out.  Monitoring these stats might give an
> application a useful understanding about what is happening in a particular
> storage device.
> I don't suggest that qemu should access this file, because it is a
> 'debugfs' file and not part of the api.  But the information is there
> and might be useful.  If you can show that it is directly useful to an
> application in some way, that would a useful step towards making the
> information more directly available in an api-stable way.
> 
> NeilBrown
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-13 22:28         ` NeilBrown
  2017-01-14  6:18           ` Darrick J. Wong
@ 2017-01-16 12:14           ` Jeff Layton
  2017-01-22 22:44             ` NeilBrown
  1 sibling, 1 reply; 53+ messages in thread
From: Jeff Layton @ 2017-01-16 12:14 UTC (permalink / raw)
  To: NeilBrown, Kevin Wolf, Theodore Ts'o
  Cc: Rik van Riel, Christoph Hellwig, linux-mm, linux-fsdevel, lsf-pc,
	Ric Wheeler

On Sat, 2017-01-14 at 09:28 +1100, NeilBrown wrote:
> On Sat, Jan 14 2017, Kevin Wolf wrote:
> 
> > Am 13.01.2017 um 15:21 hat Theodore Ts'o geschrieben:
> > > On Fri, Jan 13, 2017 at 12:09:59PM +0100, Kevin Wolf wrote:
> > > > Now even if at the moment there were no storage backend where a write
> > > > failure can be temporary (which I find hard to believe, but who knows),
> > > > a single new driver is enough to expose the problem. Are you confident
> > > > enough that no single driver will ever behave this way to make data
> > > > integrity depend on the assumption?
> > > 
> > > This is really a philosophical question.  It very much simplifiees
> > > things if we can make the assumption that a driver that *does* behave
> > > this way is **broken**.  If the I/O error is temporary, then the
> > > driver should simply not complete the write, and wait.
> > 
> > If we are sure that (at least we make it so that) every error is
> > permanent, then yes, this simplifies things a bit because it saves you
> > the retries that we know wouldn't succeed anyway.
> > 
> > In that case, what's possibly left is modifying fsync() so that it
> > consistently returns an error; or if not, we need to promise this
> > behaviour to userspace so that on the first fsync() failure it can give
> > up on the file without doing less for the user than it could do.
> 
> I think we can (and implicitly do) make that promise: if you get EIO
> From fsync, then there is no credible recovery action you can try.
> 
> > 
> > > If it fails, it should only be because it has timed out on waiting and
> > > has assumed that the problem is permanent.
> > 
> > If a manual action is required to restore the functionality, how can you
> > use a timeout for determining whether a problem is permanent or not?
> 
> If manual action is required, and can reasonably be expected, then the
> device driver should block indefinitely.
> As an example, the IBM s390 systems have a "dasd" storage driver, which
> I think is a fiber-attached storage array.  If the connection to the
> array stops working (and no more paths are available), it will (by
> default) block indefinitely.  I presume it logs the problem and the
> sysadmin can find out and fix things - or if "things" are unfixable,
> they can change the configuration to report an error.
> 
> Similary the DM multipath module has an option "queue_if_no_path" (aka
> "no_path_retry") which means that if no working paths are found, the
> request should be queued and retried (no error reported).
> 
> If manual action is an option, then the driver must be configured to wait for
> manual action.
> 
> > 
> > This is exactly the kind of errors from which we want to recover in
> > qemu instead of killing the VMs. Assuming that errors are permanent when
> > they aren't, but just require some action before they can succeed, is
> > not a solution to the problem, but it's pretty much the description of
> > the problem that we had before we implemented the retry logic.
> > 
> > So if you say that all errors are permanent, fine; but if some of them
> > are actually temporary, we're back to square one.
> > 
> > > Otherwise, every single application is going to have to learn how to
> > > deal with temporary errors, and everything that implies (throwing up
> > > dialog boxes to the user, who may not be able to do anything
> > 
> > Yes, that's obviously not a realistic option.
> > 
> > > --- this is why in the dm-thin case, if you think it should be
> > > temporary, dm-thin should be calling out to a usr space program that
> > > pages an system administrator; why do you think the process or the
> > > user who started the process can do anything about it/)
> > 
> > In the case of qemu, we can't do anything about it in terms of making
> > the request work, but we can do something useful with the information:
> > We limit the damage done, by pausing the VM and preventing it from
> > seeing a broken hard disk from which it wouldn't recover without a
> > reboot. So in our case, both the system administrator and the process
> > want to be informed.
> 
> In theory, using aio_fsync() should allow the process to determine if
> any writes are blocking indefinitely.   I have a suspicion that
> aio_fsync() is not actually asynchronous, but that might be old
> information.
> Alternately a child process could call "fsync" and report when it completed.
> 
> > 
> > A timeout could serve as a trigger for qemu, but we could possibly do
> > better for things like the dm-thin case where we know immediately that
> > we'll have to wait for manual action.
> 
> A consistent way for devices to be able to report "operator
> intervention required" would certainly be useful.  I'm not sure how easy
> it would be for a particular application to determine if such a report
> was relevant for any of its IO though.
> 
> It might not be too hard to add a flag to "sync_file_range()" to ask it to
> report the status of queues, e.g.:
>  0 - nothing queued, no data to sync
>  1 - writes are being queued, and progress appears to be normal
>  2 - queue appears to be stalled
>  3 - queue reports that admin intervention is required.
> 
> The last one would require a fair bit of plumbing to get the information
> to the right place.  The others are probably fairly easy if they can be
> defined properly.
> If you look in /sys/kernel/bdi/*/stats you will see statistic for each
> bdi (backing device info) which roughly correspond to filesystems.  You
> can easily map from a file descriptor to a bdi.
> The "BdiWriteBandwidth" will (presumably) drop if there is data to be
> written which cannot get out.  Monitoring these stats might give an
> application a useful understanding about what is happening in a particular
> storage device.
> I don't suggest that qemu should access this file, because it is a
> 'debugfs' file and not part of the api.  But the information is there
> and might be useful.  If you can show that it is directly useful to an
> application in some way, that would a useful step towards making the
> information more directly available in an api-stable way.
> 

I think my main takeaway from reading this discussion is that the
write/fsync model as a whole is really unsuitable for this (common) use
case. Given that we're already discussing using Linux specific
interfaces (sync_file_range, for instance), maybe we should turn this
topic around:

What would an ideal kernel<->userland interface for this use case look
like?

-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-16 12:14           ` [Lsf-pc] " Jeff Layton
@ 2017-01-22 22:44             ` NeilBrown
  2017-01-22 23:31               ` Jeff Layton
  0 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2017-01-22 22:44 UTC (permalink / raw)
  To: Jeff Layton, Kevin Wolf, Theodore Ts'o
  Cc: Rik van Riel, Christoph Hellwig, linux-mm, linux-fsdevel, lsf-pc,
	Ric Wheeler

[-- Attachment #1: Type: text/plain, Size: 7190 bytes --]

On Mon, Jan 16 2017, Jeff Layton wrote:

> On Sat, 2017-01-14 at 09:28 +1100, NeilBrown wrote:
>> On Sat, Jan 14 2017, Kevin Wolf wrote:
>> 
>> > Am 13.01.2017 um 15:21 hat Theodore Ts'o geschrieben:
>> > > On Fri, Jan 13, 2017 at 12:09:59PM +0100, Kevin Wolf wrote:
>> > > > Now even if at the moment there were no storage backend where a write
>> > > > failure can be temporary (which I find hard to believe, but who knows),
>> > > > a single new driver is enough to expose the problem. Are you confident
>> > > > enough that no single driver will ever behave this way to make data
>> > > > integrity depend on the assumption?
>> > > 
>> > > This is really a philosophical question.  It very much simplifiees
>> > > things if we can make the assumption that a driver that *does* behave
>> > > this way is **broken**.  If the I/O error is temporary, then the
>> > > driver should simply not complete the write, and wait.
>> > 
>> > If we are sure that (at least we make it so that) every error is
>> > permanent, then yes, this simplifies things a bit because it saves you
>> > the retries that we know wouldn't succeed anyway.
>> > 
>> > In that case, what's possibly left is modifying fsync() so that it
>> > consistently returns an error; or if not, we need to promise this
>> > behaviour to userspace so that on the first fsync() failure it can give
>> > up on the file without doing less for the user than it could do.
>> 
>> I think we can (and implicitly do) make that promise: if you get EIO
>> From fsync, then there is no credible recovery action you can try.
>> 
>> > 
>> > > If it fails, it should only be because it has timed out on waiting and
>> > > has assumed that the problem is permanent.
>> > 
>> > If a manual action is required to restore the functionality, how can you
>> > use a timeout for determining whether a problem is permanent or not?
>> 
>> If manual action is required, and can reasonably be expected, then the
>> device driver should block indefinitely.
>> As an example, the IBM s390 systems have a "dasd" storage driver, which
>> I think is a fiber-attached storage array.  If the connection to the
>> array stops working (and no more paths are available), it will (by
>> default) block indefinitely.  I presume it logs the problem and the
>> sysadmin can find out and fix things - or if "things" are unfixable,
>> they can change the configuration to report an error.
>> 
>> Similary the DM multipath module has an option "queue_if_no_path" (aka
>> "no_path_retry") which means that if no working paths are found, the
>> request should be queued and retried (no error reported).
>> 
>> If manual action is an option, then the driver must be configured to wait for
>> manual action.
>> 
>> > 
>> > This is exactly the kind of errors from which we want to recover in
>> > qemu instead of killing the VMs. Assuming that errors are permanent when
>> > they aren't, but just require some action before they can succeed, is
>> > not a solution to the problem, but it's pretty much the description of
>> > the problem that we had before we implemented the retry logic.
>> > 
>> > So if you say that all errors are permanent, fine; but if some of them
>> > are actually temporary, we're back to square one.
>> > 
>> > > Otherwise, every single application is going to have to learn how to
>> > > deal with temporary errors, and everything that implies (throwing up
>> > > dialog boxes to the user, who may not be able to do anything
>> > 
>> > Yes, that's obviously not a realistic option.
>> > 
>> > > --- this is why in the dm-thin case, if you think it should be
>> > > temporary, dm-thin should be calling out to a usr space program that
>> > > pages an system administrator; why do you think the process or the
>> > > user who started the process can do anything about it/)
>> > 
>> > In the case of qemu, we can't do anything about it in terms of making
>> > the request work, but we can do something useful with the information:
>> > We limit the damage done, by pausing the VM and preventing it from
>> > seeing a broken hard disk from which it wouldn't recover without a
>> > reboot. So in our case, both the system administrator and the process
>> > want to be informed.
>> 
>> In theory, using aio_fsync() should allow the process to determine if
>> any writes are blocking indefinitely.   I have a suspicion that
>> aio_fsync() is not actually asynchronous, but that might be old
>> information.
>> Alternately a child process could call "fsync" and report when it completed.
>> 
>> > 
>> > A timeout could serve as a trigger for qemu, but we could possibly do
>> > better for things like the dm-thin case where we know immediately that
>> > we'll have to wait for manual action.
>> 
>> A consistent way for devices to be able to report "operator
>> intervention required" would certainly be useful.  I'm not sure how easy
>> it would be for a particular application to determine if such a report
>> was relevant for any of its IO though.
>> 
>> It might not be too hard to add a flag to "sync_file_range()" to ask it to
>> report the status of queues, e.g.:
>>  0 - nothing queued, no data to sync
>>  1 - writes are being queued, and progress appears to be normal
>>  2 - queue appears to be stalled
>>  3 - queue reports that admin intervention is required.
>> 
>> The last one would require a fair bit of plumbing to get the information
>> to the right place.  The others are probably fairly easy if they can be
>> defined properly.
>> If you look in /sys/kernel/bdi/*/stats you will see statistic for each
>> bdi (backing device info) which roughly correspond to filesystems.  You
>> can easily map from a file descriptor to a bdi.
>> The "BdiWriteBandwidth" will (presumably) drop if there is data to be
>> written which cannot get out.  Monitoring these stats might give an
>> application a useful understanding about what is happening in a particular
>> storage device.
>> I don't suggest that qemu should access this file, because it is a
>> 'debugfs' file and not part of the api.  But the information is there
>> and might be useful.  If you can show that it is directly useful to an
>> application in some way, that would a useful step towards making the
>> information more directly available in an api-stable way.
>> 
>
> I think my main takeaway from reading this discussion is that the
> write/fsync model as a whole is really unsuitable for this (common) use
> case. Given that we're already discussing using Linux specific
> interfaces (sync_file_range, for instance), maybe we should turn this
> topic around:
>
> What would an ideal kernel<->userland interface for this use case look
> like?


"(common) use case" ??
I understand the use case to be "application needs to know when output
queue is congested so that it can pause gracefully instead of hang at
some arbitrary moment".
Is that the same use case that you see?
It is really "common"?

(It may still be important, even if it isn't common).

I support the idea of stepping back and asking the big picture question!

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-22 22:44             ` NeilBrown
@ 2017-01-22 23:31               ` Jeff Layton
  2017-01-23  0:21                 ` Theodore Ts'o
  0 siblings, 1 reply; 53+ messages in thread
From: Jeff Layton @ 2017-01-22 23:31 UTC (permalink / raw)
  To: NeilBrown, Kevin Wolf, Theodore Ts'o
  Cc: Rik van Riel, Christoph Hellwig, linux-mm, linux-fsdevel, lsf-pc,
	Ric Wheeler

On Mon, 2017-01-23 at 09:44 +1100, NeilBrown wrote:
> On Mon, Jan 16 2017, Jeff Layton wrote:
> 
> > On Sat, 2017-01-14 at 09:28 +1100, NeilBrown wrote:
> > > On Sat, Jan 14 2017, Kevin Wolf wrote:
> > > 
> > > > Am 13.01.2017 um 15:21 hat Theodore Ts'o geschrieben:
> > > > > On Fri, Jan 13, 2017 at 12:09:59PM +0100, Kevin Wolf wrote:
> > > > > > Now even if at the moment there were no storage backend where a write
> > > > > > failure can be temporary (which I find hard to believe, but who knows),
> > > > > > a single new driver is enough to expose the problem. Are you confident
> > > > > > enough that no single driver will ever behave this way to make data
> > > > > > integrity depend on the assumption?
> > > > > 
> > > > > This is really a philosophical question.  It very much simplifiees
> > > > > things if we can make the assumption that a driver that *does* behave
> > > > > this way is **broken**.  If the I/O error is temporary, then the
> > > > > driver should simply not complete the write, and wait.
> > > > 
> > > > If we are sure that (at least we make it so that) every error is
> > > > permanent, then yes, this simplifies things a bit because it saves you
> > > > the retries that we know wouldn't succeed anyway.
> > > > 
> > > > In that case, what's possibly left is modifying fsync() so that it
> > > > consistently returns an error; or if not, we need to promise this
> > > > behaviour to userspace so that on the first fsync() failure it can give
> > > > up on the file without doing less for the user than it could do.
> > > 
> > > I think we can (and implicitly do) make that promise: if you get EIO
> > > From fsync, then there is no credible recovery action you can try.
> > > 
> > > > 
> > > > > If it fails, it should only be because it has timed out on waiting and
> > > > > has assumed that the problem is permanent.
> > > > 
> > > > If a manual action is required to restore the functionality, how can you
> > > > use a timeout for determining whether a problem is permanent or not?
> > > 
> > > If manual action is required, and can reasonably be expected, then the
> > > device driver should block indefinitely.
> > > As an example, the IBM s390 systems have a "dasd" storage driver, which
> > > I think is a fiber-attached storage array.  If the connection to the
> > > array stops working (and no more paths are available), it will (by
> > > default) block indefinitely.  I presume it logs the problem and the
> > > sysadmin can find out and fix things - or if "things" are unfixable,
> > > they can change the configuration to report an error.
> > > 
> > > Similary the DM multipath module has an option "queue_if_no_path" (aka
> > > "no_path_retry") which means that if no working paths are found, the
> > > request should be queued and retried (no error reported).
> > > 
> > > If manual action is an option, then the driver must be configured to wait for
> > > manual action.
> > > 
> > > > 
> > > > This is exactly the kind of errors from which we want to recover in
> > > > qemu instead of killing the VMs. Assuming that errors are permanent when
> > > > they aren't, but just require some action before they can succeed, is
> > > > not a solution to the problem, but it's pretty much the description of
> > > > the problem that we had before we implemented the retry logic.
> > > > 
> > > > So if you say that all errors are permanent, fine; but if some of them
> > > > are actually temporary, we're back to square one.
> > > > 
> > > > > Otherwise, every single application is going to have to learn how to
> > > > > deal with temporary errors, and everything that implies (throwing up
> > > > > dialog boxes to the user, who may not be able to do anything
> > > > 
> > > > Yes, that's obviously not a realistic option.
> > > > 
> > > > > --- this is why in the dm-thin case, if you think it should be
> > > > > temporary, dm-thin should be calling out to a usr space program that
> > > > > pages an system administrator; why do you think the process or the
> > > > > user who started the process can do anything about it/)
> > > > 
> > > > In the case of qemu, we can't do anything about it in terms of making
> > > > the request work, but we can do something useful with the information:
> > > > We limit the damage done, by pausing the VM and preventing it from
> > > > seeing a broken hard disk from which it wouldn't recover without a
> > > > reboot. So in our case, both the system administrator and the process
> > > > want to be informed.
> > > 
> > > In theory, using aio_fsync() should allow the process to determine if
> > > any writes are blocking indefinitely.   I have a suspicion that
> > > aio_fsync() is not actually asynchronous, but that might be old
> > > information.
> > > Alternately a child process could call "fsync" and report when it completed.
> > > 
> > > > 
> > > > A timeout could serve as a trigger for qemu, but we could possibly do
> > > > better for things like the dm-thin case where we know immediately that
> > > > we'll have to wait for manual action.
> > > 
> > > A consistent way for devices to be able to report "operator
> > > intervention required" would certainly be useful.  I'm not sure how easy
> > > it would be for a particular application to determine if such a report
> > > was relevant for any of its IO though.
> > > 
> > > It might not be too hard to add a flag to "sync_file_range()" to ask it to
> > > report the status of queues, e.g.:
> > >  0 - nothing queued, no data to sync
> > >  1 - writes are being queued, and progress appears to be normal
> > >  2 - queue appears to be stalled
> > >  3 - queue reports that admin intervention is required.
> > > 
> > > The last one would require a fair bit of plumbing to get the information
> > > to the right place.  The others are probably fairly easy if they can be
> > > defined properly.
> > > If you look in /sys/kernel/bdi/*/stats you will see statistic for each
> > > bdi (backing device info) which roughly correspond to filesystems.  You
> > > can easily map from a file descriptor to a bdi.
> > > The "BdiWriteBandwidth" will (presumably) drop if there is data to be
> > > written which cannot get out.  Monitoring these stats might give an
> > > application a useful understanding about what is happening in a particular
> > > storage device.
> > > I don't suggest that qemu should access this file, because it is a
> > > 'debugfs' file and not part of the api.  But the information is there
> > > and might be useful.  If you can show that it is directly useful to an
> > > application in some way, that would a useful step towards making the
> > > information more directly available in an api-stable way.
> > > 
> > 
> > I think my main takeaway from reading this discussion is that the
> > write/fsync model as a whole is really unsuitable for this (common) use
> > case. Given that we're already discussing using Linux specific
> > interfaces (sync_file_range, for instance), maybe we should turn this
> > topic around:
> > 
> > What would an ideal kernel<->userland interface for this use case look
> > like?
> 
> 
> "(common) use case" ??
> I understand the use case to be "application needs to know when output
> queue is congested so that it can pause gracefully instead of hang at
> some arbitrary moment".
> Is that the same use case that you see?
> It is really "common"?
> 
> (It may still be important, even if it isn't common).
> 
> I support the idea of stepping back and asking the big picture question!
> 
> NeilBrown

Ahh, sorry if I wasn't clear.

I know Kevin posed this topic in the context of QEMU/KVM, and I figure
that running virt guests (themselves doing all sorts of workloads) is a
pretty common setup these days. That was what I meant by "use case"
here. Obviously there are many other workloads that could benefit from
(or be harmed by) changes in this area.

Still, I think that looking at QEMU/KVM as a "application" and
considering what we can do to help optimize that case could be helpful
here (and might also be helpful for other workloads).

-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-22 23:31               ` Jeff Layton
@ 2017-01-23  0:21                 ` Theodore Ts'o
  2017-01-23 10:09                   ` Kevin Wolf
  0 siblings, 1 reply; 53+ messages in thread
From: Theodore Ts'o @ 2017-01-23  0:21 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, Kevin Wolf, Rik van Riel, Christoph Hellwig, linux-mm,
	linux-fsdevel, lsf-pc, Ric Wheeler

On Sun, Jan 22, 2017 at 06:31:57PM -0500, Jeff Layton wrote:
> 
> Ahh, sorry if I wasn't clear.
> 
> I know Kevin posed this topic in the context of QEMU/KVM, and I figure
> that running virt guests (themselves doing all sorts of workloads) is a
> pretty common setup these days. That was what I meant by "use case"
> here. Obviously there are many other workloads that could benefit from
> (or be harmed by) changes in this area.
> 
> Still, I think that looking at QEMU/KVM as a "application" and
> considering what we can do to help optimize that case could be helpful
> here (and might also be helpful for other workloads).

Well, except for QEMU/KVM, Kevin has already confirmed that using
Direct I/O is a completely viable solution.  (And I'll add it solves a
bunch of other problems, including page cache efficiency....)

					- Ted


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-23  0:21                 ` Theodore Ts'o
@ 2017-01-23 10:09                   ` Kevin Wolf
  2017-01-23 12:10                       ` Jeff Layton
  2017-01-23 22:35                       ` Jeff Layton
  0 siblings, 2 replies; 53+ messages in thread
From: Kevin Wolf @ 2017-01-23 10:09 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jeff Layton, NeilBrown, Rik van Riel, Christoph Hellwig,
	linux-mm, linux-fsdevel, lsf-pc, Ric Wheeler

Am 23.01.2017 um 01:21 hat Theodore Ts'o geschrieben:
> On Sun, Jan 22, 2017 at 06:31:57PM -0500, Jeff Layton wrote:
> > 
> > Ahh, sorry if I wasn't clear.
> > 
> > I know Kevin posed this topic in the context of QEMU/KVM, and I figure
> > that running virt guests (themselves doing all sorts of workloads) is a
> > pretty common setup these days. That was what I meant by "use case"
> > here. Obviously there are many other workloads that could benefit from
> > (or be harmed by) changes in this area.
> > 
> > Still, I think that looking at QEMU/KVM as a "application" and
> > considering what we can do to help optimize that case could be helpful
> > here (and might also be helpful for other workloads).
> 
> Well, except for QEMU/KVM, Kevin has already confirmed that using
> Direct I/O is a completely viable solution.  (And I'll add it solves a
> bunch of other problems, including page cache efficiency....)

Yes, "don't ever use non-O_DIRECT in production" is probably workable as
a solution to the "state after failed fsync()" problem, as long as it is
consistently implemented throughout the stack. That is, if we use a
network protocol in QEMU (NFS, gluster, etc.), the server needs to use
O_DIRECT, too, if we don't want to get the same problem one level down
the stack. I'm not sure if that's possible with all of them, but if it
is, it's mostly just a matter of configuring them correctly.

However, if we look at the greater problem of hanging requests that came
up in the more recent emails of this thread, it is only moved rather
than solved. Chances are that already write() would hang now instead of
only fsync(), but we still have a hard time dealing with this.

Kevin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-23 10:09                   ` Kevin Wolf
@ 2017-01-23 12:10                       ` Jeff Layton
  2017-01-23 22:35                       ` Jeff Layton
  1 sibling, 0 replies; 53+ messages in thread
From: Jeff Layton @ 2017-01-23 12:10 UTC (permalink / raw)
  To: Kevin Wolf, Theodore Ts'o
  Cc: NeilBrown, Rik van Riel, Christoph Hellwig, linux-mm,
	linux-fsdevel, lsf-pc, Ric Wheeler

On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
> Am 23.01.2017 um 01:21 hat Theodore Ts'o geschrieben:
> > On Sun, Jan 22, 2017 at 06:31:57PM -0500, Jeff Layton wrote:
> > > 
> > > Ahh, sorry if I wasn't clear.
> > > 
> > > I know Kevin posed this topic in the context of QEMU/KVM, and I figure
> > > that running virt guests (themselves doing all sorts of workloads) is a
> > > pretty common setup these days. That was what I meant by "use case"
> > > here. Obviously there are many other workloads that could benefit from
> > > (or be harmed by) changes in this area.
> > > 
> > > Still, I think that looking at QEMU/KVM as a "application" and
> > > considering what we can do to help optimize that case could be helpful
> > > here (and might also be helpful for other workloads).
> > 
> > Well, except for QEMU/KVM, Kevin has already confirmed that using
> > Direct I/O is a completely viable solution.  (And I'll add it solves a
> > bunch of other problems, including page cache efficiency....)
> 

Sure, O_DIRECT does make this simpler (though it's not always the most
efficient way to do I/O). I'm more interested in whether we can improve
the error handling with buffered I/O.

Maybe it's possible to add new flags/behaviors to sync_file_range such
that you could more easily determine the status? Maybe a new syscall
would do it?

Either way, getting a good feel for how QEMU would like to handle this
situation would be informative. It might be possible to make things
better with small tweaks to existing interfaces.

> Yes, "don't ever use non-O_DIRECT in production" is probably workable as
> a solution to the "state after failed fsync()" problem, as long as it is
> consistently implemented throughout the stack. That is, if we use a
> network protocol in QEMU (NFS, gluster, etc.), the server needs to use
> O_DIRECT, too, if we don't want to get the same problem one level down
> the stack. I'm not sure if that's possible with all of them, but if it
> is, it's mostly just a matter of configuring them correctly.
> 
> However, if we look at the greater problem of hanging requests that came
> up in the more recent emails of this thread, it is only moved rather
> than solved. Chances are that already write() would hang now instead of
> only fsync(), but we still have a hard time dealing with this.
> 

Yeah, not much you can do there currently (at least when it comes to
NFS). If the you had your choice, what would you like to have happen in
this situation where there is a loss of communication between client and
server?

-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
@ 2017-01-23 12:10                       ` Jeff Layton
  0 siblings, 0 replies; 53+ messages in thread
From: Jeff Layton @ 2017-01-23 12:10 UTC (permalink / raw)
  To: Kevin Wolf, Theodore Ts'o
  Cc: NeilBrown, Rik van Riel, Christoph Hellwig, linux-mm,
	linux-fsdevel, lsf-pc, Ric Wheeler

On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
> Am 23.01.2017 um 01:21 hat Theodore Ts'o geschrieben:
> > On Sun, Jan 22, 2017 at 06:31:57PM -0500, Jeff Layton wrote:
> > > 
> > > Ahh, sorry if I wasn't clear.
> > > 
> > > I know Kevin posed this topic in the context of QEMU/KVM, and I figure
> > > that running virt guests (themselves doing all sorts of workloads) is a
> > > pretty common setup these days. That was what I meant by "use case"
> > > here. Obviously there are many other workloads that could benefit from
> > > (or be harmed by) changes in this area.
> > > 
> > > Still, I think that looking at QEMU/KVM as a "application" and
> > > considering what we can do to help optimize that case could be helpful
> > > here (and might also be helpful for other workloads).
> > 
> > Well, except for QEMU/KVM, Kevin has already confirmed that using
> > Direct I/O is a completely viable solution.  (And I'll add it solves a
> > bunch of other problems, including page cache efficiency....)
> 

Sure, O_DIRECT does make this simpler (though it's not always the most
efficient way to do I/O). I'm more interested in whether we can improve
the error handling with buffered I/O.

Maybe it's possible to add new flags/behaviors to sync_file_range such
that you could more easily determine the status? Maybe a new syscall
would do it?

Either way, getting a good feel for how QEMU would like to handle this
situation would be informative. It might be possible to make things
better with small tweaks to existing interfaces.

> Yes, "don't ever use non-O_DIRECT in production" is probably workable as
> a solution to the "state after failed fsync()" problem, as long as it is
> consistently implemented throughout the stack. That is, if we use a
> network protocol in QEMU (NFS, gluster, etc.), the server needs to use
> O_DIRECT, too, if we don't want to get the same problem one level down
> the stack. I'm not sure if that's possible with all of them, but if it
> is, it's mostly just a matter of configuring them correctly.
> 
> However, if we look at the greater problem of hanging requests that came
> up in the more recent emails of this thread, it is only moved rather
> than solved. Chances are that already write() would hang now instead of
> only fsync(), but we still have a hard time dealing with this.
> 

Yeah, not much you can do there currently (at least when it comes to
NFS). If the you had your choice, what would you like to have happen in
this situation where there is a loss of communication between client and
server?

-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-23 12:10                       ` Jeff Layton
  (?)
@ 2017-01-23 17:25                       ` Theodore Ts'o
  2017-01-23 17:53                         ` Chuck Lever
  2017-01-23 22:40                           ` Jeff Layton
  -1 siblings, 2 replies; 53+ messages in thread
From: Theodore Ts'o @ 2017-01-23 17:25 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Kevin Wolf, NeilBrown, Rik van Riel, Christoph Hellwig, linux-mm,
	linux-fsdevel, lsf-pc, Ric Wheeler

On Mon, Jan 23, 2017 at 07:10:00AM -0500, Jeff Layton wrote:
> > > Well, except for QEMU/KVM, Kevin has already confirmed that using
> > > Direct I/O is a completely viable solution.  (And I'll add it solves a
> > > bunch of other problems, including page cache efficiency....)
> 
> Sure, O_DIRECT does make this simpler (though it's not always the most
> efficient way to do I/O). I'm more interested in whether we can improve
> the error handling with buffered I/O.

I just want to make sure we're designing a solution that will actually
be _used_, because it is a good fit for at least one real-world use
case.

Is QEMU/KVM using volumes that are stored over NFS really used in the
real world?  Especially one where you want a huge amount of
reliability and recovery after some kind network failure?  If we are
talking about customers who are going to suspend the VM and restart it
on another server, that presumes a fairly large installation size and
enough servers that would they *really* want to use a single point of
failure such as an NFS filer?  Even if it was a proprietary
purpose-built NFS filer?  Why wouldn't they be using RADOS and Ceph
instead, for example?

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-23 17:25                       ` Theodore Ts'o
@ 2017-01-23 17:53                         ` Chuck Lever
  2017-01-23 22:40                           ` Jeff Layton
  1 sibling, 0 replies; 53+ messages in thread
From: Chuck Lever @ 2017-01-23 17:53 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jeff Layton, Kevin Wolf, NeilBrown, Rik van Riel,
	Christoph Hellwig, linux-mm, linux-fsdevel, lsf-pc, Ric Wheeler


> On Jan 23, 2017, at 12:25 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> On Mon, Jan 23, 2017 at 07:10:00AM -0500, Jeff Layton wrote:
>>>> Well, except for QEMU/KVM, Kevin has already confirmed that using
>>>> Direct I/O is a completely viable solution.  (And I'll add it solves a
>>>> bunch of other problems, including page cache efficiency....)
>> 
>> Sure, O_DIRECT does make this simpler (though it's not always the most
>> efficient way to do I/O). I'm more interested in whether we can improve
>> the error handling with buffered I/O.
> 
> I just want to make sure we're designing a solution that will actually
> be _used_, because it is a good fit for at least one real-world use
> case.
> 
> Is QEMU/KVM using volumes that are stored over NFS really used in the
> real world?

Yes. NFS has worked well for many years in pre-cloud virtualization
environments; in other words, environments that have supported guest
migration for much longer than OpenStack has been around.


> Especially one where you want a huge amount of
> reliability and recovery after some kind network failure?

These are largely data center-grade machine room area networks, not
WANs. Network failures are not as frequent as they used to be.

Most server systems ship with more than one Ethernet device anyway.
Adding a second LAN path between each client and storage targets is
pretty straightforward.


> If we are
> talking about customers who are going to suspend the VM and restart it
> on another server, that presumes a fairly large installation size and
> enough servers that would they *really* want to use a single point of
> failure such as an NFS filer?

You certainly can make NFS more reliable by using a filer that supports
IP-based cluster failover, and has a reasonable amount of redundant
durable storage.

I don't see why we should presume anything about installation size.


> Even if it was a proprietary
> purpose-built NFS filer?  Why wouldn't they be using RADOS and Ceph
> instead, for example?

NFS is a fine inexpensive solution for small deployments and experimental
set ups.

It's much simpler for a single user with no administrative rights to
manage NFS-based files than to deal with creating LUNs or backing
objects, for instance.

Considering the various weirdnesses and inefficiencies involved in
turning an object store into something that has proper POSIX file
semantics, IMO NFS is a known quantity that is straightforward
and a natural fit for some cloud deployments. If it wan't, then
there would be no reason to provide object-to-NFS gateway services.


Wrt O_DIRECT, an NFS client can open the NFS file that backs a virtual
block device with O_DIRECT, and you get the same semantics as reading
or writing to a physical block device. There is no need for the server
to use O_DIRECT as well: the client uses the NFS protocol to control
when the server commits data to durable storage (like, immediately).


--
Chuck Lever



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-23 10:09                   ` Kevin Wolf
@ 2017-01-23 22:35                       ` Jeff Layton
  2017-01-23 22:35                       ` Jeff Layton
  1 sibling, 0 replies; 53+ messages in thread
From: Jeff Layton @ 2017-01-23 22:35 UTC (permalink / raw)
  To: Kevin Wolf, Theodore Ts'o
  Cc: NeilBrown, Rik van Riel, Christoph Hellwig, linux-mm,
	linux-fsdevel, lsf-pc, Ric Wheeler

On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
> Am 23.01.2017 um 01:21 hat Theodore Ts'o geschrieben:
> > On Sun, Jan 22, 2017 at 06:31:57PM -0500, Jeff Layton wrote:
> > > 
> > > Ahh, sorry if I wasn't clear.
> > > 
> > > I know Kevin posed this topic in the context of QEMU/KVM, and I figure
> > > that running virt guests (themselves doing all sorts of workloads) is a
> > > pretty common setup these days. That was what I meant by "use case"
> > > here. Obviously there are many other workloads that could benefit from
> > > (or be harmed by) changes in this area.
> > > 
> > > Still, I think that looking at QEMU/KVM as a "application" and
> > > considering what we can do to help optimize that case could be helpful
> > > here (and might also be helpful for other workloads).
> > 
> > Well, except for QEMU/KVM, Kevin has already confirmed that using
> > Direct I/O is a completely viable solution.  (And I'll add it solves a
> > bunch of other problems, including page cache efficiency....)
> 
> Yes, "don't ever use non-O_DIRECT in production" is probably workable as
> a solution to the "state after failed fsync()" problem, as long as it is
> consistently implemented throughout the stack. That is, if we use a
> network protocol in QEMU (NFS, gluster, etc.), the server needs to use
> O_DIRECT, too, if we don't want to get the same problem one level down
> the stack. I'm not sure if that's possible with all of them, but if it
> is, it's mostly just a matter of configuring them correctly.
> 

It's actually not necessary with NFS. O_DIRECT I/O is entirely a client-
side thing. There's no support for it in the protocol (and there doesn't
really need to be).

If something happens and the server crashed before the writes were
stable, then I believe the client will reissue them.

If both the client and server crash at the same time, then all bets are
off of course. :)

> However, if we look at the greater problem of hanging requests that came
> up in the more recent emails of this thread, it is only moved rather
> than solved. Chances are that already write() would hang now instead of
> only fsync(), but we still have a hard time dealing with this.
> 

Well, it _is_ better with O_DIRECT as you can usually at least break out
of the I/O with SIGKILL.

When I last looked at this, the problem with buffered I/O was that you
often end up waiting on page bits to clear (usually PG_writeback or
PG_dirty), in non-killable sleeps for the most part.

Maybe the fix here is as simple as changing that?
-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
@ 2017-01-23 22:35                       ` Jeff Layton
  0 siblings, 0 replies; 53+ messages in thread
From: Jeff Layton @ 2017-01-23 22:35 UTC (permalink / raw)
  To: Kevin Wolf, Theodore Ts'o
  Cc: NeilBrown, Rik van Riel, Christoph Hellwig, linux-mm,
	linux-fsdevel, lsf-pc, Ric Wheeler

On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
> Am 23.01.2017 um 01:21 hat Theodore Ts'o geschrieben:
> > On Sun, Jan 22, 2017 at 06:31:57PM -0500, Jeff Layton wrote:
> > > 
> > > Ahh, sorry if I wasn't clear.
> > > 
> > > I know Kevin posed this topic in the context of QEMU/KVM, and I figure
> > > that running virt guests (themselves doing all sorts of workloads) is a
> > > pretty common setup these days. That was what I meant by "use case"
> > > here. Obviously there are many other workloads that could benefit from
> > > (or be harmed by) changes in this area.
> > > 
> > > Still, I think that looking at QEMU/KVM as a "application" and
> > > considering what we can do to help optimize that case could be helpful
> > > here (and might also be helpful for other workloads).
> > 
> > Well, except for QEMU/KVM, Kevin has already confirmed that using
> > Direct I/O is a completely viable solution.  (And I'll add it solves a
> > bunch of other problems, including page cache efficiency....)
> 
> Yes, "don't ever use non-O_DIRECT in production" is probably workable as
> a solution to the "state after failed fsync()" problem, as long as it is
> consistently implemented throughout the stack. That is, if we use a
> network protocol in QEMU (NFS, gluster, etc.), the server needs to use
> O_DIRECT, too, if we don't want to get the same problem one level down
> the stack. I'm not sure if that's possible with all of them, but if it
> is, it's mostly just a matter of configuring them correctly.
> 

It's actually not necessary with NFS. O_DIRECT I/O is entirely a client-
side thing. There's no support for it in the protocol (and there doesn't
really need to be).

If something happens and the server crashed before the writes were
stable, then I believe the client will reissue them.

If both the client and server crash at the same time, then all bets are
off of course. :)

> However, if we look at the greater problem of hanging requests that came
> up in the more recent emails of this thread, it is only moved rather
> than solved. Chances are that already write() would hang now instead of
> only fsync(), but we still have a hard time dealing with this.
> 

Well, it _is_ better with O_DIRECT as you can usually at least break out
of the I/O with SIGKILL.

When I last looked at this, the problem with buffered I/O was that you
often end up waiting on page bits to clear (usually PG_writeback or
PG_dirty), in non-killable sleeps for the most part.

Maybe the fix here is as simple as changing that?
-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-23 17:25                       ` Theodore Ts'o
@ 2017-01-23 22:40                           ` Jeff Layton
  2017-01-23 22:40                           ` Jeff Layton
  1 sibling, 0 replies; 53+ messages in thread
From: Jeff Layton @ 2017-01-23 22:40 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Kevin Wolf, NeilBrown, Rik van Riel, Christoph Hellwig, linux-mm,
	linux-fsdevel, lsf-pc, Ric Wheeler

On Mon, 2017-01-23 at 12:25 -0500, Theodore Ts'o wrote:
> On Mon, Jan 23, 2017 at 07:10:00AM -0500, Jeff Layton wrote:
> > > > Well, except for QEMU/KVM, Kevin has already confirmed that using
> > > > Direct I/O is a completely viable solution.  (And I'll add it solves a
> > > > bunch of other problems, including page cache efficiency....)
> > 
> > Sure, O_DIRECT does make this simpler (though it's not always the most
> > efficient way to do I/O). I'm more interested in whether we can improve
> > the error handling with buffered I/O.
> 
> I just want to make sure we're designing a solution that will actually
> be _used_, because it is a good fit for at least one real-world use
> case.
> 

Exactly. Asking how the QEMU folks would like to be able to interact
with the kernel is not the same as promising to implement said solution.
Still, I think it's a valid question, and I'll pose it in terms of NFS
though I think the semantics apply to other situations as well.

I'm mostly just asking to get a better idea of what the KVM folks would
really like to have happen in this situation. I don't think they want to
error out on every network blip, but in the face of a hung mount that
isn't making progress in writeback, what would they like to be able to
do to resolve it?

For instance, With NFS you can generally send a SIGKILL to the process
to make it abandon O_DIRECT writes. But, tasks accessing NFS mounts
still seem to get stuck in buffered writeback if the server goes away,
generally waiting on the page bits to clear in uninterruptible sleeps.

Would better handling of SIGKILL when waiting on buffered writeback be
what QEMU devs would like? That seems like a reasonable thing to
consider.

> Is QEMU/KVM using volumes that are stored over NFS really used in the
real world?  Especially one where you want a huge amount of
reliability and recovery after some kind network failure?  If we are
talking about customers who are going to suspend the VM and restart it
on another server, that presumes a fairly large installation size and
enough servers that would they *really* want to use a single point of
failure such as an NFS filer?  Even if it was a proprietary
> purpose-built NFS filer?  Why wouldn't they be using RADOS and Cephinstead, for example?

Nothing specific about NFS in what I was asking. I think cephfs has
similar behavior in the face of the client not being able to reach any
of its MDS', for instance.


-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
@ 2017-01-23 22:40                           ` Jeff Layton
  0 siblings, 0 replies; 53+ messages in thread
From: Jeff Layton @ 2017-01-23 22:40 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Kevin Wolf, NeilBrown, Rik van Riel, Christoph Hellwig, linux-mm,
	linux-fsdevel, lsf-pc, Ric Wheeler

On Mon, 2017-01-23 at 12:25 -0500, Theodore Ts'o wrote:
> On Mon, Jan 23, 2017 at 07:10:00AM -0500, Jeff Layton wrote:
> > > > Well, except for QEMU/KVM, Kevin has already confirmed that using
> > > > Direct I/O is a completely viable solution.  (And I'll add it solves a
> > > > bunch of other problems, including page cache efficiency....)
> > 
> > Sure, O_DIRECT does make this simpler (though it's not always the most
> > efficient way to do I/O). I'm more interested in whether we can improve
> > the error handling with buffered I/O.
> 
> I just want to make sure we're designing a solution that will actually
> be _used_, because it is a good fit for at least one real-world use
> case.
> 

Exactly. Asking how the QEMU folks would like to be able to interact
with the kernel is not the same as promising to implement said solution.
Still, I think it's a valid question, and I'll pose it in terms of NFS
though I think the semantics apply to other situations as well.

I'm mostly just asking to get a better idea of what the KVM folks would
really like to have happen in this situation. I don't think they want to
error out on every network blip, but in the face of a hung mount that
isn't making progress in writeback, what would they like to be able to
do to resolve it?

For instance, With NFS you can generally send a SIGKILL to the process
to make it abandon O_DIRECT writes. But, tasks accessing NFS mounts
still seem to get stuck in buffered writeback if the server goes away,
generally waiting on the page bits to clear in uninterruptible sleeps.

Would better handling of SIGKILL when waiting on buffered writeback be
what QEMU devs would like? That seems like a reasonable thing to
consider.

> Is QEMU/KVM using volumes that are stored over NFS really used in the
real world?  Especially one where you want a huge amount of
reliability and recovery after some kind network failure?  If we are
talking about customers who are going to suspend the VM and restart it
on another server, that presumes a fairly large installation size and
enough servers that would they *really* want to use a single point of
failure such as an NFS filer?  Even if it was a proprietary
> purpose-built NFS filer?  Why wouldn't they be using RADOS and Cephinstead, for example?

Nothing specific about NFS in what I was asking. I think cephfs has
similar behavior in the face of the client not being able to reach any
of its MDS', for instance.


-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-23 22:35                       ` Jeff Layton
  (?)
@ 2017-01-23 23:09                       ` Trond Myklebust
  2017-01-24  0:16                           ` NeilBrown
  -1 siblings, 1 reply; 53+ messages in thread
From: Trond Myklebust @ 2017-01-23 23:09 UTC (permalink / raw)
  To: kwolf, jlayton, tytso
  Cc: lsf-pc, hch, riel, neilb, rwheeler, linux-mm, linux-fsdevel

On Mon, 2017-01-23 at 17:35 -0500, Jeff Layton wrote:
> On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
> > 
> > However, if we look at the greater problem of hanging requests that
> > came
> > up in the more recent emails of this thread, it is only moved
> > rather
> > than solved. Chances are that already write() would hang now
> > instead of
> > only fsync(), but we still have a hard time dealing with this.
> > 
> 
> Well, it _is_ better with O_DIRECT as you can usually at least break
> out
> of the I/O with SIGKILL.
> 
> When I last looked at this, the problem with buffered I/O was that
> you
> often end up waiting on page bits to clear (usually PG_writeback or
> PG_dirty), in non-killable sleeps for the most part.
> 
> Maybe the fix here is as simple as changing that?

At the risk of kicking off another O_PONIES discussion: Add an
open(O_TIMEOUT) flag that would let the kernel know that the
application is prepared to handle timeouts from operations such as
read(), write() and fsync(), then add an ioctl() or syscall to allow
said application to set the timeout value.


-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-23 23:09                       ` Trond Myklebust
@ 2017-01-24  0:16                           ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-24  0:16 UTC (permalink / raw)
  To: Trond Myklebust, kwolf, jlayton, tytso
  Cc: lsf-pc, hch, riel, rwheeler, linux-mm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1774 bytes --]

On Mon, Jan 23 2017, Trond Myklebust wrote:

> On Mon, 2017-01-23 at 17:35 -0500, Jeff Layton wrote:
>> On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
>> > 
>> > However, if we look at the greater problem of hanging requests that
>> > came
>> > up in the more recent emails of this thread, it is only moved
>> > rather
>> > than solved. Chances are that already write() would hang now
>> > instead of
>> > only fsync(), but we still have a hard time dealing with this.
>> > 
>> 
>> Well, it _is_ better with O_DIRECT as you can usually at least break
>> out
>> of the I/O with SIGKILL.
>> 
>> When I last looked at this, the problem with buffered I/O was that
>> you
>> often end up waiting on page bits to clear (usually PG_writeback or
>> PG_dirty), in non-killable sleeps for the most part.
>> 
>> Maybe the fix here is as simple as changing that?
>
> At the risk of kicking off another O_PONIES discussion: Add an
> open(O_TIMEOUT) flag that would let the kernel know that the
> application is prepared to handle timeouts from operations such as
> read(), write() and fsync(), then add an ioctl() or syscall to allow
> said application to set the timeout value.

I was thinking on very similar lines, though I'd use 'fcntl()' if
possible because it would be a per-"file description" option.
This would be a function of the page cache, and a filesystem wouldn't
need to know about it at all.  Once enable, 'read', 'write', or 'fsync'
would return EWOULDBLOCK rather than waiting indefinitely.
It might be nice if 'select' could then be used on page-cache file
descriptors, but I think that is much harder.  Support O_TIMEOUT would
be a practical first step - if someone agreed to actually try to use it.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
@ 2017-01-24  0:16                           ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-24  0:16 UTC (permalink / raw)
  To: Trond Myklebust, kwolf, jlayton, tytso
  Cc: lsf-pc, hch, riel, rwheeler, linux-mm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1774 bytes --]

On Mon, Jan 23 2017, Trond Myklebust wrote:

> On Mon, 2017-01-23 at 17:35 -0500, Jeff Layton wrote:
>> On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
>> > 
>> > However, if we look at the greater problem of hanging requests that
>> > came
>> > up in the more recent emails of this thread, it is only moved
>> > rather
>> > than solved. Chances are that already write() would hang now
>> > instead of
>> > only fsync(), but we still have a hard time dealing with this.
>> > 
>> 
>> Well, it _is_ better with O_DIRECT as you can usually at least break
>> out
>> of the I/O with SIGKILL.
>> 
>> When I last looked at this, the problem with buffered I/O was that
>> you
>> often end up waiting on page bits to clear (usually PG_writeback or
>> PG_dirty), in non-killable sleeps for the most part.
>> 
>> Maybe the fix here is as simple as changing that?
>
> At the risk of kicking off another O_PONIES discussion: Add an
> open(O_TIMEOUT) flag that would let the kernel know that the
> application is prepared to handle timeouts from operations such as
> read(), write() and fsync(), then add an ioctl() or syscall to allow
> said application to set the timeout value.

I was thinking on very similar lines, though I'd use 'fcntl()' if
possible because it would be a per-"file description" option.
This would be a function of the page cache, and a filesystem wouldn't
need to know about it at all.  Once enable, 'read', 'write', or 'fsync'
would return EWOULDBLOCK rather than waiting indefinitely.
It might be nice if 'select' could then be used on page-cache file
descriptors, but I think that is much harder.  Support O_TIMEOUT would
be a practical first step - if someone agreed to actually try to use it.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-24  0:16                           ` NeilBrown
@ 2017-01-24  0:46                             ` Jeff Layton
  -1 siblings, 0 replies; 53+ messages in thread
From: Jeff Layton @ 2017-01-24  0:46 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, kwolf, tytso
  Cc: lsf-pc, hch, riel, rwheeler, linux-mm, linux-fsdevel

On Tue, 2017-01-24 at 11:16 +1100, NeilBrown wrote:
> On Mon, Jan 23 2017, Trond Myklebust wrote:
> 
> > On Mon, 2017-01-23 at 17:35 -0500, Jeff Layton wrote:
> > > On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
> > > > 
> > > > However, if we look at the greater problem of hanging requests that
> > > > came
> > > > up in the more recent emails of this thread, it is only moved
> > > > rather
> > > > than solved. Chances are that already write() would hang now
> > > > instead of
> > > > only fsync(), but we still have a hard time dealing with this.
> > > > 
> > > 
> > > Well, it _is_ better with O_DIRECT as you can usually at least break
> > > out
> > > of the I/O with SIGKILL.
> > > 
> > > When I last looked at this, the problem with buffered I/O was that
> > > you
> > > often end up waiting on page bits to clear (usually PG_writeback or
> > > PG_dirty), in non-killable sleeps for the most part.
> > > 
> > > Maybe the fix here is as simple as changing that?
> > 
> > At the risk of kicking off another O_PONIES discussion: Add an
> > open(O_TIMEOUT) flag that would let the kernel know that the
> > application is prepared to handle timeouts from operations such as
> > read(), write() and fsync(), then add an ioctl() or syscall to allow
> > said application to set the timeout value.
> 
> I was thinking on very similar lines, though I'd use 'fcntl()' if
> possible because it would be a per-"file description" option.
> This would be a function of the page cache, and a filesystem wouldn't
> need to know about it at all.  Once enable, 'read', 'write', or 'fsync'
> would return EWOULDBLOCK rather than waiting indefinitely.
> It might be nice if 'select' could then be used on page-cache file
> descriptors, but I think that is much harder.  Support O_TIMEOUT would
> be a practical first step - if someone agreed to actually try to use it.
> 

Yeah, that does seem like it might be worth exploring. 

That said, I think there's something even simpler we can do to make
things better for a lot of cases, and it may even help pave the way for
the proposal above.

Looking closer and remembering more, I think the main problem area when
the pages are stuck in writeback is the wait_on_page_writeback call in
places like wait_for_stable_page and __filemap_fdatawait_range.

That uses an uninterruptible sleep and it's common to see applications
stuck there in these situations. They're unkillable too so your only
recourse is to hard reset the box when you can't reestablish
connectivity.

I think it might be good to consider making some of those sleeps
TASK_KILLABLE. For instance, both of the above callers of those
functions are int return functions. It may be possible to return
ERESTARTSYS when the task catches a signal.

-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
@ 2017-01-24  0:46                             ` Jeff Layton
  0 siblings, 0 replies; 53+ messages in thread
From: Jeff Layton @ 2017-01-24  0:46 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, kwolf, tytso
  Cc: lsf-pc, hch, riel, rwheeler, linux-mm, linux-fsdevel

On Tue, 2017-01-24 at 11:16 +1100, NeilBrown wrote:
> On Mon, Jan 23 2017, Trond Myklebust wrote:
> 
> > On Mon, 2017-01-23 at 17:35 -0500, Jeff Layton wrote:
> > > On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
> > > > 
> > > > However, if we look at the greater problem of hanging requests that
> > > > came
> > > > up in the more recent emails of this thread, it is only moved
> > > > rather
> > > > than solved. Chances are that already write() would hang now
> > > > instead of
> > > > only fsync(), but we still have a hard time dealing with this.
> > > > 
> > > 
> > > Well, it _is_ better with O_DIRECT as you can usually at least break
> > > out
> > > of the I/O with SIGKILL.
> > > 
> > > When I last looked at this, the problem with buffered I/O was that
> > > you
> > > often end up waiting on page bits to clear (usually PG_writeback or
> > > PG_dirty), in non-killable sleeps for the most part.
> > > 
> > > Maybe the fix here is as simple as changing that?
> > 
> > At the risk of kicking off another O_PONIES discussion: Add an
> > open(O_TIMEOUT) flag that would let the kernel know that the
> > application is prepared to handle timeouts from operations such as
> > read(), write() and fsync(), then add an ioctl() or syscall to allow
> > said application to set the timeout value.
> 
> I was thinking on very similar lines, though I'd use 'fcntl()' if
> possible because it would be a per-"file description" option.
> This would be a function of the page cache, and a filesystem wouldn't
> need to know about it at all.  Once enable, 'read', 'write', or 'fsync'
> would return EWOULDBLOCK rather than waiting indefinitely.
> It might be nice if 'select' could then be used on page-cache file
> descriptors, but I think that is much harder.  Support O_TIMEOUT would
> be a practical first step - if someone agreed to actually try to use it.
> 

Yeah, that does seem like it might be worth exploring.A 

That said, I think there's something even simpler we can do to make
things better for a lot of cases, and it may even help pave the way for
the proposal above.

Looking closer and remembering more, I think the main problem area when
the pages are stuck in writeback is the wait_on_page_writeback call in
places like wait_for_stable_page and __filemap_fdatawait_range.

That uses an uninterruptible sleep and it's common to see applications
stuck there in these situations. They're unkillable too so your only
recourse is to hard reset the box when you can't reestablish
connectivity.

I think it might be good to consider making some of those sleeps
TASK_KILLABLE. For instance, both of the above callers of those
functions are int return functions. It may be possible to return
ERESTARTSYS when the task catches a signal.

-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-24  0:16                           ` NeilBrown
  (?)
  (?)
@ 2017-01-24  3:34                           ` Trond Myklebust
  2017-01-25 18:35                             ` Theodore Ts'o
  -1 siblings, 1 reply; 53+ messages in thread
From: Trond Myklebust @ 2017-01-24  3:34 UTC (permalink / raw)
  To: kwolf, jlayton, neilb, tytso
  Cc: hch, riel, linux-mm, rwheeler, lsf-pc, linux-fsdevel

On Tue, 2017-01-24 at 11:16 +1100, NeilBrown wrote:
> On Mon, Jan 23 2017, Trond Myklebust wrote:
> 
> > On Mon, 2017-01-23 at 17:35 -0500, Jeff Layton wrote:
> > > On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
> > > > 
> > > > However, if we look at the greater problem of hanging requests
> > > > that
> > > > came
> > > > up in the more recent emails of this thread, it is only moved
> > > > rather
> > > > than solved. Chances are that already write() would hang now
> > > > instead of
> > > > only fsync(), but we still have a hard time dealing with this.
> > > > 
> > > 
> > > Well, it _is_ better with O_DIRECT as you can usually at least
> > > break
> > > out
> > > of the I/O with SIGKILL.
> > > 
> > > When I last looked at this, the problem with buffered I/O was
> > > that
> > > you
> > > often end up waiting on page bits to clear (usually PG_writeback
> > > or
> > > PG_dirty), in non-killable sleeps for the most part.
> > > 
> > > Maybe the fix here is as simple as changing that?
> > 
> > At the risk of kicking off another O_PONIES discussion: Add an
> > open(O_TIMEOUT) flag that would let the kernel know that the
> > application is prepared to handle timeouts from operations such as
> > read(), write() and fsync(), then add an ioctl() or syscall to
> > allow
> > said application to set the timeout value.
> 
> I was thinking on very similar lines, though I'd use 'fcntl()' if
> possible because it would be a per-"file description" option.
> This would be a function of the page cache, and a filesystem wouldn't
> need to know about it at all.  Once enable, 'read', 'write', or
> 'fsync'
> would return EWOULDBLOCK rather than waiting indefinitely.
> It might be nice if 'select' could then be used on page-cache file
> descriptors, but I think that is much harder.  Support O_TIMEOUT
> would
> be a practical first step - if someone agreed to actually try to use
> it.

The reason why I'm thinking open() is because it has to be a contract
between a specific application and the kernel. If the application
doesn't open the file with the O_TIMEOUT flag, then it shouldn't see
nasty non-POSIX timeout errors, even if there is another process that
is using that flag on the same file.

The only place where that is difficult to manage is when the file is
mmap()ed (no file descriptor), so you'd presumably have to disallow
mixing mmap and O_TIMEOUT.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-24  0:46                             ` Jeff Layton
@ 2017-01-24 21:58                               ` NeilBrown
  -1 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-24 21:58 UTC (permalink / raw)
  To: Jeff Layton, Trond Myklebust, kwolf, tytso
  Cc: lsf-pc, hch, riel, rwheeler, linux-mm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 4250 bytes --]

On Mon, Jan 23 2017, Jeff Layton wrote:

> On Tue, 2017-01-24 at 11:16 +1100, NeilBrown wrote:
>> On Mon, Jan 23 2017, Trond Myklebust wrote:
>> 
>> > On Mon, 2017-01-23 at 17:35 -0500, Jeff Layton wrote:
>> > > On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
>> > > > 
>> > > > However, if we look at the greater problem of hanging requests that
>> > > > came
>> > > > up in the more recent emails of this thread, it is only moved
>> > > > rather
>> > > > than solved. Chances are that already write() would hang now
>> > > > instead of
>> > > > only fsync(), but we still have a hard time dealing with this.
>> > > > 
>> > > 
>> > > Well, it _is_ better with O_DIRECT as you can usually at least break
>> > > out
>> > > of the I/O with SIGKILL.
>> > > 
>> > > When I last looked at this, the problem with buffered I/O was that
>> > > you
>> > > often end up waiting on page bits to clear (usually PG_writeback or
>> > > PG_dirty), in non-killable sleeps for the most part.
>> > > 
>> > > Maybe the fix here is as simple as changing that?
>> > 
>> > At the risk of kicking off another O_PONIES discussion: Add an
>> > open(O_TIMEOUT) flag that would let the kernel know that the
>> > application is prepared to handle timeouts from operations such as
>> > read(), write() and fsync(), then add an ioctl() or syscall to allow
>> > said application to set the timeout value.
>> 
>> I was thinking on very similar lines, though I'd use 'fcntl()' if
>> possible because it would be a per-"file description" option.
>> This would be a function of the page cache, and a filesystem wouldn't
>> need to know about it at all.  Once enable, 'read', 'write', or 'fsync'
>> would return EWOULDBLOCK rather than waiting indefinitely.
>> It might be nice if 'select' could then be used on page-cache file
>> descriptors, but I think that is much harder.  Support O_TIMEOUT would
>> be a practical first step - if someone agreed to actually try to use it.
>> 
>
> Yeah, that does seem like it might be worth exploring. 
>
> That said, I think there's something even simpler we can do to make
> things better for a lot of cases, and it may even help pave the way for
> the proposal above.
>
> Looking closer and remembering more, I think the main problem area when
> the pages are stuck in writeback is the wait_on_page_writeback call in
> places like wait_for_stable_page and __filemap_fdatawait_range.

I can't see wait_for_stable_page() being very relevant.  That only
blocks on backing devices which have requested stable pages.
raid5 sometimes does that.  Some scsi/sata devices can somehow.
And rbd (part of ceph) sometimes does.  I don't think NFS ever will.
wait_for_stable_page() doesn't currently return an error, so getting to
abort in SIGKILL would be a lot of work.

filemap_fdatawait_range() is much easier.

diff --git a/mm/filemap.c b/mm/filemap.c
index b772a33ef640..2773f6dde1da 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -401,7 +401,9 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
 			if (page->index > end)
 				continue;
 
-			wait_on_page_writeback(page);
+			if (PageWriteback(page))
+				if (wait_on_page_bit_killable(page, PG_writeback))
+					err = -ERESTARTSYS;
 			if (TestClearPageError(page))
 				ret = -EIO;
 		}

That isn't a complete solution. There is code in f2fs which doesn't
check the return value and probably should.  And gfs2 calls
	mapping_set_error(mapping, error);
with the return value, with we probably don't want in the ERESTARTSYS case.
There are some usages in btrfs that I'd need to double-check too.

But it looks to be manageable. 

Thanks,
NeilBrown

>
> That uses an uninterruptible sleep and it's common to see applications
> stuck there in these situations. They're unkillable too so your only
> recourse is to hard reset the box when you can't reestablish
> connectivity.
>
> I think it might be good to consider making some of those sleeps
> TASK_KILLABLE. For instance, both of the above callers of those
> functions are int return functions. It may be possible to return
> ERESTARTSYS when the task catches a signal.
>
> -- 
> Jeff Layton <jlayton@poochiereds.net>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
@ 2017-01-24 21:58                               ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-24 21:58 UTC (permalink / raw)
  To: Jeff Layton, Trond Myklebust, kwolf, tytso
  Cc: lsf-pc, hch, riel, rwheeler, linux-mm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 4250 bytes --]

On Mon, Jan 23 2017, Jeff Layton wrote:

> On Tue, 2017-01-24 at 11:16 +1100, NeilBrown wrote:
>> On Mon, Jan 23 2017, Trond Myklebust wrote:
>> 
>> > On Mon, 2017-01-23 at 17:35 -0500, Jeff Layton wrote:
>> > > On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
>> > > > 
>> > > > However, if we look at the greater problem of hanging requests that
>> > > > came
>> > > > up in the more recent emails of this thread, it is only moved
>> > > > rather
>> > > > than solved. Chances are that already write() would hang now
>> > > > instead of
>> > > > only fsync(), but we still have a hard time dealing with this.
>> > > > 
>> > > 
>> > > Well, it _is_ better with O_DIRECT as you can usually at least break
>> > > out
>> > > of the I/O with SIGKILL.
>> > > 
>> > > When I last looked at this, the problem with buffered I/O was that
>> > > you
>> > > often end up waiting on page bits to clear (usually PG_writeback or
>> > > PG_dirty), in non-killable sleeps for the most part.
>> > > 
>> > > Maybe the fix here is as simple as changing that?
>> > 
>> > At the risk of kicking off another O_PONIES discussion: Add an
>> > open(O_TIMEOUT) flag that would let the kernel know that the
>> > application is prepared to handle timeouts from operations such as
>> > read(), write() and fsync(), then add an ioctl() or syscall to allow
>> > said application to set the timeout value.
>> 
>> I was thinking on very similar lines, though I'd use 'fcntl()' if
>> possible because it would be a per-"file description" option.
>> This would be a function of the page cache, and a filesystem wouldn't
>> need to know about it at all.  Once enable, 'read', 'write', or 'fsync'
>> would return EWOULDBLOCK rather than waiting indefinitely.
>> It might be nice if 'select' could then be used on page-cache file
>> descriptors, but I think that is much harder.  Support O_TIMEOUT would
>> be a practical first step - if someone agreed to actually try to use it.
>> 
>
> Yeah, that does seem like it might be worth exploring. 
>
> That said, I think there's something even simpler we can do to make
> things better for a lot of cases, and it may even help pave the way for
> the proposal above.
>
> Looking closer and remembering more, I think the main problem area when
> the pages are stuck in writeback is the wait_on_page_writeback call in
> places like wait_for_stable_page and __filemap_fdatawait_range.

I can't see wait_for_stable_page() being very relevant.  That only
blocks on backing devices which have requested stable pages.
raid5 sometimes does that.  Some scsi/sata devices can somehow.
And rbd (part of ceph) sometimes does.  I don't think NFS ever will.
wait_for_stable_page() doesn't currently return an error, so getting to
abort in SIGKILL would be a lot of work.

filemap_fdatawait_range() is much easier.

diff --git a/mm/filemap.c b/mm/filemap.c
index b772a33ef640..2773f6dde1da 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -401,7 +401,9 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
 			if (page->index > end)
 				continue;
 
-			wait_on_page_writeback(page);
+			if (PageWriteback(page))
+				if (wait_on_page_bit_killable(page, PG_writeback))
+					err = -ERESTARTSYS;
 			if (TestClearPageError(page))
 				ret = -EIO;
 		}

That isn't a complete solution. There is code in f2fs which doesn't
check the return value and probably should.  And gfs2 calls
	mapping_set_error(mapping, error);
with the return value, with we probably don't want in the ERESTARTSYS case.
There are some usages in btrfs that I'd need to double-check too.

But it looks to be manageable. 

Thanks,
NeilBrown

>
> That uses an uninterruptible sleep and it's common to see applications
> stuck there in these situations. They're unkillable too so your only
> recourse is to hard reset the box when you can't reestablish
> connectivity.
>
> I think it might be good to consider making some of those sleeps
> TASK_KILLABLE. For instance, both of the above callers of those
> functions are int return functions. It may be possible to return
> ERESTARTSYS when the task catches a signal.
>
> -- 
> Jeff Layton <jlayton@poochiereds.net>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-24 21:58                               ` NeilBrown
@ 2017-01-25 13:00                                 ` Jeff Layton
  -1 siblings, 0 replies; 53+ messages in thread
From: Jeff Layton @ 2017-01-25 13:00 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, kwolf, tytso
  Cc: lsf-pc, hch, riel, rwheeler, linux-mm, linux-fsdevel

On Wed, 2017-01-25 at 08:58 +1100, NeilBrown wrote:
> On Mon, Jan 23 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-01-24 at 11:16 +1100, NeilBrown wrote:
> > > On Mon, Jan 23 2017, Trond Myklebust wrote:
> > > 
> > > > On Mon, 2017-01-23 at 17:35 -0500, Jeff Layton wrote:
> > > > > On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
> > > > > > 
> > > > > > However, if we look at the greater problem of hanging requests that
> > > > > > came
> > > > > > up in the more recent emails of this thread, it is only moved
> > > > > > rather
> > > > > > than solved. Chances are that already write() would hang now
> > > > > > instead of
> > > > > > only fsync(), but we still have a hard time dealing with this.
> > > > > > 
> > > > > 
> > > > > Well, it _is_ better with O_DIRECT as you can usually at least break
> > > > > out
> > > > > of the I/O with SIGKILL.
> > > > > 
> > > > > When I last looked at this, the problem with buffered I/O was that
> > > > > you
> > > > > often end up waiting on page bits to clear (usually PG_writeback or
> > > > > PG_dirty), in non-killable sleeps for the most part.
> > > > > 
> > > > > Maybe the fix here is as simple as changing that?
> > > > 
> > > > At the risk of kicking off another O_PONIES discussion: Add an
> > > > open(O_TIMEOUT) flag that would let the kernel know that the
> > > > application is prepared to handle timeouts from operations such as
> > > > read(), write() and fsync(), then add an ioctl() or syscall to allow
> > > > said application to set the timeout value.
> > > 
> > > I was thinking on very similar lines, though I'd use 'fcntl()' if
> > > possible because it would be a per-"file description" option.
> > > This would be a function of the page cache, and a filesystem wouldn't
> > > need to know about it at all.  Once enable, 'read', 'write', or 'fsync'
> > > would return EWOULDBLOCK rather than waiting indefinitely.
> > > It might be nice if 'select' could then be used on page-cache file
> > > descriptors, but I think that is much harder.  Support O_TIMEOUT would
> > > be a practical first step - if someone agreed to actually try to use it.
> > > 
> > 
> > Yeah, that does seem like it might be worth exploring. 
> > 
> > That said, I think there's something even simpler we can do to make
> > things better for a lot of cases, and it may even help pave the way for
> > the proposal above.
> > 
> > Looking closer and remembering more, I think the main problem area when
> > the pages are stuck in writeback is the wait_on_page_writeback call in
> > places like wait_for_stable_page and __filemap_fdatawait_range.
> 
> I can't see wait_for_stable_page() being very relevant.  That only
> blocks on backing devices which have requested stable pages.
> raid5 sometimes does that.  Some scsi/sata devices can somehow.
> And rbd (part of ceph) sometimes does.  I don't think NFS ever will.
> wait_for_stable_page() doesn't currently return an error, so getting to
> abort in SIGKILL would be a lot of work.
> 

Ahh right, I missed that it only affects pages backed by a BDI that has
BDI_CAP_STABLE_WRITES. Good.


> filemap_fdatawait_range() is much easier.
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b772a33ef640..2773f6dde1da 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -401,7 +401,9 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
>  			if (page->index > end)
>  				continue;
>  
> -			wait_on_page_writeback(page);
> +			if (PageWriteback(page))
> +				if (wait_on_page_bit_killable(page, PG_writeback))
> +					err = -ERESTARTSYS;
>  			if (TestClearPageError(page))
>  				ret = -EIO;
>  		}
> 
> That isn't a complete solution. There is code in f2fs which doesn't
> check the return value and probably should.  And gfs2 calls
> 	mapping_set_error(mapping, error);
> with the return value, with we probably don't want in the ERESTARTSYS case.
> There are some usages in btrfs that I'd need to double-check too.
> 
> But it looks to be manageable. 
> 
> Thanks,
> NeilBrown
> 

Yeah, it does. The main worry I have is that this function is called all
over the place in fairly deep call chains. It definitely needs review
and testing (and probably a lot of related fixes like you mention).

We should also note that this is not really a fix for applications,
per-se. It's more of an administrative improvement, to allow admins to
kill off processes stuck waiting for an inode to finish writeback.

I think there may still be room for an interface like you and Trond were
debating. But, I think this might be a good first step and would improve
a lot of hung mount situations.
-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
@ 2017-01-25 13:00                                 ` Jeff Layton
  0 siblings, 0 replies; 53+ messages in thread
From: Jeff Layton @ 2017-01-25 13:00 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, kwolf, tytso
  Cc: lsf-pc, hch, riel, rwheeler, linux-mm, linux-fsdevel

On Wed, 2017-01-25 at 08:58 +1100, NeilBrown wrote:
> On Mon, Jan 23 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-01-24 at 11:16 +1100, NeilBrown wrote:
> > > On Mon, Jan 23 2017, Trond Myklebust wrote:
> > > 
> > > > On Mon, 2017-01-23 at 17:35 -0500, Jeff Layton wrote:
> > > > > On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
> > > > > > 
> > > > > > However, if we look at the greater problem of hanging requests that
> > > > > > came
> > > > > > up in the more recent emails of this thread, it is only moved
> > > > > > rather
> > > > > > than solved. Chances are that already write() would hang now
> > > > > > instead of
> > > > > > only fsync(), but we still have a hard time dealing with this.
> > > > > > 
> > > > > 
> > > > > Well, it _is_ better with O_DIRECT as you can usually at least break
> > > > > out
> > > > > of the I/O with SIGKILL.
> > > > > 
> > > > > When I last looked at this, the problem with buffered I/O was that
> > > > > you
> > > > > often end up waiting on page bits to clear (usually PG_writeback or
> > > > > PG_dirty), in non-killable sleeps for the most part.
> > > > > 
> > > > > Maybe the fix here is as simple as changing that?
> > > > 
> > > > At the risk of kicking off another O_PONIES discussion: Add an
> > > > open(O_TIMEOUT) flag that would let the kernel know that the
> > > > application is prepared to handle timeouts from operations such as
> > > > read(), write() and fsync(), then add an ioctl() or syscall to allow
> > > > said application to set the timeout value.
> > > 
> > > I was thinking on very similar lines, though I'd use 'fcntl()' if
> > > possible because it would be a per-"file description" option.
> > > This would be a function of the page cache, and a filesystem wouldn't
> > > need to know about it at all.  Once enable, 'read', 'write', or 'fsync'
> > > would return EWOULDBLOCK rather than waiting indefinitely.
> > > It might be nice if 'select' could then be used on page-cache file
> > > descriptors, but I think that is much harder.  Support O_TIMEOUT would
> > > be a practical first step - if someone agreed to actually try to use it.
> > > 
> > 
> > Yeah, that does seem like it might be worth exploring.A 
> > 
> > That said, I think there's something even simpler we can do to make
> > things better for a lot of cases, and it may even help pave the way for
> > the proposal above.
> > 
> > Looking closer and remembering more, I think the main problem area when
> > the pages are stuck in writeback is the wait_on_page_writeback call in
> > places like wait_for_stable_page and __filemap_fdatawait_range.
> 
> I can't see wait_for_stable_page() being very relevant.  That only
> blocks on backing devices which have requested stable pages.
> raid5 sometimes does that.  Some scsi/sata devices can somehow.
> And rbd (part of ceph) sometimes does.  I don't think NFS ever will.
> wait_for_stable_page() doesn't currently return an error, so getting to
> abort in SIGKILL would be a lot of work.
> 

Ahh right, I missed that it only affects pages backed by a BDI that has
BDI_CAP_STABLE_WRITES. Good.


> filemap_fdatawait_range() is much easier.
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b772a33ef640..2773f6dde1da 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -401,7 +401,9 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
>  			if (page->index > end)
>  				continue;
>  
> -			wait_on_page_writeback(page);
> +			if (PageWriteback(page))
> +				if (wait_on_page_bit_killable(page, PG_writeback))
> +					err = -ERESTARTSYS;
>  			if (TestClearPageError(page))
>  				ret = -EIO;
>  		}
> 
> That isn't a complete solution. There is code in f2fs which doesn't
> check the return value and probably should.  And gfs2 calls
> 	mapping_set_error(mapping, error);
> with the return value, with we probably don't want in the ERESTARTSYS case.
> There are some usages in btrfs that I'd need to double-check too.
> 
> But it looks to be manageable. 
> 
> Thanks,
> NeilBrown
> 

Yeah, it does. The main worry I have is that this function is called all
over the place in fairly deep call chains. It definitely needs review
and testing (and probably a lot of related fixes like you mention).

We should also note that this is not really a fix for applications,
per-se. It's more of an administrative improvement, to allow admins to
kill off processes stuck waiting for an inode to finish writeback.

I think there may still be room for an interface like you and Trond were
debating. But, I think this might be a good first step and would improve
a lot of hung mount situations.
-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-24  3:34                           ` Trond Myklebust
@ 2017-01-25 18:35                             ` Theodore Ts'o
  2017-01-26  0:36                                 ` NeilBrown
  0 siblings, 1 reply; 53+ messages in thread
From: Theodore Ts'o @ 2017-01-25 18:35 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: kwolf, jlayton, neilb, hch, riel, linux-mm, rwheeler, lsf-pc,
	linux-fsdevel

On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote:
> The reason why I'm thinking open() is because it has to be a contract
> between a specific application and the kernel. If the application
> doesn't open the file with the O_TIMEOUT flag, then it shouldn't see
> nasty non-POSIX timeout errors, even if there is another process that
> is using that flag on the same file.
> 
> The only place where that is difficult to manage is when the file is
> mmap()ed (no file descriptor), so you'd presumably have to disallow
> mixing mmap and O_TIMEOUT.

Well, technically there *is* a file descriptor when you do an mmap.
You can close the fd after you call mmap(), but the mmap bumps the
refcount on the struct file while the memory map is active.

I would argue though that at least for buffered writes, the timeout
has to be property of the underlying inode, and if there is an attempt
to set timeout on an inode that already has a timeout set to some
other non-zero value, the "set timeout" operation should fail with a
"timeout already set".  That's becuase we really don't want to have to
keep track, on a per-page basis, which struct file was responsible for
dirtying a page --- and what if it is dirtied by two different file
descriptors?

That being said, I suspect that for many applications, the timeout is
going to be *much* more interesting for O_DIRECT writes, and there we
can certainly have different timeouts on a per-fd basis.  This is
especially for cases where the timeout is implemented in storage
device, using multi-media extensions, and where the timout might be
measured in milliseconds (e.g., no point reading a video frame if its
been delayed too long).  That being said, it block layer would need to
know about this as well, since the timeout needs to be relative to
when the read(2) system call is issued, not to when it is finally
submitted to the storage device.

And if the process has suitable privileges, perhaps the I/O scheduler
should take the timeout into account, so that reads with a timeout
attached should be submitted, with the presumption that reads w/o a
timeout can afford to be queued.  If the process doesn't have suitable
privileges, or if cgroup has exceeded its I/O quota, perhaps the right
answer would be to fail the read right away.  In the case of a cluster
file system such, if a particular server knows its can't serve a
particular low latency read within the SLO, it might be worthwhile to
signal to the cluster file system client that it should start doing an
erasure code reconstruction right away (or read from one of the
mirrors if the file is stored with n=3 replication, etc.)

So depending on what the goals of userspace are, there are number of
different kernel policies that might be the best match for the
particular application in question.  In particular, if you are trying
to provide low latency reads to assure decent response time for web
applications, it may be *reads* that are much more interesting for
timeout purposes rather than *writes*.

(Especially in a distributed system, you're going to be using some
kind of encoding with redundancy, so as long as enough of the writes
have completed, it doesn't matter if the other writes take a long time
--- although if you eventually decide that the write's never going to
make it, it's ideal if you can reshard the chunk more aggressively,
instead of waiting for the scurbbing pass to notice that some of the
redundant copies of the chunk had gotten corrupted or were never
written out.)

Cheers,

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-25 18:35                             ` Theodore Ts'o
@ 2017-01-26  0:36                                 ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-26  0:36 UTC (permalink / raw)
  To: Theodore Ts'o, Trond Myklebust
  Cc: kwolf, jlayton, hch, riel, linux-mm, rwheeler, lsf-pc, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 6574 bytes --]

On Wed, Jan 25 2017, Theodore Ts'o wrote:

> On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote:
>> The reason why I'm thinking open() is because it has to be a contract
>> between a specific application and the kernel. If the application
>> doesn't open the file with the O_TIMEOUT flag, then it shouldn't see
>> nasty non-POSIX timeout errors, even if there is another process that
>> is using that flag on the same file.
>> 
>> The only place where that is difficult to manage is when the file is
>> mmap()ed (no file descriptor), so you'd presumably have to disallow
>> mixing mmap and O_TIMEOUT.
>
> Well, technically there *is* a file descriptor when you do an mmap.
> You can close the fd after you call mmap(), but the mmap bumps the
> refcount on the struct file while the memory map is active.
>
> I would argue though that at least for buffered writes, the timeout
> has to be property of the underlying inode, and if there is an attempt
> to set timeout on an inode that already has a timeout set to some
> other non-zero value, the "set timeout" operation should fail with a
> "timeout already set".  That's becuase we really don't want to have to
> keep track, on a per-page basis, which struct file was responsible for
> dirtying a page --- and what if it is dirtied by two different file
> descriptors?

You seem to have a very different idea to the one that is forming in my
mind.  In my vision, once the data has entered the page cache, it
doesn't matter at all where it came from.  It will remain in the page
cache, as a dirty page, until it is successfully written or until an
unrecoverable error occurs.  There are no timeouts once the data is in
the page cache.

Actually, I'm leaning away from timeouts in general.  I'm not against
them, but not entirely sure they are useful.

To be more specific, I imagine a new open flag "O_IO_NDELAY".  It is a
bit like O_NDELAY, but it explicitly affects IO, never the actual open()
call, and it is explicitly allowed on regular files and block devices.

When combined with O_DIRECT, it effectively means "no retries".  For
block devices and files backed by block devices,
REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
help.
Non-block-device filesystems would behave differently.  e.g. NFS would
probably use a RPC_TASK_SOFT call instead of the normal 'hard' call.

When used without O_DIRECT:
 - read would trigger read-ahead much as it does now (which can do
   nothing if there are resource issues) and would only return data
   if it was already in the cache.
 - write would try to allocate a page, tell the filesystem that it
   is dirty so that journal space is reserved or whatever is needed,
   and would tell the dirty_pages rate-limiting that another page was
   dirty.  If the rate-limiting reported that we cannot dirty a page
   without waiting, or if any other needed resources were not available,
   then the write would fail (-EWOULDBLOCK).
 - fsync would just fail if there were any dirty pages.  It might also
   do the equivalent of sync_file_range(SYNC_FILE_RANGE_WRITE) without
   any *WAIT* flags. (alternately, fsync could remain unchanged, and
   sync_file_range() could gain a SYNC_FILE_RANGE_TEST flag).


With O_DIRECT there would be a delay, but it would be limited and there
would be no retry.  There is not currently any way to impose a specific
delay on REQ_FAILFAST* requests.
Without O_DIRECT, there could be no significant delay, though code might
have to wait for a mutex or similar.
There are a few places that a timeout could usefully be inserted, but
I'm not sure that would be better than just having the app try again in
a little while - it would have to be prepared for that anyway.

I would like O_DIRECT|O_IO_NDELAY for mdadm so we could safely work with
devices that block when no paths are available.

>
> That being said, I suspect that for many applications, the timeout is
> going to be *much* more interesting for O_DIRECT writes, and there we
> can certainly have different timeouts on a per-fd basis.  This is
> especially for cases where the timeout is implemented in storage
> device, using multi-media extensions, and where the timout might be
> measured in milliseconds (e.g., no point reading a video frame if its
> been delayed too long).  That being said, it block layer would need to
> know about this as well, since the timeout needs to be relative to
> when the read(2) system call is issued, not to when it is finally
> submitted to the storage device.

Yes. If a deadline could be added to "struct bio", and honoured by
drivers, then that would make a timeout much more interesting for
O_DIRECT.

Thanks,
NeilBrown


>
> And if the process has suitable privileges, perhaps the I/O scheduler
> should take the timeout into account, so that reads with a timeout
> attached should be submitted, with the presumption that reads w/o a
> timeout can afford to be queued.  If the process doesn't have suitable
> privileges, or if cgroup has exceeded its I/O quota, perhaps the right
> answer would be to fail the read right away.  In the case of a cluster
> file system such, if a particular server knows its can't serve a
> particular low latency read within the SLO, it might be worthwhile to
> signal to the cluster file system client that it should start doing an
> erasure code reconstruction right away (or read from one of the
> mirrors if the file is stored with n=3 replication, etc.)
>
> So depending on what the goals of userspace are, there are number of
> different kernel policies that might be the best match for the
> particular application in question.  In particular, if you are trying
> to provide low latency reads to assure decent response time for web
> applications, it may be *reads* that are much more interesting for
> timeout purposes rather than *writes*.
>
> (Especially in a distributed system, you're going to be using some
> kind of encoding with redundancy, so as long as enough of the writes
> have completed, it doesn't matter if the other writes take a long time
> --- although if you eventually decide that the write's never going to
> make it, it's ideal if you can reshard the chunk more aggressively,
> instead of waiting for the scurbbing pass to notice that some of the
> redundant copies of the chunk had gotten corrupted or were never
> written out.)
>
> Cheers,
>
> 					- Ted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
@ 2017-01-26  0:36                                 ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-26  0:36 UTC (permalink / raw)
  To: Theodore Ts'o, Trond Myklebust
  Cc: kwolf, jlayton, hch, riel, linux-mm, rwheeler, lsf-pc, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 6574 bytes --]

On Wed, Jan 25 2017, Theodore Ts'o wrote:

> On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote:
>> The reason why I'm thinking open() is because it has to be a contract
>> between a specific application and the kernel. If the application
>> doesn't open the file with the O_TIMEOUT flag, then it shouldn't see
>> nasty non-POSIX timeout errors, even if there is another process that
>> is using that flag on the same file.
>> 
>> The only place where that is difficult to manage is when the file is
>> mmap()ed (no file descriptor), so you'd presumably have to disallow
>> mixing mmap and O_TIMEOUT.
>
> Well, technically there *is* a file descriptor when you do an mmap.
> You can close the fd after you call mmap(), but the mmap bumps the
> refcount on the struct file while the memory map is active.
>
> I would argue though that at least for buffered writes, the timeout
> has to be property of the underlying inode, and if there is an attempt
> to set timeout on an inode that already has a timeout set to some
> other non-zero value, the "set timeout" operation should fail with a
> "timeout already set".  That's becuase we really don't want to have to
> keep track, on a per-page basis, which struct file was responsible for
> dirtying a page --- and what if it is dirtied by two different file
> descriptors?

You seem to have a very different idea to the one that is forming in my
mind.  In my vision, once the data has entered the page cache, it
doesn't matter at all where it came from.  It will remain in the page
cache, as a dirty page, until it is successfully written or until an
unrecoverable error occurs.  There are no timeouts once the data is in
the page cache.

Actually, I'm leaning away from timeouts in general.  I'm not against
them, but not entirely sure they are useful.

To be more specific, I imagine a new open flag "O_IO_NDELAY".  It is a
bit like O_NDELAY, but it explicitly affects IO, never the actual open()
call, and it is explicitly allowed on regular files and block devices.

When combined with O_DIRECT, it effectively means "no retries".  For
block devices and files backed by block devices,
REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
help.
Non-block-device filesystems would behave differently.  e.g. NFS would
probably use a RPC_TASK_SOFT call instead of the normal 'hard' call.

When used without O_DIRECT:
 - read would trigger read-ahead much as it does now (which can do
   nothing if there are resource issues) and would only return data
   if it was already in the cache.
 - write would try to allocate a page, tell the filesystem that it
   is dirty so that journal space is reserved or whatever is needed,
   and would tell the dirty_pages rate-limiting that another page was
   dirty.  If the rate-limiting reported that we cannot dirty a page
   without waiting, or if any other needed resources were not available,
   then the write would fail (-EWOULDBLOCK).
 - fsync would just fail if there were any dirty pages.  It might also
   do the equivalent of sync_file_range(SYNC_FILE_RANGE_WRITE) without
   any *WAIT* flags. (alternately, fsync could remain unchanged, and
   sync_file_range() could gain a SYNC_FILE_RANGE_TEST flag).


With O_DIRECT there would be a delay, but it would be limited and there
would be no retry.  There is not currently any way to impose a specific
delay on REQ_FAILFAST* requests.
Without O_DIRECT, there could be no significant delay, though code might
have to wait for a mutex or similar.
There are a few places that a timeout could usefully be inserted, but
I'm not sure that would be better than just having the app try again in
a little while - it would have to be prepared for that anyway.

I would like O_DIRECT|O_IO_NDELAY for mdadm so we could safely work with
devices that block when no paths are available.

>
> That being said, I suspect that for many applications, the timeout is
> going to be *much* more interesting for O_DIRECT writes, and there we
> can certainly have different timeouts on a per-fd basis.  This is
> especially for cases where the timeout is implemented in storage
> device, using multi-media extensions, and where the timout might be
> measured in milliseconds (e.g., no point reading a video frame if its
> been delayed too long).  That being said, it block layer would need to
> know about this as well, since the timeout needs to be relative to
> when the read(2) system call is issued, not to when it is finally
> submitted to the storage device.

Yes. If a deadline could be added to "struct bio", and honoured by
drivers, then that would make a timeout much more interesting for
O_DIRECT.

Thanks,
NeilBrown


>
> And if the process has suitable privileges, perhaps the I/O scheduler
> should take the timeout into account, so that reads with a timeout
> attached should be submitted, with the presumption that reads w/o a
> timeout can afford to be queued.  If the process doesn't have suitable
> privileges, or if cgroup has exceeded its I/O quota, perhaps the right
> answer would be to fail the read right away.  In the case of a cluster
> file system such, if a particular server knows its can't serve a
> particular low latency read within the SLO, it might be worthwhile to
> signal to the cluster file system client that it should start doing an
> erasure code reconstruction right away (or read from one of the
> mirrors if the file is stored with n=3 replication, etc.)
>
> So depending on what the goals of userspace are, there are number of
> different kernel policies that might be the best match for the
> particular application in question.  In particular, if you are trying
> to provide low latency reads to assure decent response time for web
> applications, it may be *reads* that are much more interesting for
> timeout purposes rather than *writes*.
>
> (Especially in a distributed system, you're going to be using some
> kind of encoding with redundancy, so as long as enough of the writes
> have completed, it doesn't matter if the other writes take a long time
> --- although if you eventually decide that the write's never going to
> make it, it's ideal if you can reshard the chunk more aggressively,
> instead of waiting for the scurbbing pass to notice that some of the
> redundant copies of the chunk had gotten corrupted or were never
> written out.)
>
> Cheers,
>
> 					- Ted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-26  0:36                                 ` NeilBrown
  (?)
@ 2017-01-26  9:25                                 ` Jan Kara
  2017-01-26 22:19                                     ` NeilBrown
  -1 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2017-01-26  9:25 UTC (permalink / raw)
  To: NeilBrown
  Cc: Theodore Ts'o, Trond Myklebust, kwolf, riel, hch, linux-mm,
	linux-fsdevel, jlayton, lsf-pc, rwheeler

On Thu 26-01-17 11:36:35, NeilBrown wrote:
> On Wed, Jan 25 2017, Theodore Ts'o wrote:
> > On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote:
> >> The reason why I'm thinking open() is because it has to be a contract
> >> between a specific application and the kernel. If the application
> >> doesn't open the file with the O_TIMEOUT flag, then it shouldn't see
> >> nasty non-POSIX timeout errors, even if there is another process that
> >> is using that flag on the same file.
> >> 
> >> The only place where that is difficult to manage is when the file is
> >> mmap()ed (no file descriptor), so you'd presumably have to disallow
> >> mixing mmap and O_TIMEOUT.
> >
> > Well, technically there *is* a file descriptor when you do an mmap.
> > You can close the fd after you call mmap(), but the mmap bumps the
> > refcount on the struct file while the memory map is active.
> >
> > I would argue though that at least for buffered writes, the timeout
> > has to be property of the underlying inode, and if there is an attempt
> > to set timeout on an inode that already has a timeout set to some
> > other non-zero value, the "set timeout" operation should fail with a
> > "timeout already set".  That's becuase we really don't want to have to
> > keep track, on a per-page basis, which struct file was responsible for
> > dirtying a page --- and what if it is dirtied by two different file
> > descriptors?
> 
> You seem to have a very different idea to the one that is forming in my
> mind.  In my vision, once the data has entered the page cache, it
> doesn't matter at all where it came from.  It will remain in the page
> cache, as a dirty page, until it is successfully written or until an
> unrecoverable error occurs.  There are no timeouts once the data is in
> the page cache.

Heh, this has somehow drifted away from the original topic of handling IO
errors :)

> Actually, I'm leaning away from timeouts in general.  I'm not against
> them, but not entirely sure they are useful.
> 
> To be more specific, I imagine a new open flag "O_IO_NDELAY".  It is a
> bit like O_NDELAY, but it explicitly affects IO, never the actual open()
> call, and it is explicitly allowed on regular files and block devices.
> 
> When combined with O_DIRECT, it effectively means "no retries".  For
> block devices and files backed by block devices,
> REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
> reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
> help.
> Non-block-device filesystems would behave differently.  e.g. NFS would
> probably use a RPC_TASK_SOFT call instead of the normal 'hard' call.
> 
> When used without O_DIRECT:
>  - read would trigger read-ahead much as it does now (which can do
>    nothing if there are resource issues) and would only return data
>    if it was already in the cache.

There was a patch set which did this [1]. Not on per-fd basis but rather on
per-IO basis. Andrew blocked it because he was convinced that mincore() is
good enough interface for this.

>  - write would try to allocate a page, tell the filesystem that it
>    is dirty so that journal space is reserved or whatever is needed,
>    and would tell the dirty_pages rate-limiting that another page was
>    dirty.  If the rate-limiting reported that we cannot dirty a page
>    without waiting, or if any other needed resources were not available,
>    then the write would fail (-EWOULDBLOCK).
>  - fsync would just fail if there were any dirty pages.  It might also
>    do the equivalent of sync_file_range(SYNC_FILE_RANGE_WRITE) without
>    any *WAIT* flags. (alternately, fsync could remain unchanged, and
>    sync_file_range() could gain a SYNC_FILE_RANGE_TEST flag).
> 
> 
> With O_DIRECT there would be a delay, but it would be limited and there
> would be no retry.  There is not currently any way to impose a specific
> delay on REQ_FAILFAST* requests.
> Without O_DIRECT, there could be no significant delay, though code might
> have to wait for a mutex or similar.
> There are a few places that a timeout could usefully be inserted, but
> I'm not sure that would be better than just having the app try again in
> a little while - it would have to be prepared for that anyway.
> 
> I would like O_DIRECT|O_IO_NDELAY for mdadm so we could safely work with
> devices that block when no paths are available.

For O_DIRECT writes, there are database people who want to do non-blocking
AIO writes. Although the problem they want to solve is different - rather
similar to the one patch set [1] is trying to solve for buffered reads -
they want to do AIO write and they want it really non-blocking so they can
do IO submission directly from computation thread without the cost of the
offload to a different process which normally does the IO.

Now you need something different for mdadm but interfaces should probably
be consistent...

> > That being said, I suspect that for many applications, the timeout is
> > going to be *much* more interesting for O_DIRECT writes, and there we
> > can certainly have different timeouts on a per-fd basis.  This is
> > especially for cases where the timeout is implemented in storage
> > device, using multi-media extensions, and where the timout might be
> > measured in milliseconds (e.g., no point reading a video frame if its
> > been delayed too long).  That being said, it block layer would need to
> > know about this as well, since the timeout needs to be relative to
> > when the read(2) system call is issued, not to when it is finally
> > submitted to the storage device.
> 
> Yes. If a deadline could be added to "struct bio", and honoured by
> drivers, then that would make a timeout much more interesting for
> O_DIRECT.

Timeouts are nice but IMO a lot of work and I suspect you'd really need a
dedicated "real-time" IO scheduler for this.

								Honza

[1] https://lwn.net/Articles/636955/

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-26  9:25                                 ` Jan Kara
@ 2017-01-26 22:19                                     ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-26 22:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Trond Myklebust, kwolf, riel, hch, linux-mm,
	linux-fsdevel, jlayton, lsf-pc, rwheeler

[-- Attachment #1: Type: text/plain, Size: 7877 bytes --]

On Thu, Jan 26 2017, Jan Kara wrote:

> On Thu 26-01-17 11:36:35, NeilBrown wrote:
>> On Wed, Jan 25 2017, Theodore Ts'o wrote:
>> > On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote:
>> >> The reason why I'm thinking open() is because it has to be a contract
>> >> between a specific application and the kernel. If the application
>> >> doesn't open the file with the O_TIMEOUT flag, then it shouldn't see
>> >> nasty non-POSIX timeout errors, even if there is another process that
>> >> is using that flag on the same file.
>> >> 
>> >> The only place where that is difficult to manage is when the file is
>> >> mmap()ed (no file descriptor), so you'd presumably have to disallow
>> >> mixing mmap and O_TIMEOUT.
>> >
>> > Well, technically there *is* a file descriptor when you do an mmap.
>> > You can close the fd after you call mmap(), but the mmap bumps the
>> > refcount on the struct file while the memory map is active.
>> >
>> > I would argue though that at least for buffered writes, the timeout
>> > has to be property of the underlying inode, and if there is an attempt
>> > to set timeout on an inode that already has a timeout set to some
>> > other non-zero value, the "set timeout" operation should fail with a
>> > "timeout already set".  That's becuase we really don't want to have to
>> > keep track, on a per-page basis, which struct file was responsible for
>> > dirtying a page --- and what if it is dirtied by two different file
>> > descriptors?
>> 
>> You seem to have a very different idea to the one that is forming in my
>> mind.  In my vision, once the data has entered the page cache, it
>> doesn't matter at all where it came from.  It will remain in the page
>> cache, as a dirty page, until it is successfully written or until an
>> unrecoverable error occurs.  There are no timeouts once the data is in
>> the page cache.
>
> Heh, this has somehow drifted away from the original topic of handling IO
> errors :)

I don't think it has.
The original topic was about gracefully handling of recoverable IO errors.
The question was framed as about retrying fsync() is it reported an
error, but this was based on a misunderstand.  fsync() doesn't report
an error for recoverable errors.  It hangs.
So the original topic is really about gracefully handling IO operations
which currently can hang indefinitely.


>
>> Actually, I'm leaning away from timeouts in general.  I'm not against
>> them, but not entirely sure they are useful.
>> 
>> To be more specific, I imagine a new open flag "O_IO_NDELAY".  It is a
>> bit like O_NDELAY, but it explicitly affects IO, never the actual open()
>> call, and it is explicitly allowed on regular files and block devices.
>> 
>> When combined with O_DIRECT, it effectively means "no retries".  For
>> block devices and files backed by block devices,
>> REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
>> reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
>> help.
>> Non-block-device filesystems would behave differently.  e.g. NFS would
>> probably use a RPC_TASK_SOFT call instead of the normal 'hard' call.
>> 
>> When used without O_DIRECT:
>>  - read would trigger read-ahead much as it does now (which can do
>>    nothing if there are resource issues) and would only return data
>>    if it was already in the cache.
>
> There was a patch set which did this [1]. Not on per-fd basis but rather on
> per-IO basis. Andrew blocked it because he was convinced that mincore() is
> good enough interface for this.

Thanks for the link.  mincore() won't trigger read-ahead of course, but
fadvise() can do that.  I don't see how it can provide a hard guarantee
that a subsequent read won't block though.
Still, interesting to be reminded of the history, thanks.

>
>>  - write would try to allocate a page, tell the filesystem that it
>>    is dirty so that journal space is reserved or whatever is needed,
>>    and would tell the dirty_pages rate-limiting that another page was
>>    dirty.  If the rate-limiting reported that we cannot dirty a page
>>    without waiting, or if any other needed resources were not available,
>>    then the write would fail (-EWOULDBLOCK).
>>  - fsync would just fail if there were any dirty pages.  It might also
>>    do the equivalent of sync_file_range(SYNC_FILE_RANGE_WRITE) without
>>    any *WAIT* flags. (alternately, fsync could remain unchanged, and
>>    sync_file_range() could gain a SYNC_FILE_RANGE_TEST flag).
>> 
>> 
>> With O_DIRECT there would be a delay, but it would be limited and there
>> would be no retry.  There is not currently any way to impose a specific
>> delay on REQ_FAILFAST* requests.
>> Without O_DIRECT, there could be no significant delay, though code might
>> have to wait for a mutex or similar.
>> There are a few places that a timeout could usefully be inserted, but
>> I'm not sure that would be better than just having the app try again in
>> a little while - it would have to be prepared for that anyway.
>> 
>> I would like O_DIRECT|O_IO_NDELAY for mdadm so we could safely work with
>> devices that block when no paths are available.
>
> For O_DIRECT writes, there are database people who want to do non-blocking
> AIO writes. Although the problem they want to solve is different - rather
> similar to the one patch set [1] is trying to solve for buffered reads -
> they want to do AIO write and they want it really non-blocking so they can
> do IO submission directly from computation thread without the cost of the
> offload to a different process which normally does the IO.

And aio_write() isn't non-blocking for O_DIRECT already because .... oh,
it doesn't even try.  Is there something intrinsically hard about async
O_DIRECT writes, or is it just that no-one has written acceptable code
yet?

>
> Now you need something different for mdadm but interfaces should probably
> be consistent...

A truly async O_DIRECT aio_write() combined with a working io_cancel()
would probably be sufficient.  The block layer doesn't provide any way
to cancel a bio though, so that would need to be wired up.

I'm not sure the two cases are all that similar though.
In one case the app doesn't want to wait at all.  In the other it is
happy to wait, but would prefer an error to indefinite retries.

>
>> > That being said, I suspect that for many applications, the timeout is
>> > going to be *much* more interesting for O_DIRECT writes, and there we
>> > can certainly have different timeouts on a per-fd basis.  This is
>> > especially for cases where the timeout is implemented in storage
>> > device, using multi-media extensions, and where the timout might be
>> > measured in milliseconds (e.g., no point reading a video frame if its
>> > been delayed too long).  That being said, it block layer would need to
>> > know about this as well, since the timeout needs to be relative to
>> > when the read(2) system call is issued, not to when it is finally
>> > submitted to the storage device.
>> 
>> Yes. If a deadline could be added to "struct bio", and honoured by
>> drivers, then that would make a timeout much more interesting for
>> O_DIRECT.
>
> Timeouts are nice but IMO a lot of work and I suspect you'd really need a
> dedicated "real-time" IO scheduler for this.

While such a scheduler might be nice, I think it would solve a different
problem.
There is a difference between:
 "please do your best to complete before this time"
and
 "don't even bother trying to complete after this time".

Both could be useful.  We shouldn't reject one because the other is too
hard.

Thanks,
NeilBrown

>
> 								Honza
>
> [1] https://lwn.net/Articles/636955/
>
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
@ 2017-01-26 22:19                                     ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-26 22:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Trond Myklebust, kwolf, riel, hch, linux-mm,
	linux-fsdevel, jlayton, lsf-pc, rwheeler

[-- Attachment #1: Type: text/plain, Size: 7877 bytes --]

On Thu, Jan 26 2017, Jan Kara wrote:

> On Thu 26-01-17 11:36:35, NeilBrown wrote:
>> On Wed, Jan 25 2017, Theodore Ts'o wrote:
>> > On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote:
>> >> The reason why I'm thinking open() is because it has to be a contract
>> >> between a specific application and the kernel. If the application
>> >> doesn't open the file with the O_TIMEOUT flag, then it shouldn't see
>> >> nasty non-POSIX timeout errors, even if there is another process that
>> >> is using that flag on the same file.
>> >> 
>> >> The only place where that is difficult to manage is when the file is
>> >> mmap()ed (no file descriptor), so you'd presumably have to disallow
>> >> mixing mmap and O_TIMEOUT.
>> >
>> > Well, technically there *is* a file descriptor when you do an mmap.
>> > You can close the fd after you call mmap(), but the mmap bumps the
>> > refcount on the struct file while the memory map is active.
>> >
>> > I would argue though that at least for buffered writes, the timeout
>> > has to be property of the underlying inode, and if there is an attempt
>> > to set timeout on an inode that already has a timeout set to some
>> > other non-zero value, the "set timeout" operation should fail with a
>> > "timeout already set".  That's becuase we really don't want to have to
>> > keep track, on a per-page basis, which struct file was responsible for
>> > dirtying a page --- and what if it is dirtied by two different file
>> > descriptors?
>> 
>> You seem to have a very different idea to the one that is forming in my
>> mind.  In my vision, once the data has entered the page cache, it
>> doesn't matter at all where it came from.  It will remain in the page
>> cache, as a dirty page, until it is successfully written or until an
>> unrecoverable error occurs.  There are no timeouts once the data is in
>> the page cache.
>
> Heh, this has somehow drifted away from the original topic of handling IO
> errors :)

I don't think it has.
The original topic was about gracefully handling of recoverable IO errors.
The question was framed as about retrying fsync() is it reported an
error, but this was based on a misunderstand.  fsync() doesn't report
an error for recoverable errors.  It hangs.
So the original topic is really about gracefully handling IO operations
which currently can hang indefinitely.


>
>> Actually, I'm leaning away from timeouts in general.  I'm not against
>> them, but not entirely sure they are useful.
>> 
>> To be more specific, I imagine a new open flag "O_IO_NDELAY".  It is a
>> bit like O_NDELAY, but it explicitly affects IO, never the actual open()
>> call, and it is explicitly allowed on regular files and block devices.
>> 
>> When combined with O_DIRECT, it effectively means "no retries".  For
>> block devices and files backed by block devices,
>> REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
>> reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
>> help.
>> Non-block-device filesystems would behave differently.  e.g. NFS would
>> probably use a RPC_TASK_SOFT call instead of the normal 'hard' call.
>> 
>> When used without O_DIRECT:
>>  - read would trigger read-ahead much as it does now (which can do
>>    nothing if there are resource issues) and would only return data
>>    if it was already in the cache.
>
> There was a patch set which did this [1]. Not on per-fd basis but rather on
> per-IO basis. Andrew blocked it because he was convinced that mincore() is
> good enough interface for this.

Thanks for the link.  mincore() won't trigger read-ahead of course, but
fadvise() can do that.  I don't see how it can provide a hard guarantee
that a subsequent read won't block though.
Still, interesting to be reminded of the history, thanks.

>
>>  - write would try to allocate a page, tell the filesystem that it
>>    is dirty so that journal space is reserved or whatever is needed,
>>    and would tell the dirty_pages rate-limiting that another page was
>>    dirty.  If the rate-limiting reported that we cannot dirty a page
>>    without waiting, or if any other needed resources were not available,
>>    then the write would fail (-EWOULDBLOCK).
>>  - fsync would just fail if there were any dirty pages.  It might also
>>    do the equivalent of sync_file_range(SYNC_FILE_RANGE_WRITE) without
>>    any *WAIT* flags. (alternately, fsync could remain unchanged, and
>>    sync_file_range() could gain a SYNC_FILE_RANGE_TEST flag).
>> 
>> 
>> With O_DIRECT there would be a delay, but it would be limited and there
>> would be no retry.  There is not currently any way to impose a specific
>> delay on REQ_FAILFAST* requests.
>> Without O_DIRECT, there could be no significant delay, though code might
>> have to wait for a mutex or similar.
>> There are a few places that a timeout could usefully be inserted, but
>> I'm not sure that would be better than just having the app try again in
>> a little while - it would have to be prepared for that anyway.
>> 
>> I would like O_DIRECT|O_IO_NDELAY for mdadm so we could safely work with
>> devices that block when no paths are available.
>
> For O_DIRECT writes, there are database people who want to do non-blocking
> AIO writes. Although the problem they want to solve is different - rather
> similar to the one patch set [1] is trying to solve for buffered reads -
> they want to do AIO write and they want it really non-blocking so they can
> do IO submission directly from computation thread without the cost of the
> offload to a different process which normally does the IO.

And aio_write() isn't non-blocking for O_DIRECT already because .... oh,
it doesn't even try.  Is there something intrinsically hard about async
O_DIRECT writes, or is it just that no-one has written acceptable code
yet?

>
> Now you need something different for mdadm but interfaces should probably
> be consistent...

A truly async O_DIRECT aio_write() combined with a working io_cancel()
would probably be sufficient.  The block layer doesn't provide any way
to cancel a bio though, so that would need to be wired up.

I'm not sure the two cases are all that similar though.
In one case the app doesn't want to wait at all.  In the other it is
happy to wait, but would prefer an error to indefinite retries.

>
>> > That being said, I suspect that for many applications, the timeout is
>> > going to be *much* more interesting for O_DIRECT writes, and there we
>> > can certainly have different timeouts on a per-fd basis.  This is
>> > especially for cases where the timeout is implemented in storage
>> > device, using multi-media extensions, and where the timout might be
>> > measured in milliseconds (e.g., no point reading a video frame if its
>> > been delayed too long).  That being said, it block layer would need to
>> > know about this as well, since the timeout needs to be relative to
>> > when the read(2) system call is issued, not to when it is finally
>> > submitted to the storage device.
>> 
>> Yes. If a deadline could be added to "struct bio", and honoured by
>> drivers, then that would make a timeout much more interesting for
>> O_DIRECT.
>
> Timeouts are nice but IMO a lot of work and I suspect you'd really need a
> dedicated "real-time" IO scheduler for this.

While such a scheduler might be nice, I think it would solve a different
problem.
There is a difference between:
 "please do your best to complete before this time"
and
 "don't even bother trying to complete after this time".

Both could be useful.  We shouldn't reject one because the other is too
hard.

Thanks,
NeilBrown

>
> 								Honza
>
> [1] https://lwn.net/Articles/636955/
>
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-26 22:19                                     ` NeilBrown
  (?)
@ 2017-01-27  3:23                                     ` Theodore Ts'o
  2017-01-27  6:03                                         ` NeilBrown
  2017-01-30 16:04                                       ` Jan Kara
  -1 siblings, 2 replies; 53+ messages in thread
From: Theodore Ts'o @ 2017-01-27  3:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jan Kara, Trond Myklebust, kwolf, riel, hch, linux-mm,
	linux-fsdevel, jlayton, lsf-pc, rwheeler

On Fri, Jan 27, 2017 at 09:19:10AM +1100, NeilBrown wrote:
> I don't think it has.
> The original topic was about gracefully handling of recoverable IO errors.
> The question was framed as about retrying fsync() is it reported an
> error, but this was based on a misunderstand.  fsync() doesn't report
> an error for recoverable errors.  It hangs.
> So the original topic is really about gracefully handling IO operations
> which currently can hang indefinitely.

Well, the problem is that it is up to the device driver to decide when
an error is recoverable or not.  This might include waiting X minutes,
and then deciding that the fibre channel connection isn't coming back,
and then turning it into an unrecoverable error.  Or for other
devices, the timeout might be much smaller.

Which is fine --- I think that's where the decision ought to live, and
if users want to tune a different timeout before the driver stops
waiting, that should be between the system administrator and the
device driver /sys tuning knob.

> >> When combined with O_DIRECT, it effectively means "no retries".  For
> >> block devices and files backed by block devices,
> >> REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
> >> reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
> >> help.

Absolutely no retries?  Even TCP retries in the case of iSCSI?  I
don't think turning every TCP packet drop into EWOULDBLOCK would make
sense under any circumstances.  What might make sense is to have a
"short timeout" where it's up to the block device to decide what
"short timeout" means.

EWOULDBLOCK is also a little misleading, because even if the I/O
request is submitted immediately to the block device and immediately
serviced and returned, the I/O request would still be "blocking".
Maybe ETIMEDOUT instead?

> And aio_write() isn't non-blocking for O_DIRECT already because .... oh,
> it doesn't even try.  Is there something intrinsically hard about async
> O_DIRECT writes, or is it just that no-one has written acceptable code
> yet?

AIO/DIO writes can indeed be non-blocking, if the file system doesn't
need to do any metadata operations.  So if the file is preallocated,
you should be able to issue an async DIO write without losing the CPU.

> A truly async O_DIRECT aio_write() combined with a working io_cancel()
> would probably be sufficient.  The block layer doesn't provide any way
> to cancel a bio though, so that would need to be wired up.

Kent Overstreet worked up io_cancel for AIO/DIO writes when he was at
Google.  As I recall the patchset did get posted a few times, but it
never ended up getted accepted for upstream adoption.

We even had some very rough code that would propagate the cancellation
request to the hard drive, for those hard drives that had a facility
for accepting a cancellation request for an I/O which was queued via
NCQ but which hadn't executed yet.  It sort-of worked, but it never
hit a state where it could be published before the project was
abandoned.

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-27  3:23                                     ` Theodore Ts'o
@ 2017-01-27  6:03                                         ` NeilBrown
  2017-01-30 16:04                                       ` Jan Kara
  1 sibling, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-27  6:03 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Trond Myklebust, kwolf, riel, hch, linux-mm,
	linux-fsdevel, jlayton, lsf-pc, rwheeler

[-- Attachment #1: Type: text/plain, Size: 4496 bytes --]

On Thu, Jan 26 2017, Theodore Ts'o wrote:

> On Fri, Jan 27, 2017 at 09:19:10AM +1100, NeilBrown wrote:
>> I don't think it has.
>> The original topic was about gracefully handling of recoverable IO errors.
>> The question was framed as about retrying fsync() is it reported an
>> error, but this was based on a misunderstand.  fsync() doesn't report
>> an error for recoverable errors.  It hangs.
>> So the original topic is really about gracefully handling IO operations
>> which currently can hang indefinitely.
>
> Well, the problem is that it is up to the device driver to decide when
> an error is recoverable or not.  This might include waiting X minutes,
> and then deciding that the fibre channel connection isn't coming back,
> and then turning it into an unrecoverable error.  Or for other
> devices, the timeout might be much smaller.
>
> Which is fine --- I think that's where the decision ought to live, and
> if users want to tune a different timeout before the driver stops
> waiting, that should be between the system administrator and the
> device driver /sys tuning knob.

Completely agree.  Whether a particular condition should be treated as
recoverable or unrecoverable is a question and that driver authors and
sysadmins could reasonably provide input to.
But once that decision has been made, the application must accept the
decision.  EIO means unrecoverable.  There is never any point retrying.
Recoverable manifests as a hang, awaiting recovery.

I recently noticed that PG_error is effectively meaningless for write
errors.  filemap_fdatawait_range() can clear it, and the return value is
often ignored. AS_EIO is the really meaningful flag for write errors,
and it is per-file, not per-page.

>
>> >> When combined with O_DIRECT, it effectively means "no retries".  For
>> >> block devices and files backed by block devices,
>> >> REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
>> >> reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
>> >> help.
>
> Absolutely no retries?  Even TCP retries in the case of iSCSI?  I
> don't think turning every TCP packet drop into EWOULDBLOCK would make
> sense under any circumstances.  What might make sense is to have a
> "short timeout" where it's up to the block device to decide what
> "short timeout" means.

The implemented semantics of REQ_FAILFAST_* are to disable retries on
certain types of fail.  That is what I was meaning to refer to.
There are retries are many levels in the protocol stack, from the
collision detection retries at the data-link layer, to packet-level and
connection level and command level.  Some have predefined timeouts and
should be left alone.  Others have no timeouts and need to be disabled.
There are probably others in the middle.
I was looking for a semantic that could be implemented on top of current
interfaces, which means working with the REQ_FAILFAST_* semantic.

>
> EWOULDBLOCK is also a little misleading, because even if the I/O
> request is submitted immediately to the block device and immediately
> serviced and returned, the I/O request would still be "blocking".
> Maybe ETIMEDOUT instead?

Maybe - I won't argue.

>
>> And aio_write() isn't non-blocking for O_DIRECT already because .... oh,
>> it doesn't even try.  Is there something intrinsically hard about async
>> O_DIRECT writes, or is it just that no-one has written acceptable code
>> yet?
>
> AIO/DIO writes can indeed be non-blocking, if the file system doesn't
> need to do any metadata operations.  So if the file is preallocated,
> you should be able to issue an async DIO write without losing the CPU.

Yes, I see that now.  I misread some of the code.
Thanks.

NeilBrown


>
>> A truly async O_DIRECT aio_write() combined with a working io_cancel()
>> would probably be sufficient.  The block layer doesn't provide any way
>> to cancel a bio though, so that would need to be wired up.
>
> Kent Overstreet worked up io_cancel for AIO/DIO writes when he was at
> Google.  As I recall the patchset did get posted a few times, but it
> never ended up getted accepted for upstream adoption.
>
> We even had some very rough code that would propagate the cancellation
> request to the hard drive, for those hard drives that had a facility
> for accepting a cancellation request for an I/O which was queued via
> NCQ but which hadn't executed yet.  It sort-of worked, but it never
> hit a state where it could be published before the project was
> abandoned.
>
> 						- Ted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
@ 2017-01-27  6:03                                         ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-27  6:03 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Trond Myklebust, kwolf, riel, hch, linux-mm,
	linux-fsdevel, jlayton, lsf-pc, rwheeler

[-- Attachment #1: Type: text/plain, Size: 4496 bytes --]

On Thu, Jan 26 2017, Theodore Ts'o wrote:

> On Fri, Jan 27, 2017 at 09:19:10AM +1100, NeilBrown wrote:
>> I don't think it has.
>> The original topic was about gracefully handling of recoverable IO errors.
>> The question was framed as about retrying fsync() is it reported an
>> error, but this was based on a misunderstand.  fsync() doesn't report
>> an error for recoverable errors.  It hangs.
>> So the original topic is really about gracefully handling IO operations
>> which currently can hang indefinitely.
>
> Well, the problem is that it is up to the device driver to decide when
> an error is recoverable or not.  This might include waiting X minutes,
> and then deciding that the fibre channel connection isn't coming back,
> and then turning it into an unrecoverable error.  Or for other
> devices, the timeout might be much smaller.
>
> Which is fine --- I think that's where the decision ought to live, and
> if users want to tune a different timeout before the driver stops
> waiting, that should be between the system administrator and the
> device driver /sys tuning knob.

Completely agree.  Whether a particular condition should be treated as
recoverable or unrecoverable is a question and that driver authors and
sysadmins could reasonably provide input to.
But once that decision has been made, the application must accept the
decision.  EIO means unrecoverable.  There is never any point retrying.
Recoverable manifests as a hang, awaiting recovery.

I recently noticed that PG_error is effectively meaningless for write
errors.  filemap_fdatawait_range() can clear it, and the return value is
often ignored. AS_EIO is the really meaningful flag for write errors,
and it is per-file, not per-page.

>
>> >> When combined with O_DIRECT, it effectively means "no retries".  For
>> >> block devices and files backed by block devices,
>> >> REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
>> >> reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
>> >> help.
>
> Absolutely no retries?  Even TCP retries in the case of iSCSI?  I
> don't think turning every TCP packet drop into EWOULDBLOCK would make
> sense under any circumstances.  What might make sense is to have a
> "short timeout" where it's up to the block device to decide what
> "short timeout" means.

The implemented semantics of REQ_FAILFAST_* are to disable retries on
certain types of fail.  That is what I was meaning to refer to.
There are retries are many levels in the protocol stack, from the
collision detection retries at the data-link layer, to packet-level and
connection level and command level.  Some have predefined timeouts and
should be left alone.  Others have no timeouts and need to be disabled.
There are probably others in the middle.
I was looking for a semantic that could be implemented on top of current
interfaces, which means working with the REQ_FAILFAST_* semantic.

>
> EWOULDBLOCK is also a little misleading, because even if the I/O
> request is submitted immediately to the block device and immediately
> serviced and returned, the I/O request would still be "blocking".
> Maybe ETIMEDOUT instead?

Maybe - I won't argue.

>
>> And aio_write() isn't non-blocking for O_DIRECT already because .... oh,
>> it doesn't even try.  Is there something intrinsically hard about async
>> O_DIRECT writes, or is it just that no-one has written acceptable code
>> yet?
>
> AIO/DIO writes can indeed be non-blocking, if the file system doesn't
> need to do any metadata operations.  So if the file is preallocated,
> you should be able to issue an async DIO write without losing the CPU.

Yes, I see that now.  I misread some of the code.
Thanks.

NeilBrown


>
>> A truly async O_DIRECT aio_write() combined with a working io_cancel()
>> would probably be sufficient.  The block layer doesn't provide any way
>> to cancel a bio though, so that would need to be wired up.
>
> Kent Overstreet worked up io_cancel for AIO/DIO writes when he was at
> Google.  As I recall the patchset did get posted a few times, but it
> never ended up getted accepted for upstream adoption.
>
> We even had some very rough code that would propagate the cancellation
> request to the hard drive, for those hard drives that had a facility
> for accepting a cancellation request for an I/O which was queued via
> NCQ but which hadn't executed yet.  It sort-of worked, but it never
> hit a state where it could be published before the project was
> abandoned.
>
> 						- Ted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-25 13:00                                 ` Jeff Layton
@ 2017-01-30  5:30                                   ` NeilBrown
  -1 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-30  5:30 UTC (permalink / raw)
  To: Jeff Layton, Trond Myklebust, kwolf, tytso
  Cc: lsf-pc, hch, riel, rwheeler, linux-mm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 7093 bytes --]

On Wed, Jan 25 2017, Jeff Layton wrote:

> On Wed, 2017-01-25 at 08:58 +1100, NeilBrown wrote:
>> On Mon, Jan 23 2017, Jeff Layton wrote:
>> 
>> > On Tue, 2017-01-24 at 11:16 +1100, NeilBrown wrote:
>> > > On Mon, Jan 23 2017, Trond Myklebust wrote:
>> > > 
>> > > > On Mon, 2017-01-23 at 17:35 -0500, Jeff Layton wrote:
>> > > > > On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
>> > > > > > 
>> > > > > > However, if we look at the greater problem of hanging requests that
>> > > > > > came
>> > > > > > up in the more recent emails of this thread, it is only moved
>> > > > > > rather
>> > > > > > than solved. Chances are that already write() would hang now
>> > > > > > instead of
>> > > > > > only fsync(), but we still have a hard time dealing with this.
>> > > > > > 
>> > > > > 
>> > > > > Well, it _is_ better with O_DIRECT as you can usually at least break
>> > > > > out
>> > > > > of the I/O with SIGKILL.
>> > > > > 
>> > > > > When I last looked at this, the problem with buffered I/O was that
>> > > > > you
>> > > > > often end up waiting on page bits to clear (usually PG_writeback or
>> > > > > PG_dirty), in non-killable sleeps for the most part.
>> > > > > 
>> > > > > Maybe the fix here is as simple as changing that?
>> > > > 
>> > > > At the risk of kicking off another O_PONIES discussion: Add an
>> > > > open(O_TIMEOUT) flag that would let the kernel know that the
>> > > > application is prepared to handle timeouts from operations such as
>> > > > read(), write() and fsync(), then add an ioctl() or syscall to allow
>> > > > said application to set the timeout value.
>> > > 
>> > > I was thinking on very similar lines, though I'd use 'fcntl()' if
>> > > possible because it would be a per-"file description" option.
>> > > This would be a function of the page cache, and a filesystem wouldn't
>> > > need to know about it at all.  Once enable, 'read', 'write', or 'fsync'
>> > > would return EWOULDBLOCK rather than waiting indefinitely.
>> > > It might be nice if 'select' could then be used on page-cache file
>> > > descriptors, but I think that is much harder.  Support O_TIMEOUT would
>> > > be a practical first step - if someone agreed to actually try to use it.
>> > > 
>> > 
>> > Yeah, that does seem like it might be worth exploring. 
>> > 
>> > That said, I think there's something even simpler we can do to make
>> > things better for a lot of cases, and it may even help pave the way for
>> > the proposal above.
>> > 
>> > Looking closer and remembering more, I think the main problem area when
>> > the pages are stuck in writeback is the wait_on_page_writeback call in
>> > places like wait_for_stable_page and __filemap_fdatawait_range.
>> 
>> I can't see wait_for_stable_page() being very relevant.  That only
>> blocks on backing devices which have requested stable pages.
>> raid5 sometimes does that.  Some scsi/sata devices can somehow.
>> And rbd (part of ceph) sometimes does.  I don't think NFS ever will.
>> wait_for_stable_page() doesn't currently return an error, so getting to
>> abort in SIGKILL would be a lot of work.
>> 
>
> Ahh right, I missed that it only affects pages backed by a BDI that has
> BDI_CAP_STABLE_WRITES. Good.
>
>
>> filemap_fdatawait_range() is much easier.
>> 
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index b772a33ef640..2773f6dde1da 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -401,7 +401,9 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
>>  			if (page->index > end)
>>  				continue;
>>  
>> -			wait_on_page_writeback(page);
>> +			if (PageWriteback(page))
>> +				if (wait_on_page_bit_killable(page, PG_writeback))
>> +					err = -ERESTARTSYS;
>>  			if (TestClearPageError(page))
>>  				ret = -EIO;
>>  		}
>> 
>> That isn't a complete solution. There is code in f2fs which doesn't
>> check the return value and probably should.  And gfs2 calls
>> 	mapping_set_error(mapping, error);
>> with the return value, with we probably don't want in the ERESTARTSYS case.
>> There are some usages in btrfs that I'd need to double-check too.
>> 
>> But it looks to be manageable. 
>> 
>> Thanks,
>> NeilBrown
>> 
>
> Yeah, it does. The main worry I have is that this function is called all
> over the place in fairly deep call chains. It definitely needs review
> and testing (and probably a lot of related fixes like you mention).

I decided to dig a bit more deeply into this.  There aren't all *that*
many calls to filemap_fdatawait() itself, but when you add in calls
through filemap_write_and_wait(), it starts to add up quite a lot.

One interesting semantic of filemap_fdatawait() is that it calls
filemap_check_errors() which clears the AS_ENOSPC and AS_EIO bits.
This means that you need to be careful to save the error status,
otherwise the application might never see that a write error happened.
Some places aren't as careful as they should be.

Several filesystems use filemap_fdatawait or similar to flush dirty
pages before setattr or file locking.  9fs is particularly guilty, but
it isn't the only one.  The filesystem cannot usefully return EIO for
either of these, so it really needs to reset the AS_E* flag.  Some do,
but many don't.
There is a filemap_fdatawait_noerror() interface to avoid clearing the
flags, but that isn't exported so filesystems don't use it.

One strangeness that caught my eye is the O_DIRECT write handling in
btrfs.
If it cannot write directly to the device (e.g. if it has to allocate
space first) it falls back to bufferred writes and then flushes.
If it successfully writes some blocks directly to the device, and only
has to fallback for part of the write, then any EIO which happens during
the later part of the write gets lost.


It is a bit of a mess really.  Some of it is simple coding errors,
possibly based on not understanding the requirements properly.  But part
of it is, I think, because the filemap_fdatawait() interface is too
easy to misuse.  It would be safer if filemap_fdatawait() never cleared
the AS_E* flags, and didn't even return any error.
Separately, filemap_check_errors() could be called separately when
needed, and possible filemap_datawait_errors() (or
filemap_write_and_wait_errors()) could then be provided if it was useful
enough.

Probably many of the places that want the errors, would be OK with EINTR
on a SIGKILL.  Certainly many places that don't check errors wouldn't be
happy with that.

I'll have a play and see how easy it is to clean bits of this up.

NeilBrown


>
> We should also note that this is not really a fix for applications,
> per-se. It's more of an administrative improvement, to allow admins to
> kill off processes stuck waiting for an inode to finish writeback.
>
> I think there may still be room for an interface like you and Trond were
> debating. But, I think this might be a good first step and would improve
> a lot of hung mount situations.
> -- 
> Jeff Layton <jlayton@poochiereds.net>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
@ 2017-01-30  5:30                                   ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-01-30  5:30 UTC (permalink / raw)
  To: Jeff Layton, Trond Myklebust, kwolf, tytso
  Cc: lsf-pc, hch, riel, rwheeler, linux-mm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 7093 bytes --]

On Wed, Jan 25 2017, Jeff Layton wrote:

> On Wed, 2017-01-25 at 08:58 +1100, NeilBrown wrote:
>> On Mon, Jan 23 2017, Jeff Layton wrote:
>> 
>> > On Tue, 2017-01-24 at 11:16 +1100, NeilBrown wrote:
>> > > On Mon, Jan 23 2017, Trond Myklebust wrote:
>> > > 
>> > > > On Mon, 2017-01-23 at 17:35 -0500, Jeff Layton wrote:
>> > > > > On Mon, 2017-01-23 at 11:09 +0100, Kevin Wolf wrote:
>> > > > > > 
>> > > > > > However, if we look at the greater problem of hanging requests that
>> > > > > > came
>> > > > > > up in the more recent emails of this thread, it is only moved
>> > > > > > rather
>> > > > > > than solved. Chances are that already write() would hang now
>> > > > > > instead of
>> > > > > > only fsync(), but we still have a hard time dealing with this.
>> > > > > > 
>> > > > > 
>> > > > > Well, it _is_ better with O_DIRECT as you can usually at least break
>> > > > > out
>> > > > > of the I/O with SIGKILL.
>> > > > > 
>> > > > > When I last looked at this, the problem with buffered I/O was that
>> > > > > you
>> > > > > often end up waiting on page bits to clear (usually PG_writeback or
>> > > > > PG_dirty), in non-killable sleeps for the most part.
>> > > > > 
>> > > > > Maybe the fix here is as simple as changing that?
>> > > > 
>> > > > At the risk of kicking off another O_PONIES discussion: Add an
>> > > > open(O_TIMEOUT) flag that would let the kernel know that the
>> > > > application is prepared to handle timeouts from operations such as
>> > > > read(), write() and fsync(), then add an ioctl() or syscall to allow
>> > > > said application to set the timeout value.
>> > > 
>> > > I was thinking on very similar lines, though I'd use 'fcntl()' if
>> > > possible because it would be a per-"file description" option.
>> > > This would be a function of the page cache, and a filesystem wouldn't
>> > > need to know about it at all.  Once enable, 'read', 'write', or 'fsync'
>> > > would return EWOULDBLOCK rather than waiting indefinitely.
>> > > It might be nice if 'select' could then be used on page-cache file
>> > > descriptors, but I think that is much harder.  Support O_TIMEOUT would
>> > > be a practical first step - if someone agreed to actually try to use it.
>> > > 
>> > 
>> > Yeah, that does seem like it might be worth exploring. 
>> > 
>> > That said, I think there's something even simpler we can do to make
>> > things better for a lot of cases, and it may even help pave the way for
>> > the proposal above.
>> > 
>> > Looking closer and remembering more, I think the main problem area when
>> > the pages are stuck in writeback is the wait_on_page_writeback call in
>> > places like wait_for_stable_page and __filemap_fdatawait_range.
>> 
>> I can't see wait_for_stable_page() being very relevant.  That only
>> blocks on backing devices which have requested stable pages.
>> raid5 sometimes does that.  Some scsi/sata devices can somehow.
>> And rbd (part of ceph) sometimes does.  I don't think NFS ever will.
>> wait_for_stable_page() doesn't currently return an error, so getting to
>> abort in SIGKILL would be a lot of work.
>> 
>
> Ahh right, I missed that it only affects pages backed by a BDI that has
> BDI_CAP_STABLE_WRITES. Good.
>
>
>> filemap_fdatawait_range() is much easier.
>> 
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index b772a33ef640..2773f6dde1da 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -401,7 +401,9 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
>>  			if (page->index > end)
>>  				continue;
>>  
>> -			wait_on_page_writeback(page);
>> +			if (PageWriteback(page))
>> +				if (wait_on_page_bit_killable(page, PG_writeback))
>> +					err = -ERESTARTSYS;
>>  			if (TestClearPageError(page))
>>  				ret = -EIO;
>>  		}
>> 
>> That isn't a complete solution. There is code in f2fs which doesn't
>> check the return value and probably should.  And gfs2 calls
>> 	mapping_set_error(mapping, error);
>> with the return value, with we probably don't want in the ERESTARTSYS case.
>> There are some usages in btrfs that I'd need to double-check too.
>> 
>> But it looks to be manageable. 
>> 
>> Thanks,
>> NeilBrown
>> 
>
> Yeah, it does. The main worry I have is that this function is called all
> over the place in fairly deep call chains. It definitely needs review
> and testing (and probably a lot of related fixes like you mention).

I decided to dig a bit more deeply into this.  There aren't all *that*
many calls to filemap_fdatawait() itself, but when you add in calls
through filemap_write_and_wait(), it starts to add up quite a lot.

One interesting semantic of filemap_fdatawait() is that it calls
filemap_check_errors() which clears the AS_ENOSPC and AS_EIO bits.
This means that you need to be careful to save the error status,
otherwise the application might never see that a write error happened.
Some places aren't as careful as they should be.

Several filesystems use filemap_fdatawait or similar to flush dirty
pages before setattr or file locking.  9fs is particularly guilty, but
it isn't the only one.  The filesystem cannot usefully return EIO for
either of these, so it really needs to reset the AS_E* flag.  Some do,
but many don't.
There is a filemap_fdatawait_noerror() interface to avoid clearing the
flags, but that isn't exported so filesystems don't use it.

One strangeness that caught my eye is the O_DIRECT write handling in
btrfs.
If it cannot write directly to the device (e.g. if it has to allocate
space first) it falls back to bufferred writes and then flushes.
If it successfully writes some blocks directly to the device, and only
has to fallback for part of the write, then any EIO which happens during
the later part of the write gets lost.


It is a bit of a mess really.  Some of it is simple coding errors,
possibly based on not understanding the requirements properly.  But part
of it is, I think, because the filemap_fdatawait() interface is too
easy to misuse.  It would be safer if filemap_fdatawait() never cleared
the AS_E* flags, and didn't even return any error.
Separately, filemap_check_errors() could be called separately when
needed, and possible filemap_datawait_errors() (or
filemap_write_and_wait_errors()) could then be provided if it was useful
enough.

Probably many of the places that want the errors, would be OK with EINTR
on a SIGKILL.  Certainly many places that don't check errors wouldn't be
happy with that.

I'll have a play and see how easy it is to clean bits of this up.

NeilBrown


>
> We should also note that this is not really a fix for applications,
> per-se. It's more of an administrative improvement, to allow admins to
> kill off processes stuck waiting for an inode to finish writeback.
>
> I think there may still be room for an interface like you and Trond were
> debating. But, I think this might be a good first step and would improve
> a lot of hung mount situations.
> -- 
> Jeff Layton <jlayton@poochiereds.net>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
  2017-01-27  3:23                                     ` Theodore Ts'o
  2017-01-27  6:03                                         ` NeilBrown
@ 2017-01-30 16:04                                       ` Jan Kara
  1 sibling, 0 replies; 53+ messages in thread
From: Jan Kara @ 2017-01-30 16:04 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: NeilBrown, kwolf, riel, Jan Kara, hch, linux-mm, linux-fsdevel,
	Trond Myklebust, jlayton, lsf-pc, rwheeler

On Thu 26-01-17 22:23:18, Ted Tso wrote:
> > And aio_write() isn't non-blocking for O_DIRECT already because .... oh,
> > it doesn't even try.  Is there something intrinsically hard about async
> > O_DIRECT writes, or is it just that no-one has written acceptable code
> > yet?
> 
> AIO/DIO writes can indeed be non-blocking, if the file system doesn't
> need to do any metadata operations.  So if the file is preallocated,
> you should be able to issue an async DIO write without losing the CPU.

Well, there are couple ifs though. You can still block on locks, memory
allocation, or IO request allocation...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2017-01-30 16:04 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-10 16:02 [LSF/MM TOPIC] I/O error handling and fsync() Kevin Wolf
2017-01-11  0:41 ` NeilBrown
2017-01-13 11:09   ` Kevin Wolf
2017-01-13 14:21     ` Theodore Ts'o
2017-01-13 16:00       ` Kevin Wolf
2017-01-13 22:28         ` NeilBrown
2017-01-14  6:18           ` Darrick J. Wong
2017-01-16 12:14           ` [Lsf-pc] " Jeff Layton
2017-01-22 22:44             ` NeilBrown
2017-01-22 23:31               ` Jeff Layton
2017-01-23  0:21                 ` Theodore Ts'o
2017-01-23 10:09                   ` Kevin Wolf
2017-01-23 12:10                     ` Jeff Layton
2017-01-23 12:10                       ` Jeff Layton
2017-01-23 17:25                       ` Theodore Ts'o
2017-01-23 17:53                         ` Chuck Lever
2017-01-23 22:40                         ` Jeff Layton
2017-01-23 22:40                           ` Jeff Layton
2017-01-23 22:35                     ` Jeff Layton
2017-01-23 22:35                       ` Jeff Layton
2017-01-23 23:09                       ` Trond Myklebust
2017-01-24  0:16                         ` NeilBrown
2017-01-24  0:16                           ` NeilBrown
2017-01-24  0:46                           ` Jeff Layton
2017-01-24  0:46                             ` Jeff Layton
2017-01-24 21:58                             ` NeilBrown
2017-01-24 21:58                               ` NeilBrown
2017-01-25 13:00                               ` Jeff Layton
2017-01-25 13:00                                 ` Jeff Layton
2017-01-30  5:30                                 ` NeilBrown
2017-01-30  5:30                                   ` NeilBrown
2017-01-24  3:34                           ` Trond Myklebust
2017-01-25 18:35                             ` Theodore Ts'o
2017-01-26  0:36                               ` NeilBrown
2017-01-26  0:36                                 ` NeilBrown
2017-01-26  9:25                                 ` Jan Kara
2017-01-26 22:19                                   ` NeilBrown
2017-01-26 22:19                                     ` NeilBrown
2017-01-27  3:23                                     ` Theodore Ts'o
2017-01-27  6:03                                       ` NeilBrown
2017-01-27  6:03                                         ` NeilBrown
2017-01-30 16:04                                       ` Jan Kara
2017-01-13 18:40     ` Al Viro
2017-01-13 19:06       ` Kevin Wolf
2017-01-11  5:03 ` Theodore Ts'o
2017-01-11  9:47   ` [Lsf-pc] " Jan Kara
2017-01-11 15:45     ` Theodore Ts'o
2017-01-11 10:55   ` Chris Vest
2017-01-11 11:40   ` Kevin Wolf
2017-01-13  4:51     ` NeilBrown
2017-01-13 11:51       ` Kevin Wolf
2017-01-13 21:55         ` NeilBrown
2017-01-11 12:14   ` Chris Vest

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.