Am 11.01.2017 um 01:41 hat NeilBrown geschrieben: > On Wed, Jan 11 2017, Kevin Wolf wrote: > > > Hi all, > > > > when I mentioned the I/O error handling problem especially with fsync() > > we have in QEMU to Christoph Hellwig, he thought it would be great topic > > for LSF/MM, so here I am. This came up a few months ago on qemu-devel [1] > > and we managed to ignore it for a while, but it's a real and potentially > > serious problem, so I think I agree with Christoph that it makes sense > > to get it discussed at LSF/MM. > > > > > > At the heart of it is the semantics of fsync(). A few years ago, fsync() > > was fixed to actually flush data to the disk, so we now have a defined > > and useful meaning of fsync() as long as all your fsync() calls return > > success. > > > > However, as soon as one fsync() call fails, even if the root problem is > > solved later (network connection restored, some space freed for thin > > provisioned storage, etc.), the state we're in is mostly undefined. As > > Ric Wheeler told me back in the qemu-devel discussion, when a writeout > > fails, you get an fsync() error returned (once), but the kernel page > > cache simply marks the respective page as clean and consequently won't > > ever retry the writeout. Instead, it can evict it from the cache even > > though it isn't actually consistent with the state on disk, which means > > throwing away data that was written by some process. > > > > So if you do another fsync() and it returns success, this doesn't > > currently mean that all of the data you wrote is on disk, but if > > anything, it's just about the data you wrote after the failed fsync(). > > This isn't very helpful, to say the least, because you called fsync() in > > order to get a consistent state on disk, and you still don't have that. > > > > Essentially this means that once you got a fsync() failure, there is no > > hope to recover for the application and it has to stop using the file. > > This is not strictly correct. The application could repeat all the > recent writes. It might fsync after each write so it can find out > exactly where the problem is. > So it could a a lot of work to recover, but it is not intrinsically > impossible. You are right, I probably overgeneralised from our situation. qemu doesn't have the written data any more, and basically duplicating the page cache by keeping a second copy of all data in qemu until the next flush isn't really practicable and would both consume a considerable amount of memory (if we don't add artificial flushes that the guest didn't request, potentially unbounded) and impact performance because we wouldn't be zero-copy any more. So it is not intrinsically impossible, but practically impossible for at least some applications. As you say, it's probably also too much extra code to deal with an unlikely corner case for applications where it would be possible, so it's still unlikely they will do this. > > To give some context about my perspective as the maintainer for the QEMU > > block subsystem: QEMU has a mode (which is usually enabled in > > production) where I/O failure isn't communicated to the guest, which > > would probably offline the filesystem, thinking its hard disk has died, > > but instead QEMU pauses the VM and allows the administrator to resume > > when the problem has been fixed. Often the problem is only temporary, > > e.g. a network hiccup when a disk image is stored on NFS, so this is a > > quite helpful approach. > > If the disk image is stored over NFS, the write should hang, not cause > an error. (Of course if you mount with '-o soft' you will get an error, > but if you mount with '-o soft', then "you get to keep both halves"). Yes, bad example. (The hanging write is a problem of its own, and I think one of the reasons why '-o soft' is bad is the behaviour of the page cache if we let it fail, but while possibly related, it's a separate problem.) > Is there a more realistic situation where you might get a write error > that might succeed if the write is repeated? So where we noticed this problem in practice wasn't the kernel page cache, but the userspace gluster implementation, which exposed a similar behaviour: It threw away the cache contents on a failed fsync() and the next fsync() would report success again. In the following discussion we came to think of the kernel and that the same problem exists there in theory. This was confirmed by Ric Wheeler and Rik van Riel, who I trust to have some knowledge about this, and my own superificial read of some kernel code didn't contradict. Neither did anyone in this thread disagree, so I assume that the problem does exist on the page cache level. Now even if at the moment there were no storage backend where a write failure can be temporary (which I find hard to believe, but who knows), a single new driver is enough to expose the problem. Are you confident enough that no single driver will ever behave this way to make data integrity depend on the assumption? Now to answer your question a bit more directly: The other example we had in mind was ENOSPC in thin provisioned block devices, which can be fixed by freeing up some space. I also still see potential for such behaviour in things using the network, but I haven't checked them in detail. > > When QEMU is told to resume the VM, the request is just resubmitted. > > This works fine for read/write, but not so much for fsync, because after > > the first failure all bets are off even if a subsequent fsync() > > succeeds. > > > > So this is the aspect that directly affects me, even though the problem > > is much broader and by far doesn't only affect QEMU. > > > > > > This leads to a few invidivual points to be discussed: > > > > 1. Fix the data corruption problem that follows from the current > > behaviour. Imagine the following scenario: > > > > Process A writes to some file, calls fsync() and gets a failure. The > > data it wrote is marked clean in the page cache even though it's > > inconsistent with the disk. Process A knows that fsync() fails, so > > maybe it can deal with it, at least by stop using the file. > > > > Now process B opens the same file, reads the updated data that > > process A wrote, makes some additional changes based on that and > > calls fsync() again. Now fsync() return success. The data written by > > B is on disk, but the data written by A isn't. Oops, this is data > > corruption, and process B doesn't even know about it because all its > > operations succeeded. > > Can that really happen? I would expect the filesystem to call > SetPageError() if there was a write error, then I would expect a read to > report an error for that page if it were still in cache (or maybe flush > it out). I admit that I haven't traced through the code in detail, but > I did find some examples for SetPageError after a write error. To be honest, I kept the proposal intentionally on the high-level userspace API semantics level because I'm not familiar with the internals. I did have a look and could have been lucky enough to spot something that contradicts the theoretical considerations (which I didn't), but by far didn't spend enough time to make the opposite statement, whether there isn't something that prevents it from happening. I took Rik's word on this. Anyway, it would probably be good if someone had a closer look. > > > > 2. Define fsync() semantics that include the state after a failure (this > > probably goes a long way towards fixing 1.). > > > > The semantics that QEMU uses internally (and which it needs to map) > > is that after a successful flush, all writes to the disk image that > > have successfully completed before the flush was issued are stable on > > disk (no matter whether a previous flush failed). > > > > A possible adaption to Linux, which considers that unlike QEMU > > images, files can be opened more than once, might be that a > > succeeding fsync() on a file descriptor means that all data that has > > been read or written through this file descriptor is consistent > > between the page cache and the disk (the read part is for avoiding > > the scenario from 1.; it means that fsync flushes data written on a > > different file descriptor if it has been seen by this one; hence, the > > page cache can't contain non-dirty pages which aren't consistent with > > the disk). > > I think it would be useful to try to describe the behaviour of page > flags, particularly PG_error PG_uptodate PG_dirty in the different > scenarios. > > For example, a successful read sets PG_uptodate and a successful write > clears PG_dirty. > A failed read doesn't set PG_uptodate, and maybe sets PG_error. > A failed read probably shouldn't clear PG_dirty but should set PG_error. > > If background-write finds a PG_dirty|PG_error page, should it try to > write it out again? Or should only a foreground (fsync) write? That's a good question. I think a background write (if that includes anything not coming from userspace) needs to be able to retry writing out pages at least sometimes, specifically as the final attempt when we need the memory and are about to throw the data away for good. > If we did this, PG_error|PG_dirty pages would be pinned in memory until > a write was successful. We would need a way to purge these pages > without writing them. We would also need a way to ensure they didn't > consume a large fraction of memory. Yes, at some point throwing them away is unavoidable. If we do, a good fsync() behaviour is important to communicate this to userspace. > It isn't clear to me that the behaviour can be different for different > file descriptors. Once the data has been written to the page cache, it > belongs to the file, not to any particular fd. So enabling > "keep-data-after-write-error" would need to be per-file rather than > per-fd, and would probably need to be a privileged operations due to the > memory consumption concerns. Note that I didn't think of a "keep-data-after-write-error" flag, neither per-fd nor per-file, because I assumed that everyone would want it as long as there is some hope that the data could still be successfully written out later. The per-fd thing I envisioned was a flag that basically tells "this fd has gone bad, fsync() won't ever return success for it again" and that would be set for all open file descriptors for a file when we release PG_error|PG_dirty pages in it without having written them. I had assumed that there is a way to get back from the file to all file descriptors that are open for it, but looking at the code I don't see one indeed. Is this an intentional design decision or is it just that nobody needed it? You could still mark the whole file as "gone bad", but then this would also affect new file descriptors that never saw the content that we threw away. If I understand correctly, you would have to close all file descriptors on the file first to get rid of the "gone bad" flag (is this enough or are files kept around for longer than their fds?), and only then you could get a working new one again. This sounds a bit too heavy to me. > > > > 3. Actually make fsync() failure recoverable. > > > > You can implement 2. by making sure that a file descriptor for which > > pages have been thrown away always returns an error and never goes > > back to suceeding (it can't succeed according to the definition of 2. > > because the data that would have to be written out is gone). This is > > already a much better interface, but it doesn't really solve the > > actual problem we have. > > > > We also need to make sure that after a failed fsync() there is a > > chance to recover. This means that the pages shouldn't be thrown away > > immediately; but at the same time, you probably also don't want to > > keep pages indefinitely when there is a permanent writeout error. > > However, if we can make sure that these pages are only evicted in > > case of actual memory pressure, and only if there are no actually > > clean page to evict, I think a lot would be already won. > > I think this would make behaviour unpredictable, being dependent on how > much memory pressure there is. Predictability is nice! Yes, predictability is nice. Recovering from errors and not losing data is nice, too. I think I would generally value the latter higher, but I see that there may be cases where a different tradeoff might make sense. A sign that it should be an option? On the other hand, I wouldn't really consider page cache writeouts particularly predictable for userspace anyway. > > > > In the common case, you could then recover from a temporary failure, > > but if this state isn't maintainable, at least we get consistent > > fsync() failure telling us that the data is gone. > > > > > > I think I've summarised most aspects here, but if something is unclear > > or you'd like to see some more context, please refer to the qemu-devel > > discussion [1] that I mentioned, or feel free to just ask. > > Definitely an interesting question! > > Thanks, > NeilBrown Kevin