linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: fsync() errors is unsafe and risks data loss
@ 2018-04-10 22:07 Andres Freund
  2018-04-11 21:52 ` Andreas Dilger
  2018-04-13 14:48 ` Matthew Wilcox
  0 siblings, 2 replies; 57+ messages in thread
From: Andres Freund @ 2018-04-10 22:07 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: linux-ext4, linux-fsdevel, Joshua D. Drake, Andreas Dilger


(Sorry if I screwed up the thread structure - I'd to reconstruct the
reply-to and CC list from web archive as I've not found a way to
properly download an mbox or such of old content. Was subscribed to
fsdevel but not ext4 lists)

Hi,

2018-04-10 18:43:56 Ted wrote:
> I'll try to give as unbiased a description as possible, but certainly
> some of this is going to be filtered by my own biases no matter how
> careful I can be.

Same ;)


2018-04-10 18:43:56 Ted wrote:
> So for better or for worse, there has not been as much investment in
> buffered I/O and data robustness in the face of exception handling of
> storage devices.

That's a bit of a cop out. It's not just databases that care. Even more
basic tools like SCM, package managers and editors care whether they can
proper responses back from fsync that imply things actually were synced.


2018-04-10 18:43:56 Ted wrote:
> So this is the explanation for why Linux handles I/O errors by
> clearing the dirty bit after reporting the error up to user space.
> And why there is not eagerness to solve the problem simply by "don't
> clear the dirty bit".  For every one Postgres installation that might
> have a better recover after an I/O error, there's probably a thousand
> clueless Fedora and Ubuntu users who will have a much worse user
> experience after a USB stick pull happens.

I don't think these necessarily are as contradictory goals as you paint
them.  At least in postgres' case we can deal with the fact that an
fsync retry isn't going to fix the problem by reentering crash recovery
or just shutting down - therefore we don't need to keep all the dirty
buffers around.  A per-inode or per-superblock bit that causes further
fsyncs to fail would be entirely sufficent for that.

While there's some differing opinions on the referenced postgres thread,
the fundamental problem isn't so much that a retry won't fix the
problem, it's that we might NEVER see the failure.  If writeback happens
in the background, encounters an error, undirties the buffer, we will
happily carry on because we've never seen that. That's when we're
majorly screwed.

Both in postgres, *and* a lot of other applications, it's not at all
guaranteed to consistently have one FD open for every file
writtten. Therefore even the more recent per-fd errseq logic doesn't
guarantee that the failure will ever be seen by an application
diligently fsync()ing.

You'd not even need to have per inode information or such in the case
that the block device goes away entirely. As the FS isn't generally
unmounted in that case, you could trivially keep a per-mount (or
superblock?) bit that says "I died" and set that instead of keeping per
inode/whatever information.


2018-04-10 18:43:56 Ted wrote:
> If you are aware of a company who is willing to pay to have a new
> kernel feature implemented to meet your needs, we might be able to
> refer you to a company or a consultant who might be able to do that
> work.

I find it a bit dissapointing response. I think it's fair to say that
for advanced features, but we're talking about the basic guarantee that
fsync actually does something even remotely reasonable.


2018-04-10 19:44:48 Andreas wrote:
> The confusion is whether fsync() is a "level" state (return error
> forever if there were pages that could not be written), or an "edge"
> state (return error only for any write failures since the previous
> fsync() call).

I don't think that's the full issue. We can deal with the fact that an
fsync failure is edge-triggered if there's a guarantee that every
process doing so would get it.  The fact that one needs to have an FD
open from before any failing writes occurred to get a failure, *THAT'S*
the big issue.

Beyond postgres, it's a pretty common approach to do work on a lot of
files without fsyncing, then iterate over the directory fsync
everything, and *then* assume you're safe. But unless I severaly
misunderstand something that'd only be safe if you kept an FD for every
file open, which isn't realistic for pretty obvious reasons.


2018-04-10 18:43:56 Ted wrote:
> I think Anthony Iliopoulos was pretty clear in his multiple
> descriptions in that thread of why the current behaviour is needed
> (OOM of the whole system if dirty pages are kept around forever), but
> many others were stuck on "I can't believe this is happening??? This
> is totally unacceptable and every kernel needs to change to match my
> expectations!!!" without looking at the larger picture of what is
> practical to change and where the issue should best be fixed.

Everone can participate in discussions...

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread
[parent not found: <8da874c9-cf9c-d40a-3474-b773190878e7@commandprompt.com>]

end of thread, other threads:[~2018-04-21 23:19 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-10 22:07 fsync() errors is unsafe and risks data loss Andres Freund
2018-04-11 21:52 ` Andreas Dilger
2018-04-12  0:09   ` Dave Chinner
2018-04-12  2:32     ` Andres Freund
2018-04-12  2:51       ` Andres Freund
2018-04-12  5:09       ` Theodore Y. Ts'o
2018-04-12  5:45       ` Dave Chinner
2018-04-12 11:24         ` Jeff Layton
2018-04-12 21:11           ` Andres Freund
2018-04-12 10:19       ` Lukas Czerner
2018-04-12 19:46         ` Andres Freund
2018-04-12  2:17   ` Andres Freund
2018-04-12  3:02     ` Matthew Wilcox
2018-04-12 11:09       ` Jeff Layton
2018-04-12 11:19         ` Matthew Wilcox
2018-04-12 12:01         ` Dave Chinner
2018-04-12 15:08           ` Jeff Layton
2018-04-12 22:44             ` Dave Chinner
2018-04-13 13:18               ` Jeff Layton
2018-04-13 13:25                 ` Andres Freund
2018-04-13 14:02                 ` Matthew Wilcox
2018-04-14  1:47                   ` Dave Chinner
2018-04-14  2:04                     ` Andres Freund
2018-04-18 23:59                       ` Dave Chinner
2018-04-19  0:23                         ` Eric Sandeen
2018-04-14  2:38                     ` Matthew Wilcox
2018-04-19  0:13                       ` Dave Chinner
2018-04-19  0:40                         ` Matthew Wilcox
2018-04-19  1:08                           ` Theodore Y. Ts'o
2018-04-19 17:40                             ` Matthew Wilcox
2018-04-19 23:27                               ` Theodore Y. Ts'o
2018-04-19 23:28                           ` Dave Chinner
2018-04-12 15:16           ` Theodore Y. Ts'o
2018-04-12 20:13             ` Andres Freund
2018-04-12 20:28               ` Matthew Wilcox
2018-04-12 21:14                 ` Jeff Layton
2018-04-12 21:31                   ` Matthew Wilcox
2018-04-13 12:56                     ` Jeff Layton
2018-04-12 21:21                 ` Theodore Y. Ts'o
2018-04-12 21:24                   ` Matthew Wilcox
2018-04-12 21:37                   ` Andres Freund
2018-04-12 20:24         ` Andres Freund
2018-04-12 21:27           ` Jeff Layton
2018-04-12 21:53             ` Andres Freund
2018-04-12 21:57               ` Theodore Y. Ts'o
2018-04-21 18:14         ` Jan Kara
2018-04-12  5:34     ` Theodore Y. Ts'o
2018-04-12 19:55       ` Andres Freund
2018-04-12 21:52         ` Theodore Y. Ts'o
2018-04-12 22:03           ` Andres Freund
2018-04-18 18:09     ` J. Bruce Fields
2018-04-13 14:48 ` Matthew Wilcox
2018-04-21 16:59   ` Jan Kara
     [not found] <8da874c9-cf9c-d40a-3474-b773190878e7@commandprompt.com>
     [not found] ` <20180410184356.GD3563@thunk.org>
2018-04-10 19:47   ` Martin Steigerwald
2018-04-18 16:52     ` J. Bruce Fields
2018-04-19  8:39       ` Christoph Hellwig
2018-04-19 14:10         ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).