linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: fsync() errors is unsafe and risks data loss
@ 2018-04-10 22:07 Andres Freund
  2018-04-11 21:52 ` Andreas Dilger
  2018-04-13 14:48 ` Matthew Wilcox
  0 siblings, 2 replies; 57+ messages in thread
From: Andres Freund @ 2018-04-10 22:07 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: linux-ext4, linux-fsdevel, Joshua D. Drake, Andreas Dilger


(Sorry if I screwed up the thread structure - I'd to reconstruct the
reply-to and CC list from web archive as I've not found a way to
properly download an mbox or such of old content. Was subscribed to
fsdevel but not ext4 lists)

Hi,

2018-04-10 18:43:56 Ted wrote:
> I'll try to give as unbiased a description as possible, but certainly
> some of this is going to be filtered by my own biases no matter how
> careful I can be.

Same ;)


2018-04-10 18:43:56 Ted wrote:
> So for better or for worse, there has not been as much investment in
> buffered I/O and data robustness in the face of exception handling of
> storage devices.

That's a bit of a cop out. It's not just databases that care. Even more
basic tools like SCM, package managers and editors care whether they can
proper responses back from fsync that imply things actually were synced.


2018-04-10 18:43:56 Ted wrote:
> So this is the explanation for why Linux handles I/O errors by
> clearing the dirty bit after reporting the error up to user space.
> And why there is not eagerness to solve the problem simply by "don't
> clear the dirty bit".  For every one Postgres installation that might
> have a better recover after an I/O error, there's probably a thousand
> clueless Fedora and Ubuntu users who will have a much worse user
> experience after a USB stick pull happens.

I don't think these necessarily are as contradictory goals as you paint
them.  At least in postgres' case we can deal with the fact that an
fsync retry isn't going to fix the problem by reentering crash recovery
or just shutting down - therefore we don't need to keep all the dirty
buffers around.  A per-inode or per-superblock bit that causes further
fsyncs to fail would be entirely sufficent for that.

While there's some differing opinions on the referenced postgres thread,
the fundamental problem isn't so much that a retry won't fix the
problem, it's that we might NEVER see the failure.  If writeback happens
in the background, encounters an error, undirties the buffer, we will
happily carry on because we've never seen that. That's when we're
majorly screwed.

Both in postgres, *and* a lot of other applications, it's not at all
guaranteed to consistently have one FD open for every file
writtten. Therefore even the more recent per-fd errseq logic doesn't
guarantee that the failure will ever be seen by an application
diligently fsync()ing.

You'd not even need to have per inode information or such in the case
that the block device goes away entirely. As the FS isn't generally
unmounted in that case, you could trivially keep a per-mount (or
superblock?) bit that says "I died" and set that instead of keeping per
inode/whatever information.


2018-04-10 18:43:56 Ted wrote:
> If you are aware of a company who is willing to pay to have a new
> kernel feature implemented to meet your needs, we might be able to
> refer you to a company or a consultant who might be able to do that
> work.

I find it a bit dissapointing response. I think it's fair to say that
for advanced features, but we're talking about the basic guarantee that
fsync actually does something even remotely reasonable.


2018-04-10 19:44:48 Andreas wrote:
> The confusion is whether fsync() is a "level" state (return error
> forever if there were pages that could not be written), or an "edge"
> state (return error only for any write failures since the previous
> fsync() call).

I don't think that's the full issue. We can deal with the fact that an
fsync failure is edge-triggered if there's a guarantee that every
process doing so would get it.  The fact that one needs to have an FD
open from before any failing writes occurred to get a failure, *THAT'S*
the big issue.

Beyond postgres, it's a pretty common approach to do work on a lot of
files without fsyncing, then iterate over the directory fsync
everything, and *then* assume you're safe. But unless I severaly
misunderstand something that'd only be safe if you kept an FD for every
file open, which isn't realistic for pretty obvious reasons.


2018-04-10 18:43:56 Ted wrote:
> I think Anthony Iliopoulos was pretty clear in his multiple
> descriptions in that thread of why the current behaviour is needed
> (OOM of the whole system if dirty pages are kept around forever), but
> many others were stuck on "I can't believe this is happening??? This
> is totally unacceptable and every kernel needs to change to match my
> expectations!!!" without looking at the larger picture of what is
> practical to change and where the issue should best be fixed.

Everone can participate in discussions...

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-10 22:07 fsync() errors is unsafe and risks data loss Andres Freund
@ 2018-04-11 21:52 ` Andreas Dilger
  2018-04-12  0:09   ` Dave Chinner
  2018-04-12  2:17   ` Andres Freund
  2018-04-13 14:48 ` Matthew Wilcox
  1 sibling, 2 replies; 57+ messages in thread
From: Andreas Dilger @ 2018-04-11 21:52 UTC (permalink / raw)
  To: 20180410184356.GD3563
  Cc: Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Jeff Layton, Joshua D. Drake, andres

[-- Attachment #1: Type: text/plain, Size: 9079 bytes --]

On Apr 10, 2018, at 4:07 PM, Andres Freund <andres@anarazel.de> wrote:
> 2018-04-10 18:43:56 Ted wrote:
>> So for better or for worse, there has not been as much investment in
>> buffered I/O and data robustness in the face of exception handling of
>> storage devices.
> 
> That's a bit of a cop out. It's not just databases that care. Even more
> basic tools like SCM, package managers and editors care whether they can
> proper responses back from fsync that imply things actually were synced.

Sure, but it is mostly PG that is doing (IMHO) crazy things like writing
to thousands(?) of files, closing the file descriptors, then expecting
fsync() on a newly-opened fd to return a historical error.  If an editor
tries to write a file, then calls fsync and gets an error, the user will
enter a new pathname and retry the write.  The package manager will assume
the package installation failed, and uninstall the parts of the package
that were already written.

There is no way the filesystem can handle the package manager failure case,
and keeping the pages dirty and retrying indefinitely may never work (e.g.
disk is dead or disconnected, is a sparse volume without any free space,
etc).  This (IMHO) implies that the higher layer (which knows more about
what the write failure implies) needs to deal with this.

> 2018-04-10 18:43:56 Ted wrote:
>> So this is the explanation for why Linux handles I/O errors by
>> clearing the dirty bit after reporting the error up to user space.
>> And why there is not eagerness to solve the problem simply by "don't
>> clear the dirty bit".  For every one Postgres installation that might
>> have a better recover after an I/O error, there's probably a thousand
>> clueless Fedora and Ubuntu users who will have a much worse user
>> experience after a USB stick pull happens.
> 
> I don't think these necessarily are as contradictory goals as you paint
> them.  At least in postgres' case we can deal with the fact that an
> fsync retry isn't going to fix the problem by reentering crash recovery
> or just shutting down - therefore we don't need to keep all the dirty
> buffers around.  A per-inode or per-superblock bit that causes further
> fsyncs to fail would be entirely sufficent for that.
> 

> While there's some differing opinions on the referenced postgres thread,
> the fundamental problem isn't so much that a retry won't fix the
> problem, it's that we might NEVER see the failure.  If writeback happens
> in the background, encounters an error, undirties the buffer, we will
> happily carry on because we've never seen that. That's when we're
> majorly screwed.


I think there are two issues here - "fsync() on an fd that was just opened"
and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback *without an open file descriptor*,
there is no mechanism for the kernel to return an error to any application
which may exist, or may not ever come back.

Consider if there was a per-inode "there was once an error writing this
inode" flag.  Then fsync() would return an error on the inode forever,
since there is no way in POSIX to clear this state, since it would need
to be kept in case some new fd is opened on the inode and does an fsync()
and wants the error to be returned.

IMHO, the only alternative would be to keep the dirty pages in memory
until they are written to disk.  If that was not possible, what then?
It would need a reboot to clear the dirty pages, or truncate the file
(discarding all data)?

> Both in postgres, *and* a lot of other applications, it's not at all
> guaranteed to consistently have one FD open for every file
> written. Therefore even the more recent per-fd errseq logic doesn't
> guarantee that the failure will ever be seen by an application
> diligently fsync()ing.

... only if the application closes all fds for the file before calling
fsync.  If any fd is kept open from the time of the failure, it will
return the original error on fsync() (and then no longer return it).

It's not that you need to keep every fd open forever.  You could put them
into a shared pool, and re-use them if the file is "re-opened", and call
fsync on each fd before it is closed (because the pool is getting too big
or because you want to flush the data for that file, or shut down the DB).
That wouldn't require a huge re-architecture of PG, just a small library
to handle the shared fd pool.

That might even improve performance, because opening and closing files
is itself not free, especially if you are working with remote filesystems.

> You'd not even need to have per inode information or such in the case
> that the block device goes away entirely. As the FS isn't generally
> unmounted in that case, you could trivially keep a per-mount (or
> superblock?) bit that says "I died" and set that instead of keeping per
> inode/whatever information.

The filesystem will definitely return an error in this case, I don't
think this needs any kind of changes:

int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
        if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
                return -EIO;


> 2018-04-10 18:43:56 Ted wrote:
>> If you are aware of a company who is willing to pay to have a new
>> kernel feature implemented to meet your needs, we might be able to
>> refer you to a company or a consultant who might be able to do that
>> work.
> 
> I find it a bit dissapointing response. I think it's fair to say that
> for advanced features, but we're talking about the basic guarantee that
> fsync actually does something even remotely reasonable.

Linux (as PG) is run by people who develop it for their own needs, or
are paid to develop it for the needs of others.  Everyone already has
too much work to do, so you need to find someone who has an interest
in fixing this (IMHO very peculiar) use case.  If PG developers want
to add a tunable "keep dirty pages in RAM on IO failure", I don't think
that it would be too hard for someone to do.  It might be harder to
convince some of the kernel maintainers to accept it, and I've been on
the losing side of that battle more than once.  However, like everything
you don't pay for, you can't require someone else to do this for you.
It wouldn't hurt to see if Jeff Layton, who wrote the errseq patches,
would be interested to work on something like this.

That said, _even_ if a fix was available for Linux tomorrow, it would
be *years* before a majority of users would have it available on their
system, that includes even the errseq mechanism that was landed a few
months ago.  That implies to me that you'd want something that fixes PG
*now* so that it works around whatever (perceived) breakage exists in
the Linux fsync() implementation.  Since the thread indicates that
non-Linux kernels have the same fsync() behaviour, it makes sense to do
that even if the Linux fix was available.

> 2018-04-10 19:44:48 Andreas wrote:
>> The confusion is whether fsync() is a "level" state (return error
>> forever if there were pages that could not be written), or an "edge"
>> state (return error only for any write failures since the previous
>> fsync() call).
> 
> I don't think that's the full issue. We can deal with the fact that an
> fsync failure is edge-triggered if there's a guarantee that every
> process doing so would get it.  The fact that one needs to have an FD
> open from before any failing writes occurred to get a failure, *THAT'S*
> the big issue.
> 
> Beyond postgres, it's a pretty common approach to do work on a lot of
> files without fsyncing, then iterate over the directory fsync
> everything, and *then* assume you're safe. But unless I severaly
> misunderstand something that'd only be safe if you kept an FD for
> every file open, which isn't realistic for pretty obvious reasons.

I can't say how common or uncommon such a workload is, though PG is the
only application that I've heard of doing it, and I've been working on
filesystems for 20 years.  I'm a bit surprised that anyone expects
fsync() on a newly-opened fd to have any state from write() calls that
predate the open.  I can understand fsync() returning an error for any
IO that happens within the context of that fsync(), but how far should
it go back for reporting errors on that file?  Forever?   The only
way to clear the error would be to reboot the system, since I'm not
aware of any existing POSIX code to clear such an error.

> 2018-04-10 19:44:48 Andreas wrote:
>> I think Anthony Iliopoulos was pretty clear in his multiple
>> descriptions in that thread of why the current behaviour is needed
>> (OOM of the whole system if dirty pages are kept around forever), but
>> many others were stuck on "I can't believe this is happening??? This
>> is totally unacceptable and every kernel needs to change to match my
>> expectations!!!" without looking at the larger picture of what is
>> practical to change and where the issue should best be fixed.
> 
> Everone can participate in discussions...
> 
> Greetings,
> 
> Andres Freund


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-11 21:52 ` Andreas Dilger
@ 2018-04-12  0:09   ` Dave Chinner
  2018-04-12  2:32     ` Andres Freund
  2018-04-12  2:17   ` Andres Freund
  1 sibling, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2018-04-12  0:09 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: 20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Jeff Layton,
	Joshua D. Drake, andres

On Wed, Apr 11, 2018 at 03:52:44PM -0600, Andreas Dilger wrote:
> On Apr 10, 2018, at 4:07 PM, Andres Freund <andres@anarazel.de> wrote:
> > 2018-04-10 18:43:56 Ted wrote:
> >> So for better or for worse, there has not been as much investment in
> >> buffered I/O and data robustness in the face of exception handling of
> >> storage devices.
> > 
> > That's a bit of a cop out. It's not just databases that care. Even more
> > basic tools like SCM, package managers and editors care whether they can
> > proper responses back from fsync that imply things actually were synced.
> 
> Sure, but it is mostly PG that is doing (IMHO) crazy things like writing
> to thousands(?) of files, closing the file descriptors, then expecting
> fsync() on a newly-opened fd to return a historical error. 

Yeah, this seems like a recipe for disaster, especially on
cross-platform code where every OS platform behaves differently and
almost never to expectation.

And speaking of "behaving differently to expectations", nobody has
mentioned that close() can also return write errors. Hence if you do
write - close - open - fsync the the write error might get reported
on close, not fsync.  IOWs, the assumption that "async writeback
errors will persist across close to open" is fundamentally broken to
begin with. It's even documented as a slient data loss vector in
the close(2) man page:

$ man 2 close
.....
   Dealing with error returns from close()

	  A careful programmer will check the return value of
	  close(), since it is quite possible that  errors  on  a
	  previous  write(2)  operation  are reported  only on the
	  final close() that releases the open file description.
	  Failing to check the return value when closing a file may
	  lead to silent loss of data.  This can especially be
	  observed with NFS and with disk quota.

Yeah, ensuring data integrity in the face of IO errors is a really
hard problem. :/

----

To pound the broken record: there are many good reasons why Linux
filesystem developers have said "you should use direct IO" to the PG
devs each time we have this "the kernel doesn't do <complex things
PG needs>" discussion.

In this case, robust IO error reporting is easy with DIO. It's one
of the reasons most of the high performance database engines are
either using or moving to non-blocking AIO+DIO (RWF_NOWAIT) and use
O_DSYNC/RWF_DSYNC for integrity-critical IO dispatch. This is also
being driven by the availability of high performance, high IOPS
solid state storage where buffering in RAM to optimise IO patterns
and throughput provides no real performance benefit.

Using the AIO+DIO infrastructure ensures errors are reported for the
specific write that fails at failure time (i.e. in the aio
completion event for the specific IO), yet high IO throughput can be
maintained without the application needing it's own threading
infrastructure to prevent blocking.

This means the application doesn't have to guess where the write
error occurred to retry/recover, have to handle async write errors
on close(), have to use fsync() to gather write IO errors and then
infer where the IO failure was, or require kernels on every
supported platform to jump through hoops to try to do exactly the
right thing in error conditions for everyone in all circumstances at
all times....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-11 21:52 ` Andreas Dilger
  2018-04-12  0:09   ` Dave Chinner
@ 2018-04-12  2:17   ` Andres Freund
  2018-04-12  3:02     ` Matthew Wilcox
                       ` (2 more replies)
  1 sibling, 3 replies; 57+ messages in thread
From: Andres Freund @ 2018-04-12  2:17 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: 20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Jeff Layton,
	Joshua D. Drake

Hi,

On 2018-04-11 15:52:44 -0600, Andreas Dilger wrote:
> On Apr 10, 2018, at 4:07 PM, Andres Freund <andres@anarazel.de> wrote:
> > 2018-04-10 18:43:56 Ted wrote:
> >> So for better or for worse, there has not been as much investment in
> >> buffered I/O and data robustness in the face of exception handling of
> >> storage devices.
> > 
> > That's a bit of a cop out. It's not just databases that care. Even more
> > basic tools like SCM, package managers and editors care whether they can
> > proper responses back from fsync that imply things actually were synced.
> 
> Sure, but it is mostly PG that is doing (IMHO) crazy things like writing
> to thousands(?) of files, closing the file descriptors, then expecting
> fsync() on a newly-opened fd to return a historical error.

It's not just postgres. dpkg (underlying apt, on debian derived distros)
to take an example I just randomly guessed, does too:
  /* We want to guarantee the extracted files are on the disk, so that the
   * subsequent renames to the info database do not end up with old or zero
   * length files in case of a system crash. As neither dpkg-deb nor tar do
   * explicit fsync()s, we have to do them here.
   * XXX: This could be avoided by switching to an internal tar extractor. */
  dir_sync_contents(cidir);

(a bunch of other places too)

Especially on ext3 but also on newer filesystems it's performancewise
entirely infeasible to fsync() every single file individually - the
performance becomes entirely attrocious if you do that.

I think there's some legitimate arguments that a database should use
direct IO (more on that as a reply to David), but claiming that all
sorts of random utilities need to use DIO with buffering etc is just
insane.


> If an editor tries to write a file, then calls fsync and gets an
> error, the user will enter a new pathname and retry the write.  The
> package manager will assume the package installation failed, and
> uninstall the parts of the package that were already written.

Except that they won't notice that they got a failure, at least in the
dpkg case. And happily continue installing corrupted data


> There is no way the filesystem can handle the package manager failure case,
> and keeping the pages dirty and retrying indefinitely may never work (e.g.
> disk is dead or disconnected, is a sparse volume without any free space,
> etc).  This (IMHO) implies that the higher layer (which knows more about
> what the write failure implies) needs to deal with this.

Yea, I agree that'd not be sane. As far as I understand the dpkg code
(all of 10min reading it), that'd also be unnecessary. It can abort the
installation, but only if it detects the error.  Which isn't happening.


> > While there's some differing opinions on the referenced postgres thread,
> > the fundamental problem isn't so much that a retry won't fix the
> > problem, it's that we might NEVER see the failure.  If writeback happens
> > in the background, encounters an error, undirties the buffer, we will
> > happily carry on because we've never seen that. That's when we're
> > majorly screwed.
> 
> 
> I think there are two issues here - "fsync() on an fd that was just opened"
> and "persistent error state (without keeping dirty pages in memory)".
> 
> If there is background data writeback *without an open file descriptor*,
> there is no mechanism for the kernel to return an error to any application
> which may exist, or may not ever come back.

And that's *horrible*. If I cp a file, and writeback fails in the
background, and I then cat that file before restarting, I should be able
to see that that failed. Instead of returning something bogus.

Or even more extreme, you untar/zip/git clone a directory. Then do a
sync. And you don't know whether anything actually succeeded.


> Consider if there was a per-inode "there was once an error writing this
> inode" flag.  Then fsync() would return an error on the inode forever,
> since there is no way in POSIX to clear this state, since it would need
> to be kept in case some new fd is opened on the inode and does an fsync()
> and wants the error to be returned.

The data in the file also is corrupt. Having to unmount or delete the
file to reset the fact that it can't safely be assumed to be on disk
isn't insane.


> > Both in postgres, *and* a lot of other applications, it's not at all
> > guaranteed to consistently have one FD open for every file
> > written. Therefore even the more recent per-fd errseq logic doesn't
> > guarantee that the failure will ever be seen by an application
> > diligently fsync()ing.
> 
> ... only if the application closes all fds for the file before calling
> fsync.  If any fd is kept open from the time of the failure, it will
> return the original error on fsync() (and then no longer return it).
> 
> It's not that you need to keep every fd open forever.  You could put them
> into a shared pool, and re-use them if the file is "re-opened", and call
> fsync on each fd before it is closed (because the pool is getting too big
> or because you want to flush the data for that file, or shut down the DB).
> That wouldn't require a huge re-architecture of PG, just a small library
> to handle the shared fd pool.

Except that postgres uses multiple processes. And works on a lot of
architectures.  If we started to fsync all opened files on process exit
our users would *lynch* us.  We'd need a complicated scheme that sends
processes across sockets between processes, then deduplicate them on the
receiving side, somehow figuring out which is the oldest filedescriptors
(handling clockdrift safely).

Note that it'd be perfectly fine that we've "thrown away" the buffer
contents if we'd get notified that the fsync failed. We could just do
WAL replay, and restore the contents (just was we do after crashes
and/or for replication).


> That might even improve performance, because opening and closing files
> is itself not free, especially if you are working with remote filesystems.

There's already a per-process cache of open files.


> > You'd not even need to have per inode information or such in the case
> > that the block device goes away entirely. As the FS isn't generally
> > unmounted in that case, you could trivially keep a per-mount (or
> > superblock?) bit that says "I died" and set that instead of keeping per
> > inode/whatever information.
> 
> The filesystem will definitely return an error in this case, I don't
> think this needs any kind of changes:
> 
> int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> {
>         if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
>                 return -EIO;

Well, I'm making that argument because several people argued that
throwing away buffer contents in this case is the only way to not cause
OOMs, and that that's incompatible with reporting errors. It's clearly
not...


> > 2018-04-10 18:43:56 Ted wrote:
> >> If you are aware of a company who is willing to pay to have a new
> >> kernel feature implemented to meet your needs, we might be able to
> >> refer you to a company or a consultant who might be able to do that
> >> work.
> > 
> > I find it a bit dissapointing response. I think it's fair to say that
> > for advanced features, but we're talking about the basic guarantee that
> > fsync actually does something even remotely reasonable.
> 
> Linux (as PG) is run by people who develop it for their own needs, or
> are paid to develop it for the needs of others.

Sure.


> Everyone already has too much work to do, so you need to find someone
> who has an interest in fixing this (IMHO very peculiar) use case.  If
> PG developers want to add a tunable "keep dirty pages in RAM on IO
> failure", I don't think that it would be too hard for someone to do.
> It might be harder to convince some of the kernel maintainers to
> accept it, and I've been on the losing side of that battle more than
> once.  However, like everything you don't pay for, you can't require
> someone else to do this for you.  It wouldn't hurt to see if Jeff
> Layton, who wrote the errseq patches, would be interested to work on
> something like this.

I don't think this is that PG specific, as explained above.


Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12  0:09   ` Dave Chinner
@ 2018-04-12  2:32     ` Andres Freund
  2018-04-12  2:51       ` Andres Freund
                         ` (3 more replies)
  0 siblings, 4 replies; 57+ messages in thread
From: Andres Freund @ 2018-04-12  2:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Jeff Layton,
	Joshua D. Drake

Hi,

On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:
> To pound the broken record: there are many good reasons why Linux
> filesystem developers have said "you should use direct IO" to the PG
> devs each time we have this "the kernel doesn't do <complex things
> PG needs>" discussion.

I personally am on board with doing that. But you also gotta recognize
that an efficient DIO usage is a metric ton of work, and you need a
large amount of differing logic for different platforms. It's just not
realistic to do so for every platform.  Postgres is developed by a small
number of people, isn't VC backed etc. The amount of resources we can
throw at something is fairly limited.  I'm hoping to work on adding
linux DIO support to pg, but I'm sure as hell not going to do be able to
do the same on windows (solaris, hpux, aix, ...) etc.

And there's cases where that just doesn't help at all. Being able to
untar a database from backup / archive / timetravel / whatnot, and then
fsyncing the directory tree to make sure it's actually safe, is really
not an insane idea.  Or even just cp -r ing it, and then starting up a
copy of the database.  What you're saying is that none of that is doable
in a safe way, unless you use special-case DIO using tooling for the
whole operation (or at least tools that fsync carefully without ever
closing a fd, which certainly isn't the case for cp et al).


> In this case, robust IO error reporting is easy with DIO. It's one
> of the reasons most of the high performance database engines are
> either using or moving to non-blocking AIO+DIO (RWF_NOWAIT) and use
> O_DSYNC/RWF_DSYNC for integrity-critical IO dispatch. This is also
> being driven by the availability of high performance, high IOPS
> solid state storage where buffering in RAM to optimise IO patterns
> and throughput provides no real performance benefit.
> 
> Using the AIO+DIO infrastructure ensures errors are reported for the
> specific write that fails at failure time (i.e. in the aio
> completion event for the specific IO), yet high IO throughput can be
> maintained without the application needing it's own threading
> infrastructure to prevent blocking.
> 
> This means the application doesn't have to guess where the write
> error occurred to retry/recover, have to handle async write errors
> on close(), have to use fsync() to gather write IO errors and then
> infer where the IO failure was, or require kernels on every
> supported platform to jump through hoops to try to do exactly the
> right thing in error conditions for everyone in all circumstances at
> all times....

Most of that sounds like a good thing to do, but you got to recognize
that that's a lot of linux specific code.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12  2:32     ` Andres Freund
@ 2018-04-12  2:51       ` Andres Freund
  2018-04-12  5:09       ` Theodore Y. Ts'o
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 57+ messages in thread
From: Andres Freund @ 2018-04-12  2:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Jeff Layton,
	Joshua D. Drake

Hi,

On 2018-04-11 19:32:21 -0700, Andres Freund wrote:
> And there's cases where that just doesn't help at all. Being able to
> untar a database from backup / archive / timetravel / whatnot, and then
> fsyncing the directory tree to make sure it's actually safe, is really
> not an insane idea.  Or even just cp -r ing it, and then starting up a
> copy of the database.  What you're saying is that none of that is doable
> in a safe way, unless you use special-case DIO using tooling for the
> whole operation (or at least tools that fsync carefully without ever
> closing a fd, which certainly isn't the case for cp et al).

And before somebody argues that that's a too small window to trigger the
problem realistically: Restoring large databases happens pretty commonly
(for new replicas, testcases, or actual fatal issues), takes time, and
it's where a lot of storage is actually written to for the first time in
a while, so it's far from unlikely to trigger bad block errors or such.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12  2:17   ` Andres Freund
@ 2018-04-12  3:02     ` Matthew Wilcox
  2018-04-12 11:09       ` Jeff Layton
  2018-04-12  5:34     ` Theodore Y. Ts'o
  2018-04-18 18:09     ` J. Bruce Fields
  2 siblings, 1 reply; 57+ messages in thread
From: Matthew Wilcox @ 2018-04-12  3:02 UTC (permalink / raw)
  To: Andres Freund
  Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Jeff Layton,
	Joshua D. Drake

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:
> > > While there's some differing opinions on the referenced postgres thread,
> > > the fundamental problem isn't so much that a retry won't fix the
> > > problem, it's that we might NEVER see the failure.  If writeback happens
> > > in the background, encounters an error, undirties the buffer, we will
> > > happily carry on because we've never seen that. That's when we're
> > > majorly screwed.
> > 
> > 
> > I think there are two issues here - "fsync() on an fd that was just opened"
> > and "persistent error state (without keeping dirty pages in memory)".
> > 
> > If there is background data writeback *without an open file descriptor*,
> > there is no mechanism for the kernel to return an error to any application
> > which may exist, or may not ever come back.
> 
> And that's *horrible*. If I cp a file, and writeback fails in the
> background, and I then cat that file before restarting, I should be able
> to see that that failed. Instead of returning something bogus.

At the moment, when we open a file, we sample the current state of the
writeback error and only report new errors.  We could set it to zero
instead, and report the most recent error as soon as anything happens
which would report an error.  That way err = close(open("file")); would
report the most recent error.

That's not going to be persistent across the data structure for that inode
being removed from memory; we'd need filesystem support for persisting
that.  But maybe it's "good enough" to only support it for recent files.

Jeff, what do you think?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12  2:32     ` Andres Freund
  2018-04-12  2:51       ` Andres Freund
@ 2018-04-12  5:09       ` Theodore Y. Ts'o
  2018-04-12  5:45       ` Dave Chinner
  2018-04-12 10:19       ` Lukas Czerner
  3 siblings, 0 replies; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-12  5:09 UTC (permalink / raw)
  To: Andres Freund
  Cc: Dave Chinner, Andreas Dilger, Ext4 Developers List,
	Linux FS Devel, Jeff Layton, Joshua D. Drake

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:
> 
> Most of that sounds like a good thing to do, but you got to recognize
> that that's a lot of linux specific code.

I know it's not what PG has chosen, but realistically all of the other
major databases and userspace based storage systems have used DIO
precisely *because* it's the way to avoid OS-specific behavior or
require OS-specific code.  DIO is simple, and pretty much the same
everywhere.

In contrast, the exact details of how buffered I/O workrs can be quite
different on different OS's.  This is especially true if you take
performance related details (e.g., the cleaning algorithm, how pages
get chosen for eviction, etc.)

As I read the PG-hackers thread, I thought I saw acknowledgement that
some of the behaviors you don't like with Linux also show up on other
Unix or Unix-like systems?

						- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12  2:17   ` Andres Freund
  2018-04-12  3:02     ` Matthew Wilcox
@ 2018-04-12  5:34     ` Theodore Y. Ts'o
  2018-04-12 19:55       ` Andres Freund
  2018-04-18 18:09     ` J. Bruce Fields
  2 siblings, 1 reply; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-12  5:34 UTC (permalink / raw)
  To: Andres Freund
  Cc: Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List,
	Linux FS Devel, Jeff Layton, Joshua D. Drake

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:
> > If there is background data writeback *without an open file descriptor*,
> > there is no mechanism for the kernel to return an error to any application
> > which may exist, or may not ever come back.
> 
> And that's *horrible*. If I cp a file, and writeback fails in the
> background, and I then cat that file before restarting, I should be able
> to see that that failed. Instead of returning something bogus.

If there is no open file descriptor, and in many cases, no process
(because it has already exited), it may be horrible, but what the h*ll
else do you expect the OS to do?

The solution we use at Google is that we watch for I/O errors using a
completely different process that is responsible for monitoring
machine health.  It used to scrape dmesg, but we now arrange to have
I/O errors get sent via a netlink channel to the machine health
monitoring daemon.  If it detects errors on a particular hard drive,
it tells the cluster file system to stop using that disk, and to
reconstruct from erasure code all of the data chunks on that disk onto
other disks in the cluster.  We then run a series of disk diagnostics
to make sure we find all of the bad sectors (every often, where there
is one bad sector, there are several more waiting to be found), and
then afterwards, put the disk back into service.

By making it be a separate health monitoring process, we can have HDD
experts write much more sophisticated code that can ask the disk
firmware for more information (e.g., SMART, the grown defect list), do
much more careful scrubbing of the disk media, etc., before returning
the disk back to service.

> > Everyone already has too much work to do, so you need to find someone
> > who has an interest in fixing this (IMHO very peculiar) use case.  If
> > PG developers want to add a tunable "keep dirty pages in RAM on IO
> > failure", I don't think that it would be too hard for someone to do.
> > It might be harder to convince some of the kernel maintainers to
> > accept it, and I've been on the losing side of that battle more than
> > once.  However, like everything you don't pay for, you can't require
> > someone else to do this for you.  It wouldn't hurt to see if Jeff
> > Layton, who wrote the errseq patches, would be interested to work on
> > something like this.
> 
> I don't think this is that PG specific, as explained above.

The reality is that recovering from disk errors is tricky business,
and I very much doubt most userspace applications, including distro
package managers, are going to want to engineer for trying to detect
and recover from disk errors.  If that were true, then Red Hat and/or
SuSE have kernel engineers, and they would have implemented everything
everything on your wish list.  They haven't, and that should tell you
something.

The other reality is that once a disk starts developing errors, in
reality you will probably need to take the disk off-line, scrub it to
find any other media errors, and there's a good chance you'll need to
rewrite bad sectors (incluing some which are on top of file system
metadata, so you probably will have to run fsck or reformat the whole
file system).  I certainly don't think it's realistic to assume adding
lots of sophistication to each and every userspace program.

If you have tens or hundreds of thousands of disk drives, then you
will need to do tsomething automated, but I claim that you really
don't want to smush all of that detailed exception handling and HDD
repair technology into each database or cluster file system component.
It really needs to be done in a separate health-monitor and
machine-level management system.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12  2:32     ` Andres Freund
  2018-04-12  2:51       ` Andres Freund
  2018-04-12  5:09       ` Theodore Y. Ts'o
@ 2018-04-12  5:45       ` Dave Chinner
  2018-04-12 11:24         ` Jeff Layton
  2018-04-12 10:19       ` Lukas Czerner
  3 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2018-04-12  5:45 UTC (permalink / raw)
  To: Andres Freund
  Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Jeff Layton,
	Joshua D. Drake

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:
> > To pound the broken record: there are many good reasons why Linux
> > filesystem developers have said "you should use direct IO" to the PG
> > devs each time we have this "the kernel doesn't do <complex things
> > PG needs>" discussion.
> 
> I personally am on board with doing that. But you also gotta recognize
> that an efficient DIO usage is a metric ton of work, and you need a
> large amount of differing logic for different platforms. It's just not
> realistic to do so for every platform.  Postgres is developed by a small
> number of people, isn't VC backed etc. The amount of resources we can
> throw at something is fairly limited.  I'm hoping to work on adding
> linux DIO support to pg, but I'm sure as hell not going to do be able to
> do the same on windows (solaris, hpux, aix, ...) etc.
> 
> And there's cases where that just doesn't help at all. Being able to
> untar a database from backup / archive / timetravel / whatnot, and then
> fsyncing the directory tree to make sure it's actually safe, is really
> not an insane idea.

Yes it is. 

This is what syncfs() is for - making sure a large amount of of data
and metadata spread across many files and subdirectories in a single
filesystem is pushed to stable storage in the most efficient manner
possible.

> Or even just cp -r ing it, and then starting up a
> copy of the database.  What you're saying is that none of that is doable
> in a safe way, unless you use special-case DIO using tooling for the
> whole operation (or at least tools that fsync carefully without ever
> closing a fd, which certainly isn't the case for cp et al).

No, Just saying fsyncing individual files and directories is about
the most inefficient way you could possible go about doing this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12  2:32     ` Andres Freund
                         ` (2 preceding siblings ...)
  2018-04-12  5:45       ` Dave Chinner
@ 2018-04-12 10:19       ` Lukas Czerner
  2018-04-12 19:46         ` Andres Freund
  3 siblings, 1 reply; 57+ messages in thread
From: Lukas Czerner @ 2018-04-12 10:19 UTC (permalink / raw)
  To: Andres Freund
  Cc: Dave Chinner, Andreas Dilger, 20180410184356.GD3563,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Jeff Layton, Joshua D. Drake

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

> And there's cases where that just doesn't help at all. Being able to
> untar a database from backup / archive / timetravel / whatnot, and then
> fsyncing the directory tree to make sure it's actually safe, is really
> not an insane idea.  Or even just cp -r ing it, and then starting up a
> copy of the database.  What you're saying is that none of that is doable
> in a safe way, unless you use special-case DIO using tooling for the
> whole operation (or at least tools that fsync carefully without ever
> closing a fd, which certainly isn't the case for cp et al).

Does not seem like a problem to me, just checksum the thing if you
really need to be extra safe. You should probably be doing it anyway if
you backup / archive / timetravel / whatnot.

-Lukas

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12  3:02     ` Matthew Wilcox
@ 2018-04-12 11:09       ` Jeff Layton
  2018-04-12 11:19         ` Matthew Wilcox
                           ` (3 more replies)
  0 siblings, 4 replies; 57+ messages in thread
From: Jeff Layton @ 2018-04-12 11:09 UTC (permalink / raw)
  To: Matthew Wilcox, Andres Freund
  Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Joshua D. Drake

On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:
> On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:
> > > > While there's some differing opinions on the referenced postgres thread,
> > > > the fundamental problem isn't so much that a retry won't fix the
> > > > problem, it's that we might NEVER see the failure.  If writeback happens
> > > > in the background, encounters an error, undirties the buffer, we will
> > > > happily carry on because we've never seen that. That's when we're
> > > > majorly screwed.
> > > 
> > > 
> > > I think there are two issues here - "fsync() on an fd that was just opened"
> > > and "persistent error state (without keeping dirty pages in memory)".
> > > 
> > > If there is background data writeback *without an open file descriptor*,
> > > there is no mechanism for the kernel to return an error to any application
> > > which may exist, or may not ever come back.
> > 
> > And that's *horrible*. If I cp a file, and writeback fails in the
> > background, and I then cat that file before restarting, I should be able
> > to see that that failed. Instead of returning something bogus.

What are you expecting to happen in this case? Are you expecting a read
error due to a writeback failure? Or are you just saying that we should
be invalidating pages that failed to be written back, so that they can
be re-read?

> At the moment, when we open a file, we sample the current state of the
> writeback error and only report new errors.  We could set it to zero
> instead, and report the most recent error as soon as anything happens
> which would report an error.  That way err = close(open("file")); would
> report the most recent error.
> 
> That's not going to be persistent across the data structure for that inode
> being removed from memory; we'd need filesystem support for persisting
> that.  But maybe it's "good enough" to only support it for recent files.
> 
> Jeff, what do you think?

I hate it :). We could do that, but....yecchhhh.

Reporting errors only in the case where the inode happened to stick
around in the cache seems too unreliable for real-world usage, and might
be problematic for some use cases. I'm also not sure it would really be
helpful.

I think the crux of the matter here is not really about error reporting,
per-se. I asked this at LSF last year, and got no real answer:

When there is a writeback error, what should be done with the dirty
page(s)? Right now, we usually just mark them clean and carry on. Is
that the right thing to do?

One possibility would be to invalidate the range that failed to be
written (or the whole file) and force the pages to be faulted in again
on the next access. It could be surprising for some applications to not
see the results of their writes on a subsequent read after such an
event.

Maybe that's ok in the face of a writeback error though? IDK.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 11:09       ` Jeff Layton
@ 2018-04-12 11:19         ` Matthew Wilcox
  2018-04-12 12:01         ` Dave Chinner
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 57+ messages in thread
From: Matthew Wilcox @ 2018-04-12 11:19 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Andres Freund, Andreas Dilger, 20180410184356.GD3563,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:
> On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:
> > At the moment, when we open a file, we sample the current state of the
> > writeback error and only report new errors.  We could set it to zero
> > instead, and report the most recent error as soon as anything happens
> > which would report an error.  That way err = close(open("file")); would
> > report the most recent error.
> > 
> > That's not going to be persistent across the data structure for that inode
> > being removed from memory; we'd need filesystem support for persisting
> > that.  But maybe it's "good enough" to only support it for recent files.
> > 
> > Jeff, what do you think?
> 
> I hate it :). We could do that, but....yecchhhh.
> 
> Reporting errors only in the case where the inode happened to stick
> around in the cache seems too unreliable for real-world usage, and might
> be problematic for some use cases. I'm also not sure it would really be
> helpful.

Yeah, it's definitely half-arsed.  We could make further changes to
improve the situation, but they'd have wider impact.  For example, we can
tell if the error has been sampled by any existing fd, so we could bias
our inode reaping to have inodes with unreported errors stick around in
the cache for longer.

> I think the crux of the matter here is not really about error reporting,
> per-se. I asked this at LSF last year, and got no real answer:
> 
> When there is a writeback error, what should be done with the dirty
> page(s)? Right now, we usually just mark them clean and carry on. Is
> that the right thing to do?

I suspect it isn't.  If there's a transient error then we should reattempt
the write.  OTOH if the error is permanent then reattempting the write
isn't going to do any good and it's just going to cause the drive to go
through the whole error handling dance again.  And what do we do if we're
low on memory and need these pages back to avoid going OOM?  There's a
lot of options here, all of them bad in one situation or another.

> One possibility would be to invalidate the range that failed to be
> written (or the whole file) and force the pages to be faulted in again
> on the next access. It could be surprising for some applications to not
> see the results of their writes on a subsequent read after such an
> event.
> 
> Maybe that's ok in the face of a writeback error though? IDK.

I don't know either.  It'd force the application to face up to the fact
that the data is gone immediately rather than only finding it out after
a reboot.  Again though that might cause more problems than it solves.
It's hard to know what the right thing to do is.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12  5:45       ` Dave Chinner
@ 2018-04-12 11:24         ` Jeff Layton
  2018-04-12 21:11           ` Andres Freund
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff Layton @ 2018-04-12 11:24 UTC (permalink / raw)
  To: Dave Chinner, Andres Freund
  Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Joshua D. Drake

On Thu, 2018-04-12 at 15:45 +1000, Dave Chinner wrote:
> On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:
> > > To pound the broken record: there are many good reasons why Linux
> > > filesystem developers have said "you should use direct IO" to the PG
> > > devs each time we have this "the kernel doesn't do <complex things
> > > PG needs>" discussion.
> > 
> > I personally am on board with doing that. But you also gotta recognize
> > that an efficient DIO usage is a metric ton of work, and you need a
> > large amount of differing logic for different platforms. It's just not
> > realistic to do so for every platform.  Postgres is developed by a small
> > number of people, isn't VC backed etc. The amount of resources we can
> > throw at something is fairly limited.  I'm hoping to work on adding
> > linux DIO support to pg, but I'm sure as hell not going to do be able to
> > do the same on windows (solaris, hpux, aix, ...) etc.
> > 
> > And there's cases where that just doesn't help at all. Being able to
> > untar a database from backup / archive / timetravel / whatnot, and then
> > fsyncing the directory tree to make sure it's actually safe, is really
> > not an insane idea.
> 
> Yes it is. 
> 
> This is what syncfs() is for - making sure a large amount of of data
> and metadata spread across many files and subdirectories in a single
> filesystem is pushed to stable storage in the most efficient manner
> possible.
> 

Just note that the error return from syncfs is somewhat iffy. It doesn't
necessarily return an error when one inode fails to be written back. I
think it mainly returns errors when you get a metadata writeback error.


> > Or even just cp -r ing it, and then starting up a
> > copy of the database.  What you're saying is that none of that is doable
> > in a safe way, unless you use special-case DIO using tooling for the
> > whole operation (or at least tools that fsync carefully without ever
> > closing a fd, which certainly isn't the case for cp et al).
> 
> No, Just saying fsyncing individual files and directories is about
> the most inefficient way you could possible go about doing this.
> 

You can still use syncfs but what you'd probably have to do is call
syncfs while you still hold all of the fd's open, and then fsync each
one afterward to ensure that they all got written back properly. That
should work as you'd expect.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 11:09       ` Jeff Layton
  2018-04-12 11:19         ` Matthew Wilcox
@ 2018-04-12 12:01         ` Dave Chinner
  2018-04-12 15:08           ` Jeff Layton
  2018-04-12 15:16           ` Theodore Y. Ts'o
  2018-04-12 20:24         ` Andres Freund
  2018-04-21 18:14         ` Jan Kara
  3 siblings, 2 replies; 57+ messages in thread
From: Dave Chinner @ 2018-04-12 12:01 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Matthew Wilcox, Andres Freund, Andreas Dilger,
	20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Joshua D. Drake

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:
> When there is a writeback error, what should be done with the dirty
> page(s)? Right now, we usually just mark them clean and carry on. Is
> that the right thing to do?

There isn't a right thing. Whatever we do will be wrong for someone.

> One possibility would be to invalidate the range that failed to be
> written (or the whole file) and force the pages to be faulted in again
> on the next access. It could be surprising for some applications to not
> see the results of their writes on a subsequent read after such an
> event.

Not to mention a POSIX IO ordering violation. Seeing stale data
after a "successful" write is simply not allowed.

> Maybe that's ok in the face of a writeback error though? IDK.

No matter what we do for async writeback error handling, it will be
slightly different from filesystem to filesystem, not to mention OS
to OS. The is no magic bullet here, so I'm not sure we should worry
too much. There's direct IO for anyone who cares that need to know
about the completion status of every single write IO....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 12:01         ` Dave Chinner
@ 2018-04-12 15:08           ` Jeff Layton
  2018-04-12 22:44             ` Dave Chinner
  2018-04-12 15:16           ` Theodore Y. Ts'o
  1 sibling, 1 reply; 57+ messages in thread
From: Jeff Layton @ 2018-04-12 15:08 UTC (permalink / raw)
  To: Dave Chinner, lsf-pc
  Cc: Matthew Wilcox, Andres Freund, Andreas Dilger,
	20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Joshua D. Drake

On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:
> On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:
> > When there is a writeback error, what should be done with the dirty
> > page(s)? Right now, we usually just mark them clean and carry on. Is
> > that the right thing to do?
> 
> There isn't a right thing. Whatever we do will be wrong for someone.
> 
> > One possibility would be to invalidate the range that failed to be
> > written (or the whole file) and force the pages to be faulted in again
> > on the next access. It could be surprising for some applications to not
> > see the results of their writes on a subsequent read after such an
> > event.
> 
> Not to mention a POSIX IO ordering violation. Seeing stale data
> after a "successful" write is simply not allowed.
> 

I'm not so sure here, given that we're dealing with an error condition.
Are we really obligated not to allow any changes to pages that we can't
write back?

Given that the pages are clean after these failures, we aren't doing
this even today:

Suppose we're unable to do writes but can do reads vs. the backing
store. After a wb failure, the page has the dirty bit cleared. If it
gets kicked out of the cache before the read occurs, it'll have to be
faulted back in. Poof -- your write just disappeared.

That can even happen before you get the chance to call fsync, so even a
write()+read()+fsync() is not guaranteed to be safe in this regard
today, given sufficient memory pressure.

I think the current situation is fine from a "let's not OOM at all
costs" standpoint, but not so good for application predictability. We
should really consider ways to do better here.


> > Maybe that's ok in the face of a writeback error though? IDK.
> 
> No matter what we do for async writeback error handling, it will be
> slightly different from filesystem to filesystem, not to mention OS
> to OS. The is no magic bullet here, so I'm not sure we should worry
> too much. There's direct IO for anyone who cares that need to know
> about the completion status of every single write IO....

I think we we have an opportunity here to come up with better defined
and hopefully more useful behavior for buffered I/O in the face of
writeback errors. The first step would be to hash out what we'd want it
to look like.

Maybe we need a plenary session at LSF/MM?
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 12:01         ` Dave Chinner
  2018-04-12 15:08           ` Jeff Layton
@ 2018-04-12 15:16           ` Theodore Y. Ts'o
  2018-04-12 20:13             ` Andres Freund
  1 sibling, 1 reply; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-12 15:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Layton, Matthew Wilcox, Andres Freund, Andreas Dilger,
	20180410184356.GD3563, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Thu, Apr 12, 2018 at 10:01:22PM +1000, Dave Chinner wrote:
> On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:
> > When there is a writeback error, what should be done with the dirty
> > page(s)? Right now, we usually just mark them clean and carry on. Is
> > that the right thing to do?
> 
> There isn't a right thing. Whatever we do will be wrong for someone.

That's the problem.  The best that could be done (and it's not enough)
would be to have a mode which does with the PG folks want (or what
they *think* they want).  It seems what they want is to have an error
result in the page being marked clean.  When they discover the outcome
(OOM-city and the unability to unmount a file system on a failed
drive), then they will complain to us *again*, at which point we can
tell them that want they really want is another variation on O_PONIES,
and welcome to the real world and real life.

Which is why, even if they were to pay someone to implement what they
want, I'm not sure we would want to accept it upstream --- or distro's
might consider it a support nightmare, and refuse to allow that mode
to be enabled on enterprise distro's.  But at least, it will have been
some PG-based company who will have implemented it, so they're not
wasting other people's time or other people's resources...

We could try to get something like what Google is doing upstream,
which is to have the I/O errors sent to userspace via a netlink
channel (without changing anything else about how buffered writeback
is handled in the face of errors).  Then userspace applications could
switch to Direct I/O like all of the other really serious userspace
storage solutions I'm aware of, and then someone could try to write
some kind of HDD health monitoring system that tries to do the right
thing when a disk is discovered to have developed some media errors or
something more serious (e.g., a head failure).  That plus some kind of
RAID solution is I think the only thing which is really realistic for
a typical PG site.

It's certainly that's what *I* would do if I didn't decide to use a
hosted cloud solution, such as Cloud SQL for Postgres, and let someone
else solve the really hard problems of dealing with real-world HDD
failures.   :-)

					- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 10:19       ` Lukas Czerner
@ 2018-04-12 19:46         ` Andres Freund
  0 siblings, 0 replies; 57+ messages in thread
From: Andres Freund @ 2018-04-12 19:46 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: Dave Chinner, Andreas Dilger, 20180410184356.GD3563,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Jeff Layton, Joshua D. Drake

Hi,

On 2018-04-12 12:19:26 +0200, Lukas Czerner wrote:
> On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:
> 
> > And there's cases where that just doesn't help at all. Being able to
> > untar a database from backup / archive / timetravel / whatnot, and then
> > fsyncing the directory tree to make sure it's actually safe, is really
> > not an insane idea.  Or even just cp -r ing it, and then starting up a
> > copy of the database.  What you're saying is that none of that is doable
> > in a safe way, unless you use special-case DIO using tooling for the
> > whole operation (or at least tools that fsync carefully without ever
> > closing a fd, which certainly isn't the case for cp et al).
> 
> Does not seem like a problem to me, just checksum the thing if you
> really need to be extra safe. You should probably be doing it anyway if
> you backup / archive / timetravel / whatnot.

That doesn't really help, unless you want to sync() and then re-read all
the data to make sure it's the same. Rereading multi-TB backups just to
know whether there was an error that the OS knew about isn't
particularly fun. Without verifying after sync it's not going to improve
the situation measurably, you're still only going to discover that $data
isn't available when it's needed.

What you're saying here is that there's no way to use standard linux
tools to manipulate files and know whether it failed, without filtering
kernel logs for IO errors. Or am I missing something?

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12  5:34     ` Theodore Y. Ts'o
@ 2018-04-12 19:55       ` Andres Freund
  2018-04-12 21:52         ` Theodore Y. Ts'o
  0 siblings, 1 reply; 57+ messages in thread
From: Andres Freund @ 2018-04-12 19:55 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Andreas Dilger, Ext4 Developers List, Linux FS Devel,
	Jeff Layton, Joshua D. Drake

Hi,

On 2018-04-12 01:34:45 -0400, Theodore Y. Ts'o wrote:
> The solution we use at Google is that we watch for I/O errors using a
> completely different process that is responsible for monitoring
> machine health.  It used to scrape dmesg, but we now arrange to have
> I/O errors get sent via a netlink channel to the machine health
> monitoring daemon.

Any pointers to that the underling netlink mechanism? If we can force
postgres to kill itself when such an error is detected (via a dedicated
monitoring process), I'd personally be happy enough.  It'd be nicer if
we could associate that knowledge with particular filesystems etc
(which'd possibly hard through dm etc?), but this'd be much better than
nothing.



> The reality is that recovering from disk errors is tricky business,
> and I very much doubt most userspace applications, including distro
> package managers, are going to want to engineer for trying to detect
> and recover from disk errors.  If that were true, then Red Hat and/or
> SuSE have kernel engineers, and they would have implemented everything
> everything on your wish list.  They haven't, and that should tell you
> something.

The problem really isn't about *recovering* from disk errors. *Knowing*
about them is the crucial part. We do not want to give back clients the
information that an operation succeeded, when it actually didn't. There
could be improvements above that, but as long as it's guaranteed that
"we" get the error (rather than just some kernel log we don't have
access to, which looks different due to config etc), it's ok. We can
throw our hands up in the air and give up.


> The other reality is that once a disk starts developing errors, in
> reality you will probably need to take the disk off-line, scrub it to
> find any other media errors, and there's a good chance you'll need to
> rewrite bad sectors (incluing some which are on top of file system
> metadata, so you probably will have to run fsck or reformat the whole
> file system).  I certainly don't think it's realistic to assume adding
> lots of sophistication to each and every userspace program.

> If you have tens or hundreds of thousands of disk drives, then you
> will need to do tsomething automated, but I claim that you really
> don't want to smush all of that detailed exception handling and HDD
> repair technology into each database or cluster file system component.
> It really needs to be done in a separate health-monitor and
> machine-level management system.

Yea, agreed on all that. I don't think anybody actually involved in
postgres wants to do anything like that. Seems far outside of postgres'
remit.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 15:16           ` Theodore Y. Ts'o
@ 2018-04-12 20:13             ` Andres Freund
  2018-04-12 20:28               ` Matthew Wilcox
  0 siblings, 1 reply; 57+ messages in thread
From: Andres Freund @ 2018-04-12 20:13 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Dave Chinner, Jeff Layton, Matthew Wilcox, Andreas Dilger,
	20180410184356.GD3563, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

Hi,

On 2018-04-12 11:16:46 -0400, Theodore Y. Ts'o wrote:
> That's the problem.  The best that could be done (and it's not enough)
> would be to have a mode which does with the PG folks want (or what
> they *think* they want).  It seems what they want is to have an error
> result in the page being marked clean.  When they discover the outcome
> (OOM-city and the unability to unmount a file system on a failed
> drive), then they will complain to us *again*, at which point we can
> tell them that want they really want is another variation on O_PONIES,
> and welcome to the real world and real life.

I think a per-file or even per-blockdev/fs error state that'd be
returned by fsync() would be more than sufficient.  I don't see that
that'd realistically would trigger OOM or the inability to unmount a
filesystem.  If the drive is entirely gone there's obviously no point in
keeping per-file information around, so per-blockdev/fs information
suffices entirely to return an error on fsync (which at least on ext4
appears to happen if the underlying blockdev is gone).

Have fun making up things we want, but I'm not sure it's particularly
productive.


> Which is why, even if they were to pay someone to implement what they
> want, I'm not sure we would want to accept it upstream --- or distro's
> might consider it a support nightmare, and refuse to allow that mode
> to be enabled on enterprise distro's.  But at least, it will have been
> some PG-based company who will have implemented it, so they're not
> wasting other people's time or other people's resources...

Well, that's why I'm discussing here so we can figure out what's
acceptable before considering wasting money and revew cycles doing or
paying somebody to do some crazy useless shit.


> We could try to get something like what Google is doing upstream,
> which is to have the I/O errors sent to userspace via a netlink
> channel (without changing anything else about how buffered writeback
> is handled in the face of errors).

Ah, darn. After you'd mentioned that in an earlier mail I'd hoped that'd
be upstream. And yes, that'd be perfect.


> Then userspace applications could switch to Direct I/O like all of the
> other really serious userspace storage solutions I'm aware of, and
> then someone could try to write some kind of HDD health monitoring
> system that tries to do the right thing when a disk is discovered to
> have developed some media errors or something more serious (e.g., a
> head failure).  That plus some kind of RAID solution is I think the
> only thing which is really realistic for a typical PG site.

As I said earlier, I think there's good reason to move to DIO for
postgres. But to keep that performant is going to need some serious
work.

But afaict such a solution wouldn't really depend on applications using
DIO or not. Before finishing a checkpoint (logging it persistently and
allowing to throw older data away), we could check if any errors have
been reported and give up if there have been any.  And after starting
postgres on a directory restored from backup using $tool, we can fsync
the directory recursively, check for such errors, and give up if
there've been any.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 11:09       ` Jeff Layton
  2018-04-12 11:19         ` Matthew Wilcox
  2018-04-12 12:01         ` Dave Chinner
@ 2018-04-12 20:24         ` Andres Freund
  2018-04-12 21:27           ` Jeff Layton
  2018-04-21 18:14         ` Jan Kara
  3 siblings, 1 reply; 57+ messages in thread
From: Andres Freund @ 2018-04-12 20:24 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Matthew Wilcox, Andreas Dilger, 20180410184356.GD3563,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On 2018-04-12 07:09:14 -0400, Jeff Layton wrote:
> On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:
> > On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:
> > > > > While there's some differing opinions on the referenced postgres thread,
> > > > > the fundamental problem isn't so much that a retry won't fix the
> > > > > problem, it's that we might NEVER see the failure.  If writeback happens
> > > > > in the background, encounters an error, undirties the buffer, we will
> > > > > happily carry on because we've never seen that. That's when we're
> > > > > majorly screwed.
> > > > 
> > > > 
> > > > I think there are two issues here - "fsync() on an fd that was just opened"
> > > > and "persistent error state (without keeping dirty pages in memory)".
> > > > 
> > > > If there is background data writeback *without an open file descriptor*,
> > > > there is no mechanism for the kernel to return an error to any application
> > > > which may exist, or may not ever come back.
> > > 
> > > And that's *horrible*. If I cp a file, and writeback fails in the
> > > background, and I then cat that file before restarting, I should be able
> > > to see that that failed. Instead of returning something bogus.
> 
> What are you expecting to happen in this case? Are you expecting a read
> error due to a writeback failure? Or are you just saying that we should
> be invalidating pages that failed to be written back, so that they can
> be re-read?

Yes, I'd hope for a read error after a writeback failure. I think that's
sane behaviour. But I don't really care *that* much.

At the very least *some* way to *know* that such a failure occurred from
userland without having to parse the kernel log. As far as I understand,
neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report
an error if it was encountered by writeback in the background.

If that's indeed true for syncfs(2), even if the fd has been opened
before (which I can see how it could happen from an implementation POV,
nothing would associate a random FD with failures on different files),
it's really impossible to detect this stuff from userland without text
parsing.

Even if it'd were just a perf-fs /sys/$something file that'd return the
current count of unreported errors in a filesystem independent way, it'd
be better than what we have right now.

1) figure out /sys/$whatnot $directory belongs to
2) oldcount=$(cat /sys/$whatnot/unreported_errors)
3) filesystem operations in $directory
4) sync;sync;
5) newcount=$(cat /sys/$whatnot/unreported_errors)
6) test "$oldcount" -eq "$newcount" || die-with-horrible-message

Isn't beautiful to script, but it's also not absolutely terrible.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 20:13             ` Andres Freund
@ 2018-04-12 20:28               ` Matthew Wilcox
  2018-04-12 21:14                 ` Jeff Layton
  2018-04-12 21:21                 ` Theodore Y. Ts'o
  0 siblings, 2 replies; 57+ messages in thread
From: Matthew Wilcox @ 2018-04-12 20:28 UTC (permalink / raw)
  To: Andres Freund
  Cc: Theodore Y. Ts'o, Dave Chinner, Jeff Layton, Andreas Dilger,
	20180410184356.GD3563, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:
> I think a per-file or even per-blockdev/fs error state that'd be
> returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC.  That we add a per-superblock
wb_err and then allow syncfs() to return it.  So you'd open an fd on
a directory (for example), and call syncfs() which would return -EIO
or -ENOSPC if either of those conditions had occurred since you opened
the fd.

>  I don't see that
> that'd realistically would trigger OOM or the inability to unmount a
> filesystem.

Ted's referring to the current state of affairs where the writeback error
is held in the inode; if we can't evict the inode because it's holding
the error indicator, that can send us OOM.  If instead we transfer the
error indicator to the superblock, then there's no problem.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 11:24         ` Jeff Layton
@ 2018-04-12 21:11           ` Andres Freund
  0 siblings, 0 replies; 57+ messages in thread
From: Andres Freund @ 2018-04-12 21:11 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dave Chinner, Andreas Dilger, 20180410184356.GD3563,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On 2018-04-12 07:24:12 -0400, Jeff Layton wrote:
> On Thu, 2018-04-12 at 15:45 +1000, Dave Chinner wrote:
> > On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:
> > > > To pound the broken record: there are many good reasons why Linux
> > > > filesystem developers have said "you should use direct IO" to the PG
> > > > devs each time we have this "the kernel doesn't do <complex things
> > > > PG needs>" discussion.
> > > 
> > > I personally am on board with doing that. But you also gotta recognize
> > > that an efficient DIO usage is a metric ton of work, and you need a
> > > large amount of differing logic for different platforms. It's just not
> > > realistic to do so for every platform.  Postgres is developed by a small
> > > number of people, isn't VC backed etc. The amount of resources we can
> > > throw at something is fairly limited.  I'm hoping to work on adding
> > > linux DIO support to pg, but I'm sure as hell not going to do be able to
> > > do the same on windows (solaris, hpux, aix, ...) etc.
> > > 
> > > And there's cases where that just doesn't help at all. Being able to
> > > untar a database from backup / archive / timetravel / whatnot, and then
> > > fsyncing the directory tree to make sure it's actually safe, is really
> > > not an insane idea.
> > 
> > Yes it is. 
> > 
> > This is what syncfs() is for - making sure a large amount of of data
> > and metadata spread across many files and subdirectories in a single
> > filesystem is pushed to stable storage in the most efficient manner
> > possible.

syncfs isn't standardized, it operates on an entire filesystem (thus
writing out unnecessary stuff), it has no meaningful documentation of
it's return codes.  Yes, using syncfs() might better performancewise,
but it doesn't seem like it actually solves anything, performance aside:

> Just note that the error return from syncfs is somewhat iffy. It doesn't
> necessarily return an error when one inode fails to be written back. I
> think it mainly returns errors when you get a metadata writeback error.


> You can still use syncfs but what you'd probably have to do is call
> syncfs while you still hold all of the fd's open, and then fsync each
> one afterward to ensure that they all got written back properly. That
> should work as you'd expect.

Which again doesn't allow one to use any non-bespoke tooling (like tar
or whatnot). And it means you'll have to call syncfs() every few hundred
files, because you'll obviously run into filehandle limitations.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 20:28               ` Matthew Wilcox
@ 2018-04-12 21:14                 ` Jeff Layton
  2018-04-12 21:31                   ` Matthew Wilcox
  2018-04-12 21:21                 ` Theodore Y. Ts'o
  1 sibling, 1 reply; 57+ messages in thread
From: Jeff Layton @ 2018-04-12 21:14 UTC (permalink / raw)
  To: Matthew Wilcox, Andres Freund
  Cc: Theodore Y. Ts'o, Dave Chinner, Andreas Dilger,
	20180410184356.GD3563, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote:
> On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:
> > I think a per-file or even per-blockdev/fs error state that'd be
> > returned by fsync() would be more than sufficient.
>
> Ah; this was my suggestion to Jeff on IRC.  That we add a per-
> superblock
> wb_err and then allow syncfs() to return it.  So you'd open an fd on
> a directory (for example), and call syncfs() which would return -EIO
> or -ENOSPC if either of those conditions had occurred since you
> opened
> the fd.

Not a bad idea and shouldn't be too costly. mapping_set_error could
flag the superblock one before or after the one in the mapping.

We'd need to define what happens if you interleave fsync and syncfs
calls on the same inode though. How do we handle file->f_wb_err in that
case? Would we need a second field in struct file to act as the per-sb
error cursor?

> >  I don't see that
> > that'd realistically would trigger OOM or the inability to unmount
> > a
> > filesystem.
> 
> Ted's referring to the current state of affairs where the writeback
> error
> is held in the inode; if we can't evict the inode because it's
> holding
> the error indicator, that can send us OOM.  If instead we transfer
> the
> error indicator to the superblock, then there's no problem.
> 

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 20:28               ` Matthew Wilcox
  2018-04-12 21:14                 ` Jeff Layton
@ 2018-04-12 21:21                 ` Theodore Y. Ts'o
  2018-04-12 21:24                   ` Matthew Wilcox
  2018-04-12 21:37                   ` Andres Freund
  1 sibling, 2 replies; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-12 21:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andres Freund, Dave Chinner, Jeff Layton, Andreas Dilger,
	20180410184356.GD3563, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:
> On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:
> > I think a per-file or even per-blockdev/fs error state that'd be
> > returned by fsync() would be more than sufficient.
> 
> Ah; this was my suggestion to Jeff on IRC.  That we add a per-superblock
> wb_err and then allow syncfs() to return it.  So you'd open an fd on
> a directory (for example), and call syncfs() which would return -EIO
> or -ENOSPC if either of those conditions had occurred since you opened
> the fd.

When or how would the per-superblock wb_err flag get cleared?

Would all subsequent fsync() calls on that file system now return EIO?
Or would only all subsequent syncfs() calls return EIO?

> >  I don't see that
> > that'd realistically would trigger OOM or the inability to unmount a
> > filesystem.
> 
> Ted's referring to the current state of affairs where the writeback error
> is held in the inode; if we can't evict the inode because it's holding
> the error indicator, that can send us OOM.  If instead we transfer the
> error indicator to the superblock, then there's no problem.

Actually, I was referring to the pg-hackers original ask, which was
that after an error, all of the dirty pages that couldn't be written
out would stay dirty.

If it's only as single inode which is pinned in memory with the dirty
flag, that's bad, but it's not as bad as pinning all of the memory
pages for which there was a failed write.  We would still need to
invent some mechanism or define some semantic when it would be OK to
clear the per-inode flag and let the memory associated with that
pinned inode get released, though.

						- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 21:21                 ` Theodore Y. Ts'o
@ 2018-04-12 21:24                   ` Matthew Wilcox
  2018-04-12 21:37                   ` Andres Freund
  1 sibling, 0 replies; 57+ messages in thread
From: Matthew Wilcox @ 2018-04-12 21:24 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Andres Freund, Dave Chinner, Jeff Layton, Andreas Dilger,
	20180410184356.GD3563, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Thu, Apr 12, 2018 at 05:21:44PM -0400, Theodore Y. Ts'o wrote:
> On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:
> > > I think a per-file or even per-blockdev/fs error state that'd be
> > > returned by fsync() would be more than sufficient.
> > 
> > Ah; this was my suggestion to Jeff on IRC.  That we add a per-superblock
> > wb_err and then allow syncfs() to return it.  So you'd open an fd on
> > a directory (for example), and call syncfs() which would return -EIO
> > or -ENOSPC if either of those conditions had occurred since you opened
> > the fd.
> 
> When or how would the per-superblock wb_err flag get cleared?

That's not how errseq works, Ted ;-)

> Would all subsequent fsync() calls on that file system now return EIO?
> Or would only all subsequent syncfs() calls return EIO?

Only ones which occur after the last sampling get reported through this
particular file descriptor.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 20:24         ` Andres Freund
@ 2018-04-12 21:27           ` Jeff Layton
  2018-04-12 21:53             ` Andres Freund
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff Layton @ 2018-04-12 21:27 UTC (permalink / raw)
  To: Andres Freund
  Cc: Matthew Wilcox, Andreas Dilger, 20180410184356.GD3563,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Thu, 2018-04-12 at 13:24 -0700, Andres Freund wrote:
> On 2018-04-12 07:09:14 -0400, Jeff Layton wrote:
> > On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:
> > > On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:
> > > > > > While there's some differing opinions on the referenced postgres thread,
> > > > > > the fundamental problem isn't so much that a retry won't fix the
> > > > > > problem, it's that we might NEVER see the failure.  If writeback happens
> > > > > > in the background, encounters an error, undirties the buffer, we will
> > > > > > happily carry on because we've never seen that. That's when we're
> > > > > > majorly screwed.
> > > > > 
> > > > > 
> > > > > I think there are two issues here - "fsync() on an fd that was just opened"
> > > > > and "persistent error state (without keeping dirty pages in memory)".
> > > > > 
> > > > > If there is background data writeback *without an open file descriptor*,
> > > > > there is no mechanism for the kernel to return an error to any application
> > > > > which may exist, or may not ever come back.
> > > > 
> > > > And that's *horrible*. If I cp a file, and writeback fails in the
> > > > background, and I then cat that file before restarting, I should be able
> > > > to see that that failed. Instead of returning something bogus.
> > 
> > What are you expecting to happen in this case? Are you expecting a read
> > error due to a writeback failure? Or are you just saying that we should
> > be invalidating pages that failed to be written back, so that they can
> > be re-read?
> 
> Yes, I'd hope for a read error after a writeback failure. I think that's
> sane behaviour. But I don't really care *that* much.
> 

I'll have to respectfully disagree. Why should I interpret an error on
a read() syscall to mean that writeback failed? Note that the data is
still potentially intact.

What _might_ make sense, IMO, is to just invalidate the pages that
failed to be written back. Then you could potentially do a read to
fault them in again (i.e. sync the pagecache and the backing store) and
possibly redirty them for another try.

Note that you can detect this situation by checking the return code
from fsync. It should report the latest error once per file
description.

> At the very least *some* way to *know* that such a failure occurred from
> userland without having to parse the kernel log. As far as I understand,
> neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report
> an error if it was encountered by writeback in the background.
> 
> If that's indeed true for syncfs(2), even if the fd has been opened
> before (which I can see how it could happen from an implementation POV,
> nothing would associate a random FD with failures on different files),
> it's really impossible to detect this stuff from userland without text
> parsing.
> 

syncfs could use some work.

I'm warming to willy's idea to add a per-sb errseq_t. I think that
might be a simple way to get better semantics here. Not sure how we
want to handle the reporting end yet though...

We probably also need to consider how to better track metadata
writeback errors (on e.g. ext2). We don't really do that properly at
quite yet either.


> Even if it'd were just a perf-fs /sys/$something file that'd return the
> current count of unreported errors in a filesystem independent way, it'd
> be better than what we have right now.
> 
> 1) figure out /sys/$whatnot $directory belongs to
> 2) oldcount=$(cat /sys/$whatnot/unreported_errors)
> 3) filesystem operations in $directory
> 4) sync;sync;
> 5) newcount=$(cat /sys/$whatnot/unreported_errors)
> 6) test "$oldcount" -eq "$newcount" || die-with-horrible-message
> 
> Isn't beautiful to script, but it's also not absolutely terrible.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 21:14                 ` Jeff Layton
@ 2018-04-12 21:31                   ` Matthew Wilcox
  2018-04-13 12:56                     ` Jeff Layton
  0 siblings, 1 reply; 57+ messages in thread
From: Matthew Wilcox @ 2018-04-12 21:31 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Andres Freund, Theodore Y. Ts'o, Dave Chinner,
	Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List,
	Linux FS Devel, Joshua D. Drake

On Thu, Apr 12, 2018 at 05:14:54PM -0400, Jeff Layton wrote:
> On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote:
> > On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:
> > > I think a per-file or even per-blockdev/fs error state that'd be
> > > returned by fsync() would be more than sufficient.
> >
> > Ah; this was my suggestion to Jeff on IRC.  That we add a per-
> > superblock
> > wb_err and then allow syncfs() to return it.  So you'd open an fd on
> > a directory (for example), and call syncfs() which would return -EIO
> > or -ENOSPC if either of those conditions had occurred since you
> > opened
> > the fd.
> 
> Not a bad idea and shouldn't be too costly. mapping_set_error could
> flag the superblock one before or after the one in the mapping.
> 
> We'd need to define what happens if you interleave fsync and syncfs
> calls on the same inode though. How do we handle file->f_wb_err in that
> case? Would we need a second field in struct file to act as the per-sb
> error cursor?

Ooh.  I hadn't thought that through.  Bleh.  I don't want to add a field
to struct file for this uncommon case.

Maybe O_PATH could be used for this?  It gets you a file descriptor on
a particular filesystem, so syncfs() is defined, but it can't report
a writeback error.  So if you open something O_PATH, you can use the
file's f_wb_err for the mapping's error cursor.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 21:21                 ` Theodore Y. Ts'o
  2018-04-12 21:24                   ` Matthew Wilcox
@ 2018-04-12 21:37                   ` Andres Freund
  1 sibling, 0 replies; 57+ messages in thread
From: Andres Freund @ 2018-04-12 21:37 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Matthew Wilcox, Dave Chinner, Jeff Layton, Andreas Dilger,
	20180410184356.GD3563, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

Hi,

On 2018-04-12 17:21:44 -0400, Theodore Y. Ts'o wrote:
> On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:
> > > I think a per-file or even per-blockdev/fs error state that'd be
> > > returned by fsync() would be more than sufficient.
> > 
> > Ah; this was my suggestion to Jeff on IRC.  That we add a per-superblock
> > wb_err and then allow syncfs() to return it.  So you'd open an fd on
> > a directory (for example), and call syncfs() which would return -EIO
> > or -ENOSPC if either of those conditions had occurred since you opened
> > the fd.
> 
> When or how would the per-superblock wb_err flag get cleared?

I don't think unmount + resettable via /sys would be an insane
approach. Requiring explicit action to acknowledge data loss isn't a
crazy concept.  But I think that's something reasonable minds could
disagree with.


> Would all subsequent fsync() calls on that file system now return EIO?
> Or would only all subsequent syncfs() calls return EIO?

If it were tied to syncfs, I wonder if there's a way to have some errseq
type logic. Store a per superblock (or whatever equivalent thing) errseq
value of errors.  For each fd calling syncfs() report the error once,
but then store the current value in a separate per-fd field.  And if
that's considered too weird, only report the errors to fds that have
been opened from before the error occurred.

I can see writing a tool 'pg_run_and_sync /directo /ries -- command'
which opens an fd for each of the filesystems the directories reside on,
and calls syncfs() after.  That'd allow to use backup/restore tools at
least semi safely.


> > >  I don't see that
> > > that'd realistically would trigger OOM or the inability to unmount a
> > > filesystem.
> > 
> > Ted's referring to the current state of affairs where the writeback error
> > is held in the inode; if we can't evict the inode because it's holding
> > the error indicator, that can send us OOM.  If instead we transfer the
> > error indicator to the superblock, then there's no problem.
> 
> Actually, I was referring to the pg-hackers original ask, which was
> that after an error, all of the dirty pages that couldn't be written
> out would stay dirty.

Well, it's an open list, everyone can argue. And initially people at
first didn't know the OOM explanation, and then it takes some time to
revise ones priors :).  I think it's a design question that reasonable
people can disagree upon (if "hot" removed devices are handled by
throwing data away regardless, at least).  But as it's clearly not
something viable, we can move on to something that can solve the
problem.


> If it's only as single inode which is pinned in memory with the dirty
> flag, that's bad, but it's not as bad as pinning all of the memory
> pages for which there was a failed write.  We would still need to
> invent some mechanism or define some semantic when it would be OK to
> clear the per-inode flag and let the memory associated with that
> pinned inode get released, though.

Yea, I agree that that's not obvious. One way would be to say that it's
only automatically cleared when you unlink the file. A bit heavyhanded,
but not too crazy.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 19:55       ` Andres Freund
@ 2018-04-12 21:52         ` Theodore Y. Ts'o
  2018-04-12 22:03           ` Andres Freund
  0 siblings, 1 reply; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-12 21:52 UTC (permalink / raw)
  To: Andres Freund
  Cc: Andreas Dilger, Ext4 Developers List, Linux FS Devel,
	Jeff Layton, Joshua D. Drake

On Thu, Apr 12, 2018 at 12:55:36PM -0700, Andres Freund wrote:
> 
> Any pointers to that the underling netlink mechanism? If we can force
> postgres to kill itself when such an error is detected (via a dedicated
> monitoring process), I'd personally be happy enough.  It'd be nicer if
> we could associate that knowledge with particular filesystems etc
> (which'd possibly hard through dm etc?), but this'd be much better than
> nothing.

Yeah, sorry, it never got upstreamed.  It's not really all that
complicated, it was just that there were some other folks who wanted
to do something similar, and there was a round of bike-sheddingh
several years ago, and nothing ever went upstream.  Part of the
problem was that our orignial scheme sent up information about file
system-level corruption reports --- e.g, those stemming from calls to
ext4_error() --- and lots of people had different ideas about how tot
get all of the possible information up in some structured format.
(Think something like uerf from Digtial's OSF/1.)

We did something *really* simple/stupid.  We just sent essentially an
ascii test string out the netlink socket.  That's because what we were
doing before was essentially scraping the output of dmesg
(e.g. /dev/kmssg).

That's actually probably the simplest thing to do, and it has the
advantage that it will work even on ancient enterprise kernels that PG
users are likely to want to use.  So you will need to implement the
dmesg text scraper anyway, and that's probably good enough for most
use cases.

> The problem really isn't about *recovering* from disk errors. *Knowing*
> about them is the crucial part. We do not want to give back clients the
> information that an operation succeeded, when it actually didn't. There
> could be improvements above that, but as long as it's guaranteed that
> "we" get the error (rather than just some kernel log we don't have
> access to, which looks different due to config etc), it's ok. We can
> throw our hands up in the air and give up.

Right, it's a little challenging because the actual regexp's you would
need to use do vary from device driver to device driver.  Fortunately
nearly everything is a SCSI/SATA device these days, so there isn't
_that_ much variability.

> Yea, agreed on all that. I don't think anybody actually involved in
> postgres wants to do anything like that. Seems far outside of postgres'
> remit.

Some people on the pg-hackers list were talking about wanting to retry
the fsync() and hoping that would cause the write to somehow suceed.
It's *possible* that might help, but it's not likely to be helpful in
my experience.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 21:27           ` Jeff Layton
@ 2018-04-12 21:53             ` Andres Freund
  2018-04-12 21:57               ` Theodore Y. Ts'o
  0 siblings, 1 reply; 57+ messages in thread
From: Andres Freund @ 2018-04-12 21:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Matthew Wilcox, Andreas Dilger, 20180410184356.GD3563,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On 2018-04-12 17:27:54 -0400, Jeff Layton wrote:
> On Thu, 2018-04-12 at 13:24 -0700, Andres Freund wrote:
> > At the very least *some* way to *know* that such a failure occurred from
> > userland without having to parse the kernel log. As far as I understand,
> > neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report
> > an error if it was encountered by writeback in the background.
> > 
> > If that's indeed true for syncfs(2), even if the fd has been opened
> > before (which I can see how it could happen from an implementation POV,
> > nothing would associate a random FD with failures on different files),
> > it's really impossible to detect this stuff from userland without text
> > parsing.
> > 
> 
> syncfs could use some work.

It's really too bad that it doesn't have a flags argument.


> We probably also need to consider how to better track metadata
> writeback errors (on e.g. ext2). We don't really do that properly at
> quite yet either.
> 
> 
> > Even if it'd were just a perf-fs /sys/$something file that'd return the
> > current count of unreported errors in a filesystem independent way, it'd
> > be better than what we have right now.
> > 
> > 1) figure out /sys/$whatnot $directory belongs to
> > 2) oldcount=$(cat /sys/$whatnot/unreported_errors)
> > 3) filesystem operations in $directory
> > 4) sync;sync;
> > 5) newcount=$(cat /sys/$whatnot/unreported_errors)
> > 6) test "$oldcount" -eq "$newcount" || die-with-horrible-message
> > 
> > Isn't beautiful to script, but it's also not absolutely terrible.

ext4 seems to have something roughly like that
(/sys/fs/ext4/$dev/errors_count), and by my reading it already seems to
be incremented from the necessary places.  By my reading XFS doesn't
seem to have something similar.

Wouldn't be bad to standardize...

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 21:53             ` Andres Freund
@ 2018-04-12 21:57               ` Theodore Y. Ts'o
  0 siblings, 0 replies; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-12 21:57 UTC (permalink / raw)
  To: Andres Freund
  Cc: Jeff Layton, Matthew Wilcox, Andreas Dilger,
	20180410184356.GD3563, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Thu, Apr 12, 2018 at 02:53:19PM -0700, Andres Freund wrote:
> > > 
> > > Isn't beautiful to script, but it's also not absolutely terrible.
> 
> ext4 seems to have something roughly like that
> (/sys/fs/ext4/$dev/errors_count), and by my reading it already seems to
> be incremented from the necessary places.

This is only for file system inconsistencies noticed by the kernel.
We don't bump that count for data block I/O errors.

The same idea could be used on a block device level.  It would be
pretty simple to maintain a counter for I/O errors, and when the last
error was detected on a particular device.  You could evne break out
and track read errors and write errors eparately if that would be
useful.

If you don't care what block was bad, but just that _some_ I/O error
had happened, a counter is definitely the simplest approach, and less
hair to implemnet and use than something like a netlink channel or
scraping dmesg....

						- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 21:52         ` Theodore Y. Ts'o
@ 2018-04-12 22:03           ` Andres Freund
  0 siblings, 0 replies; 57+ messages in thread
From: Andres Freund @ 2018-04-12 22:03 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Andreas Dilger, Ext4 Developers List, Linux FS Devel,
	Jeff Layton, Joshua D. Drake

Hi,

On 2018-04-12 17:52:52 -0400, Theodore Y. Ts'o wrote:
> We did something *really* simple/stupid.  We just sent essentially an
> ascii test string out the netlink socket.  That's because what we were
> doing before was essentially scraping the output of dmesg
> (e.g. /dev/kmssg).
> 
> That's actually probably the simplest thing to do, and it has the
> advantage that it will work even on ancient enterprise kernels that PG
> users are likely to want to use.  So you will need to implement the
> dmesg text scraper anyway, and that's probably good enough for most
> use cases.

The worst part of that is, as you mention below, needing to handle a lot
of different error message formats. I guess it's reasonable enough if
you control your hardware, but no such luck.

Aren't there quite realistic scenarios where one could miss kmsg style
messages due to it being a ringbuffer?


> Right, it's a little challenging because the actual regexp's you would
> need to use do vary from device driver to device driver.  Fortunately
> nearly everything is a SCSI/SATA device these days, so there isn't
> _that_ much variability.

There's also SAN / NAS type stuff - not all of that presents as a
SCSI/SATA device, right?


> > Yea, agreed on all that. I don't think anybody actually involved in
> > postgres wants to do anything like that. Seems far outside of postgres'
> > remit.
> 
> Some people on the pg-hackers list were talking about wanting to retry
> the fsync() and hoping that would cause the write to somehow suceed.
> It's *possible* that might help, but it's not likely to be helpful in
> my experience.

Depends on the type of error and storage. ENOSPC, especially over NFS,
has some reasonable chances of being cleared up. And for networked block
storage it's also not impossible to think of scenarios where that'd
work for EIO.

But I think besides hope of clearing up itself, it has the advantage
that it trivially can give *some* feedback to the user. The user'll get
back strerror(ENOSPC) with some decent SQL error code, which'll
hopefully cause them to investigate (well, once monitoring detects high
error rates).   It's much nicer for the user to type COMMIT; get an
appropriate error back etc, than if the database just commits suicide.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 15:08           ` Jeff Layton
@ 2018-04-12 22:44             ` Dave Chinner
  2018-04-13 13:18               ` Jeff Layton
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2018-04-12 22:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: lsf-pc, Matthew Wilcox, Andres Freund, Andreas Dilger,
	20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Joshua D. Drake

On Thu, Apr 12, 2018 at 11:08:50AM -0400, Jeff Layton wrote:
> On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:
> > On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:
> > > When there is a writeback error, what should be done with the dirty
> > > page(s)? Right now, we usually just mark them clean and carry on. Is
> > > that the right thing to do?
> > 
> > There isn't a right thing. Whatever we do will be wrong for someone.
> > 
> > > One possibility would be to invalidate the range that failed to be
> > > written (or the whole file) and force the pages to be faulted in again
> > > on the next access. It could be surprising for some applications to not
> > > see the results of their writes on a subsequent read after such an
> > > event.
> > 
> > Not to mention a POSIX IO ordering violation. Seeing stale data
> > after a "successful" write is simply not allowed.
> 
> I'm not so sure here, given that we're dealing with an error condition.
> Are we really obligated not to allow any changes to pages that we can't
> write back?

Posix says this about write():

  After a write() to a regular file has successfully returned:

     Any successful read() from each byte position in the file that
     was modified by that write shall return the data specified by
     the write() for that position until such byte positions are
     again modified.

IOWs, even if there is a later error, we told the user the write was
successful, and so according to POSIX we are not allowed to wind
back the data to what it was before the write() occurred.

> Given that the pages are clean after these failures, we aren't doing
> this even today:
> 
> Suppose we're unable to do writes but can do reads vs. the backing
> store. After a wb failure, the page has the dirty bit cleared. If it
> gets kicked out of the cache before the read occurs, it'll have to be
> faulted back in. Poof -- your write just disappeared.

Yes - I was pointing out what the specification we supposedly
conform to says about this behaviour, not that our current behaviour
conforms to the spec.  Indeed, have you even noticed
xfs_aops_discard_page() and it's surrounding context on page
writeback submission errors?

To save you looking, XFS will trash the page contents completely on
a filesystem level ->writepage error. It doesn't mark them "clean",
doesn't attempt to redirty and rewrite them - it clears the uptodate
state and may invalidate it completely. IOWs, the data written
"sucessfully" to the cached page is now gone. It will be re-read
from disk on the next read() call, in direct violation of the above
POSIX requirements.

This is my point: we've done that in XFS knowing that we violate
POSIX specifications in this specific corner case - it's the lesser
of many evils we have to chose between. Hence if we chose to encode
that behaviour as the general writeback IO error handling algorithm,
then it needs to done with the knowledge it is a specification
violation. Not to mention be documented as a POSIX violation in the
various relevant man pages and that this is how all filesystems will
behave on async writeback error.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 21:31                   ` Matthew Wilcox
@ 2018-04-13 12:56                     ` Jeff Layton
  0 siblings, 0 replies; 57+ messages in thread
From: Jeff Layton @ 2018-04-13 12:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andres Freund, Theodore Y. Ts'o, Dave Chinner,
	Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List,
	Linux FS Devel, Joshua D. Drake

On Thu, 2018-04-12 at 14:31 -0700, Matthew Wilcox wrote:
> On Thu, Apr 12, 2018 at 05:14:54PM -0400, Jeff Layton wrote:
> > On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote:
> > > On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:
> > > > I think a per-file or even per-blockdev/fs error state that'd be
> > > > returned by fsync() would be more than sufficient.
> > > 
> > > Ah; this was my suggestion to Jeff on IRC.  That we add a per-
> > > superblock
> > > wb_err and then allow syncfs() to return it.  So you'd open an fd on
> > > a directory (for example), and call syncfs() which would return -EIO
> > > or -ENOSPC if either of those conditions had occurred since you
> > > opened
> > > the fd.
> > 
> > Not a bad idea and shouldn't be too costly. mapping_set_error could
> > flag the superblock one before or after the one in the mapping.
> > 
> > We'd need to define what happens if you interleave fsync and syncfs
> > calls on the same inode though. How do we handle file->f_wb_err in that
> > case? Would we need a second field in struct file to act as the per-sb
> > error cursor?
> 
> Ooh.  I hadn't thought that through.  Bleh.  I don't want to add a field
> to struct file for this uncommon case.
> 
> Maybe O_PATH could be used for this?  It gets you a file descriptor on
> a particular filesystem, so syncfs() is defined, but it can't report
> a writeback error.  So if you open something O_PATH, you can use the
> file's f_wb_err for the mapping's error cursor.
> 

That might work.

It'd be a syscall behavioral change so we'd need to document that well.
It's probably innocuous though -- I doubt we have a lot of callers in
the field opening files with O_PATH and calling syncfs on them.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 22:44             ` Dave Chinner
@ 2018-04-13 13:18               ` Jeff Layton
  2018-04-13 13:25                 ` Andres Freund
  2018-04-13 14:02                 ` Matthew Wilcox
  0 siblings, 2 replies; 57+ messages in thread
From: Jeff Layton @ 2018-04-13 13:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: lsf-pc, Matthew Wilcox, Andres Freund, Andreas Dilger,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:
> On Thu, Apr 12, 2018 at 11:08:50AM -0400, Jeff Layton wrote:
> > On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:
> > > On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:
> > > > When there is a writeback error, what should be done with the dirty
> > > > page(s)? Right now, we usually just mark them clean and carry on. Is
> > > > that the right thing to do?
> > > 
> > > There isn't a right thing. Whatever we do will be wrong for someone.
> > > 
> > > > One possibility would be to invalidate the range that failed to be
> > > > written (or the whole file) and force the pages to be faulted in again
> > > > on the next access. It could be surprising for some applications to not
> > > > see the results of their writes on a subsequent read after such an
> > > > event.
> > > 
> > > Not to mention a POSIX IO ordering violation. Seeing stale data
> > > after a "successful" write is simply not allowed.
> > 
> > I'm not so sure here, given that we're dealing with an error condition.
> > Are we really obligated not to allow any changes to pages that we can't
> > write back?
> 
> Posix says this about write():
> 
>   After a write() to a regular file has successfully returned:
> 
>      Any successful read() from each byte position in the file that
>      was modified by that write shall return the data specified by
>      the write() for that position until such byte positions are
>      again modified.
> 
> IOWs, even if there is a later error, we told the user the write was
> successful, and so according to POSIX we are not allowed to wind
> back the data to what it was before the write() occurred.
> 
> > Given that the pages are clean after these failures, we aren't doing
> > this even today:
> > 
> > Suppose we're unable to do writes but can do reads vs. the backing
> > store. After a wb failure, the page has the dirty bit cleared. If it
> > gets kicked out of the cache before the read occurs, it'll have to be
> > faulted back in. Poof -- your write just disappeared.
> 
> Yes - I was pointing out what the specification we supposedly
> conform to says about this behaviour, not that our current behaviour
> conforms to the spec.  Indeed, have you even noticed
> xfs_aops_discard_page() and it's surrounding context on page
> writeback submission errors?
> 
> To save you looking, XFS will trash the page contents completely on
> a filesystem level ->writepage error. It doesn't mark them "clean",
> doesn't attempt to redirty and rewrite them - it clears the uptodate
> state and may invalidate it completely. IOWs, the data written
> "sucessfully" to the cached page is now gone. It will be re-read
> from disk on the next read() call, in direct violation of the above
> POSIX requirements.
> 
> This is my point: we've done that in XFS knowing that we violate
> POSIX specifications in this specific corner case - it's the lesser
> of many evils we have to chose between. Hence if we chose to encode
> that behaviour as the general writeback IO error handling algorithm,
> then it needs to done with the knowledge it is a specification
> violation. Not to mention be documented as a POSIX violation in the
> various relevant man pages and that this is how all filesystems will
> behave on async writeback error.....
> 

Got it, thanks.

Yes, I think we ought to probably do the same thing globally. It's nice
to know that xfs has already been doing this. That makes me feel better
about making this behavior the gold standard for Linux filesystems.

So to summarize, at this point in the discussion, I think we want to
consider doing the following:

* better reporting from syncfs (report an error when even one inode
failed to be written back since last syncfs call). We'll probably
implement this via a per-sb errseq_t in some fashion, though there are
some implementation issues to work out.

* invalidate or clear uptodate flag on pages that experience writeback
errors, across filesystems. Encourage this as standard behavior for
filesystems and maybe add helpers to make it easier to do this.

Did I miss anything? Would that be enough to help the Pg usecase?

I don't see us ever being able to reasonably support its current
expectation that writeback errors will be seen on fd's that were opened
after the error occurred. That's a really thorny problem from an object
lifetime perspective.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-13 13:18               ` Jeff Layton
@ 2018-04-13 13:25                 ` Andres Freund
  2018-04-13 14:02                 ` Matthew Wilcox
  1 sibling, 0 replies; 57+ messages in thread
From: Andres Freund @ 2018-04-13 13:25 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dave Chinner, lsf-pc, Matthew Wilcox, Andreas Dilger,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

Hi,

On 2018-04-13 09:18:56 -0400, Jeff Layton wrote:
> Yes, I think we ought to probably do the same thing globally. It's nice
> to know that xfs has already been doing this. That makes me feel better
> about making this behavior the gold standard for Linux filesystems.
> 
> So to summarize, at this point in the discussion, I think we want to
> consider doing the following:
> 
> * better reporting from syncfs (report an error when even one inode
> failed to be written back since last syncfs call). We'll probably
> implement this via a per-sb errseq_t in some fashion, though there are
> some implementation issues to work out.
> 
> * invalidate or clear uptodate flag on pages that experience writeback
> errors, across filesystems. Encourage this as standard behavior for
> filesystems and maybe add helpers to make it easier to do this.
> 
> Did I miss anything? Would that be enough to help the Pg usecase?
> 
> I don't see us ever being able to reasonably support its current
> expectation that writeback errors will be seen on fd's that were opened
> after the error occurred. That's a really thorny problem from an object
> lifetime perspective.

It's not perfect, but I think the amount of hacky OS specific code
should be acceptable. And it does allow for a wrapper tool that can be
used around backup restores etc to syncfs all the necessary filesystems.
Let me mull with others for a bit.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-13 13:18               ` Jeff Layton
  2018-04-13 13:25                 ` Andres Freund
@ 2018-04-13 14:02                 ` Matthew Wilcox
  2018-04-14  1:47                   ` Dave Chinner
  1 sibling, 1 reply; 57+ messages in thread
From: Matthew Wilcox @ 2018-04-13 14:02 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dave Chinner, lsf-pc, Andres Freund, Andreas Dilger,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote:
> On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:
> > To save you looking, XFS will trash the page contents completely on
> > a filesystem level ->writepage error. It doesn't mark them "clean",
> > doesn't attempt to redirty and rewrite them - it clears the uptodate
> > state and may invalidate it completely. IOWs, the data written
> > "sucessfully" to the cached page is now gone. It will be re-read
> > from disk on the next read() call, in direct violation of the above
> > POSIX requirements.
> > 
> > This is my point: we've done that in XFS knowing that we violate
> > POSIX specifications in this specific corner case - it's the lesser
> > of many evils we have to chose between. Hence if we chose to encode
> > that behaviour as the general writeback IO error handling algorithm,
> > then it needs to done with the knowledge it is a specification
> > violation. Not to mention be documented as a POSIX violation in the
> > various relevant man pages and that this is how all filesystems will
> > behave on async writeback error.....
> > 
> 
> Got it, thanks.
> 
> Yes, I think we ought to probably do the same thing globally. It's nice
> to know that xfs has already been doing this. That makes me feel better
> about making this behavior the gold standard for Linux filesystems.
> 
> So to summarize, at this point in the discussion, I think we want to
> consider doing the following:
> 
> * better reporting from syncfs (report an error when even one inode
> failed to be written back since last syncfs call). We'll probably
> implement this via a per-sb errseq_t in some fashion, though there are
> some implementation issues to work out.
> 
> * invalidate or clear uptodate flag on pages that experience writeback
> errors, across filesystems. Encourage this as standard behavior for
> filesystems and maybe add helpers to make it easier to do this.
> 
> Did I miss anything? Would that be enough to help the Pg usecase?
> 
> I don't see us ever being able to reasonably support its current
> expectation that writeback errors will be seen on fd's that were opened
> after the error occurred. That's a really thorny problem from an object
> lifetime perspective.

I think we can do better than XFS is currently doing (but I agree that
we should have the same behaviour across all Linux filesystems!)

1. If we get an error while wbc->for_background is true, we should not clear
   uptodate on the page, rather SetPageError and SetPageDirty.
2. Background writebacks should skip pages which are PageError.
3. for_sync writebacks should attempt one last write.  Maybe it'll
   succeed this time.  If it does, just ClearPageError.  If not, we have
   somebody to report this writeback error to, and ClearPageUptodate.

I think kupdate writes are the same as for_background writes.  for_reclaim
is tougher.  I don't want to see us getting into OOM because we're hanging
onto stale data, but we don't necessarily have an open fd to report the
error on.  I think I'm leaning towards behaving the same for for_reclaim
as for_sync, but this is probably a subject on which reasonable people
can disagree.

And this logic all needs to be on one place, although invoked from
each filesystem.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-10 22:07 fsync() errors is unsafe and risks data loss Andres Freund
  2018-04-11 21:52 ` Andreas Dilger
@ 2018-04-13 14:48 ` Matthew Wilcox
  2018-04-21 16:59   ` Jan Kara
  1 sibling, 1 reply; 57+ messages in thread
From: Matthew Wilcox @ 2018-04-13 14:48 UTC (permalink / raw)
  To: Andres Freund
  Cc: Theodore Y. Ts'o, linux-ext4, linux-fsdevel, Joshua D. Drake,
	Andreas Dilger

On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote:
> I don't think that's the full issue. We can deal with the fact that an
> fsync failure is edge-triggered if there's a guarantee that every
> process doing so would get it.  The fact that one needs to have an FD
> open from before any failing writes occurred to get a failure, *THAT'S*
> the big issue.
> 
> Beyond postgres, it's a pretty common approach to do work on a lot of
> files without fsyncing, then iterate over the directory fsync
> everything, and *then* assume you're safe. But unless I severaly
> misunderstand something that'd only be safe if you kept an FD for every
> file open, which isn't realistic for pretty obvious reasons.

While accepting that under memory pressure we can still evict the error
indicators, we can do a better job than we do today.  The current design
of error reporting says that all errors which occurred before you opened
the file descriptor are of no interest to you.  I don't think that's
necessarily true, and it's actually a change of behaviour from before
the errseq work.

Consider Stupid Task A which calls open(), write(), close(), and Smart
Task B which calls open(), write(), fsync(), close() operating on the
same file.  If A goes entirely before B and encounters an error, before
errseq_t, B would see the error from A's write.

If A and B overlap, even a little bit, then B still gets to see A's
error today.  But if writeback happens for A's write before B opens the
file then B will never see the error.

B doesn't want to see historical errors that a previous invocation of
B has already handled, but we know whether *anyone* has seen the error
or not.  So here's a patch which restores the historical behaviour of
seeing old unhandled errors on a fresh file descriptor:

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>

diff --git a/lib/errseq.c b/lib/errseq.c
index df782418b333..093f1fba4ee0 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set);
 errseq_t errseq_sample(errseq_t *eseq)
 {
 	errseq_t old = READ_ONCE(*eseq);
-	errseq_t new = old;
 
-	/*
-	 * For the common case of no errors ever having been set, we can skip
-	 * marking the SEEN bit. Once an error has been set, the value will
-	 * never go back to zero.
-	 */
-	if (old != 0) {
-		new |= ERRSEQ_SEEN;
-		if (old != new)
-			cmpxchg(eseq, old, new);
-	}
-	return new;
+	/* If nobody has seen this error yet, then we can be the first. */
+	if (!(old & ERRSEQ_SEEN))
+		old = 0;
+	return old;
 }
 EXPORT_SYMBOL(errseq_sample);
 

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-13 14:02                 ` Matthew Wilcox
@ 2018-04-14  1:47                   ` Dave Chinner
  2018-04-14  2:04                     ` Andres Freund
  2018-04-14  2:38                     ` Matthew Wilcox
  0 siblings, 2 replies; 57+ messages in thread
From: Dave Chinner @ 2018-04-14  1:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:
> On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote:
> > On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:
> > > To save you looking, XFS will trash the page contents completely on
> > > a filesystem level ->writepage error. It doesn't mark them "clean",
> > > doesn't attempt to redirty and rewrite them - it clears the uptodate
> > > state and may invalidate it completely. IOWs, the data written
> > > "sucessfully" to the cached page is now gone. It will be re-read
> > > from disk on the next read() call, in direct violation of the above
> > > POSIX requirements.
> > > 
> > > This is my point: we've done that in XFS knowing that we violate
> > > POSIX specifications in this specific corner case - it's the lesser
> > > of many evils we have to chose between. Hence if we chose to encode
> > > that behaviour as the general writeback IO error handling algorithm,
> > > then it needs to done with the knowledge it is a specification
> > > violation. Not to mention be documented as a POSIX violation in the
> > > various relevant man pages and that this is how all filesystems will
> > > behave on async writeback error.....
> > > 
> > 
> > Got it, thanks.
> > 
> > Yes, I think we ought to probably do the same thing globally. It's nice
> > to know that xfs has already been doing this. That makes me feel better
> > about making this behavior the gold standard for Linux filesystems.
> > 
> > So to summarize, at this point in the discussion, I think we want to
> > consider doing the following:
> > 
> > * better reporting from syncfs (report an error when even one inode
> > failed to be written back since last syncfs call). We'll probably
> > implement this via a per-sb errseq_t in some fashion, though there are
> > some implementation issues to work out.
> > 
> > * invalidate or clear uptodate flag on pages that experience writeback
> > errors, across filesystems. Encourage this as standard behavior for
> > filesystems and maybe add helpers to make it easier to do this.
> > 
> > Did I miss anything? Would that be enough to help the Pg usecase?
> > 
> > I don't see us ever being able to reasonably support its current
> > expectation that writeback errors will be seen on fd's that were opened
> > after the error occurred. That's a really thorny problem from an object
> > lifetime perspective.
> 
> I think we can do better than XFS is currently doing (but I agree that
> we should have the same behaviour across all Linux filesystems!)
> 
> 1. If we get an error while wbc->for_background is true, we should not clear
>    uptodate on the page, rather SetPageError and SetPageDirty.

So you're saying we should treat it as a transient error rather than
a permanent error.

> 2. Background writebacks should skip pages which are PageError.

That seems decidedly dodgy in the case where there is a transient
error - it requires a user to specifically run sync to get the data
to disk after the transient error has occurred. Say they don't
notice the problem because it's fleeting and doesn't cause any
obvious problems?

e.g. XFS gets to enospc, runs out of reserve pool blocks so can't
allocate space to write back the page, then space is freed up a few
seconds later and so the next write will work just fine.

This is a recipe for "I lost data that I wrote /days/ before the
system crashed" bug reports.

> 3. for_sync writebacks should attempt one last write.  Maybe it'll
>    succeed this time.  If it does, just ClearPageError.  If not, we have
>    somebody to report this writeback error to, and ClearPageUptodate.

Which may well be unmount. Are we really going to wait until unmount
to report fatal errors?

We used to do this with XFS metadata. We'd just keep trying to write
metadata and keep the filesystem running (because it's consistent in
memory and it might be a transient error) rather than shutting down
the filesystem after a couple of retries. the result was that users
wouldn't notice there were problems until unmount, and the most
common sympton of that was "why is system shutdown hanging?".

We now don't hang at unmount by default:

$ cat /sys/fs/xfs/dm-0/error/fail_at_unmount 
1
$

And we treat different errors according to their seriousness.  EIO
and device ENOSPC we default to retry forever because they are often
transient, but for ENODEV we fail and shutdown immediately (someone
pulled the USB stick out). metadata failure behaviour is configured
via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/...

We've planned to extend this failure configuration to data IO, too,
but never quite got around to it yet. this is a clear example of
"one size doesn't fit all" and I think we'll end up doing the same
sort of error behaviour configuration in XFS for these cases.
(i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....)

> And this logic all needs to be on one place, although invoked from
> each filesystem.

Perhaps so, but as there's no "one-size-fits-all" behaviour, I
really want to extend the XFS error config infrastructure to control
what the filesystem does on error here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-14  1:47                   ` Dave Chinner
@ 2018-04-14  2:04                     ` Andres Freund
  2018-04-18 23:59                       ` Dave Chinner
  2018-04-14  2:38                     ` Matthew Wilcox
  1 sibling, 1 reply; 57+ messages in thread
From: Andres Freund @ 2018-04-14  2:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Jeff Layton, lsf-pc, Andreas Dilger,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

Hi,

On 2018-04-14 11:47:52 +1000, Dave Chinner wrote:
> And we treat different errors according to their seriousness.  EIO
> and device ENOSPC we default to retry forever because they are often
> transient, but for ENODEV we fail and shutdown immediately (someone
> pulled the USB stick out). metadata failure behaviour is configured
> via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/...
> 
> We've planned to extend this failure configuration to data IO, too,
> but never quite got around to it yet. this is a clear example of
> "one size doesn't fit all" and I think we'll end up doing the same
> sort of error behaviour configuration in XFS for these cases.
> (i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....)

Have you considered adding an ext/fat/jfs
errors=remount-ro/panic/continue style mount parameter?

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-14  1:47                   ` Dave Chinner
  2018-04-14  2:04                     ` Andres Freund
@ 2018-04-14  2:38                     ` Matthew Wilcox
  2018-04-19  0:13                       ` Dave Chinner
  1 sibling, 1 reply; 57+ messages in thread
From: Matthew Wilcox @ 2018-04-14  2:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:
> On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:
> > 1. If we get an error while wbc->for_background is true, we should not clear
> >    uptodate on the page, rather SetPageError and SetPageDirty.
> 
> So you're saying we should treat it as a transient error rather than
> a permanent error.

Yes, I'm proposing leaving the data in memory in case the user wants to
try writing it somewhere else.

> > 2. Background writebacks should skip pages which are PageError.
> 
> That seems decidedly dodgy in the case where there is a transient
> error - it requires a user to specifically run sync to get the data
> to disk after the transient error has occurred. Say they don't
> notice the problem because it's fleeting and doesn't cause any
> obvious problems?

That's fair.  What I want to avoid is triggering the same error every
30 seconds (or whatever the periodic writeback threshold is set to).

> e.g. XFS gets to enospc, runs out of reserve pool blocks so can't
> allocate space to write back the page, then space is freed up a few
> seconds later and so the next write will work just fine.
> 
> This is a recipe for "I lost data that I wrote /days/ before the
> system crashed" bug reports.

So ... exponential backoff on retries?

> > 3. for_sync writebacks should attempt one last write.  Maybe it'll
> >    succeed this time.  If it does, just ClearPageError.  If not, we have
> >    somebody to report this writeback error to, and ClearPageUptodate.
> 
> Which may well be unmount. Are we really going to wait until unmount
> to report fatal errors?

Goodness, no.  The errors would be immediately reportable using the wb_err
mechanism, as soon as the first error was encountered.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12  2:17   ` Andres Freund
  2018-04-12  3:02     ` Matthew Wilcox
  2018-04-12  5:34     ` Theodore Y. Ts'o
@ 2018-04-18 18:09     ` J. Bruce Fields
  2 siblings, 0 replies; 57+ messages in thread
From: J. Bruce Fields @ 2018-04-18 18:09 UTC (permalink / raw)
  To: Andres Freund
  Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Jeff Layton,
	Joshua D. Drake

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-04-11 15:52:44 -0600, Andreas Dilger wrote:
> > On Apr 10, 2018, at 4:07 PM, Andres Freund <andres@anarazel.de> wrote:
> > > 2018-04-10 18:43:56 Ted wrote:
> > >> So for better or for worse, there has not been as much investment in
> > >> buffered I/O and data robustness in the face of exception handling of
> > >> storage devices.
> > > 
> > > That's a bit of a cop out. It's not just databases that care. Even more
> > > basic tools like SCM, package managers and editors care whether they can
> > > proper responses back from fsync that imply things actually were synced.
> > 
> > Sure, but it is mostly PG that is doing (IMHO) crazy things like writing
> > to thousands(?) of files, closing the file descriptors, then expecting
> > fsync() on a newly-opened fd to return a historical error.
> 
> It's not just postgres. dpkg (underlying apt, on debian derived distros)
> to take an example I just randomly guessed, does too:
>   /* We want to guarantee the extracted files are on the disk, so that the
>    * subsequent renames to the info database do not end up with old or zero
>    * length files in case of a system crash. As neither dpkg-deb nor tar do
>    * explicit fsync()s, we have to do them here.
>    * XXX: This could be avoided by switching to an internal tar extractor. */
>   dir_sync_contents(cidir);
> 
> (a bunch of other places too)
> 
> Especially on ext3 but also on newer filesystems it's performancewise
> entirely infeasible to fsync() every single file individually - the
> performance becomes entirely attrocious if you do that.

Is that still true if you're able to use some kind of parallelism?
(async io, or fsync from multiple processes?)

--b.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-14  2:04                     ` Andres Freund
@ 2018-04-18 23:59                       ` Dave Chinner
  2018-04-19  0:23                         ` Eric Sandeen
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2018-04-18 23:59 UTC (permalink / raw)
  To: Andres Freund
  Cc: Matthew Wilcox, Jeff Layton, lsf-pc, Andreas Dilger,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Fri, Apr 13, 2018 at 07:04:33PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-04-14 11:47:52 +1000, Dave Chinner wrote:
> > And we treat different errors according to their seriousness.  EIO
> > and device ENOSPC we default to retry forever because they are often
> > transient, but for ENODEV we fail and shutdown immediately (someone
> > pulled the USB stick out). metadata failure behaviour is configured
> > via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/...
> > 
> > We've planned to extend this failure configuration to data IO, too,
> > but never quite got around to it yet. this is a clear example of
> > "one size doesn't fit all" and I think we'll end up doing the same
> > sort of error behaviour configuration in XFS for these cases.
> > (i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....)
> 
> Have you considered adding an ext/fat/jfs
> errors=remount-ro/panic/continue style mount parameter?

That's for metadata writeback error behaviour, not data writeback
IO errors.

We are definitely not planning to add mount options to configure IO
error behaviors. Mount options are a horrible way to configure
filesystem behaviour and we've already got other, fine-grained
configuration infrastructure for configuring IO error behaviour.
Which, as I just pointed out, was designed to be be extended to data
writeback and other operational error handling in the filesystem
(e.g. dealing with ENOMEM in different ways).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-14  2:38                     ` Matthew Wilcox
@ 2018-04-19  0:13                       ` Dave Chinner
  2018-04-19  0:40                         ` Matthew Wilcox
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2018-04-19  0:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote:
> On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:
> > On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:
> > > 1. If we get an error while wbc->for_background is true, we should not clear
> > >    uptodate on the page, rather SetPageError and SetPageDirty.
> > 
> > So you're saying we should treat it as a transient error rather than
> > a permanent error.
> 
> Yes, I'm proposing leaving the data in memory in case the user wants to
> try writing it somewhere else.

And if it's getting IO errors because of USB stick pull? What
then?

> > > 2. Background writebacks should skip pages which are PageError.
> > 
> > That seems decidedly dodgy in the case where there is a transient
> > error - it requires a user to specifically run sync to get the data
> > to disk after the transient error has occurred. Say they don't
> > notice the problem because it's fleeting and doesn't cause any
> > obvious problems?
> 
> That's fair.  What I want to avoid is triggering the same error every
> 30 seconds (or whatever the periodic writeback threshold is set to).

So if kernel ring buffer overflows and so users miss the first error
report, they'll have no idea that the data writeback is still
failing?

> > e.g. XFS gets to enospc, runs out of reserve pool blocks so can't
> > allocate space to write back the page, then space is freed up a few
> > seconds later and so the next write will work just fine.
> > 
> > This is a recipe for "I lost data that I wrote /days/ before the
> > system crashed" bug reports.
> 
> So ... exponential backoff on retries?

Maybe, but I don't think that actually helps anything and adds yet
more "when should we write this" complication to inode writeback....

> > > 3. for_sync writebacks should attempt one last write.  Maybe it'll
> > >    succeed this time.  If it does, just ClearPageError.  If not, we have
> > >    somebody to report this writeback error to, and ClearPageUptodate.
> > 
> > Which may well be unmount. Are we really going to wait until unmount
> > to report fatal errors?
> 
> Goodness, no.  The errors would be immediately reportable using the wb_err
> mechanism, as soon as the first error was encountered.

But if there are no open files when the error occurs, that error
won't get reported to anyone. Which means the next time anyone
accesses that inode from a user context could very well be unmount
or a third party sync/syncfs()....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-18 23:59                       ` Dave Chinner
@ 2018-04-19  0:23                         ` Eric Sandeen
  0 siblings, 0 replies; 57+ messages in thread
From: Eric Sandeen @ 2018-04-19  0:23 UTC (permalink / raw)
  To: Dave Chinner, Andres Freund
  Cc: Matthew Wilcox, Jeff Layton, lsf-pc, Andreas Dilger,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake



On 4/18/18 6:59 PM, Dave Chinner wrote:
> On Fri, Apr 13, 2018 at 07:04:33PM -0700, Andres Freund wrote:
>> Hi,
>>
>> On 2018-04-14 11:47:52 +1000, Dave Chinner wrote:
>>> And we treat different errors according to their seriousness.  EIO
>>> and device ENOSPC we default to retry forever because they are often
>>> transient, but for ENODEV we fail and shutdown immediately (someone
>>> pulled the USB stick out). metadata failure behaviour is configured
>>> via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/...
>>>
>>> We've planned to extend this failure configuration to data IO, too,
>>> but never quite got around to it yet. this is a clear example of
>>> "one size doesn't fit all" and I think we'll end up doing the same
>>> sort of error behaviour configuration in XFS for these cases.
>>> (i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....)
>>
>> Have you considered adding an ext/fat/jfs
>> errors=remount-ro/panic/continue style mount parameter?
> 
> That's for metadata writeback error behaviour, not data writeback
> IO errors.

/me points casually at data_err=abort & data_err=ignore in ext4...
 
       data_err=ignore
              Just print an error message if an error occurs in a file data buffer in ordered mode.

       data_err=abort
              Abort the journal if an error occurs in a file data buffer in ordered mode.

Just sayin'

> We are definitely not planning to add mount options to configure IO
> error behaviors. Mount options are a horrible way to configure
> filesystem behaviour and we've already got other, fine-grained
> configuration infrastructure for configuring IO error behaviour.
> Which, as I just pointed out, was designed to be be extended to data
> writeback and other operational error handling in the filesystem
> (e.g. dealing with ENOMEM in different ways).

I don't disagree, but there are already mount-option knobs in ext4, FWIW.

-Eric

> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-19  0:13                       ` Dave Chinner
@ 2018-04-19  0:40                         ` Matthew Wilcox
  2018-04-19  1:08                           ` Theodore Y. Ts'o
  2018-04-19 23:28                           ` Dave Chinner
  0 siblings, 2 replies; 57+ messages in thread
From: Matthew Wilcox @ 2018-04-19  0:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Thu, Apr 19, 2018 at 10:13:43AM +1000, Dave Chinner wrote:
> On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote:
> > On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:
> > > On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:
> > > > 1. If we get an error while wbc->for_background is true, we should not clear
> > > >    uptodate on the page, rather SetPageError and SetPageDirty.
> > > 
> > > So you're saying we should treat it as a transient error rather than
> > > a permanent error.
> > 
> > Yes, I'm proposing leaving the data in memory in case the user wants to
> > try writing it somewhere else.
> 
> And if it's getting IO errors because of USB stick pull? What
> then?

I've been thinking about this.  Ideally we want to pass some kind of
notification all the way up to the desktop and tell the user to plug the
damn stick back in.  Then have the USB stick become the same blockdev
that it used to be, and complete the writeback.  We are so far from
being able to do that right now that it's not even funny.

> > > > 2. Background writebacks should skip pages which are PageError.
> > > 
> > > That seems decidedly dodgy in the case where there is a transient
> > > error - it requires a user to specifically run sync to get the data
> > > to disk after the transient error has occurred. Say they don't
> > > notice the problem because it's fleeting and doesn't cause any
> > > obvious problems?
> > 
> > That's fair.  What I want to avoid is triggering the same error every
> > 30 seconds (or whatever the periodic writeback threshold is set to).
> 
> So if kernel ring buffer overflows and so users miss the first error
> report, they'll have no idea that the data writeback is still
> failing?

I wasn't thinking about kernel ringbuffer based reporting; I was thinking
about errseq_t based reporting, so the application can tell the fsync
failed and maybe does something application-level to recover like send
the transactions across to another node in the cluster (or whatever this
hypothetical application is).

> > > > 3. for_sync writebacks should attempt one last write.  Maybe it'll
> > > >    succeed this time.  If it does, just ClearPageError.  If not, we have
> > > >    somebody to report this writeback error to, and ClearPageUptodate.
> > > 
> > > Which may well be unmount. Are we really going to wait until unmount
> > > to report fatal errors?
> > 
> > Goodness, no.  The errors would be immediately reportable using the wb_err
> > mechanism, as soon as the first error was encountered.
> 
> But if there are no open files when the error occurs, that error
> won't get reported to anyone. Which means the next time anyone
> accesses that inode from a user context could very well be unmount
> or a third party sync/syncfs()....

Right.  But then that's on the application.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-19  0:40                         ` Matthew Wilcox
@ 2018-04-19  1:08                           ` Theodore Y. Ts'o
  2018-04-19 17:40                             ` Matthew Wilcox
  2018-04-19 23:28                           ` Dave Chinner
  1 sibling, 1 reply; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-19  1:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger,
	Ext4 Developers List, Linux FS Devel, Joshua D. Drake

On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote:
> 
> I've been thinking about this.  Ideally we want to pass some kind of
> notification all the way up to the desktop and tell the user to plug the
> damn stick back in.  Then have the USB stick become the same blockdev
> that it used to be, and complete the writeback.  We are so far from
> being able to do that right now that it's not even funny.o

Maybe we shouldn't be trying to do any of this in the kernel, or at
least as little as possible in the kernel?  Perhaps it would be better
to do most of this as a device mapper hack; I suspect we'll need
userspace help to igure out whether the user has plugged the same USB
stick in, or a different USB stick, anyway.

					- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-19  1:08                           ` Theodore Y. Ts'o
@ 2018-04-19 17:40                             ` Matthew Wilcox
  2018-04-19 23:27                               ` Theodore Y. Ts'o
  0 siblings, 1 reply; 57+ messages in thread
From: Matthew Wilcox @ 2018-04-19 17:40 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Dave Chinner, Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger,
	Ext4 Developers List, Linux FS Devel, Joshua D. Drake

On Wed, Apr 18, 2018 at 09:08:19PM -0400, Theodore Y. Ts'o wrote:
> On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote:
> > 
> > I've been thinking about this.  Ideally we want to pass some kind of
> > notification all the way up to the desktop and tell the user to plug the
> > damn stick back in.  Then have the USB stick become the same blockdev
> > that it used to be, and complete the writeback.  We are so far from
> > being able to do that right now that it's not even funny.o
> 
> Maybe we shouldn't be trying to do any of this in the kernel, or at
> least as little as possible in the kernel?  Perhaps it would be better
> to do most of this as a device mapper hack; I suspect we'll need
> userspace help to igure out whether the user has plugged the same USB
> stick in, or a different USB stick, anyway.

The device mapper target (dm-removable?) was my first idea too, but I kept
thinking through use cases and I think we end up wanting this functionality
in the block layer.  Let's try a story.

Stephen the PFY goes into the data centre looking to hotswap a failed
drive.  Due to the eight pints of lager he had for lunch, he pulls out
the root drive instead of the failed drive.  The air raid siren warbles
and he realises his mistake, shoving the drive back in.

CYOA:

Currently: All writes are lost, calamities ensue.  The PFY is fired.

With dm-removable: Nobody thought to set up dm-removable on the root
drive.  Calamities still ensue, but now it's the BOFH's fault instead
of the PFY's fault.

Built into the block layer: After a brief hiccup while we reattach the
drive to its block_device, the writes resume and nobody loses their job.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-19 17:40                             ` Matthew Wilcox
@ 2018-04-19 23:27                               ` Theodore Y. Ts'o
  0 siblings, 0 replies; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-19 23:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger,
	Ext4 Developers List, Linux FS Devel, Joshua D. Drake

On Thu, Apr 19, 2018 at 10:40:10AM -0700, Matthew Wilcox wrote:
> With dm-removable: Nobody thought to set up dm-removable on the root
> drive.  Calamities still ensue, but now it's the BOFH's fault instead
> of the PFY's fault.
> 
> Built into the block layer: After a brief hiccup while we reattach the
> drive to its block_device, the writes resume and nobody loses their job.

What you're talking about is a deployment issue, though.  Ultimately
the distribution will set up dm-removable automatically if the user
requests it, much like it sets up dm-crypt automatically for laptop
users upon request.

My concern is that not all removable devices have a globally unique id
number available in hardware so the kernel can tell whether or not
it's the same device that has been plugged in.  There are hueristics
you could use -- for example, you could look at the file system uuid
plus the last fsck time.  But they tend to be very file system
specific, and not things we would want ot have in the kernel.

					- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-19  0:40                         ` Matthew Wilcox
  2018-04-19  1:08                           ` Theodore Y. Ts'o
@ 2018-04-19 23:28                           ` Dave Chinner
  1 sibling, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2018-04-19 23:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger,
	Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Joshua D. Drake

On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote:
> On Thu, Apr 19, 2018 at 10:13:43AM +1000, Dave Chinner wrote:
> > On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote:
> > > On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:
> > > > On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:
> > > > > 1. If we get an error while wbc->for_background is true, we should not clear
> > > > >    uptodate on the page, rather SetPageError and SetPageDirty.
> > > > 
> > > > So you're saying we should treat it as a transient error rather than
> > > > a permanent error.
> > > 
> > > Yes, I'm proposing leaving the data in memory in case the user wants to
> > > try writing it somewhere else.
> > 
> > And if it's getting IO errors because of USB stick pull? What
> > then?
> 
> I've been thinking about this.  Ideally we want to pass some kind of
> notification all the way up to the desktop and tell the user to plug the
> damn stick back in.  Then have the USB stick become the same blockdev
> that it used to be, and complete the writeback.  We are so far from
> being able to do that right now that it's not even funny.

*nod*

But in the meantime, device unplug (should give ENODEV, not EIO) is
a fatal error and we need to toss away the data.

> > > > > 2. Background writebacks should skip pages which are PageError.
> > > > 
> > > > That seems decidedly dodgy in the case where there is a transient
> > > > error - it requires a user to specifically run sync to get the data
> > > > to disk after the transient error has occurred. Say they don't
> > > > notice the problem because it's fleeting and doesn't cause any
> > > > obvious problems?
> > > 
> > > That's fair.  What I want to avoid is triggering the same error every
> > > 30 seconds (or whatever the periodic writeback threshold is set to).
> > 
> > So if kernel ring buffer overflows and so users miss the first error
> > report, they'll have no idea that the data writeback is still
> > failing?
> 
> I wasn't thinking about kernel ringbuffer based reporting; I was thinking
> about errseq_t based reporting, so the application can tell the fsync
> failed and maybe does something application-level to recover like send
> the transactions across to another node in the cluster (or whatever this
> hypothetical application is).

But if it's still failing, then we should be still trying to report
the error. i.e. if fsync fails and the page remains dirty, then the
next attmept to write it is a new error and fsync should report
that. IOWs, I think we should be returning errors at every occasion
errors need to be reported if we have a persistent writeback
failure...

> > > > > 3. for_sync writebacks should attempt one last write.  Maybe it'll
> > > > >    succeed this time.  If it does, just ClearPageError.  If not, we have
> > > > >    somebody to report this writeback error to, and ClearPageUptodate.
> > > > 
> > > > Which may well be unmount. Are we really going to wait until unmount
> > > > to report fatal errors?
> > > 
> > > Goodness, no.  The errors would be immediately reportable using the wb_err
> > > mechanism, as soon as the first error was encountered.
> > 
> > But if there are no open files when the error occurs, that error
> > won't get reported to anyone. Which means the next time anyone
> > accesses that inode from a user context could very well be unmount
> > or a third party sync/syncfs()....
> 
> Right.  But then that's on the application.

Which we know don't do the right thing. Seems like a lot of hoops to
jump through given it still won't work if the appliction isn't
changed to support linux specific error handling requirements...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-13 14:48 ` Matthew Wilcox
@ 2018-04-21 16:59   ` Jan Kara
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Kara @ 2018-04-21 16:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andres Freund, Theodore Y. Ts'o, linux-ext4, linux-fsdevel,
	Joshua D. Drake, Andreas Dilger

On Fri 13-04-18 07:48:07, Matthew Wilcox wrote:
> On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote:
> > I don't think that's the full issue. We can deal with the fact that an
> > fsync failure is edge-triggered if there's a guarantee that every
> > process doing so would get it.  The fact that one needs to have an FD
> > open from before any failing writes occurred to get a failure, *THAT'S*
> > the big issue.
> > 
> > Beyond postgres, it's a pretty common approach to do work on a lot of
> > files without fsyncing, then iterate over the directory fsync
> > everything, and *then* assume you're safe. But unless I severaly
> > misunderstand something that'd only be safe if you kept an FD for every
> > file open, which isn't realistic for pretty obvious reasons.
> 
> While accepting that under memory pressure we can still evict the error
> indicators, we can do a better job than we do today.  The current design
> of error reporting says that all errors which occurred before you opened
> the file descriptor are of no interest to you.  I don't think that's
> necessarily true, and it's actually a change of behaviour from before
> the errseq work.
> 
> Consider Stupid Task A which calls open(), write(), close(), and Smart
> Task B which calls open(), write(), fsync(), close() operating on the
> same file.  If A goes entirely before B and encounters an error, before
> errseq_t, B would see the error from A's write.
> 
> If A and B overlap, even a little bit, then B still gets to see A's
> error today.  But if writeback happens for A's write before B opens the
> file then B will never see the error.
> 
> B doesn't want to see historical errors that a previous invocation of
> B has already handled, but we know whether *anyone* has seen the error
> or not.  So here's a patch which restores the historical behaviour of
> seeing old unhandled errors on a fresh file descriptor:
> 
> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>

So I agree with going to the old semantics of reporting errors from before
a file was open at least once to someone. As the PG case shows apps are
indeed relying on the old behavior. As much as it is unreliable, it ends up
doing the right thing for these apps in 99% of cases and we shouldn't break
them (BTW IMO the changelog should contain a note that this fixes a
regression of PostgreSQL, a reference to this thread and CC to stable).
Anyway feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

Oh, and to make myself clear I do think we need to find a better way of
reporting IO errors. I consider this just an immediate band-aid to avoid
userspace regressions.

								Honza

> diff --git a/lib/errseq.c b/lib/errseq.c
> index df782418b333..093f1fba4ee0 100644
> --- a/lib/errseq.c
> +++ b/lib/errseq.c
> @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set);
>  errseq_t errseq_sample(errseq_t *eseq)
>  {
>  	errseq_t old = READ_ONCE(*eseq);
> -	errseq_t new = old;
>  
> -	/*
> -	 * For the common case of no errors ever having been set, we can skip
> -	 * marking the SEEN bit. Once an error has been set, the value will
> -	 * never go back to zero.
> -	 */
> -	if (old != 0) {
> -		new |= ERRSEQ_SEEN;
> -		if (old != new)
> -			cmpxchg(eseq, old, new);
> -	}
> -	return new;
> +	/* If nobody has seen this error yet, then we can be the first. */
> +	if (!(old & ERRSEQ_SEEN))
> +		old = 0;
> +	return old;



-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-12 11:09       ` Jeff Layton
                           ` (2 preceding siblings ...)
  2018-04-12 20:24         ` Andres Freund
@ 2018-04-21 18:14         ` Jan Kara
  3 siblings, 0 replies; 57+ messages in thread
From: Jan Kara @ 2018-04-21 18:14 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Matthew Wilcox, Andres Freund, Andreas Dilger,
	20180410184356.GD3563, Theodore Y. Ts'o,
	Ext4 Developers List, Linux FS Devel, Joshua D. Drake

On Thu 12-04-18 07:09:14, Jeff Layton wrote:
> On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:
> > At the moment, when we open a file, we sample the current state of the
> > writeback error and only report new errors.  We could set it to zero
> > instead, and report the most recent error as soon as anything happens
> > which would report an error.  That way err = close(open("file")); would
> > report the most recent error.
> > 
> > That's not going to be persistent across the data structure for that inode
> > being removed from memory; we'd need filesystem support for persisting
> > that.  But maybe it's "good enough" to only support it for recent files.
> > 
> > Jeff, what do you think?
> 
> I hate it :). We could do that, but....yecchhhh.
> 
> Reporting errors only in the case where the inode happened to stick
> around in the cache seems too unreliable for real-world usage, and might
> be problematic for some use cases. I'm also not sure it would really be
> helpful.

So this is never going to be perfect but I think we could do good enough
by:
1) Mark inodes that hit IO error.
2) If the inode gets evicted from memory we store the fact that we hit an
error for this IO in a more space efficient data structure (sparse bitmap,
radix tree, extent tree, whatever).
3) If the underlying device gets destroyed, we can just switch the whole SB
to an error state and forget per inode info.
4) If there's too much of per-inode error info (probably per-fs configurable
limit in terms of number of inodes), we would yell in the kernel log,
switch the whole fs to the error state and forget per inode info.

This way there won't be silent loss of IO errors. Memory usage would be
reasonably limited. It could happen the whole fs would switch to error state
"prematurely" but if that's a problem for the machine, admin could tune the
limit for number of inodes to keep IO errors for...

> I think the crux of the matter here is not really about error reporting,
> per-se.

I think this is related but a different question.

> I asked this at LSF last year, and got no real answer:
> 
> When there is a writeback error, what should be done with the dirty
> page(s)? Right now, we usually just mark them clean and carry on. Is
> that the right thing to do?
> 
> One possibility would be to invalidate the range that failed to be
> written (or the whole file) and force the pages to be faulted in again
> on the next access. It could be surprising for some applications to not
> see the results of their writes on a subsequent read after such an
> event.
> 
> Maybe that's ok in the face of a writeback error though? IDK.

I can see the admin wanting to rather kill the machine with OOM than having
to deal with data loss due to IO errors (e.g. if he has HA server fail over
set up).  Or retry for some time before dropping the dirty data.  Or do
what we do now (possibly with invalidating pages as you say). As Dave said
elsewhere there's not one strategy that's going to please everybody. So it
might be beneficial to have this configurable like XFS has it for metadata.

OTOH if I look at the problem from application developer POV, most apps
will just declare game over at the face of IO errors (if they take care to
check for them at all). And the sophisticated apps that will try some kind
of error recovery have to be prepared that the data is just gone (as
depending on what exactly the kernel does is rather fragile) so I'm not
sure how much practical value the configurable behavior on writeback errors
would bring.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-19  8:39       ` Christoph Hellwig
@ 2018-04-19 14:10         ` J. Bruce Fields
  0 siblings, 0 replies; 57+ messages in thread
From: J. Bruce Fields @ 2018-04-19 14:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin Steigerwald, Theodore Y. Ts'o, Joshua D. Drake,
	linux-ext4, linux-fsdevel

On Thu, Apr 19, 2018 at 01:39:04AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 18, 2018 at 12:52:19PM -0400, J. Bruce Fields wrote:
> > > Theodore Y. Ts'o - 10.04.18, 20:43:
> > > > First of all, what storage devices will do when they hit an exception
> > > > condition is quite non-deterministic.  For example, the vast majority
> > > > of SSD's are not power fail certified.  What this means is that if
> > > > they suffer a power drop while they are doing a GC, it is quite
> > > > possible for data written six months ago to be lost as a result.  The
> > > > LBA could potentialy be far, far away from any LBA's that were
> > > > recently written, and there could have been multiple CACHE FLUSH
> > > > operations in the since the LBA in question was last written six
> > > > months ago.  No matter; for a consumer-grade SSD, it's possible for
> > > > that LBA to be trashed after an unexpected power drop.
> > 
> > Pointers to documentation or papers or anything?  The only google
> > results I can find for "power fail certified" are your posts.
> > 
> > I've always been confused by SSD power-loss protection, as nobody seems
> > completely clear whether it's a safety or a performance feature.
> 
> Devices from reputable vendors should always be power fail safe, bugs
> notwithstanding.  What power-loss protection in marketing slides usually
> means is that an SSD has a non-volatile write cache.  That is once a
> write is ACKed data is persisted and no additional cache flush needs to
> be sent.  This is a feature only available in expensive eterprise SSDs
> as the required capacitors are expensive.  Cheaper consumer or boot
> driver SSDs have a volatile write cache, that is we need to do a
> separate cache flush to persist data (REQ_OP_FLUSH in Linux).  But
> a reasonable implementation of those still won't corrupt previously
> written data, they will just lose the volatile write cache that hasn't
> been flushed.  Occasional bugs, bad actors or other issues might still
> happen.

Thanks!  That was my understanding too.  But then the name is terrible.
As is all the vendor documentation I can find:

	https://insights.samsung.com/2016/03/22/power-loss-protection-how-ssds-are-protecting-data-integrity-white-paper/

	"Power loss protection is a critical aspect of ensuring data
	integrity, especially in servers or data centers."

	https://www.intel.com/content/.../ssd-320-series-power-loss-data-protection-brief.pdf

	"Data safety features prepare for unexpected power-loss and
	protect system and user data."

Why do they all neglect to mention that their consumer drives are also
perfectly capable of well-defined behavior after power loss, just at the
expense of flush performance?  It's ridiculously confusing.

--b.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-18 16:52     ` J. Bruce Fields
@ 2018-04-19  8:39       ` Christoph Hellwig
  2018-04-19 14:10         ` J. Bruce Fields
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2018-04-19  8:39 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Martin Steigerwald, Theodore Y. Ts'o, Joshua D. Drake,
	linux-ext4, linux-fsdevel

On Wed, Apr 18, 2018 at 12:52:19PM -0400, J. Bruce Fields wrote:
> > Theodore Y. Ts'o - 10.04.18, 20:43:
> > > First of all, what storage devices will do when they hit an exception
> > > condition is quite non-deterministic.  For example, the vast majority
> > > of SSD's are not power fail certified.  What this means is that if
> > > they suffer a power drop while they are doing a GC, it is quite
> > > possible for data written six months ago to be lost as a result.  The
> > > LBA could potentialy be far, far away from any LBA's that were
> > > recently written, and there could have been multiple CACHE FLUSH
> > > operations in the since the LBA in question was last written six
> > > months ago.  No matter; for a consumer-grade SSD, it's possible for
> > > that LBA to be trashed after an unexpected power drop.
> 
> Pointers to documentation or papers or anything?  The only google
> results I can find for "power fail certified" are your posts.
> 
> I've always been confused by SSD power-loss protection, as nobody seems
> completely clear whether it's a safety or a performance feature.

Devices from reputable vendors should always be power fail safe, bugs
notwithstanding.  What power-loss protection in marketing slides usually
means is that an SSD has a non-volatile write cache.  That is once a
write is ACKed data is persisted and no additional cache flush needs to
be sent.  This is a feature only available in expensive eterprise SSDs
as the required capacitors are expensive.  Cheaper consumer or boot
driver SSDs have a volatile write cache, that is we need to do a
separate cache flush to persist data (REQ_OP_FLUSH in Linux).  But
a reasonable implementation of those still won't corrupt previously
written data, they will just lose the volatile write cache that hasn't
been flushed.  Occasional bugs, bad actors or other issues might still
happen.

> 
> --b.
---end quoted text---

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
  2018-04-10 19:47   ` Martin Steigerwald
@ 2018-04-18 16:52     ` J. Bruce Fields
  2018-04-19  8:39       ` Christoph Hellwig
  0 siblings, 1 reply; 57+ messages in thread
From: J. Bruce Fields @ 2018-04-18 16:52 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Theodore Y. Ts'o, Joshua D. Drake, linux-ext4, linux-fsdevel

> Theodore Y. Ts'o - 10.04.18, 20:43:
> > First of all, what storage devices will do when they hit an exception
> > condition is quite non-deterministic.  For example, the vast majority
> > of SSD's are not power fail certified.  What this means is that if
> > they suffer a power drop while they are doing a GC, it is quite
> > possible for data written six months ago to be lost as a result.  The
> > LBA could potentialy be far, far away from any LBA's that were
> > recently written, and there could have been multiple CACHE FLUSH
> > operations in the since the LBA in question was last written six
> > months ago.  No matter; for a consumer-grade SSD, it's possible for
> > that LBA to be trashed after an unexpected power drop.

Pointers to documentation or papers or anything?  The only google
results I can find for "power fail certified" are your posts.

I've always been confused by SSD power-loss protection, as nobody seems
completely clear whether it's a safety or a performance feature.

--b.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: fsync() errors is unsafe and risks data loss
       [not found] ` <20180410184356.GD3563@thunk.org>
@ 2018-04-10 19:47   ` Martin Steigerwald
  2018-04-18 16:52     ` J. Bruce Fields
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Steigerwald @ 2018-04-10 19:47 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Joshua D. Drake, linux-ext4, linux-fsdevel

Hi Theodore, Darrick, Joshua.

CC´d fsdevel as it does not appear to be Ext4 specific to me (and to you as 
well, Theodore).

Theodore Y. Ts'o - 10.04.18, 20:43:
> This isn't actually an ext4 issue, but a long-standing VFS/MM issue.
[…]
> First of all, what storage devices will do when they hit an exception
> condition is quite non-deterministic.  For example, the vast majority
> of SSD's are not power fail certified.  What this means is that if
> they suffer a power drop while they are doing a GC, it is quite
> possible for data written six months ago to be lost as a result.  The
> LBA could potentialy be far, far away from any LBA's that were
> recently written, and there could have been multiple CACHE FLUSH
> operations in the since the LBA in question was last written six
> months ago.  No matter; for a consumer-grade SSD, it's possible for
> that LBA to be trashed after an unexpected power drop.

Guh. I was not aware of this. I knew consumer-grade SSDs often do not have 
power loss protection, but still thought they´d handle garble collection in an 
atomic way. Sometimes I am tempted to sing an "all hardware is crap" song 
(starting with Meltdown/Spectre, then probably heading over to storage devices 
and so on… including firmware crap like Intel ME).

> Next, the reason why fsync() has the behaviour that it does is one
> ofhe the most common cases of I/O storage errors in buffered use
> cases, certainly as seen by the community distros, is the user who
> pulls out USB stick while it is in use.  In that case, if there are
> dirtied pages in the page cache, the question is what can you do?
> Sooner or later the writes will time out, and if you leave the pages
> dirty, then it effectively becomes a permanent memory leak.  You can't
> unmount the file system --- that requires writing out all of the pages
> such that the dirty bit is turned off.  And if you don't clear the
> dirty bit on an I/O error, then they can never be cleaned.  You can't
> even re-insert the USB stick; the re-inserted USB stick will get a new
> block device.  Worse, when the USB stick was pulled, it will have
> suffered a power drop, and see above about what could happen after a
> power drop for non-power fail certified flash devices --- it goes
> double for the cheap sh*t USB sticks found in the checkout aisle of
> Micro Center.

>From the original PostgreSQL mailing list thread I did not get on how exactly 
FreeBSD differs in behavior, compared to Linux. I am aware of one operating 
system that from a user point of view handles this in almost the right way 
IMHO: AmigaOS.

When you removed a floppy disk from the drive while the OS was writing to it 
it showed a  "You MUST insert volume somename into drive somedrive:" and if 
you did, it just continued writing. (The part that did not work well was that 
with the original filesystem if you did not insert it back, the whole disk was 
corrupted, usually to the point beyond repair, so the "MUST" was no joke.)

In my opinion from a user´s point of view this is the only sane way to handle 
the premature removal of removable media. I have read of a GSoC project to 
implement something like this for NetBSD but I did not check on the outcome of 
it. But in MS-DOS I think there has been something similar, however MS-DOS is 
not an multitasking operating system as AmigaOS is.

Implementing something like this for Linux would be quite a feat, I think, 
cause in addition to the implementation in the kernel, the desktop environment 
or whatever other userspace you use would need to handle it as well, so you´d 
have to adapt udev / udisks / probably Systemd. And probably this behavior 
needs to be restricted to anything that is really removable and even then in 
order to prevent memory exhaustion in case processes continue to write to an 
removed and not yet re-inserted USB harddisk the kernel would need to halt I/O 
processes which dirty I/O to this device. (I believe this is what AmigaOS did. 
It just blocked all subsequent I/O to the device still it was re-inserted. But 
then the I/O handling in that OS at that time is quite different from what 
Linux does.)

> So this is the explanation for why Linux handles I/O errors by
> clearing the dirty bit after reporting the error up to user space.
> And why there is not eagerness to solve the problem simply by "don't
> clear the dirty bit".  For every one Postgres installation that might
> have a better recover after an I/O error, there's probably a thousand
> clueless Fedora and Ubuntu users who will have a much worse user
> experience after a USB stick pull happens.

I was not aware that flash based media may be as crappy as you hint at.

>From my tests with AmigaOS 4.something or AmigaOS 3.9 + 3rd Party Poseidon USB 
stack the above mechanism worked even with USB sticks. I however did not test 
this often and I did not check for data corruption after a test.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2018-04-21 23:19 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-10 22:07 fsync() errors is unsafe and risks data loss Andres Freund
2018-04-11 21:52 ` Andreas Dilger
2018-04-12  0:09   ` Dave Chinner
2018-04-12  2:32     ` Andres Freund
2018-04-12  2:51       ` Andres Freund
2018-04-12  5:09       ` Theodore Y. Ts'o
2018-04-12  5:45       ` Dave Chinner
2018-04-12 11:24         ` Jeff Layton
2018-04-12 21:11           ` Andres Freund
2018-04-12 10:19       ` Lukas Czerner
2018-04-12 19:46         ` Andres Freund
2018-04-12  2:17   ` Andres Freund
2018-04-12  3:02     ` Matthew Wilcox
2018-04-12 11:09       ` Jeff Layton
2018-04-12 11:19         ` Matthew Wilcox
2018-04-12 12:01         ` Dave Chinner
2018-04-12 15:08           ` Jeff Layton
2018-04-12 22:44             ` Dave Chinner
2018-04-13 13:18               ` Jeff Layton
2018-04-13 13:25                 ` Andres Freund
2018-04-13 14:02                 ` Matthew Wilcox
2018-04-14  1:47                   ` Dave Chinner
2018-04-14  2:04                     ` Andres Freund
2018-04-18 23:59                       ` Dave Chinner
2018-04-19  0:23                         ` Eric Sandeen
2018-04-14  2:38                     ` Matthew Wilcox
2018-04-19  0:13                       ` Dave Chinner
2018-04-19  0:40                         ` Matthew Wilcox
2018-04-19  1:08                           ` Theodore Y. Ts'o
2018-04-19 17:40                             ` Matthew Wilcox
2018-04-19 23:27                               ` Theodore Y. Ts'o
2018-04-19 23:28                           ` Dave Chinner
2018-04-12 15:16           ` Theodore Y. Ts'o
2018-04-12 20:13             ` Andres Freund
2018-04-12 20:28               ` Matthew Wilcox
2018-04-12 21:14                 ` Jeff Layton
2018-04-12 21:31                   ` Matthew Wilcox
2018-04-13 12:56                     ` Jeff Layton
2018-04-12 21:21                 ` Theodore Y. Ts'o
2018-04-12 21:24                   ` Matthew Wilcox
2018-04-12 21:37                   ` Andres Freund
2018-04-12 20:24         ` Andres Freund
2018-04-12 21:27           ` Jeff Layton
2018-04-12 21:53             ` Andres Freund
2018-04-12 21:57               ` Theodore Y. Ts'o
2018-04-21 18:14         ` Jan Kara
2018-04-12  5:34     ` Theodore Y. Ts'o
2018-04-12 19:55       ` Andres Freund
2018-04-12 21:52         ` Theodore Y. Ts'o
2018-04-12 22:03           ` Andres Freund
2018-04-18 18:09     ` J. Bruce Fields
2018-04-13 14:48 ` Matthew Wilcox
2018-04-21 16:59   ` Jan Kara
     [not found] <8da874c9-cf9c-d40a-3474-b773190878e7@commandprompt.com>
     [not found] ` <20180410184356.GD3563@thunk.org>
2018-04-10 19:47   ` Martin Steigerwald
2018-04-18 16:52     ` J. Bruce Fields
2018-04-19  8:39       ` Christoph Hellwig
2018-04-19 14:10         ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).