linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors
@ 2020-02-07 17:04 Jeff Layton
  2020-02-07 17:04 ` [PATCH v3 1/3] vfs: track per-sb writeback errors and report them to syncfs Jeff Layton
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Jeff Layton @ 2020-02-07 17:04 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, linux-api, andres, willy, dhowells,
	hch, jack, akpm

You're probably wondering -- Where are v1 and v2 sets?

I did the first couple of versions of this set back in 2018, and then
got dragged off to work on other things. I'd like to resurrect this set
though, as I think it's valuable overall, and I have need of it for some
other work I'm doing.

Currently, syncfs does not return errors when one of the inodes fails to
be written back. It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.

The basic idea is to track writeback errors at the superblock level,
so that we can quickly and easily check whether something bad happened
without having to fsync each file individually. syncfs is then changed
to reliably report writeback errors, and a new ioctl is added to allow
userland to get at the current errseq_t value w/o having to sync out
anything.

I do have a xfstest for this. I do not yet have manpage patches, but
I'm happy to roll some once there is consensus on the interface.

Caveats:

- Having different behavior for an O_PATH descriptor in syncfs is
  a bit odd, but it means that we don't have to grow struct file. Is
  that acceptable from an API standpoint?

- This adds a new generic fs ioctl to allow userland to scrape the
  current superblock's errseq_t value. It may be best to present this
  to userland via fsinfo() instead (once that's merged). I'm fine with
  dropping the last patch for now and reworking it for fsinfo if so.

Jeff Layton (3):
  vfs: track per-sb writeback errors and report them to syncfs
  buffer: record blockdev write errors in super_block that it backs
  vfs: add a new ioctl for fetching the superblock's errseq_t

 fs/buffer.c             |  2 ++
 fs/ioctl.c              |  4 ++++
 fs/open.c               |  6 +++---
 fs/sync.c               |  9 ++++++++-
 include/linux/errseq.h  |  1 +
 include/linux/fs.h      |  3 +++
 include/linux/pagemap.h |  5 ++++-
 include/uapi/linux/fs.h |  1 +
 lib/errseq.c            | 33 +++++++++++++++++++++++++++++++--
 9 files changed, 57 insertions(+), 7 deletions(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/3] vfs: track per-sb writeback errors and report them to syncfs
  2020-02-07 17:04 [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors Jeff Layton
@ 2020-02-07 17:04 ` Jeff Layton
  2020-02-07 17:04 ` [PATCH v3 2/3] buffer: record blockdev write errors in super_block that it backs Jeff Layton
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-02-07 17:04 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, linux-api, andres, willy, dhowells,
	hch, jack, akpm

From: Jeff Layton <jlayton@redhat.com>

Usually we suggest that applications call fsync when they want to
ensure that all data written to the file has made it to the backing
store, but that can be inefficient when there are a lot of open
files.

Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect. If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error. syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.

It would be better if syncfs reported an error if there were any writeback
failures. Then applications could call syncfs to see if there are any
errors on any open files, and could then call fsync on all of the other
descriptors to figure out which one failed.

This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.

To report those errors, we also need to keep an errseq_t for in struct
file to act as a cursor, but growing struct file for this purpose is
undesirable. We could just reuse f_wb_err, but someone could mix calls
to fsync and syncfs and that would break things.

This patch implements an alternative suggested by Willy. When the file
is opened with O_PATH, then we repurpose the f_wb_err cursor to track
s_wb_err. Any file opened with O_PATH will not have an fsync
file_operation, and attempts to fsync such a fd will return -EBADF.

Note that calling syncfs on an O_PATH descriptor today will also return
-EBADF, so this scheme gives userland a way to tell whether this
mechanism will work at runtime.

Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/open.c               | 6 +++---
 fs/sync.c               | 9 ++++++++-
 include/linux/fs.h      | 3 +++
 include/linux/pagemap.h | 5 ++++-
 4 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..de10a0bf7697 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -744,12 +744,10 @@ static int do_dentry_open(struct file *f,
 	f->f_inode = inode;
 	f->f_mapping = inode->i_mapping;
 
-	/* Ensure that we skip any errors that predate opening of the file */
-	f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
-
 	if (unlikely(f->f_flags & O_PATH)) {
 		f->f_mode = FMODE_PATH | FMODE_OPENED;
 		f->f_op = &empty_fops;
+		f->f_wb_err = errseq_sample(&f->f_path.dentry->d_sb->s_wb_err);
 		return 0;
 	}
 
@@ -759,6 +757,8 @@ static int do_dentry_open(struct file *f,
 		goto cleanup_file;
 	}
 
+	f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
+
 	if (f->f_mode & FMODE_WRITE && !special_file(inode->i_mode)) {
 		error = get_write_access(inode);
 		if (unlikely(error))
diff --git a/fs/sync.c b/fs/sync.c
index 4d1ff010bc5a..8373d0372767 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -159,7 +159,7 @@ void emergency_sync(void)
  */
 SYSCALL_DEFINE1(syncfs, int, fd)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_raw(fd);
 	struct super_block *sb;
 	int ret;
 
@@ -171,6 +171,13 @@ SYSCALL_DEFINE1(syncfs, int, fd)
 	ret = sync_filesystem(sb);
 	up_read(&sb->s_umount);
 
+	if (f.file->f_flags & O_PATH) {
+		int ret2 = errseq_check_and_advance(&sb->s_wb_err,
+						    &f.file->f_wb_err);
+		if (ret == 0)
+			ret = ret2;
+	}
+
 	fdput(f);
 	return ret;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6eae91c0668f..bdbb0cbad03a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1514,6 +1514,9 @@ struct super_block {
 	/* Being remounted read-only */
 	int s_readonly_remount;
 
+	/* per-sb errseq_t for reporting writeback errors via syncfs */
+	errseq_t s_wb_err;
+
 	/* AIO completions deferred from interrupt context */
 	struct workqueue_struct *s_dio_done_wq;
 	struct hlist_head s_pins;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ccb14b6a16b5..897439475315 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -51,7 +51,10 @@ static inline void mapping_set_error(struct address_space *mapping, int error)
 		return;
 
 	/* Record in wb_err for checkers using errseq_t based tracking */
-	filemap_set_wb_err(mapping, error);
+	__filemap_set_wb_err(mapping, error);
+
+	/* Record it in superblock */
+	errseq_set(&mapping->host->i_sb->s_wb_err, error);
 
 	/* Record it in flags for now, for legacy callers */
 	if (error == -ENOSPC)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/3] buffer: record blockdev write errors in super_block that it backs
  2020-02-07 17:04 [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors Jeff Layton
  2020-02-07 17:04 ` [PATCH v3 1/3] vfs: track per-sb writeback errors and report them to syncfs Jeff Layton
@ 2020-02-07 17:04 ` Jeff Layton
  2020-02-07 17:04 ` [PATCH v3 3/3] vfs: add a new ioctl for fetching the superblock's errseq_t Jeff Layton
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-02-07 17:04 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, linux-api, andres, willy, dhowells,
	hch, jack, akpm

From: Jeff Layton <jlayton@redhat.com>

When syncing out a block device (a'la __sync_blockdev), any error
encountered will only be recorded in the bd_inode's mapping. When the
blockdev contains a filesystem however, we'd like to also record the
error in the super_block that's stored there.

Make mark_buffer_write_io_error also record the error in the
corresponding super_block when a writeback error occurs and the block
device contains a mounted superblock.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/buffer.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index b8d28370cfd7..451f1be6e1a4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1166,6 +1166,8 @@ void mark_buffer_write_io_error(struct buffer_head *bh)
 		mapping_set_error(bh->b_page->mapping, -EIO);
 	if (bh->b_assoc_map)
 		mapping_set_error(bh->b_assoc_map, -EIO);
+	if (bh->b_bdev->bd_super)
+		errseq_set(&bh->b_bdev->bd_super->s_wb_err, -EIO);
 }
 EXPORT_SYMBOL(mark_buffer_write_io_error);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/3] vfs: add a new ioctl for fetching the superblock's errseq_t
  2020-02-07 17:04 [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors Jeff Layton
  2020-02-07 17:04 ` [PATCH v3 1/3] vfs: track per-sb writeback errors and report them to syncfs Jeff Layton
  2020-02-07 17:04 ` [PATCH v3 2/3] buffer: record blockdev write errors in super_block that it backs Jeff Layton
@ 2020-02-07 17:04 ` Jeff Layton
  2020-02-07 20:52 ` [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors Dave Chinner
  2020-02-12 12:21 ` Jeff Layton
  4 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-02-07 17:04 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, linux-api, andres, willy, dhowells,
	hch, jack, akpm

From: Jeff Layton <jlayton@redhat.com>

Some time ago, the PostgreSQL developers mentioned that they'd like a
way to tell whether there have been any writeback errors on a given
filesystem without having to forcibly sync out all buffered writes.

Now that we have a per-sb errseq_t that tracks whether any inode on the
filesystem might have failed writeback, we can present that to userland
applications via a new interface. Add a new generic fs ioctl for that
purpose. This just reports the current state of the errseq_t counter
with the SEEN bit masked off.

Cc: Andres Freund <andres@anarazel.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ioctl.c              |  4 ++++
 include/linux/errseq.h  |  1 +
 include/uapi/linux/fs.h |  1 +
 lib/errseq.c            | 33 +++++++++++++++++++++++++++++++--
 4 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 7c9a5df5a597..41e991cec4c3 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -705,6 +705,10 @@ static int do_vfs_ioctl(struct file *filp, unsigned int fd,
 	case FS_IOC_FIEMAP:
 		return ioctl_fiemap(filp, argp);
 
+	case FS_IOC_GETFSERR:
+		return put_user(errseq_scrape(&inode->i_sb->s_wb_err),
+				(unsigned int __user *)argp);
+
 	case FIGETBSZ:
 		/* anon_bdev filesystems may not have a block size */
 		if (!inode->i_sb->s_blocksize)
diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index fc2777770768..de165623fa86 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -9,6 +9,7 @@ typedef u32	errseq_t;
 
 errseq_t errseq_set(errseq_t *eseq, int err);
 errseq_t errseq_sample(errseq_t *eseq);
+errseq_t errseq_scrape(errseq_t *eseq);
 int errseq_check(errseq_t *eseq, errseq_t since);
 int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
 #endif
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 379a612f8f1d..c39b37fba7f9 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -214,6 +214,7 @@ struct fsxattr {
 #define FS_IOC_FSSETXATTR		_IOW('X', 32, struct fsxattr)
 #define FS_IOC_GETFSLABEL		_IOR(0x94, 49, char[FSLABEL_MAX])
 #define FS_IOC_SETFSLABEL		_IOW(0x94, 50, char[FSLABEL_MAX])
+#define FS_IOC_GETFSERR			_IOR('e', 1, unsigned int)
 
 /*
  * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
diff --git a/lib/errseq.c b/lib/errseq.c
index 81f9e33aa7e7..8ded0920eed3 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -108,7 +108,7 @@ errseq_t errseq_set(errseq_t *eseq, int err)
 EXPORT_SYMBOL(errseq_set);
 
 /**
- * errseq_sample() - Grab current errseq_t value.
+ * errseq_sample() - Grab current errseq_t value (or 0 if it hasn't been seen)
  * @eseq: Pointer to errseq_t to be sampled.
  *
  * This function allows callers to initialise their errseq_t variable.
@@ -117,7 +117,7 @@ EXPORT_SYMBOL(errseq_set);
  * see it the next time it checks for an error.
  *
  * Context: Any context.
- * Return: The current errseq value.
+ * Return: The current errseq value or 0 if it wasn't previously seen
  */
 errseq_t errseq_sample(errseq_t *eseq)
 {
@@ -130,6 +130,35 @@ errseq_t errseq_sample(errseq_t *eseq)
 }
 EXPORT_SYMBOL(errseq_sample);
 
+/**
+ * errseq_scrape() - Grab current errseq_t value
+ * @eseq: Pointer to errseq_t to be sampled.
+ *
+ * This function allows callers to scrape the current value of an errseq_t.
+ * Unlike errseq_sample, this will always return the current value with
+ * the SEEN flag unset, even when the value has not yet been seen.
+ *
+ * Context: Any context.
+ * Return: The current errseq value with ERRSEQ_SEEN masked off
+ */
+errseq_t errseq_scrape(errseq_t *eseq)
+{
+	errseq_t old = READ_ONCE(*eseq);
+
+	/*
+	 * For the common case of no errors ever having been set, we can skip
+	 * marking the SEEN bit. Once an error has been set, the value will
+	 * never go back to zero.
+	 */
+	if (old != 0) {
+		errseq_t new = old | ERRSEQ_SEEN;
+		if (old != new)
+			cmpxchg(eseq, old, new);
+	}
+	return old & ~ERRSEQ_SEEN;
+}
+EXPORT_SYMBOL(errseq_scrape);
+
 /**
  * errseq_check() - Has an error occurred since a particular sample point?
  * @eseq: Pointer to errseq_t value to be checked.
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors
  2020-02-07 17:04 [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors Jeff Layton
                   ` (2 preceding siblings ...)
  2020-02-07 17:04 ` [PATCH v3 3/3] vfs: add a new ioctl for fetching the superblock's errseq_t Jeff Layton
@ 2020-02-07 20:52 ` Dave Chinner
  2020-02-07 21:20   ` Andres Freund
  2020-02-12 12:21 ` Jeff Layton
  4 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2020-02-07 20:52 UTC (permalink / raw)
  To: Jeff Layton
  Cc: viro, linux-fsdevel, linux-kernel, linux-api, andres, willy,
	dhowells, hch, jack, akpm

On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> You're probably wondering -- Where are v1 and v2 sets?
> 
> I did the first couple of versions of this set back in 2018, and then
> got dragged off to work on other things. I'd like to resurrect this set
> though, as I think it's valuable overall, and I have need of it for some
> other work I'm doing.
> 
> Currently, syncfs does not return errors when one of the inodes fails to
> be written back. It will return errors based on the legacy AS_EIO and
> AS_ENOSPC flags when syncing out the block device fails, but that's not
> particularly helpful for filesystems that aren't backed by a blockdev.
> It's also possible for a stray sync to lose those errors.
> 
> The basic idea is to track writeback errors at the superblock level,
> so that we can quickly and easily check whether something bad happened
> without having to fsync each file individually. syncfs is then changed
> to reliably report writeback errors, and a new ioctl is added to allow
> userland to get at the current errseq_t value w/o having to sync out
> anything.

So what, exactly, can userspace do with this error? It has no idea
at all what file the writeback failure occurred on or even
what files syncfs() even acted on so there's no obvious error
recovery that it could perform on reception of such an error.

> I do have a xfstest for this. I do not yet have manpage patches, but
> I'm happy to roll some once there is consensus on the interface.
> 
> Caveats:
> 
> - Having different behavior for an O_PATH descriptor in syncfs is
>   a bit odd, but it means that we don't have to grow struct file. Is
>   that acceptable from an API standpoint?

It's an ugly wart, IMO. But because we suck at APIs, I'm betting
that we'll decide this is OK or do something even worse...

> - This adds a new generic fs ioctl to allow userland to scrape the
>   current superblock's errseq_t value. It may be best to present this
>   to userland via fsinfo() instead (once that's merged). I'm fine with
>   dropping the last patch for now and reworking it for fsinfo if so.

What, exactly, is this useful for? Why would we consider exposing
an internal implementation detail to userspace like this?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors
  2020-02-07 20:52 ` [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors Dave Chinner
@ 2020-02-07 21:20   ` Andres Freund
  2020-02-07 22:05     ` Jeff Layton
  2020-02-10 21:46     ` Dave Chinner
  0 siblings, 2 replies; 15+ messages in thread
From: Andres Freund @ 2020-02-07 21:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Layton, viro, linux-fsdevel, linux-kernel, linux-api, willy,
	dhowells, hch, jack, akpm

Hi,

On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > You're probably wondering -- Where are v1 and v2 sets?

> > The basic idea is to track writeback errors at the superblock level,
> > so that we can quickly and easily check whether something bad happened
> > without having to fsync each file individually. syncfs is then changed
> > to reliably report writeback errors, and a new ioctl is added to allow
> > userland to get at the current errseq_t value w/o having to sync out
> > anything.
> 
> So what, exactly, can userspace do with this error? It has no idea
> at all what file the writeback failure occurred on or even
> what files syncfs() even acted on so there's no obvious error
> recovery that it could perform on reception of such an error.

Depends on the application.  For e.g. postgres it'd to be to reset
in-memory contents and perform WAL replay from the last checkpoint. Due
to various reasons* it's very hard for us (without major performance
and/or reliability impact) to fully guarantee that by the time we fsync
specific files we do so on an old enough fd to guarantee that we'd see
the an error triggered by background writeback.  But keeping track of
all potential filesystems data resides on (with one fd open permanently
for each) and then syncfs()ing them at checkpoint time is quite doable.

*I can go into details, but it's probably not interesting enough


> > - This adds a new generic fs ioctl to allow userland to scrape the
> >   current superblock's errseq_t value. It may be best to present this
> >   to userland via fsinfo() instead (once that's merged). I'm fine with
> >   dropping the last patch for now and reworking it for fsinfo if so.
> 
> What, exactly, is this useful for? Why would we consider exposing
> an internal implementation detail to userspace like this?

There is, as far as I can tell, so far no way but scraping the kernel
log to figure out if there have been data loss errors on a
machine/fs. Even besides app specific reactions like outlined above,
just generally being able to alert whenever there error count increases
seems extremely useful.  I'm not sure it makes sense to expose the
errseq_t bits straight though - seems like it'd enshrine them in
userspace ABI too much?

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors
  2020-02-07 21:20   ` Andres Freund
@ 2020-02-07 22:05     ` Jeff Layton
  2020-02-07 22:21       ` Andres Freund
  2020-02-10 21:46     ` Dave Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2020-02-07 22:05 UTC (permalink / raw)
  To: Andres Freund, Dave Chinner
  Cc: viro, linux-fsdevel, linux-kernel, linux-api, willy, dhowells,
	hch, jack, akpm

On Fri, 2020-02-07 at 13:20 -0800, Andres Freund wrote:
> Hi,
> 
> On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > You're probably wondering -- Where are v1 and v2 sets?
> > > The basic idea is to track writeback errors at the superblock level,
> > > so that we can quickly and easily check whether something bad happened
> > > without having to fsync each file individually. syncfs is then changed
> > > to reliably report writeback errors, and a new ioctl is added to allow
> > > userland to get at the current errseq_t value w/o having to sync out
> > > anything.
> > 
> > So what, exactly, can userspace do with this error? It has no idea
> > at all what file the writeback failure occurred on or even
> > what files syncfs() even acted on so there's no obvious error
> > recovery that it could perform on reception of such an error.
> 
> Depends on the application.  For e.g. postgres it'd to be to reset
> in-memory contents and perform WAL replay from the last checkpoint. Due
> to various reasons* it's very hard for us (without major performance
> and/or reliability impact) to fully guarantee that by the time we fsync
> specific files we do so on an old enough fd to guarantee that we'd see
> the an error triggered by background writeback.  But keeping track of
> all potential filesystems data resides on (with one fd open permanently
> for each) and then syncfs()ing them at checkpoint time is quite doable.
> 
> *I can go into details, but it's probably not interesting enough
> 

Do applications (specifically postgresql) need the ability to check
whether there have been writeback errors on a filesystem w/o blocking on
a syncfs() call?  I thought that you had mentioned a specific usecase
for that, but if you're actually ok with syncfs() then we can drop that
part altogether.

> 
> > > - This adds a new generic fs ioctl to allow userland to scrape the
> > >   current superblock's errseq_t value. It may be best to present this
> > >   to userland via fsinfo() instead (once that's merged). I'm fine with
> > >   dropping the last patch for now and reworking it for fsinfo if so.
> > 
> > What, exactly, is this useful for? Why would we consider exposing
> > an internal implementation detail to userspace like this?
> 
> There is, as far as I can tell, so far no way but scraping the kernel
> log to figure out if there have been data loss errors on a
> machine/fs. Even besides app specific reactions like outlined above,
> just generally being able to alert whenever there error count increases
> seems extremely useful.  I'm not sure it makes sense to expose the
> errseq_t bits straight though - seems like it'd enshrine them in
> userspace ABI too much?
> 

Yeah, if we do end up keeping it, I'm leaning toward making this
fetchable via fsinfo() (once that's merged). If we do that, then we'll
split this into a struct with two fields -- the most recent errno and an
opaque token that you can keep to tell whether new errors have been
recorded since.

I think that should be a little cleaner from an API standpoint. Probably
we can just drop the ioctl, under the assumption that fsinfo() will be
available in 5.7.

Cheers,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors
  2020-02-07 22:05     ` Jeff Layton
@ 2020-02-07 22:21       ` Andres Freund
  0 siblings, 0 replies; 15+ messages in thread
From: Andres Freund @ 2020-02-07 22:21 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dave Chinner, viro, linux-fsdevel, linux-kernel, linux-api,
	willy, dhowells, hch, jack, akpm

Hi,

On 2020-02-07 17:05:28 -0500, Jeff Layton wrote:
> On Fri, 2020-02-07 at 13:20 -0800, Andres Freund wrote:
> > On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > > You're probably wondering -- Where are v1 and v2 sets?
> > > > The basic idea is to track writeback errors at the superblock level,
> > > > so that we can quickly and easily check whether something bad happened
> > > > without having to fsync each file individually. syncfs is then changed
> > > > to reliably report writeback errors, and a new ioctl is added to allow
> > > > userland to get at the current errseq_t value w/o having to sync out
> > > > anything.
> > > 
> > > So what, exactly, can userspace do with this error? It has no idea
> > > at all what file the writeback failure occurred on or even
> > > what files syncfs() even acted on so there's no obvious error
> > > recovery that it could perform on reception of such an error.
> > 
> > Depends on the application.  For e.g. postgres it'd to be to reset
> > in-memory contents and perform WAL replay from the last checkpoint. Due
> > to various reasons* it's very hard for us (without major performance
> > and/or reliability impact) to fully guarantee that by the time we fsync
> > specific files we do so on an old enough fd to guarantee that we'd see
> > the an error triggered by background writeback.  But keeping track of
> > all potential filesystems data resides on (with one fd open permanently
> > for each) and then syncfs()ing them at checkpoint time is quite doable.
> > 
> > *I can go into details, but it's probably not interesting enough
> > 
> 
> Do applications (specifically postgresql) need the ability to check
> whether there have been writeback errors on a filesystem w/o blocking on
> a syncfs() call?  I thought that you had mentioned a specific usecase
> for that, but if you're actually ok with syncfs() then we can drop that
> part altogether.

It'd be considerably better if we could check for errors without a
blocking syncfs(). A syncfs will trigger much more dirty pages to be
written back than what we need for durability. Our checkpoint writes are
throttled to reduce the impact on current IO, we try to ensure there's
not much outstanding IO before calling fsync() on FDs, etc - all to
avoid stalls. Especially as on plenty installations there's also
temporary files, e.g. for bigger-than-memory sorts, on the same FS.  So
if we had to syncfs() to reliability detect errros it'd cause some pain
- but would still be an improvement.

But with a nonblocking check we could compare the error count from the
last checkpoint with the current count before finalizing the checkpoint
- without causing unnecessary writeback.

Currently, if we crash (any unclean shutdown, be it a PG bug, OS dying,
kill -9), we'll iterate over all files afterwards to make sure they're
fsynced, before starting to perform WAL replay. That can take quite a
while on some systems - it'd be much nicer if we could just syncfs() the
involved filesystems (which we can detect more quickly than iterating
over the whole directory tree, there's only a few places where we
support separate mounts), and still get errors.


> Yeah, if we do end up keeping it, I'm leaning toward making this
> fetchable via fsinfo() (once that's merged). If we do that, then we'll
> split this into a struct with two fields -- the most recent errno and an
> opaque token that you can keep to tell whether new errors have been
> recorded since.
> 
> I think that should be a little cleaner from an API standpoint. Probably
> we can just drop the ioctl, under the assumption that fsinfo() will be
> available in 5.7.

Sounds like a plan.

I guess an alternative could be to expose the error count in /sys, but
that looks like it'd be a bigger project, as there doesn't seem to be a
good pre-existing hierarchy to hook into.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors
  2020-02-07 21:20   ` Andres Freund
  2020-02-07 22:05     ` Jeff Layton
@ 2020-02-10 21:46     ` Dave Chinner
  2020-02-10 23:59       ` Andres Freund, David Howells
                         ` (2 more replies)
  1 sibling, 3 replies; 15+ messages in thread
From: Dave Chinner @ 2020-02-10 21:46 UTC (permalink / raw)
  To: Andres Freund
  Cc: Jeff Layton, viro, linux-fsdevel, linux-kernel, linux-api, willy,
	dhowells, hch, jack, akpm

On Fri, Feb 07, 2020 at 01:20:12PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > You're probably wondering -- Where are v1 and v2 sets?
> 
> > > The basic idea is to track writeback errors at the superblock level,
> > > so that we can quickly and easily check whether something bad happened
> > > without having to fsync each file individually. syncfs is then changed
> > > to reliably report writeback errors, and a new ioctl is added to allow
> > > userland to get at the current errseq_t value w/o having to sync out
> > > anything.
> > 
> > So what, exactly, can userspace do with this error? It has no idea
> > at all what file the writeback failure occurred on or even
> > what files syncfs() even acted on so there's no obvious error
> > recovery that it could perform on reception of such an error.
> 
> Depends on the application.  For e.g. postgres it'd to be to reset
> in-memory contents and perform WAL replay from the last checkpoint.

What happens if a user runs 'sync -f /path/to/postgres/data' instead
of postgres? All the writeback errors are consumed at that point by
reporting them to the process that ran syncfs()...

> Due
> to various reasons* it's very hard for us (without major performance
> and/or reliability impact) to fully guarantee that by the time we fsync
> specific files we do so on an old enough fd to guarantee that we'd see
> the an error triggered by background writeback.  But keeping track of
> all potential filesystems data resides on (with one fd open permanently
> for each) and then syncfs()ing them at checkpoint time is quite doable.

Oh, you have to keep an fd permanently open to every superblock that
application holds data on so that errors detected by other users of
that filesystem are also reported to the application?

This seems like a fairly important requirement for applications to
ensure this error reporting is "reliable" and that certainly wasn't
apparent from the patches or their description.  i.e. the API has an
explicit userspace application behaviour requirement for reliable
functioning, and that was not documented.  "we suck at APIs" and all
that..

It also seems to me as useful only to applications that have a
"rollback and replay" error recovery mechanism. If the application
doesn't have the ability to go back in time to before the
"unfindable" writeback error occurred, then this error is largely
useless to those applications because they can't do anything with
it, and so....

> > > - This adds a new generic fs ioctl to allow userland to scrape
> > > the current superblock's errseq_t value. It may be best to
> > > present this to userland via fsinfo() instead (once that's
> > > merged). I'm fine with dropping the last patch for now and
> > > reworking it for fsinfo if so.
> > 
> > What, exactly, is this useful for? Why would we consider
> > exposing an internal implementation detail to userspace like
> > this?
> 
> There is, as far as I can tell, so far no way but scraping the
> kernel log to figure out if there have been data loss errors on a
> machine/fs.

.... most applications will still require users to scrape their
logs to find out what error actually occurred. IOWs, we haven't
really changed the status quo with this new mechanism.

FWIW, explicit userspace error notifications for data loss events is
one of the features that David Howell's generic filesystem
notification mechanism is intended to provide.  Hence I'm not sure
that there's a huge amount of value in providing a partial solution
that only certain applications can use when there's a fully generic
mechanism for error notification just around the corner.

> Even besides app specific reactions like outlined above,
> just generally being able to alert whenever there error count
> increases seems extremely useful.

Yup, a generic filesystem notification mechanism is perfect for
that, plus it can provide more explicit details of where the error
actually occurred rather than jsut a handwavy "some error occurred
some where" report....

> I'm not sure it makes sense to
> expose the errseq_t bits straight though - seems like it'd
> enshrine them in userspace ABI too much?

Even a little is way too much. Userspace ABI needs to be completely
independent of the kernel internal structures and implementation.
This is basic "we suck at APIs 101" stuff...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors
  2020-02-10 21:46     ` Dave Chinner
@ 2020-02-10 23:59       ` Andres Freund, David Howells
  2020-02-11  0:04       ` Andres Freund
  2020-02-11 12:57       ` Jeff Layton
  2 siblings, 0 replies; 15+ messages in thread
From: Andres Freund, David Howells @ 2020-02-10 23:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Layton, viro, linux-fsdevel, linux-kernel, linux-api, willy,
	dhowells, hch, jack, akpm

Hi,

David added you, because there's discussion about your notify work
below.

On 2020-02-11 08:46:57 +1100, Dave Chinner wrote:
> On Fri, Feb 07, 2020 at 01:20:12PM -0800, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > > You're probably wondering -- Where are v1 and v2 sets?
> > 
> > > > The basic idea is to track writeback errors at the superblock level,
> > > > so that we can quickly and easily check whether something bad happened
> > > > without having to fsync each file individually. syncfs is then changed
> > > > to reliably report writeback errors, and a new ioctl is added to allow
> > > > userland to get at the current errseq_t value w/o having to sync out
> > > > anything.
> > > 
> > > So what, exactly, can userspace do with this error? It has no idea
> > > at all what file the writeback failure occurred on or even
> > > what files syncfs() even acted on so there's no obvious error
> > > recovery that it could perform on reception of such an error.
> > 
> > Depends on the application.  For e.g. postgres it'd to be to reset
> > in-memory contents and perform WAL replay from the last checkpoint.
> 
> What happens if a user runs 'sync -f /path/to/postgres/data' instead
> of postgres? All the writeback errors are consumed at that point by
> reporting them to the process that ran syncfs()...

We'd have to keep an fd open from *before* we start durable operations,
which has a sampled errseq_t from before we rely on seeing errors.


> > Due to various reasons* it's very hard for us (without major performance
> > and/or reliability impact) to fully guarantee that by the time we fsync
> > specific files we do so on an old enough fd to guarantee that we'd see
> > the an error triggered by background writeback.  But keeping track of
> > all potential filesystems data resides on (with one fd open permanently
> > for each) and then syncfs()ing them at checkpoint time is quite doable.
> 
> Oh, you have to keep an fd permanently open to every superblock that
> application holds data on so that errors detected by other users of
> that filesystem are also reported to the application?

Right

Currently it's much worse (you probably now?):

Without error reporting capabilities in syncfs or such you have to keep
an fd open to *every* single inode you want to reliably get errors
for. Fds have an errseq_t to keep track of which errors have been seen
by that fd, so if you have one open from *before* an error is triggered,
you can be sure to detect that. But if the fd is not guaranteed to be
old enough you can hit two cases:

1) Some other application actually sees an error, address_space->wb_err
   is marked ERRSEQ_SEEN. Any new fd will not see a report the problem
   anymore.
2) The inode with the error gets evicted (memory pressure on a database
   server isn't rare) while there is no fd open. Nobody might see the
   error.

If there were a reliable (i.e. it may not wrap around or such) error
counter available *somewhere*, we could just keep track of that, instead
of actually needing an "old" open fd in the right process.


> This seems like a fairly important requirement for applications to
> ensure this error reporting is "reliable" and that certainly wasn't
> apparent from the patches or their description.  i.e. the API has an
> explicit userspace application behaviour requirement for reliable
> functioning, and that was not documented.  "we suck at APIs" and all
> that..

Yup.


> It also seems to me as useful only to applications that have a
> "rollback and replay" error recovery mechanism. If the application
> doesn't have the ability to go back in time to before the
> "unfindable" writeback error occurred, then this error is largely
> useless to those applications because they can't do anything with
> it, and so....

That's a pretty common thing these days for applications that actually
care about data to some degree. Either they immediately f[data]sync
after writing, or they use some form of storage that has journalling
capabilities. Which e.g. sqlite provides for the myriad of cases that
don't want a separate server.

And even if they can't recover from the error, there's a huge difference
between not noticing that shit has hit the fan and happily continuing to
accept further data , and telling the user that something has gone wrong
with data integrity without details.


> .... most applications will still require users to scrape their
> logs to find out what error actually occurred. IOWs, we haven't
> really changed the status quo with this new mechanism.

I think there's a huge practical difference between having to do so in
case there was an actual error (on the relevant FSs), and having to do
so continually.


> FWIW, explicit userspace error notifications for data loss events is
> one of the features that David Howell's generic filesystem
> notification mechanism is intended to provide.  Hence I'm not sure
> that there's a huge amount of value in providing a partial solution
> that only certain applications can use when there's a fully generic
> mechanism for error notification just around the corner.

Interesting. I largely missed that work, unfortunately. It's hard to
keep up with all kernel things, while also maintaining / developing an
RDBMS :/

I assume the last state that includes the superblock layer stuff is at
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
whereas there's a newer
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications-core
not including that. There do seem to be some significant changes between
the two.

As far as I can tell the superblock based stuff does *not* actually
report any errors yet (contrast to READONLY, EDQUOT). Is the plan here
to include writeback errors as well? Or just filesystem metadata/journal
IO?

I don't think that block layer notifications would be sufficient for an
individual userspace application's data integrity purposes? For one,
it'd need to map devices to relevant filesystems afaictl. And there's
also errors above the block layer.


Based on skimming the commits in those two trees, I'm not quite sure I
understand what the permission model will be for accessing the
notifications will be? Seems anyone, even within a container or
something, will see blockdev errors from everywhere?  The earlier
superblock support (I'm not sure I like that name btw, hard to
understand for us userspace folks), seems to have required exec
permission, but nothing else.

For it to be usable for integrity purposes delivery has to be reliable
and there needs to be a clear ordering, in the sense that reading
notification needs to return all errors that occured timewise before the
last pending notification has been read. Looks like that's largely the
case, although I'm a bit scared after seeing:

+void __post_watch_notification(struct watch_list *wlist,
+			       struct watch_notification *n,
+			       const struct cred *cred,
+			       u64 id)
+
+		if (security_post_notification(watch->cred, cred, n) < 0)
+			continue;
+

if an LSM module just decides to hide notifications that relied upon for
integrity, we'll be in trouble.


> > I'm not sure it makes sense to
> > expose the errseq_t bits straight though - seems like it'd
> > enshrine them in userspace ABI too much?
> 
> Even a little is way too much. Userspace ABI needs to be completely
> independent of the kernel internal structures and implementation.
> This is basic "we suck at APIs 101" stuff...

Well, if it were just a counter of errors that gets stuck at some well
defined max, it seems like it'd be ok. But I agree that it'd not be the
best possible interface, and that the notify API seems like it could
turn into that.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors
  2020-02-10 21:46     ` Dave Chinner
  2020-02-10 23:59       ` Andres Freund, David Howells
@ 2020-02-11  0:04       ` Andres Freund
  2020-02-11  0:48         ` Dave Chinner
  2020-02-11 12:57       ` Jeff Layton
  2 siblings, 1 reply; 15+ messages in thread
From: Andres Freund @ 2020-02-11  0:04 UTC (permalink / raw)
  To: Dave Chinner, David Howells
  Cc: Jeff Layton, viro, linux-fsdevel, linux-kernel, linux-api, willy,
	dhowells, hch, jack, akpm

Hi,

(sorry if somebody got this twice)

David added you, because there's discussion about your notify work
below.

On 2020-02-11 08:46:57 +1100, Dave Chinner wrote:
> On Fri, Feb 07, 2020 at 01:20:12PM -0800, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > > You're probably wondering -- Where are v1 and v2 sets?
> > 
> > > > The basic idea is to track writeback errors at the superblock level,
> > > > so that we can quickly and easily check whether something bad happened
> > > > without having to fsync each file individually. syncfs is then changed
> > > > to reliably report writeback errors, and a new ioctl is added to allow
> > > > userland to get at the current errseq_t value w/o having to sync out
> > > > anything.
> > > 
> > > So what, exactly, can userspace do with this error? It has no idea
> > > at all what file the writeback failure occurred on or even
> > > what files syncfs() even acted on so there's no obvious error
> > > recovery that it could perform on reception of such an error.
> > 
> > Depends on the application.  For e.g. postgres it'd to be to reset
> > in-memory contents and perform WAL replay from the last checkpoint.
> 
> What happens if a user runs 'sync -f /path/to/postgres/data' instead
> of postgres? All the writeback errors are consumed at that point by
> reporting them to the process that ran syncfs()...

We'd have to keep an fd open from *before* we start durable operations,
which has a sampled errseq_t from before we rely on seeing errors.


> > Due to various reasons* it's very hard for us (without major performance
> > and/or reliability impact) to fully guarantee that by the time we fsync
> > specific files we do so on an old enough fd to guarantee that we'd see
> > the an error triggered by background writeback.  But keeping track of
> > all potential filesystems data resides on (with one fd open permanently
> > for each) and then syncfs()ing them at checkpoint time is quite doable.
> 
> Oh, you have to keep an fd permanently open to every superblock that
> application holds data on so that errors detected by other users of
> that filesystem are also reported to the application?

Right

Currently it's much worse (you probably now?):

Without error reporting capabilities in syncfs or such you have to keep
an fd open to *every* single inode you want to reliably get errors
for. Fds have an errseq_t to keep track of which errors have been seen
by that fd, so if you have one open from *before* an error is triggered,
you can be sure to detect that. But if the fd is not guaranteed to be
old enough you can hit two cases:

1) Some other application actually sees an error, address_space->wb_err
   is marked ERRSEQ_SEEN. Any new fd will not see a report the problem
   anymore.
2) The inode with the error gets evicted (memory pressure on a database
   server isn't rare) while there is no fd open. Nobody might see the
   error.

If there were a reliable (i.e. it may not wrap around or such) error
counter available *somewhere*, we could just keep track of that, instead
of actually needing an "old" open fd in the right process.


> This seems like a fairly important requirement for applications to
> ensure this error reporting is "reliable" and that certainly wasn't
> apparent from the patches or their description.  i.e. the API has an
> explicit userspace application behaviour requirement for reliable
> functioning, and that was not documented.  "we suck at APIs" and all
> that..

Yup.


> It also seems to me as useful only to applications that have a
> "rollback and replay" error recovery mechanism. If the application
> doesn't have the ability to go back in time to before the
> "unfindable" writeback error occurred, then this error is largely
> useless to those applications because they can't do anything with
> it, and so....

That's a pretty common thing these days for applications that actually
care about data to some degree. Either they immediately f[data]sync
after writing, or they use some form of storage that has journalling
capabilities. Which e.g. sqlite provides for the myriad of cases that
don't want a separate server.

And even if they can't recover from the error, there's a huge difference
between not noticing that shit has hit the fan and happily continuing to
accept further data , and telling the user that something has gone wrong
with data integrity without details.


> .... most applications will still require users to scrape their
> logs to find out what error actually occurred. IOWs, we haven't
> really changed the status quo with this new mechanism.

I think there's a huge practical difference between having to do so in
case there was an actual error (on the relevant FSs), and having to do
so continually.


> FWIW, explicit userspace error notifications for data loss events is
> one of the features that David Howell's generic filesystem
> notification mechanism is intended to provide.  Hence I'm not sure
> that there's a huge amount of value in providing a partial solution
> that only certain applications can use when there's a fully generic
> mechanism for error notification just around the corner.

Interesting. I largely missed that work, unfortunately. It's hard to
keep up with all kernel things, while also maintaining / developing an
RDBMS :/

I assume the last state that includes the superblock layer stuff is at
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
whereas there's a newer
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications-core
not including that. There do seem to be some significant changes between
the two.

As far as I can tell the superblock based stuff does *not* actually
report any errors yet (contrast to READONLY, EDQUOT). Is the plan here
to include writeback errors as well? Or just filesystem metadata/journal
IO?

I don't think that block layer notifications would be sufficient for an
individual userspace application's data integrity purposes? For one,
it'd need to map devices to relevant filesystems afaictl. And there's
also errors above the block layer.


Based on skimming the commits in those two trees, I'm not quite sure I
understand what the permission model will be for accessing the
notifications will be? Seems anyone, even within a container or
something, will see blockdev errors from everywhere?  The earlier
superblock support (I'm not sure I like that name btw, hard to
understand for us userspace folks), seems to have required exec
permission, but nothing else.

For it to be usable for integrity purposes delivery has to be reliable
and there needs to be a clear ordering, in the sense that reading
notification needs to return all errors that occured timewise before the
last pending notification has been read. Looks like that's largely the
case, although I'm a bit scared after seeing:

+void __post_watch_notification(struct watch_list *wlist,
+			       struct watch_notification *n,
+			       const struct cred *cred,
+			       u64 id)
+
+		if (security_post_notification(watch->cred, cred, n) < 0)
+			continue;
+

if an LSM module just decides to hide notifications that relied upon for
integrity, we'll be in trouble.


> > I'm not sure it makes sense to
> > expose the errseq_t bits straight though - seems like it'd
> > enshrine them in userspace ABI too much?
> 
> Even a little is way too much. Userspace ABI needs to be completely
> independent of the kernel internal structures and implementation.
> This is basic "we suck at APIs 101" stuff...

Well, if it were just a counter of errors that gets stuck at some well
defined max, it seems like it'd be ok. But I agree that it'd not be the
best possible interface, and that the notify API seems like it could
turn into that.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors
  2020-02-11  0:04       ` Andres Freund
@ 2020-02-11  0:48         ` Dave Chinner
  2020-02-11  1:31           ` Andres Freund
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2020-02-11  0:48 UTC (permalink / raw)
  To: Andres Freund
  Cc: David Howells, Jeff Layton, viro, linux-fsdevel, linux-kernel,
	linux-api, willy, hch, jack, akpm

On Mon, Feb 10, 2020 at 04:04:05PM -0800, Andres Freund wrote:
> Hi,
> 
> (sorry if somebody got this twice)
> 
> David added you, because there's discussion about your notify work
> below.
> 
> On 2020-02-11 08:46:57 +1100, Dave Chinner wrote:
> > On Fri, Feb 07, 2020 at 01:20:12PM -0800, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > > > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > > > You're probably wondering -- Where are v1 and v2 sets?
> > > 
> > > > > The basic idea is to track writeback errors at the superblock level,
> > > > > so that we can quickly and easily check whether something bad happened
> > > > > without having to fsync each file individually. syncfs is then changed
> > > > > to reliably report writeback errors, and a new ioctl is added to allow
> > > > > userland to get at the current errseq_t value w/o having to sync out
> > > > > anything.
> > > > 
> > > > So what, exactly, can userspace do with this error? It has no idea
> > > > at all what file the writeback failure occurred on or even
> > > > what files syncfs() even acted on so there's no obvious error
> > > > recovery that it could perform on reception of such an error.
> > > 
> > > Depends on the application.  For e.g. postgres it'd to be to reset
> > > in-memory contents and perform WAL replay from the last checkpoint.
> > 
> > What happens if a user runs 'sync -f /path/to/postgres/data' instead
> > of postgres? All the writeback errors are consumed at that point by
> > reporting them to the process that ran syncfs()...
> 
> We'd have to keep an fd open from *before* we start durable operations,
> which has a sampled errseq_t from before we rely on seeing errors.
> 
> 
> > > Due to various reasons* it's very hard for us (without major performance
> > > and/or reliability impact) to fully guarantee that by the time we fsync
> > > specific files we do so on an old enough fd to guarantee that we'd see
> > > the an error triggered by background writeback.  But keeping track of
> > > all potential filesystems data resides on (with one fd open permanently
> > > for each) and then syncfs()ing them at checkpoint time is quite doable.
> > 
> > Oh, you have to keep an fd permanently open to every superblock that
> > application holds data on so that errors detected by other users of
> > that filesystem are also reported to the application?
> 
> Right
> 
> Currently it's much worse (you probably now?):

*nod*

> > FWIW, explicit userspace error notifications for data loss events is
> > one of the features that David Howell's generic filesystem
> > notification mechanism is intended to provide.  Hence I'm not sure
> > that there's a huge amount of value in providing a partial solution
> > that only certain applications can use when there's a fully generic
> > mechanism for error notification just around the corner.
> 
> Interesting. I largely missed that work, unfortunately. It's hard to
> keep up with all kernel things, while also maintaining / developing an
> RDBMS :/

It's hard enough trying to keep up with everything as a filesystem
developer... :/

> I assume the last state that includes the superblock layer stuff is at
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
> whereas there's a newer
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications-core
> not including that. There do seem to be some significant changes between
> the two.

ANd they are out of date, anyway, because they are still based on
a mmap'd ring buffer rather than the pipe infrastructure. See this
branch for the latest:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications-pipe-core

It doesn't have all the block/superblock/usb notification
infrastructure in it, just the keyring implementation. No point
reimplementing everything while the core notification mechanism is
still changing...

> 
> As far as I can tell the superblock based stuff does *not* actually
> report any errors yet (contrast to READONLY, EDQUOT). Is the plan here
> to include writeback errors as well? Or just filesystem metadata/journal
> IO?

Right, that part hasn't been implemented yet, though it's repeatedly
mentioned as intended to be supported functionality. It will depend
on the filesystem to what it is going to report, but I would expect
that it will initially be focussed on reporting user data errors
(e.g. writeback errors, block device gone bad data loss reports,
etc). It may not be possible to do anything sane with
metadata/journal IO errors as they typically cause the filesystem to
shutdown.

Of course, a filesystem shutdown is likely to result in a thundering
herd of userspace IO error notifications (think hundreds of GB of
dirty page cache getting EIO errors). Hence individual filesystems
will have to put some thought into how critical filesystem error
notifications are handled.

That said, we likely want userspace notification of metadata IO
errors for our own purposes. e.g. so we can trigger the online
filesystem repair code to start trying to fix whatever went wrong. I
doubt there's much userspace can do with things like "bad freespace
btree block" notifications, whilst the filesystem's online repair
tool can trigger a free space scan and rebuild/repair it without
userspace applications even being aware that we just detected and
corrected a critical metadata corruption....

> I don't think that block layer notifications would be sufficient for an
> individual userspace application's data integrity purposes? For one,
> it'd need to map devices to relevant filesystems afaictl. And there's
> also errors above the block layer.

Block device errors separate notifications to the superblock
notifications. If you want the notification of raw block device
errors, then that's what you listen for. If you want the filesystem
to actually tell you what file and offset that EIO was generated
for, then you'd get that through the superblock notifier, not the
block device notifier...

> Based on skimming the commits in those two trees, I'm not quite sure I
> understand what the permission model will be for accessing the
> notifications will be? Seems anyone, even within a container or
> something, will see blockdev errors from everywhere?  The earlier
> superblock support (I'm not sure I like that name btw, hard to
> understand for us userspace folks), seems to have required exec
> permission, but nothing else.

I'm not really familiar with those details - I've only got all the
"how it fits into the big picture" stuff in my head. Little
implementation details like that :) aren't that important to me -
all I need to know is how the infrastructure interacts with the
kernel filesystem code and whether it provides the functionality we
need to report filesystem errors directly to userspace...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors
  2020-02-11  0:48         ` Dave Chinner
@ 2020-02-11  1:31           ` Andres Freund
  0 siblings, 0 replies; 15+ messages in thread
From: Andres Freund @ 2020-02-11  1:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: David Howells, Jeff Layton, viro, linux-fsdevel, linux-kernel,
	linux-api, willy, hch, jack, akpm

Hi,

I shortly after this found a thread where Linus was explicitly asking
for potential userspace users of the feature, so I also responded there:
https://lore.kernel.org/linux-fsdevel/20200211005626.7yqjf5rbs3vbwagd@alap3.anarazel.de/

On 2020-02-11 11:48:30 +1100, Dave Chinner wrote:
> On Mon, Feb 10, 2020 at 04:04:05PM -0800, Andres Freund wrote:
> > On 2020-02-11 08:46:57 +1100, Dave Chinner wrote:
> > As far as I can tell the superblock based stuff does *not* actually
> > report any errors yet (contrast to READONLY, EDQUOT). Is the plan here
> > to include writeback errors as well? Or just filesystem metadata/journal
> > IO?
> 
> Right, that part hasn't been implemented yet, though it's repeatedly
> mentioned as intended to be supported functionality. It will depend
> on the filesystem to what it is going to report

There really ought to be some clear guidelines what is expected to be
reported though. Otherwise we'll just end up with a hodgepodge of
different semantics, which'd be, ummm, not good.


> but I would expect that it will initially be focussed on reporting
> user data errors (e.g. writeback errors, block device gone bad data
> loss reports, etc). It may not be possible to do anything sane with
> metadata/journal IO errors as they typically cause the filesystem to
> shutdown.

I was mostly referencing the metadata/journal errors because it's what a
number of filesystems seem to treat as errors (cf errors=remount-ro
etc), and I just wanted to be sure that more than just those get
reported up...

I think the patch already had support for getting a separate type of
notification for SBs remounted ro, shouldn't be too hard to change that
so it'd report error shutdowns / remount-ro as a different
category. Without


> Of course, a filesystem shutdown is likely to result in a thundering
> herd of userspace IO error notifications (think hundreds of GB of
> dirty page cache getting EIO errors). Hence individual filesystems
> will have to put some thought into how critical filesystem error
> notifications are handled.

Probably would make sense to stop reporting them individually once the
whole FS is shutdown/remounted due to errors, and a notification about
that fact has been sent.


> That said, we likely want userspace notification of metadata IO
> errors for our own purposes. e.g. so we can trigger the online
> filesystem repair code to start trying to fix whatever went wrong. I
> doubt there's much userspace can do with things like "bad freespace
> btree block" notifications, whilst the filesystem's online repair
> tool can trigger a free space scan and rebuild/repair it without
> userspace applications even being aware that we just detected and
> corrected a critical metadata corruption....

Neat.


> > I don't think that block layer notifications would be sufficient for an
> > individual userspace application's data integrity purposes? For one,
> > it'd need to map devices to relevant filesystems afaictl. And there's
> > also errors above the block layer.
> 
> Block device errors separate notifications to the superblock
> notifications. If you want the notification of raw block device
> errors, then that's what you listen for. If you want the filesystem
> to actually tell you what file and offset that EIO was generated
> for, then you'd get that through the superblock notifier, not the
> block device notifier...

Not something we urgently need, but it might come in handy at a later
point.

Thanks,

Andres

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors
  2020-02-10 21:46     ` Dave Chinner
  2020-02-10 23:59       ` Andres Freund, David Howells
  2020-02-11  0:04       ` Andres Freund
@ 2020-02-11 12:57       ` Jeff Layton
  2 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-02-11 12:57 UTC (permalink / raw)
  To: Dave Chinner, Andres Freund
  Cc: viro, linux-fsdevel, linux-kernel, linux-api, willy, dhowells,
	hch, jack, akpm

On Tue, 2020-02-11 at 08:46 +1100, Dave Chinner wrote:
> On Fri, Feb 07, 2020 at 01:20:12PM -0800, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > > You're probably wondering -- Where are v1 and v2 sets?
> > > > The basic idea is to track writeback errors at the superblock level,
> > > > so that we can quickly and easily check whether something bad happened
> > > > without having to fsync each file individually. syncfs is then changed
> > > > to reliably report writeback errors, and a new ioctl is added to allow
> > > > userland to get at the current errseq_t value w/o having to sync out
> > > > anything.
> > > 
> > > So what, exactly, can userspace do with this error? It has no idea
> > > at all what file the writeback failure occurred on or even
> > > what files syncfs() even acted on so there's no obvious error
> > > recovery that it could perform on reception of such an error.
> > 
> > Depends on the application.  For e.g. postgres it'd to be to reset
> > in-memory contents and perform WAL replay from the last checkpoint.
> 
> What happens if a user runs 'sync -f /path/to/postgres/data' instead
> of postgres? All the writeback errors are consumed at that point by
> reporting them to the process that ran syncfs()...
> 

Well, no. If you keep a fd open, then you can be sure that you'll see
any errors that occurred since that open at syncfs time, regardless of
who else is issuing syncfs calls(). That's basically how errseq_t works.

I guess I figured that part was obvious and didn't point it out here --
mea culpa.

> > Due
> > to various reasons* it's very hard for us (without major performance
> > and/or reliability impact) to fully guarantee that by the time we fsync
> > specific files we do so on an old enough fd to guarantee that we'd see
> > the an error triggered by background writeback.  But keeping track of
> > all potential filesystems data resides on (with one fd open permanently
> > for each) and then syncfs()ing them at checkpoint time is quite doable.
> 
> Oh, you have to keep an fd permanently open to every superblock that
> application holds data on so that errors detected by other users of
> that filesystem are also reported to the application?
> 
> This seems like a fairly important requirement for applications to
> ensure this error reporting is "reliable" and that certainly wasn't
> apparent from the patches or their description.  i.e. the API has an
> explicit userspace application behaviour requirement for reliable
> functioning, and that was not documented.  "we suck at APIs" and all
> that..
> 
> It also seems to me as useful only to applications that have a
> "rollback and replay" error recovery mechanism. If the application
> doesn't have the ability to go back in time to before the
> "unfindable" writeback error occurred, then this error is largely
> useless to those applications because they can't do anything with
> it, and so....
> 

Just knowing that an error occurred is still a better situation than
letting the application obliviously chug along.

In a sense I see the above argument as circular. Our error reporting
mechanisms have historically sucked, and applications have been written
accordingly. Unless we improve how errors are reported then applications
will never improve.

It is true that we can't reasonably report which inodes failed writeback
with this interface, but syncfs only returns int. That's really the best
we can do with it.

> > > > - This adds a new generic fs ioctl to allow userland to scrape
> > > > the current superblock's errseq_t value. It may be best to
> > > > present this to userland via fsinfo() instead (once that's
> > > > merged). I'm fine with dropping the last patch for now and
> > > > reworking it for fsinfo if so.
> > > 
> > > What, exactly, is this useful for? Why would we consider
> > > exposing an internal implementation detail to userspace like
> > > this?
> > 
> > There is, as far as I can tell, so far no way but scraping the
> > kernel log to figure out if there have been data loss errors on a
> > machine/fs.
> 
> .... most applications will still require users to scrape their
> logs to find out what error actually occurred. IOWs, we haven't
> really changed the status quo with this new mechanism.
> 
> FWIW, explicit userspace error notifications for data loss events is
> one of the features that David Howell's generic filesystem
> notification mechanism is intended to provide.  Hence I'm not sure
> that there's a huge amount of value in providing a partial solution
> that only certain applications can use when there's a fully generic
> mechanism for error notification just around the corner.
> 

David's notification work is great, but it's quite a bit more
heavyweight than just allowing syncfs to return better errors. The
application would need to register to be notified and watch a pipe. Not
all application are going to want or need to do that.

> > Even besides app specific reactions like outlined above,
> > just generally being able to alert whenever there error count
> > increases seems extremely useful.
> 
> Yup, a generic filesystem notification mechanism is perfect for
> that, plus it can provide more explicit details of where the error
> actually occurred rather than jsut a handwavy "some error occurred
> some where" report....
> 
> > I'm not sure it makes sense to
> > expose the errseq_t bits straight though - seems like it'd
> > enshrine them in userspace ABI too much?
> 
> Even a little is way too much. Userspace ABI needs to be completely
> independent of the kernel internal structures and implementation.
> This is basic "we suck at APIs 101" stuff...
> 

Yeah, I've already self-NAK'ed that patch.

If we are going to expose that info, we'll probably do it via fsinfo(),
and put it in a struct with two fields: last reported error code, and an
opaque token that you can use to see whether new errors have been
recorded since you last checked. I think that should be enough to ensure
that we don't tie this too closely to the internal kernel mplementation.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors
  2020-02-07 17:04 [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors Jeff Layton
                   ` (3 preceding siblings ...)
  2020-02-07 20:52 ` [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors Dave Chinner
@ 2020-02-12 12:21 ` Jeff Layton
  4 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-02-12 12:21 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, linux-api, andres, willy, dhowells,
	hch, jack, akpm

On Fri, 2020-02-07 at 12:04 -0500, Jeff Layton wrote:
> You're probably wondering -- Where are v1 and v2 sets?
> 
> I did the first couple of versions of this set back in 2018, and then
> got dragged off to work on other things. I'd like to resurrect this set
> though, as I think it's valuable overall, and I have need of it for some
> other work I'm doing.
> 
> Currently, syncfs does not return errors when one of the inodes fails to
> be written back. It will return errors based on the legacy AS_EIO and
> AS_ENOSPC flags when syncing out the block device fails, but that's not
> particularly helpful for filesystems that aren't backed by a blockdev.
> It's also possible for a stray sync to lose those errors.
> 
> The basic idea is to track writeback errors at the superblock level,
> so that we can quickly and easily check whether something bad happened
> without having to fsync each file individually. syncfs is then changed
> to reliably report writeback errors, and a new ioctl is added to allow
> userland to get at the current errseq_t value w/o having to sync out
> anything.
> 
> I do have a xfstest for this. I do not yet have manpage patches, but
> I'm happy to roll some once there is consensus on the interface.
> 
> Caveats:
> 
> - Having different behavior for an O_PATH descriptor in syncfs is
>   a bit odd, but it means that we don't have to grow struct file. Is
>   that acceptable from an API standpoint?
> 

There are a couple of other options besides requiring an O_PATH fd here:

1) we could just add a new errseq_t field to struct file for this. On my
machine (x86_64) there is 4 bytes of padding at the end of struct file.
An errseq_t would slot in there without changing the slab object size.
YMMV on other arches of course.

2) we could add a new fcntl command value (F_SYNCFS or something?), that
would flip the fd to being suitable for syncfs. If you tried to use the
fd to do a fsync at that point, we could return an error.

Anyone else have other thoughts on how best to do this?

> - This adds a new generic fs ioctl to allow userland to scrape the
>   current superblock's errseq_t value. It may be best to present this
>   to userland via fsinfo() instead (once that's merged). I'm fine with
>   dropping the last patch for now and reworking it for fsinfo if so.
> 

To be clear, as I stated in earlier replies, I think we can drop the
ioctl. If we did want something like this, I think we'd want to expose
it via fsinfo() instead, and that could be done after the syncfs changes
went in.

> Jeff Layton (3):
>   vfs: track per-sb writeback errors and report them to syncfs
>   buffer: record blockdev write errors in super_block that it backs
>   vfs: add a new ioctl for fetching the superblock's errseq_t
> 
>  fs/buffer.c             |  2 ++
>  fs/ioctl.c              |  4 ++++
>  fs/open.c               |  6 +++---
>  fs/sync.c               |  9 ++++++++-
>  include/linux/errseq.h  |  1 +
>  include/linux/fs.h      |  3 +++
>  include/linux/pagemap.h |  5 ++++-
>  include/uapi/linux/fs.h |  1 +
>  lib/errseq.c            | 33 +++++++++++++++++++++++++++++++--
>  9 files changed, 57 insertions(+), 7 deletions(-)
> 

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-02-12 12:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-07 17:04 [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors Jeff Layton
2020-02-07 17:04 ` [PATCH v3 1/3] vfs: track per-sb writeback errors and report them to syncfs Jeff Layton
2020-02-07 17:04 ` [PATCH v3 2/3] buffer: record blockdev write errors in super_block that it backs Jeff Layton
2020-02-07 17:04 ` [PATCH v3 3/3] vfs: add a new ioctl for fetching the superblock's errseq_t Jeff Layton
2020-02-07 20:52 ` [PATCH v3 0/3] vfs: have syncfs() return error when there are writeback errors Dave Chinner
2020-02-07 21:20   ` Andres Freund
2020-02-07 22:05     ` Jeff Layton
2020-02-07 22:21       ` Andres Freund
2020-02-10 21:46     ` Dave Chinner
2020-02-10 23:59       ` Andres Freund, David Howells
2020-02-11  0:04       ` Andres Freund
2020-02-11  0:48         ` Dave Chinner
2020-02-11  1:31           ` Andres Freund
2020-02-11 12:57       ` Jeff Layton
2020-02-12 12:21 ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).