[PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts

linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts
@ 2020-12-11 23:49 Sargun Dhillon
  2020-12-11 23:50 ` [PATCH v2 1/3] errseq: Add errseq_counter to allow for all errors to be observed Sargun Dhillon
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Sargun Dhillon @ 2020-12-11 23:49 UTC (permalink / raw)
  To: Amir Goldstein, Miklos Szeredi
  Cc: Sargun Dhillon, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox

The semantics of errseq and syncfs are such that it is impossible to track
if any errors have occurred between the time the first error occurred, and
the user checks for the error (calls syncfs, and subsequently
errseq_check_and_advance.

Overlayfs has a volatile feature which short-circuits syncfs. This, in turn
makes it so that the user can have silent data corruption and not know
about it. The third patch in the series introduces behaviour that makes it
so that we can track errors, and bubble up whether the user has put
themselves in bad situation.

This required some gymanstics in errseq, and adding a wrapper around it
called "errseq_counter" (errseq + counter). The data structure uses an
atomic to track overflow errors. This approach, rather than moving to an
atomic64 / u64 is so we can avoid bloating every person that subscribes to
an errseq, and only add the subscriber behaviour to those who care (at the
expense of space.

The datastructure is write-optimized, and rightfully so, as the users
of the counter feature are just overlayfs, and it's called in fsync
checking, which is a rather seldom operation, and not really on
any hotpaths.

[1]: https://lore.kernel.org/linux-fsdevel/20201202092720.41522-1-sargun@sargun.me/

Sargun Dhillon (3):
  errseq: Add errseq_counter to allow for all errors to be observed
  errseq: Add mechanism to snapshot errseq_counter and check snapshot
  overlay: Implement volatile-specific fsync error behaviour

 Documentation/filesystems/overlayfs.rst |   8 ++
 fs/buffer.c                             |   2 +-
 fs/overlayfs/file.c                     |   5 +-
 fs/overlayfs/overlayfs.h                |   1 +
 fs/overlayfs/ovl_entry.h                |   3 +
 fs/overlayfs/readdir.c                  |   5 +-
 fs/overlayfs/super.c                    |  26 +++--
 fs/overlayfs/util.c                     |  28 +++++
 fs/super.c                              |   1 +
 fs/sync.c                               |   3 +-
 include/linux/errseq.h                  |  18 ++++
 include/linux/fs.h                      |   6 +-
 include/linux/pagemap.h                 |   2 +-
 lib/errseq.c                            | 129 ++++++++++++++++++++----
 14 files changed, 202 insertions(+), 35 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/3] errseq: Add errseq_counter to allow for all errors to be observed
  2020-12-11 23:49 [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts Sargun Dhillon
@ 2020-12-11 23:50 ` Sargun Dhillon
  2020-12-11 23:50 ` [PATCH v2 2/3] errseq: Add mechanism to snapshot errseq_counter and check snapshot Sargun Dhillon
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Sargun Dhillon @ 2020-12-11 23:50 UTC (permalink / raw)
  To: Amir Goldstein, Miklos Szeredi
  Cc: Sargun Dhillon, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox

One of the challenges presented with the current errseq method is that it
is impossible for an observer to determine whether or not an error has
occurred since the last error if the last error has not been marked as
"seen". The reason is because the counter is not incremented on every
error, only on each seen -> unseen transition.

One of the reasons for this is that adding an additional 32-bits to each
errseq_t is suboptimal, and having to pay for the cost of doing a cmpxchg
on every error is not ideal either. Instead this introduces a new structure
- errseq_counter, which is meant to be on superblocks, or other objects
that have far fewer instances. A subscriber can then use the combination of
a counter and errseq to determine if an error has occurred.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
---
 fs/buffer.c             |  2 +-
 fs/super.c              |  1 +
 fs/sync.c               |  3 +-
 include/linux/errseq.h  | 14 ++++++++
 include/linux/fs.h      |  6 ++--
 include/linux/pagemap.h |  2 +-
 lib/errseq.c            | 78 +++++++++++++++++++++++++++++++----------
 7 files changed, 82 insertions(+), 24 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 23f645657488..dc787834b7a9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1156,7 +1156,7 @@ void mark_buffer_write_io_error(struct buffer_head *bh)
 	rcu_read_lock();
 	sb = READ_ONCE(bh->b_bdev->bd_super);
 	if (sb)
-		errseq_set(&sb->s_wb_err, -EIO);
+		errseq_counter_set(&sb->s_wb_err, -EIO);
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL(mark_buffer_write_io_error);
diff --git a/fs/super.c b/fs/super.c
index 98bb0629ee10..542c61929c37 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -211,6 +211,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	s->s_user_ns = get_user_ns(user_ns);
 	init_rwsem(&s->s_umount);
 	lockdep_set_class(&s->s_umount, &type->s_umount_key);
+	INIT_ERRSEQ_COUNTER(&s->s_wb_err);
 	/*
 	 * sget() can have s_umount recursion.
 	 *
diff --git a/fs/sync.c b/fs/sync.c
index 1373a610dc78..a3bd9ca5052a 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -172,7 +172,8 @@ SYSCALL_DEFINE1(syncfs, int, fd)
 	ret = sync_filesystem(sb);
 	up_read(&sb->s_umount);
 
-	ret2 = errseq_check_and_advance(&sb->s_wb_err, &f.file->f_sb_err);
+	ret2 = errseq_check_and_advance(&sb->s_wb_err.errseq,
+					&f.file->f_sb_err);
 
 	fdput(f);
 	return ret ? ret : ret2;
diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index fc2777770768..35818c484290 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -5,8 +5,22 @@
 #ifndef _LINUX_ERRSEQ_H
 #define _LINUX_ERRSEQ_H
 
+#include <linux/atomic.h>
+
 typedef u32	errseq_t;
+struct errseq_counter {
+	errseq_t	errseq;
+	/* errors are only incremented if the errseq has not been seen */
+	atomic_t	errors;
+};
+
+static inline void INIT_ERRSEQ_COUNTER(struct errseq_counter *counter)
+{
+	counter->errseq = 0;
+	atomic_set(&counter->errors, 0);
+}
 
+errseq_t errseq_counter_set(struct errseq_counter *counter, int err);
 errseq_t errseq_set(errseq_t *eseq, int err);
 errseq_t errseq_sample(errseq_t *eseq);
 int errseq_check(errseq_t *eseq, errseq_t since);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8667d0cdc71e..127860fb938a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1510,8 +1510,8 @@ struct super_block {
 	/* Being remounted read-only */
 	int s_readonly_remount;
 
-	/* per-sb errseq_t for reporting writeback errors via syncfs */
-	errseq_t s_wb_err;
+	/* per-sb errseq_counter for reporting writeback errors via syncfs */
+	struct errseq_counter s_wb_err;
 
 	/* AIO completions deferred from interrupt context */
 	struct workqueue_struct *s_dio_done_wq;
@@ -2718,7 +2718,7 @@ static inline errseq_t filemap_sample_wb_err(struct address_space *mapping)
  */
 static inline errseq_t file_sample_sb_err(struct file *file)
 {
-	return errseq_sample(&file->f_path.dentry->d_sb->s_wb_err);
+	return errseq_sample(&file->f_path.dentry->d_sb->s_wb_err.errseq);
 }
 
 extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d5570deff400..1779eee01e34 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -56,7 +56,7 @@ static inline void mapping_set_error(struct address_space *mapping, int error)
 
 	/* Record it in superblock */
 	if (mapping->host)
-		errseq_set(&mapping->host->i_sb->s_wb_err, error);
+		errseq_set(&mapping->host->i_sb->s_wb_err.errseq, error);
 
 	/* Record it in flags for now, for legacy callers */
 	if (error == -ENOSPC)
diff --git a/lib/errseq.c b/lib/errseq.c
index 81f9e33aa7e7..d555e7fc18d2 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -41,23 +41,9 @@
 /* The lowest bit of the counter */
 #define ERRSEQ_CTR_INC		(1 << (ERRSEQ_SHIFT + 1))
 
-/**
- * errseq_set - set a errseq_t for later reporting
- * @eseq: errseq_t field that should be set
- * @err: error to set (must be between -1 and -MAX_ERRNO)
- *
- * This function sets the error in @eseq, and increments the sequence counter
- * if the last sequence was sampled at some point in the past.
- *
- * Any error set will always overwrite an existing error.
- *
- * Return: The previous value, primarily for debugging purposes. The
- * return value should not be used as a previously sampled value in later
- * calls as it will not have the SEEN flag set.
- */
-errseq_t errseq_set(errseq_t *eseq, int err)
+static errseq_t __errseq_set(errseq_t old, errseq_t *eseq, int err)
 {
-	errseq_t cur, old;
+	errseq_t cur;
 
 	/* MAX_ERRNO must be able to serve as a mask */
 	BUILD_BUG_ON_NOT_POWER_OF_2(MAX_ERRNO + 1);
@@ -68,8 +54,6 @@ errseq_t errseq_set(errseq_t *eseq, int err)
 	 * also don't accept zero here as that would effectively clear a
 	 * previous error.
 	 */
-	old = READ_ONCE(*eseq);
-
 	if (WARN(unlikely(err == 0 || (unsigned int)-err > MAX_ERRNO),
 				"err = %d\n", err))
 		return old;
@@ -105,8 +89,66 @@ errseq_t errseq_set(errseq_t *eseq, int err)
 	}
 	return cur;
 }
+
+/**
+ * errseq_set - set a errseq_t for later reporting
+ * @eseq: errseq_t field that should be set
+ * @err: error to set (must be between -1 and -MAX_ERRNO)
+ *
+ * This function sets the error in @eseq, and increments the sequence counter
+ * if the last sequence was sampled at some point in the past.
+ *
+ * Any error set will always overwrite an existing error.
+ *
+ * Return: The previous value, primarily for debugging purposes. The
+ * return value should not be used as a previously sampled value in later
+ * calls as it will not have the SEEN flag set.
+ */
+errseq_t errseq_set(errseq_t *eseq, int err)
+{
+	errseq_t old = READ_ONCE(*eseq);
+
+	return __errseq_set(old, eseq, err);
+}
 EXPORT_SYMBOL(errseq_set);
 
+/**
+ * errseq_counter_set - set an errseq_counter for later reporting
+ * @counter: errseq_counter that should be set
+ * @err: error to set (must be between -1 and -MAX_ERRNO)
+ *
+ * This function follows same semantics as errseq_set(), but if the error
+ * has not been seen, it will increment the errors, allowing a subscriber to
+ * determine is any errors have occurred since they sampled the errseq_counter.
+ *
+ * Return: The same value as errseq_set
+ */
+errseq_t errseq_counter_set(struct errseq_counter *counter, int err)
+{
+	errseq_t new, old;
+
+	old = READ_ONCE(counter->errseq);
+	new = __errseq_set(old, &counter->errseq, err);
+
+	if (new == old) {
+		/*
+		 * When the observer checks if counters is incremented, then
+		 * will then retrieve the error from the errseq itself,
+		 * therefore we have to ensure that the error in errseq is
+		 * visible before errors is incremented.
+		 *
+		 * If errors and errseq ar both read, a read barrier is required
+		 * to preserve this ordering.
+		 */
+		smp_mb__before_atomic();
+		atomic_inc(&counter->errors);
+	}
+
+	return new;
+}
+EXPORT_SYMBOL(errseq_counter_set);
+
+
 /**
  * errseq_sample() - Grab current errseq_t value.
  * @eseq: Pointer to errseq_t to be sampled.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/3] errseq: Add mechanism to snapshot errseq_counter and check snapshot
  2020-12-11 23:49 [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts Sargun Dhillon
  2020-12-11 23:50 ` [PATCH v2 1/3] errseq: Add errseq_counter to allow for all errors to be observed Sargun Dhillon
@ 2020-12-11 23:50 ` Sargun Dhillon
  2020-12-12  9:57   ` Amir Goldstein
  2020-12-11 23:50 ` [PATCH v2 3/3] overlay: Implement volatile-specific fsync error behaviour Sargun Dhillon
  2020-12-12 11:21 ` [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts Jeff Layton
  3 siblings, 1 reply; 9+ messages in thread
From: Sargun Dhillon @ 2020-12-11 23:50 UTC (permalink / raw)
  To: Amir Goldstein, Miklos Szeredi
  Cc: Sargun Dhillon, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox

This adds the function errseq_counter_sample to allow for "subscribers"
to take point-in-time snapshots of the errseq_counter, and store the
counter + errseq_t.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
---
 include/linux/errseq.h |  4 ++++
 lib/errseq.c           | 51 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index 35818c484290..8998df499a3b 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -25,4 +25,8 @@ errseq_t errseq_set(errseq_t *eseq, int err);
 errseq_t errseq_sample(errseq_t *eseq);
 int errseq_check(errseq_t *eseq, errseq_t since);
 int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
+void errseq_counter_sample(errseq_t *dst_errseq, int *dst_errors,
+			   struct errseq_counter *counter);
+int errseq_counter_check(struct errseq_counter *counter, errseq_t errseq_since,
+			 int errors_since);
 #endif
diff --git a/lib/errseq.c b/lib/errseq.c
index d555e7fc18d2..98fcfafa3d97 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -246,3 +246,54 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
 	return err;
 }
 EXPORT_SYMBOL(errseq_check_and_advance);
+
+/**
+ * errseq_counter_sample() - Grab the current errseq_counter value
+ * @dst_errseq: The errseq_t to copy to
+ * @dst_errors: The destination overflow to copy to
+ * @counter: The errseq_counter to copy from
+ *
+ * Grabs a point in time sample of the errseq_counter for latter comparison
+ */
+void errseq_counter_sample(errseq_t *dst_errseq, int *dst_errors,
+			   struct errseq_counter *counter)
+{
+	errseq_t cur;
+
+	do {
+		cur = READ_ONCE(counter->errseq);
+		*dst_errors = atomic_read(&counter->errors);
+	} while (cur != READ_ONCE(counter->errseq));
+
+	/* Clear the seen bit to make checking later easier */
+	*dst_errseq = cur & ~ERRSEQ_SEEN;
+}
+EXPORT_SYMBOL(errseq_counter_sample);
+
+/**
+ * errseq_counter_check() - Has an error occurred since the sample
+ * @counter: The errseq_counter from which to check.
+ * @errseq_since: The errseq_t sampled with errseq_counter_sample to check
+ * @errors_since: The errors sampled with errseq_counter_sample to check
+ *
+ * Returns: The latest error set in the errseq_t or 0 if there have been none.
+ */
+int errseq_counter_check(struct errseq_counter *counter, errseq_t errseq_since,
+			 int errors_since)
+{
+	errseq_t cur_errseq;
+	int cur_errors;
+
+	cur_errors = atomic_read(&counter->errors);
+	/* To match the barrier in errseq_counter_set */
+	smp_rmb();
+
+	/* Clear / ignore the seen bit as we do at sample time */
+	cur_errseq = READ_ONCE(counter->errseq) & ~ERRSEQ_SEEN;
+
+	if (cur_errseq == errseq_since && errors_since == cur_errors)
+		return 0;
+
+	return -(cur_errseq & MAX_ERRNO);
+}
+EXPORT_SYMBOL(errseq_counter_check);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/3] overlay: Implement volatile-specific fsync error behaviour
  2020-12-11 23:49 [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts Sargun Dhillon
  2020-12-11 23:50 ` [PATCH v2 1/3] errseq: Add errseq_counter to allow for all errors to be observed Sargun Dhillon
  2020-12-11 23:50 ` [PATCH v2 2/3] errseq: Add mechanism to snapshot errseq_counter and check snapshot Sargun Dhillon
@ 2020-12-11 23:50 ` Sargun Dhillon
  2020-12-12 11:21 ` [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts Jeff Layton
  3 siblings, 0 replies; 9+ messages in thread
From: Sargun Dhillon @ 2020-12-11 23:50 UTC (permalink / raw)
  To: Amir Goldstein, Miklos Szeredi
  Cc: Sargun Dhillon, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, Jeff Layton

Overlayfs's volatile option allows the user to bypass all forced sync calls
to the upperdir filesystem. This comes at the cost of safety. We can never
ensure that the user's data is intact, but we can make a best effort to
expose whether or not the data is likely to be in a bad state.

We decided[1] that the best way to handle this in the time being is that if
an overlayfs's upperdir experiences an error after a volatile mount occurs,
that error will be returned on fsync, fdatasync, sync, and syncfs. This is
contradictory to the traditional behaviour of VFS which fails the call
once, and only raises an error if a subsequent fsync error has occurred,
and been raised by the filesystem.

One awkward aspect of the patch is that we have to manually set the
superblock's errseq_t after the sync_fs callback as opposed to just
returning an error from syncfs. This is because the call chain looks
something like this:

sys_syncfs ->
	sync_filesystem ->
		__sync_filesystem ->
			/* The return value is ignored here
			sb->s_op->sync_fs(sb)
			_sync_blockdev
		/* Where the VFS fetches the error to raise to userspace */
		errseq_check_and_advance

Because of this we call errseq_set every time the sync_fs callback occurs.

[1]: https://lore.kernel.org/linux-fsdevel/36d820394c3e7cd1faa1b28a8135136d5001dadd.camel@redhat.com/T/#u

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-unionfs@vger.kernel.org
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/filesystems/overlayfs.rst |  8 +++++++
 fs/overlayfs/file.c                     |  5 +++--
 fs/overlayfs/overlayfs.h                |  1 +
 fs/overlayfs/ovl_entry.h                |  3 +++
 fs/overlayfs/readdir.c                  |  5 +++--
 fs/overlayfs/super.c                    | 26 ++++++++++++++++-------
 fs/overlayfs/util.c                     | 28 +++++++++++++++++++++++++
 7 files changed, 65 insertions(+), 11 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 580ab9a0fe31..3af569cea6a7 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -575,6 +575,14 @@ without significant effort.
 The advantage of mounting with the "volatile" option is that all forms of
 sync calls to the upper filesystem are omitted.
 
+In order to avoid a giving a false sense of safety, the syncfs (and fsync)
+semantics of volatile mounts are slightly different than that of the rest of
+VFS.  If any error occurs on the upperdir's filesystem after a volatile mount
+takes place, all sync functions will return the last error observed on the
+upperdir filesystem.  Once this condition is reached, the filesystem will not
+recover, and every subsequent sync call will return an error, even if the
+upperdir has not experience a new error since the last sync call.
+
 When overlay is mounted with "volatile" option, the directory
 "$workdir/work/incompat/volatile" is created.  During next mount, overlay
 checks for this directory and refuses to mount if present. This is a strong
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index efccb7c1f9bc..d7bc3e94a106 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -445,8 +445,9 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	const struct cred *old_cred;
 	int ret;
 
-	if (!ovl_should_sync(OVL_FS(file_inode(file)->i_sb)))
-		return 0;
+	ret = ovl_check_sync(OVL_FS(file_inode(file)->i_sb));
+	if (ret <= 0)
+		return ret;
 
 	ret = ovl_real_fdget_meta(file, &real, !datasync);
 	if (ret)
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f8880aa2ba0e..af79c3a2392e 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct dentry *dentry);
 bool ovl_is_metacopy_dentry(struct dentry *dentry);
 char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
 			     int padding);
+int ovl_check_sync(struct ovl_fs *ofs);
 
 static inline bool ovl_is_impuredir(struct super_block *sb,
 				    struct dentry *dentry)
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 1b5a2094df8e..355a0b66ba16 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -79,6 +79,9 @@ struct ovl_fs {
 	atomic_long_t last_ino;
 	/* Whiteout dentry cache */
 	struct dentry *whiteout;
+	/* snapshot of upperdir's errseq_counter for volatile mounts */
+	errseq_t upper_errseq;
+	int upper_errseq_counter;
 };
 
 static inline struct vfsmount *ovl_upper_mnt(struct ovl_fs *ofs)
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 01620ebae1bd..f7f1a29e290f 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -909,8 +909,9 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
 	struct file *realfile;
 	int err;
 
-	if (!ovl_should_sync(OVL_FS(file->f_path.dentry->d_sb)))
-		return 0;
+	err = ovl_check_sync(OVL_FS(file->f_path.dentry->d_sb));
+	if (err <= 0)
+		return err;
 
 	realfile = ovl_dir_real_file(file, true);
 	err = PTR_ERR_OR_ZERO(realfile);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..ec6bed8f35c8 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -261,11 +261,18 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
 	struct super_block *upper_sb;
 	int ret;
 
-	if (!ovl_upper_mnt(ofs))
-		return 0;
+	ret = ovl_check_sync(ofs);
+	/*
+	 * We have to always set the err, because the return value isn't
+	 * checked, and instead VFS looks at the writeback errseq after
+	 * this call.
+	 */
+	if (ret < 0)
+		errseq_counter_set(&sb->s_wb_err, ret);
+
+	if (!ret)
+		return ret;
 
-	if (!ovl_should_sync(ofs))
-		return 0;
 	/*
 	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
 	 * All the super blocks will be iterated, including upper_sb.
@@ -1927,6 +1934,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_op = &ovl_super_operations;
 
 	if (ofs->config.upperdir) {
+		struct super_block *upper_mnt_sb;
+
 		if (!ofs->config.workdir) {
 			pr_err("missing 'workdir'\n");
 			goto out_err;
@@ -1943,9 +1952,12 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 		if (!ofs->workdir)
 			sb->s_flags |= SB_RDONLY;
 
-		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
-		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
-
+		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
+		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
+		sb->s_time_gran = upper_mnt_sb->s_time_gran;
+		errseq_counter_sample(&ofs->upper_errseq,
+				      &ofs->upper_errseq_counter,
+				      &upper_mnt_sb->s_wb_err);
 	}
 	oe = ovl_get_lowerstack(sb, splitlower, numlower, ofs, layers);
 	err = PTR_ERR(oe);
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 23f475627d07..1408d56748c0 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -950,3 +950,31 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
 	kfree(buf);
 	return ERR_PTR(res);
 }
+
+/*
+ * ovl_check_sync provides sync checking, and safety for volatile mounts
+ *
+ * Returns 1 if sync required.
+ *
+ * Returns 0 if syncing can be skipped because mount is volatile, and no errors
+ * have occurred on the upperdir since the mount.
+ *
+ * Returns -errno if it is a volatile mount, and the error that occurred since
+ * the last mount. If the error code changes, it'll return the latest error
+ * code.
+ */
+
+int ovl_check_sync(struct ovl_fs *ofs)
+{
+	struct vfsmount *mnt;
+
+	if (ovl_should_sync(ofs))
+		return 1;
+
+	mnt = ovl_upper_mnt(ofs);
+	if (!mnt)
+		return 0;
+
+	return errseq_counter_check(&mnt->mnt_sb->s_wb_err, ofs->upper_errseq,
+				    ofs->upper_errseq_counter);
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/3] errseq: Add mechanism to snapshot errseq_counter and check snapshot
  2020-12-11 23:50 ` [PATCH v2 2/3] errseq: Add mechanism to snapshot errseq_counter and check snapshot Sargun Dhillon
@ 2020-12-12  9:57   ` Amir Goldstein
  2020-12-13 19:41     ` Sargun Dhillon
  0 siblings, 1 reply; 9+ messages in thread
From: Amir Goldstein @ 2020-12-12  9:57 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, Jeff Layton

Forgot to CC Jeff?

On Sat, Dec 12, 2020 at 1:50 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> This adds the function errseq_counter_sample to allow for "subscribers"
> to take point-in-time snapshots of the errseq_counter, and store the
> counter + errseq_t.
>
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> ---
>  include/linux/errseq.h |  4 ++++
>  lib/errseq.c           | 51 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 55 insertions(+)
>
> diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> index 35818c484290..8998df499a3b 100644
> --- a/include/linux/errseq.h
> +++ b/include/linux/errseq.h
> @@ -25,4 +25,8 @@ errseq_t errseq_set(errseq_t *eseq, int err);
>  errseq_t errseq_sample(errseq_t *eseq);
>  int errseq_check(errseq_t *eseq, errseq_t since);
>  int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
> +void errseq_counter_sample(errseq_t *dst_errseq, int *dst_errors,
> +                          struct errseq_counter *counter);
> +int errseq_counter_check(struct errseq_counter *counter, errseq_t errseq_since,
> +                        int errors_since);
>  #endif
> diff --git a/lib/errseq.c b/lib/errseq.c
> index d555e7fc18d2..98fcfafa3d97 100644
> --- a/lib/errseq.c
> +++ b/lib/errseq.c
> @@ -246,3 +246,54 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
>         return err;
>  }
>  EXPORT_SYMBOL(errseq_check_and_advance);
> +
> +/**
> + * errseq_counter_sample() - Grab the current errseq_counter value
> + * @dst_errseq: The errseq_t to copy to
> + * @dst_errors: The destination overflow to copy to
> + * @counter: The errseq_counter to copy from
> + *
> + * Grabs a point in time sample of the errseq_counter for latter comparison
> + */
> +void errseq_counter_sample(errseq_t *dst_errseq, int *dst_errors,

Why 2 arguments and not struct errseq_counter *dst_counter?

> +                          struct errseq_counter *counter)
> +{
> +       errseq_t cur;
> +
> +       do {
> +               cur = READ_ONCE(counter->errseq);
> +               *dst_errors = atomic_read(&counter->errors);
> +       } while (cur != READ_ONCE(counter->errseq));

This loop seems odd. I think the return value should reflect the fact that
the snapshot failed and let the caller decide if it wants to loop.

And about the one and only introduced caller, I think the answer is that
it shouldn't loop. If volatile overlayfs mount tries to sample the upper sb
error counter and an unseen error exists, I argued before that I think
mount should fail, so that the container orchestrator can decide what to do.
Failure to take an errseq_counter sample means than an unseen error
has been observed at least in the first or second check.

> +
> +       /* Clear the seen bit to make checking later easier */
> +       *dst_errseq = cur & ~ERRSEQ_SEEN;
> +}
> +EXPORT_SYMBOL(errseq_counter_sample);
> +
> +/**
> + * errseq_counter_check() - Has an error occurred since the sample
> + * @counter: The errseq_counter from which to check.
> + * @errseq_since: The errseq_t sampled with errseq_counter_sample to check
> + * @errors_since: The errors sampled with errseq_counter_sample to check
> + *
> + * Returns: The latest error set in the errseq_t or 0 if there have been none.
> + */
> +int errseq_counter_check(struct errseq_counter *counter, errseq_t errseq_since,
> +                        int errors_since)
> +{
> +       errseq_t cur_errseq;
> +       int cur_errors;
> +
> +       cur_errors = atomic_read(&counter->errors);
> +       /* To match the barrier in errseq_counter_set */
> +       smp_rmb();
> +
> +       /* Clear / ignore the seen bit as we do at sample time */
> +       cur_errseq = READ_ONCE(counter->errseq) & ~ERRSEQ_SEEN;
> +
> +       if (cur_errseq == errseq_since && errors_since == cur_errors)
> +               return 0;
> +
> +       return -(cur_errseq & MAX_ERRNO);
> +}


Same here. Why not pass an errseq_counter_since argument?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts
  2020-12-11 23:49 [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts Sargun Dhillon
                   ` (2 preceding siblings ...)
  2020-12-11 23:50 ` [PATCH v2 3/3] overlay: Implement volatile-specific fsync error behaviour Sargun Dhillon
@ 2020-12-12 11:21 ` Jeff Layton
  2020-12-12 11:48   ` Jeff Layton
  2020-12-13 20:06   ` Sargun Dhillon
  3 siblings, 2 replies; 9+ messages in thread
From: Jeff Layton @ 2020-12-12 11:21 UTC (permalink / raw)
  To: Sargun Dhillon, Amir Goldstein, Miklos Szeredi
  Cc: Vivek Goyal, overlayfs, Linux FS-devel Mailing List, Matthew Wilcox

On Fri, 2020-12-11 at 15:49 -0800, Sargun Dhillon wrote:
> The semantics of errseq and syncfs are such that it is impossible to track
> if any errors have occurred between the time the first error occurred, and
> the user checks for the error (calls syncfs, and subsequently
> errseq_check_and_advance.
> 
> Overlayfs has a volatile feature which short-circuits syncfs. This, in turn
> makes it so that the user can have silent data corruption and not know
> about it. The third patch in the series introduces behaviour that makes it
> so that we can track errors, and bubble up whether the user has put
> themselves in bad situation.
> 
> This required some gymanstics in errseq, and adding a wrapper around it
> called "errseq_counter" (errseq + counter). The data structure uses an
> atomic to track overflow errors. This approach, rather than moving to an
> atomic64 / u64 is so we can avoid bloating every person that subscribes to
> an errseq, and only add the subscriber behaviour to those who care (at the
> expense of space.
> 
> The datastructure is write-optimized, and rightfully so, as the users
> of the counter feature are just overlayfs, and it's called in fsync
> checking, which is a rather seldom operation, and not really on
> any hotpaths.
> 
> [1]: https://lore.kernel.org/linux-fsdevel/20201202092720.41522-1-sargun@sargun.me/
> 
> Sargun Dhillon (3):
>   errseq: Add errseq_counter to allow for all errors to be observed
>   errseq: Add mechanism to snapshot errseq_counter and check snapshot
>   overlay: Implement volatile-specific fsync error behaviour
> 
>  Documentation/filesystems/overlayfs.rst |   8 ++
>  fs/buffer.c                             |   2 +-
>  fs/overlayfs/file.c                     |   5 +-
>  fs/overlayfs/overlayfs.h                |   1 +
>  fs/overlayfs/ovl_entry.h                |   3 +
>  fs/overlayfs/readdir.c                  |   5 +-
>  fs/overlayfs/super.c                    |  26 +++--
>  fs/overlayfs/util.c                     |  28 +++++
>  fs/super.c                              |   1 +
>  fs/sync.c                               |   3 +-
>  include/linux/errseq.h                  |  18 ++++
>  include/linux/fs.h                      |   6 +-
>  include/linux/pagemap.h                 |   2 +-
>  lib/errseq.c                            | 129 ++++++++++++++++++++----
>  14 files changed, 202 insertions(+), 35 deletions(-)
> 

It would hel if you could more clearly lay out the semantics you're
looking for. If I understand correctly:

You basically want to be able to sample the sb->s_wb_err of the upper
layer at mount time and then always return an error if any new errors
were recorded since that point.

If that's correct, then I'm not sure I get need for all of this extra
counter machinery. Why not just sample it at mount time without
recording it as 0 if the seen flag isn't set. Then just do an
errseq_check against the upper superblock (without advancing) in the
overlayfs ->sync_fs routine and just errseq_set that error into the
overlayfs superblock? The syncfs syscall wrapper should then always
report the latest error.

Or (even better) rework all of the sync_fs/syncfs mess to be more sane,
so that overlayfs has more control over what errors get returned to
userland. ISTM that the main problem you have is that the
errseq_check_and_advance is done in the syscall wrapper, and that's
probably not appropriate for your use-case.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts
  2020-12-12 11:21 ` [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts Jeff Layton
@ 2020-12-12 11:48   ` Jeff Layton
  2020-12-13 20:06   ` Sargun Dhillon
  1 sibling, 0 replies; 9+ messages in thread
From: Jeff Layton @ 2020-12-12 11:48 UTC (permalink / raw)
  To: Sargun Dhillon, Amir Goldstein, Miklos Szeredi
  Cc: Vivek Goyal, overlayfs, Linux FS-devel Mailing List, Matthew Wilcox

On Sat, 2020-12-12 at 06:21 -0500, Jeff Layton wrote:
> On Fri, 2020-12-11 at 15:49 -0800, Sargun Dhillon wrote:
> > The semantics of errseq and syncfs are such that it is impossible to track
> > if any errors have occurred between the time the first error occurred, and
> > the user checks for the error (calls syncfs, and subsequently
> > errseq_check_and_advance.
> > 
> > Overlayfs has a volatile feature which short-circuits syncfs. This, in turn
> > makes it so that the user can have silent data corruption and not know
> > about it. The third patch in the series introduces behaviour that makes it
> > so that we can track errors, and bubble up whether the user has put
> > themselves in bad situation.
> > 
> > This required some gymanstics in errseq, and adding a wrapper around it
> > called "errseq_counter" (errseq + counter). The data structure uses an
> > atomic to track overflow errors. This approach, rather than moving to an
> > atomic64 / u64 is so we can avoid bloating every person that subscribes to
> > an errseq, and only add the subscriber behaviour to those who care (at the
> > expense of space.
> > 
> > The datastructure is write-optimized, and rightfully so, as the users
> > of the counter feature are just overlayfs, and it's called in fsync
> > checking, which is a rather seldom operation, and not really on
> > any hotpaths.
> > 
> > [1]: https://lore.kernel.org/linux-fsdevel/20201202092720.41522-1-sargun@sargun.me/
> > 
> > Sargun Dhillon (3):
> >   errseq: Add errseq_counter to allow for all errors to be observed
> >   errseq: Add mechanism to snapshot errseq_counter and check snapshot
> >   overlay: Implement volatile-specific fsync error behaviour
> > 
> >  Documentation/filesystems/overlayfs.rst |   8 ++
> >  fs/buffer.c                             |   2 +-
> >  fs/overlayfs/file.c                     |   5 +-
> >  fs/overlayfs/overlayfs.h                |   1 +
> >  fs/overlayfs/ovl_entry.h                |   3 +
> >  fs/overlayfs/readdir.c                  |   5 +-
> >  fs/overlayfs/super.c                    |  26 +++--
> >  fs/overlayfs/util.c                     |  28 +++++
> >  fs/super.c                              |   1 +
> >  fs/sync.c                               |   3 +-
> >  include/linux/errseq.h                  |  18 ++++
> >  include/linux/fs.h                      |   6 +-
> >  include/linux/pagemap.h                 |   2 +-
> >  lib/errseq.c                            | 129 ++++++++++++++++++++----
> >  14 files changed, 202 insertions(+), 35 deletions(-)
> > 
> 
> It would hel if you could more clearly lay out the semantics you're
> looking for. If I understand correctly:
> 
> You basically want to be able to sample the sb->s_wb_err of the upper
> layer at mount time and then always return an error if any new errors
> were recorded since that point.
> 
> If that's correct, then I'm not sure I get need for all of this extra
> counter machinery. Why not just sample it at mount time without
> recording it as 0 if the seen flag isn't set. Then just do an
> errseq_check against the upper superblock (without advancing) in the
> overlayfs ->sync_fs routine and just errseq_set that error into the
> overlayfs superblock? The syncfs syscall wrapper should then always
> report the latest error.
> 
> Or (even better) rework all of the sync_fs/syncfs mess to be more sane,
> so that overlayfs has more control over what errors get returned to
> userland. ISTM that the main problem you have is that the
> errseq_check_and_advance is done in the syscall wrapper, and that's
> probably not appropriate for your use-case.
> 

Something like this is what I was thinking (completely untested, of
course and needs comments). Would this not give you what you want for
volatile mount syncfs semantics?

diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 1b5a2094df8e..b7ada3fd04fd 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -79,6 +79,7 @@ struct ovl_fs {
 	atomic_long_t last_ino;
 	/* Whiteout dentry cache */
 	struct dentry *whiteout;
+	errseq_t err;
 };
 
 static inline struct vfsmount *ovl_upper_mnt(struct ovl_fs *ofs)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..35780776360c 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -264,8 +264,12 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
 	if (!ovl_upper_mnt(ofs))
 		return 0;
 
-	if (!ovl_should_sync(ofs))
-		return 0;
+	if (!ovl_should_sync(ofs)) {
+		ret = errseq_check(&upper_sb->s_wb_err, ofs->err);
+		errseq_set(&sb->s_wb_err, ret);
+		return ret;
+	}
+
 	/*
 	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
 	 * All the super blocks will be iterated, including upper_sb.
@@ -1945,7 +1949,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 
 		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
 		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
-
+		ofs->err = errseq_sample(&ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
 	}
 	oe = ovl_get_lowerstack(sb, splitlower, numlower, ofs, layers);
 	err = PTR_ERR(oe);
diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index fc2777770768..0d9ead687dc1 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -9,6 +9,7 @@ typedef u32	errseq_t;
 
 errseq_t errseq_set(errseq_t *eseq, int err);
 errseq_t errseq_sample(errseq_t *eseq);
+errseq_t errseq_sample_advance(errseq_t *eseq);
 int errseq_check(errseq_t *eseq, errseq_t since);
 int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
 #endif
diff --git a/lib/errseq.c b/lib/errseq.c
index 81f9e33aa7e7..27ab3000507f 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -130,6 +130,25 @@ errseq_t errseq_sample(errseq_t *eseq)
 }
 EXPORT_SYMBOL(errseq_sample);
 
+errseq_t errseq_sample_advance(errseq_t *eseq)
+{
+	errseq_t old = READ_ONCE(*eseq);
+	errseq_t new = old;
+
+	/*
+	 * For the common case of no errors ever having been set, we can skip
+	 * marking the SEEN bit. Once an error has been set, the value will
+	 * never go back to zero.
+	 */
+	if (old != 0) {
+		new |= ERRSEQ_SEEN;
+		if (old != new)
+			cmpxchg(eseq, old, new);
+	}
+	return new;
+}
+EXPORT_SYMBOL(errseq_sample_advance);
+
 /**
  * errseq_check() - Has an error occurred since a particular sample point?
  * @eseq: Pointer to errseq_t value to be checked.



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/3] errseq: Add mechanism to snapshot errseq_counter and check snapshot
  2020-12-12  9:57   ` Amir Goldstein
@ 2020-12-13 19:41     ` Sargun Dhillon
  0 siblings, 0 replies; 9+ messages in thread
From: Sargun Dhillon @ 2020-12-13 19:41 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, Jeff Layton

On Sat, Dec 12, 2020 at 11:57:52AM +0200, Amir Goldstein wrote:
> Forgot to CC Jeff?
> 
Oops.
> On Sat, Dec 12, 2020 at 1:50 AM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > This adds the function errseq_counter_sample to allow for "subscribers"
> > to take point-in-time snapshots of the errseq_counter, and store the
> > counter + errseq_t.
> >
> > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > ---
> >  include/linux/errseq.h |  4 ++++
> >  lib/errseq.c           | 51 ++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 55 insertions(+)
> >
> > diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> > index 35818c484290..8998df499a3b 100644
> > --- a/include/linux/errseq.h
> > +++ b/include/linux/errseq.h
> > @@ -25,4 +25,8 @@ errseq_t errseq_set(errseq_t *eseq, int err);
> >  errseq_t errseq_sample(errseq_t *eseq);
> >  int errseq_check(errseq_t *eseq, errseq_t since);
> >  int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
> > +void errseq_counter_sample(errseq_t *dst_errseq, int *dst_errors,
> > +                          struct errseq_counter *counter);
> > +int errseq_counter_check(struct errseq_counter *counter, errseq_t errseq_since,
> > +                        int errors_since);
> >  #endif
> > diff --git a/lib/errseq.c b/lib/errseq.c
> > index d555e7fc18d2..98fcfafa3d97 100644
> > --- a/lib/errseq.c
> > +++ b/lib/errseq.c
> > @@ -246,3 +246,54 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> >         return err;
> >  }
> >  EXPORT_SYMBOL(errseq_check_and_advance);
> > +
> > +/**
> > + * errseq_counter_sample() - Grab the current errseq_counter value
> > + * @dst_errseq: The errseq_t to copy to
> > + * @dst_errors: The destination overflow to copy to
> > + * @counter: The errseq_counter to copy from
> > + *
> > + * Grabs a point in time sample of the errseq_counter for latter comparison
> > + */
> > +void errseq_counter_sample(errseq_t *dst_errseq, int *dst_errors,
> 
> Why 2 arguments and not struct errseq_counter *dst_counter?
> 

Mostly not to have to use atomic_* when setting this value and avoiding locking 
another cacheline on the CPU. IIRC, atomic_t is always 4-byte aligned but int 
doesn't have to be.

> > +                          struct errseq_counter *counter)
> > +{
> > +       errseq_t cur;
> > +
> > +       do {
> > +               cur = READ_ONCE(counter->errseq);
> > +               *dst_errors = atomic_read(&counter->errors);
> > +       } while (cur != READ_ONCE(counter->errseq));
> 
> This loop seems odd. I think the return value should reflect the fact that
> the snapshot failed and let the caller decide if it wants to loop.
> 
> And about the one and only introduced caller, I think the answer is that
> it shouldn't loop. If volatile overlayfs mount tries to sample the upper sb
> error counter and an unseen error exists, I argued before that I think
> mount should fail, so that the container orchestrator can decide what to do.
> Failure to take an errseq_counter sample means than an unseen error
> has been observed at least in the first or second check.
> 

I guess. In the "good" case, there's the same computational cost, but the bad
case (error occurs while we are snapshotting results in another spin.

> > +
> > +       /* Clear the seen bit to make checking later easier */
> > +       *dst_errseq = cur & ~ERRSEQ_SEEN;
> > +}
> > +EXPORT_SYMBOL(errseq_counter_sample);
> > +
> > +/**
> > + * errseq_counter_check() - Has an error occurred since the sample
> > + * @counter: The errseq_counter from which to check.
> > + * @errseq_since: The errseq_t sampled with errseq_counter_sample to check
> > + * @errors_since: The errors sampled with errseq_counter_sample to check
> > + *
> > + * Returns: The latest error set in the errseq_t or 0 if there have been none.
> > + */
> > +int errseq_counter_check(struct errseq_counter *counter, errseq_t errseq_since,
> > +                        int errors_since)
> > +{
> > +       errseq_t cur_errseq;
> > +       int cur_errors;
> > +
> > +       cur_errors = atomic_read(&counter->errors);
> > +       /* To match the barrier in errseq_counter_set */
> > +       smp_rmb();
> > +
> > +       /* Clear / ignore the seen bit as we do at sample time */
> > +       cur_errseq = READ_ONCE(counter->errseq) & ~ERRSEQ_SEEN;
> > +
> > +       if (cur_errseq == errseq_since && errors_since == cur_errors)
> > +               return 0;
> > +
> > +       return -(cur_errseq & MAX_ERRNO);
> > +}
> 
> 
> Same here. Why not pass an errseq_counter_since argument?
> 
> Thanks,
> Amir.

See above. I can change this, and I mulled over this decision a bunch, 
unfortunately (micro)benchmarking was inconclusive as to whether this made a 
difference or not.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts
  2020-12-12 11:21 ` [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts Jeff Layton
  2020-12-12 11:48   ` Jeff Layton
@ 2020-12-13 20:06   ` Sargun Dhillon
  1 sibling, 0 replies; 9+ messages in thread
From: Sargun Dhillon @ 2020-12-13 20:06 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox

On Sat, Dec 12, 2020 at 06:21:37AM -0500, Jeff Layton wrote:
> On Fri, 2020-12-11 at 15:49 -0800, Sargun Dhillon wrote:
> > The semantics of errseq and syncfs are such that it is impossible to track
> > if any errors have occurred between the time the first error occurred, and
> > the user checks for the error (calls syncfs, and subsequently
> > errseq_check_and_advance.
> > 
> > Overlayfs has a volatile feature which short-circuits syncfs. This, in turn
> > makes it so that the user can have silent data corruption and not know
> > about it. The third patch in the series introduces behaviour that makes it
> > so that we can track errors, and bubble up whether the user has put
> > themselves in bad situation.
> > 
> > This required some gymanstics in errseq, and adding a wrapper around it
> > called "errseq_counter" (errseq + counter). The data structure uses an
> > atomic to track overflow errors. This approach, rather than moving to an
> > atomic64 / u64 is so we can avoid bloating every person that subscribes to
> > an errseq, and only add the subscriber behaviour to those who care (at the
> > expense of space.
> > 
> > The datastructure is write-optimized, and rightfully so, as the users
> > of the counter feature are just overlayfs, and it's called in fsync
> > checking, which is a rather seldom operation, and not really on
> > any hotpaths.
> > 
> > [1]: https://lore.kernel.org/linux-fsdevel/20201202092720.41522-1-sargun@sargun.me/
> > 
> > Sargun Dhillon (3):
> >   errseq: Add errseq_counter to allow for all errors to be observed
> >   errseq: Add mechanism to snapshot errseq_counter and check snapshot
> >   overlay: Implement volatile-specific fsync error behaviour
> > 
> >  Documentation/filesystems/overlayfs.rst |   8 ++
> >  fs/buffer.c                             |   2 +-
> >  fs/overlayfs/file.c                     |   5 +-
> >  fs/overlayfs/overlayfs.h                |   1 +
> >  fs/overlayfs/ovl_entry.h                |   3 +
> >  fs/overlayfs/readdir.c                  |   5 +-
> >  fs/overlayfs/super.c                    |  26 +++--
> >  fs/overlayfs/util.c                     |  28 +++++
> >  fs/super.c                              |   1 +
> >  fs/sync.c                               |   3 +-
> >  include/linux/errseq.h                  |  18 ++++
> >  include/linux/fs.h                      |   6 +-
> >  include/linux/pagemap.h                 |   2 +-
> >  lib/errseq.c                            | 129 ++++++++++++++++++++----
> >  14 files changed, 202 insertions(+), 35 deletions(-)
> > 
> 
> It would hel if you could more clearly lay out the semantics you're
> looking for. If I understand correctly:
> 
> You basically want to be able to sample the sb->s_wb_err of the upper
> layer at mount time and then always return an error if any new errors
> were recorded since that point.
> 
There's two things we want to achieve:

1. If an error occurs on the upperidr after mount time, we want to tell the user 
  on every syncfs  they try to do on the overlayfs volume that it occurred, and
  that the volume is in an inconsistent state.
2. We want to be able to checkpoint some information to disk, and if an overlayfs
   mount was unmounted, and remounted, while in volatile mode, we want to make sure
   no error occurred while we were way.

> If that's correct, then I'm not sure I get need for all of this extra
> counter machinery. Why not just sample it at mount time without
> recording it as 0 if the seen flag isn't set. Then just do an
> errseq_check against the upper superblock (without advancing) in the
> overlayfs ->sync_fs routine and just errseq_set that error into the
> overlayfs superblock? The syncfs syscall wrapper should then always
> report the latest error.
I considered the following options:
1. Make errseq_t a u64: Downside: Bloats all errseq_ts to u64 / 8-byte aligned
2. Make errseq_counter_t an atomic64 / u64 giving us 52 error checking bits vs.
   just 20: Downside: We would have to do an cmpxchg64 on every error, which
   seemed like it could be costly on platforms that don't naturally support it.
3. Have an overflow counter: This doesn't introduce extra CPU overhead for any
   other user of errseq, nor does it introduce much memory overhead. Downside:
   complexity.

> 
> Or (even better) rework all of the sync_fs/syncfs mess to be more sane,
> so that overlayfs has more control over what errors get returned to
> userland. ISTM that the main problem you have is that the
> errseq_check_and_advance is done in the syscall wrapper, and that's
> probably not appropriate for your use-case.
> 
> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-12-13 20:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-11 23:49 [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts Sargun Dhillon
2020-12-11 23:50 ` [PATCH v2 1/3] errseq: Add errseq_counter to allow for all errors to be observed Sargun Dhillon
2020-12-11 23:50 ` [PATCH v2 2/3] errseq: Add mechanism to snapshot errseq_counter and check snapshot Sargun Dhillon
2020-12-12  9:57   ` Amir Goldstein
2020-12-13 19:41     ` Sargun Dhillon
2020-12-11 23:50 ` [PATCH v2 3/3] overlay: Implement volatile-specific fsync error behaviour Sargun Dhillon
2020-12-12 11:21 ` [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts Jeff Layton
2020-12-12 11:48   ` Jeff Layton
2020-12-13 20:06   ` Sargun Dhillon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).