[RFC PATCH v2 0/2] errseq+overlayfs: accomodate the volatile upper layer use-case

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/2] errseq+overlayfs: accomodate the volatile upper layer use-case
@ 2020-12-14 22:14 Jeff Layton
  2020-12-14 22:14 ` [RFC PATCH v2 1/2] errseq: split the SEEN flag into two new flags Jeff Layton
  2020-12-14 22:14 ` [RFC PATCH v2 2/2] overlayfs: propagate errors from upper to overlay sb in sync_fs Jeff Layton
  0 siblings, 2 replies; 6+ messages in thread
From: Jeff Layton @ 2020-12-14 22:14 UTC (permalink / raw)
  To: Amir Goldstein, Sargun Dhillon
  Cc: Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, NeilBrown, Jan Kara

Here's a second pass at working in the overlayfs volatile use case.
Some differences since the first RFC set:

- use the BIT() macro for the flags and counter, also add some new
  mask constants
- fix bug in errseq_sample (we don't want to set the SEEN bit there)
- fix handling in errseq_check_and_advance. We now need to reattempt
  the cmpxchg in one case, but we should only need to do it once.
- comment and documentation fixes and cleanup
- initialize upper_sb pointer before dereferencing it
- only call errseq_set when there is an error

I think this is getting closer to merge. It seems to do the right thing
on xfs (and I assume other filesystems). I've also sorted out a number
of bugs.

What I haven't actually tested is the overlayfs part. Sargun, would you
or someone else (Vivek?) be able to verify that it does the right thing?

This is in the errseq-mustinc branch in my kernel.org tree:

    https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=errseq-mustinc

-------------[ Original cover letter follows ]--------------

What about this as an alternate approach to the problem that Sargun has
been working on? I have some minor concerns about the complexity of
managing a stateful object across two different words. That can be
done, but I think this may be simpler.

This set steals an extra flag bit from the errseq_t counter so that we
have two flags: one indicating whether to increment the counter at set
time, and another to indicate whether the error has been reported to
userland.

This should give you the semantics you want in the syncfs case, no?  If
this does look like it's a suitable approach, then I'll plan to clean up
the comments and docs.

I have a vague feeling that this might help us eventually kill the
AS_EIO and AS_ENOSPC bits too, but that would require a bit more work to
plumb in "since" samples at appropriate places.

Jeff Layton (2):
  errseq: split the SEEN flag into two new flags
  overlayfs: propagate errors from upper to overlay sb in sync_fs

 Documentation/core-api/errseq.rst |  22 +++--
 fs/overlayfs/ovl_entry.h          |   1 +
 fs/overlayfs/super.c              |  19 +++--
 include/linux/errseq.h            |   2 +
 lib/errseq.c                      | 136 ++++++++++++++++++++++--------
 5 files changed, 132 insertions(+), 48 deletions(-)

-- 
2.29.2

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH v2 1/2] errseq: split the SEEN flag into two new flags
  2020-12-14 22:14 [RFC PATCH v2 0/2] errseq+overlayfs: accomodate the volatile upper layer use-case Jeff Layton
@ 2020-12-14 22:14 ` Jeff Layton
  2020-12-16 23:51   ` NeilBrown
  2020-12-14 22:14 ` [RFC PATCH v2 2/2] overlayfs: propagate errors from upper to overlay sb in sync_fs Jeff Layton
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff Layton @ 2020-12-14 22:14 UTC (permalink / raw)
  To: Amir Goldstein, Sargun Dhillon
  Cc: Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, NeilBrown, Jan Kara

Overlayfs's volatile mounts want to be able to sample an error for
their own purposes, without preventing a later opener from potentially
seeing the error.

The original reason for the SEEN flag was to make it so that we didn't
need to increment the counter if nothing had observed the latest value
and the error was the same. Eventually, a regression was reported in
the errseq_t conversion, and we fixed that by using the SEEN flag to
also mean that the error had been reported to userland at least once
somewhere.

Those are two different states, however. If we instead take a second
flag bit from the counter, we can track these two things separately,
and accomodate the overlayfs volatile mount use-case.

Add a new MUSTINC flag that indicates that the counter must be
incremented the next time an error is set, and rework the errseq
functions to set and clear that flag whenever the SEEN bit is set or
cleared.

Test only for the MUSTINC bit when deciding whether to increment the
counter and only for the SEEN bit when deciding what to return in
errseq_sample.

Add a new errseq_peek function to allow for the overlayfs use-case.
This just grabs the latest counter and sets the MUSTINC bit, leaving
the SEEN bit untouched.

errseq_check_and_advance must now handle a single special case where
it races against a "peek" of an as of yet unseen value. The do/while
loop looks scary, but shouldn't loop more than once.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 Documentation/core-api/errseq.rst |  22 +++--
 include/linux/errseq.h            |   2 +
 lib/errseq.c                      | 136 ++++++++++++++++++++++--------
 3 files changed, 117 insertions(+), 43 deletions(-)

diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst
index ff332e272405..43a4042a0546 100644
--- a/Documentation/core-api/errseq.rst
+++ b/Documentation/core-api/errseq.rst
@@ -18,18 +18,22 @@ these functions can be called from any context.
 Note that there is a risk of collisions if new errors are being recorded
 frequently, since we have so few bits to use as a counter.
 
-To mitigate this, the bit between the error value and counter is used as
-a flag to tell whether the value has been sampled since a new value was
-recorded.  That allows us to avoid bumping the counter if no one has
-sampled it since the last time an error was recorded.
+To mitigate this, the bits between the error value and counter are used
+as flags to tell whether the value has been sampled since a new value
+was recorded, and whether the latest error has been seen by userland.
+That allows us to avoid bumping the counter if no one has sampled it
+since the last time an error was recorded, and also ensures that any
+recorded error will be seen at least once.
 
 Thus we end up with a value that looks something like this:
 
-+--------------------------------------+----+------------------------+
-| 31..13                               | 12 | 11..0                  |
-+--------------------------------------+----+------------------------+
-| counter                              | SF | errno                  |
-+--------------------------------------+----+------------------------+
++---------------------------------+----+----+------------------------+
+| 31..13                          | 13 | 12 | 11..0                  |
++---------------------------------+----+----+------------------------+
+| counter                         | MF | SF | errno                  |
++---------------------------------+----+----+------------------------+
+SF = ERRSEQ_SEEN flag
+MI = ERRSEQ_MUSTINC flag
 
 The general idea is for "watchers" to sample an errseq_t value and keep
 it as a running cursor.  That value can later be used to tell whether
diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index fc2777770768..6d4b9bc629ac 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -9,6 +9,8 @@ typedef u32	errseq_t;
 
 errseq_t errseq_set(errseq_t *eseq, int err);
 errseq_t errseq_sample(errseq_t *eseq);
+errseq_t errseq_peek(errseq_t *eseq);
+errseq_t errseq_sample_advance(errseq_t *eseq);
 int errseq_check(errseq_t *eseq, errseq_t since);
 int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
 #endif
diff --git a/lib/errseq.c b/lib/errseq.c
index 81f9e33aa7e7..cee9f6b45725 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -21,10 +21,14 @@
  * Note that there is a risk of collisions if new errors are being recorded
  * frequently, since we have so few bits to use as a counter.
  *
- * To mitigate this, one bit is used as a flag to tell whether the value has
- * been sampled since a new value was recorded. That allows us to avoid bumping
- * the counter if no one has sampled it since the last time an error was
- * recorded.
+ * To mitigate this, one bit is used as a flag to tell whether the value has been
+ * observed in some fashion. That allows us to avoid bumping the counter if no
+ * one has sampled it since the last time an error was recorded.
+ *
+ * A second flag bit is used to indicate whether the latest error that has been
+ * recorded has been reported to userland. If the SEEN bit is not set when the
+ * file is opened, then we ensure that the opener will see the error by setting
+ * its sample to 0.
  *
  * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
  * is the special (but common) case where there has never been an error. An all
@@ -36,10 +40,32 @@
 #define ERRSEQ_SHIFT		ilog2(MAX_ERRNO + 1)
 
 /* This bit is used as a flag to indicate whether the value has been seen */
-#define ERRSEQ_SEEN		(1 << ERRSEQ_SHIFT)
+#define ERRSEQ_SEEN		BIT(ERRSEQ_SHIFT)
+
+/* This bit indicates that value must be incremented even when error is same */
+#define ERRSEQ_MUSTINC		BIT(ERRSEQ_SHIFT + 1)
 
 /* The lowest bit of the counter */
-#define ERRSEQ_CTR_INC		(1 << (ERRSEQ_SHIFT + 1))
+#define ERRSEQ_CTR_INC		BIT(ERRSEQ_SHIFT + 2)
+
+/* Mask that just contains the counter bits */
+#define ERRSEQ_CTR_MASK		~(ERRSEQ_CTR_INC - 1)
+
+/* Mask that just contains flags */
+#define ERRSEQ_FLAG_MASK	(ERRSEQ_SEEN|ERRSEQ_MUSTINC)
+
+/**
+ * errseq_same - return true if the errseq counters and values are the same
+ * @a: first errseq
+ * @b: second errseq
+ *
+ * Compare two errseqs and return true if they are the same, ignoring their
+ * flag bits.
+ */
+static inline bool errseq_same(errseq_t a, errseq_t b)
+{
+	return (a & ~ERRSEQ_FLAG_MASK) == (b & ~ERRSEQ_FLAG_MASK);
+}
 
 /**
  * errseq_set - set a errseq_t for later reporting
@@ -53,7 +79,7 @@
  *
  * Return: The previous value, primarily for debugging purposes. The
  * return value should not be used as a previously sampled value in later
- * calls as it will not have the SEEN flag set.
+ * calls as it will not have the MUSTINC flag set.
  */
 errseq_t errseq_set(errseq_t *eseq, int err)
 {
@@ -77,11 +103,11 @@ errseq_t errseq_set(errseq_t *eseq, int err)
 	for (;;) {
 		errseq_t new;
 
-		/* Clear out error bits and set new error */
-		new = (old & ~(MAX_ERRNO|ERRSEQ_SEEN)) | -err;
+		/* Clear out flag bits and old errors, and set new error */
+		new = (old & ERRSEQ_CTR_MASK) | -err;
 
-		/* Only increment if someone has looked at it */
-		if (old & ERRSEQ_SEEN)
+		/* Only increment if we have to */
+		if (old & ERRSEQ_MUSTINC)
 			new += ERRSEQ_CTR_INC;
 
 		/* If there would be no change, then call it done */
@@ -108,11 +134,38 @@ errseq_t errseq_set(errseq_t *eseq, int err)
 EXPORT_SYMBOL(errseq_set);
 
 /**
- * errseq_sample() - Grab current errseq_t value.
+ * errseq_peek - Grab current errseq_t value
+ * @eseq: Pointer to errseq_t to be sampled.
+ *
+ * In some cases, we need to be able to sample the errseq_t, but we're not
+ * in a situation where we can report the value to userland. Use this
+ * function to do that. This ensures that later errors will be recorded,
+ * and that any current errors are reported at least once when it is
+ * next sampled.
+ *
+ * Context: Any context.
+ * Return: The current errseq value.
+ */
+errseq_t errseq_peek(errseq_t *eseq)
+{
+	errseq_t old = READ_ONCE(*eseq);
+	errseq_t new = old;
+
+	if (old != 0) {
+		new |= ERRSEQ_MUSTINC;
+		if (old != new)
+			cmpxchg(eseq, old, new);
+	}
+	return new;
+}
+EXPORT_SYMBOL(errseq_peek);
+
+/**
+ * errseq_sample() - Sample errseq_t value, and ensure that unseen errors are reported
  * @eseq: Pointer to errseq_t to be sampled.
  *
  * This function allows callers to initialise their errseq_t variable.
- * If the error has been "seen", new callers will not see an old error.
+ * If the latest error has been "seen", new callers will not see an old error.
  * If there is an unseen error in @eseq, the caller of this function will
  * see it the next time it checks for an error.
  *
@@ -121,12 +174,11 @@ EXPORT_SYMBOL(errseq_set);
  */
 errseq_t errseq_sample(errseq_t *eseq)
 {
-	errseq_t old = READ_ONCE(*eseq);
+	errseq_t new = errseq_peek(eseq);
 
-	/* If nobody has seen this error yet, then we can be the first. */
-	if (!(old & ERRSEQ_SEEN))
-		old = 0;
-	return old;
+	if (!(new & ERRSEQ_SEEN))
+		return 0;
+	return new;
 }
 EXPORT_SYMBOL(errseq_sample);
 
@@ -145,7 +197,7 @@ int errseq_check(errseq_t *eseq, errseq_t since)
 {
 	errseq_t cur = READ_ONCE(*eseq);
 
-	if (likely(cur == since))
+	if (errseq_same(cur, since))
 		return 0;
 	return -(cur & MAX_ERRNO);
 }
@@ -159,9 +211,9 @@ EXPORT_SYMBOL(errseq_check);
  * Grab the eseq value, and see whether it matches the value that @since
  * points to. If it does, then just return 0.
  *
- * If it doesn't, then the value has changed. Set the "seen" flag, and try to
- * swap it into place as the new eseq value. Then, set that value as the new
- * "since" value, and return whatever the error portion is set to.
+ * If it doesn't, then the value has changed. Set the SEEN+MUSTINC flags, and
+ * try to swap it into place as the new eseq value. Then, set that value as
+ * the new "since" value, and return whatever the error portion is set to.
  *
  * Note that no locking is provided here for concurrent updates to the "since"
  * value. The caller must provide that if necessary. Because of this, callers
@@ -183,21 +235,37 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
 	 */
 	old = READ_ONCE(*eseq);
 	if (old != *since) {
+		int loops = 0;
+
 		/*
-		 * Set the flag and try to swap it into place if it has
-		 * changed.
+		 * Set the flag and try to swap it into place if it has changed.
+		 *
+		 * If the swap doesn't occur, then it has either been updated by a
+		 * writer who is setting a new error and/or bumping the counter, or
+		 * another reader who is setting flags.
 		 *
-		 * We don't care about the outcome of the swap here. If the
-		 * swap doesn't occur, then it has either been updated by a
-		 * writer who is altering the value in some way (updating
-		 * counter or resetting the error), or another reader who is
-		 * just setting the "seen" flag. Either outcome is OK, and we
-		 * can advance "since" and return an error based on what we
-		 * have.
+		 * We only need to retry in one case -- if we raced with another
+		 * reader that is only setting the MUSTINC flag. We need the
+		 * current value to have the SEEN bit set if the other fields
+		 * didn't change, or we might report the same error twice.
 		 */
-		new = old | ERRSEQ_SEEN;
-		if (new != old)
-			cmpxchg(eseq, old, new);
+		do {
+			if (unlikely(loops >= 2)) {
+				/*
+				 * This should never loop more than once, as any
+				 * change not involving the SEEN bit would also
+				 * involve non-flag bits. WARN and just go with
+				 * what we have in that case.
+				 */
+				WARN_ON_ONCE(true);
+				break;
+			}
+			loops++;
+			new = old | ERRSEQ_SEEN | ERRSEQ_MUSTINC;
+			if (new == old)
+				break;
+			old = cmpxchg(eseq, old, new);
+		} while (old == (new & ~ERRSEQ_SEEN));
 		*since = new;
 		err = -(new & MAX_ERRNO);
 	}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH v2 2/2] overlayfs: propagate errors from upper to overlay sb in sync_fs
  2020-12-14 22:14 [RFC PATCH v2 0/2] errseq+overlayfs: accomodate the volatile upper layer use-case Jeff Layton
  2020-12-14 22:14 ` [RFC PATCH v2 1/2] errseq: split the SEEN flag into two new flags Jeff Layton
@ 2020-12-14 22:14 ` Jeff Layton
  2020-12-15 16:30   ` Vivek Goyal
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff Layton @ 2020-12-14 22:14 UTC (permalink / raw)
  To: Amir Goldstein, Sargun Dhillon
  Cc: Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, NeilBrown, Jan Kara

Peek at the upper layer's errseq_t at mount time for volatile mounts,
and record it in the per-sb info. In sync_fs, check for an error since
the recorded point and set it in the overlayfs superblock if there was
one.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/overlayfs/ovl_entry.h |  1 +
 fs/overlayfs/super.c     | 19 ++++++++++++++-----
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 1b5a2094df8e..f4285da50525 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -79,6 +79,7 @@ struct ovl_fs {
 	atomic_long_t last_ino;
 	/* Whiteout dentry cache */
 	struct dentry *whiteout;
+	errseq_t errseq;
 };
 
 static inline struct vfsmount *ovl_upper_mnt(struct ovl_fs *ofs)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..3f0cb91915ff 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -264,8 +264,16 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
 	if (!ovl_upper_mnt(ofs))
 		return 0;
 
-	if (!ovl_should_sync(ofs))
-		return 0;
+	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
+
+	if (!ovl_should_sync(ofs)) {
+		/* Propagate errors from upper to overlayfs */
+		ret = errseq_check(&upper_sb->s_wb_err, ofs->errseq);
+		if (ret)
+			errseq_set(&sb->s_wb_err, ret);
+		return ret;
+	}
+
 	/*
 	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
 	 * All the super blocks will be iterated, including upper_sb.
@@ -277,8 +285,6 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
 	if (!wait)
 		return 0;
 
-	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
-
 	down_read(&upper_sb->s_umount);
 	ret = sync_filesystem(upper_sb);
 	up_read(&upper_sb->s_umount);
@@ -1945,8 +1951,11 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 
 		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
 		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
-
 	}
+
+	if (ofs->config.ovl_volatile)
+		ofs->errseq = errseq_peek(&ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
+
 	oe = ovl_get_lowerstack(sb, splitlower, numlower, ofs, layers);
 	err = PTR_ERR(oe);
 	if (IS_ERR(oe))
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH v2 2/2] overlayfs: propagate errors from upper to overlay sb in sync_fs
  2020-12-14 22:14 ` [RFC PATCH v2 2/2] overlayfs: propagate errors from upper to overlay sb in sync_fs Jeff Layton
@ 2020-12-15 16:30   ` Vivek Goyal
  2020-12-15 16:43     ` Jeff Layton
  0 siblings, 1 reply; 6+ messages in thread
From: Vivek Goyal @ 2020-12-15 16:30 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Sargun Dhillon, Miklos Szeredi, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, NeilBrown, Jan Kara

On Mon, Dec 14, 2020 at 05:14:21PM -0500, Jeff Layton wrote:
> Peek at the upper layer's errseq_t at mount time for volatile mounts,
> and record it in the per-sb info. In sync_fs, check for an error since
> the recorded point and set it in the overlayfs superblock if there was
> one.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/overlayfs/ovl_entry.h |  1 +
>  fs/overlayfs/super.c     | 19 ++++++++++++++-----
>  2 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> index 1b5a2094df8e..f4285da50525 100644
> --- a/fs/overlayfs/ovl_entry.h
> +++ b/fs/overlayfs/ovl_entry.h
> @@ -79,6 +79,7 @@ struct ovl_fs {
>  	atomic_long_t last_ino;
>  	/* Whiteout dentry cache */
>  	struct dentry *whiteout;
> +	errseq_t errseq;
>  };
>  
>  static inline struct vfsmount *ovl_upper_mnt(struct ovl_fs *ofs)
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 290983bcfbb3..3f0cb91915ff 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -264,8 +264,16 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
>  	if (!ovl_upper_mnt(ofs))
>  		return 0;
>  
> -	if (!ovl_should_sync(ofs))
> -		return 0;
> +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> +
> +	if (!ovl_should_sync(ofs)) {
> +		/* Propagate errors from upper to overlayfs */
> +		ret = errseq_check(&upper_sb->s_wb_err, ofs->errseq);
> +		if (ret)
> +			errseq_set(&sb->s_wb_err, ret);
> +		return ret;
> +	}
> +

I have few concerns here. I think ovl_sync_fs() should not be different
for volatile mounts and non-volatile mounts. IOW, if an overlayfs
user calls syncfs(fd), then only difference with non-volatile mount
is that we will not call sync_filesystem() on underlying filesystem. But
if there is an existing writeback error then that should be reported
to syncfs(fd) caller both in case of volatile and non-volatile mounts.

Additional requirement in case of non-volatile mount seems to be that
as soon as we detect first error, we probably should mark whole file
system bad and start returning error for overlay operations so that
upper layer can be thrown away and process restarted.

And final non-volatile mount requirement seems to that we want to detect
writeback errors in non syncfs() paths, for ex. mount(). That's what
Sargun is trying to do. Keep a snapshot of upper_sb errseq on disk
and upon remount of volatile overlay make sure no writeback errors
have happened since then. And that's where I think we should be using
new errseq_peek() and errseq_check(&upper_sb->s_wb_err, ofs->errseq)
infracture. That way we can detect error on upper without consuming
it upon overlay remount.

IOW, IMHO, ovl_sync_fs(), should use same mechanism to report error to
user space both for volatile and non-volatile mounts. And this new
mechanism of peeking at error without consuming it should be used
in other paths like remount and possibly other overlay operations(if need
be). 

But creating a special path in ovl_sync_fs() for volatile mounts
only will create conflicts with error reporting for non-volatile
mounts. And IMHO, these should be same.

Is there a good reason that why we should treat volatile and non-volatile
mounts differently in ovl_sync_fs() from error detection and reporting
point of view.

Thanks
Vivek

>  	/*
>  	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
>  	 * All the super blocks will be iterated, including upper_sb.
> @@ -277,8 +285,6 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
>  	if (!wait)
>  		return 0;
>  
> -	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> -
>  	down_read(&upper_sb->s_umount);
>  	ret = sync_filesystem(upper_sb);
>  	up_read(&upper_sb->s_umount);
> @@ -1945,8 +1951,11 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>  
>  		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
>  		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> -
>  	}
> +
> +	if (ofs->config.ovl_volatile)
> +		ofs->errseq = errseq_peek(&ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
> +
>  	oe = ovl_get_lowerstack(sb, splitlower, numlower, ofs, layers);
>  	err = PTR_ERR(oe);
>  	if (IS_ERR(oe))
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH v2 2/2] overlayfs: propagate errors from upper to overlay sb in sync_fs
  2020-12-15 16:30   ` Vivek Goyal
@ 2020-12-15 16:43     ` Jeff Layton
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff Layton @ 2020-12-15 16:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Amir Goldstein, Sargun Dhillon, Miklos Szeredi, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, NeilBrown, Jan Kara

On Tue, 2020-12-15 at 11:30 -0500, Vivek Goyal wrote:
> On Mon, Dec 14, 2020 at 05:14:21PM -0500, Jeff Layton wrote:
> > Peek at the upper layer's errseq_t at mount time for volatile mounts,
> > and record it in the per-sb info. In sync_fs, check for an error since
> > the recorded point and set it in the overlayfs superblock if there was
> > one.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/overlayfs/ovl_entry.h |  1 +
> >  fs/overlayfs/super.c     | 19 ++++++++++++++-----
> >  2 files changed, 15 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> > index 1b5a2094df8e..f4285da50525 100644
> > --- a/fs/overlayfs/ovl_entry.h
> > +++ b/fs/overlayfs/ovl_entry.h
> > @@ -79,6 +79,7 @@ struct ovl_fs {
> >  	atomic_long_t last_ino;
> >  	/* Whiteout dentry cache */
> >  	struct dentry *whiteout;
> > +	errseq_t errseq;
> >  };
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >  static inline struct vfsmount *ovl_upper_mnt(struct ovl_fs *ofs)
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 290983bcfbb3..3f0cb91915ff 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -264,8 +264,16 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> >  	if (!ovl_upper_mnt(ofs))
> >  		return 0;
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -	if (!ovl_should_sync(ofs))
> > -		return 0;
> > +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > +
> > +	if (!ovl_should_sync(ofs)) {
> > +		/* Propagate errors from upper to overlayfs */
> > +		ret = errseq_check(&upper_sb->s_wb_err, ofs->errseq);
> > +		if (ret)
> > +			errseq_set(&sb->s_wb_err, ret);
> > +		return ret;
> > +	}
> > +
> 
> I have few concerns here. I think ovl_sync_fs() should not be different
> for volatile mounts and non-volatile mounts. IOW, if an overlayfs
> user calls syncfs(fd), then only difference with non-volatile mount
> is that we will not call sync_filesystem() on underlying filesystem. But
> if there is an existing writeback error then that should be reported
> to syncfs(fd) caller both in case of volatile and non-volatile mounts.
> 
> Additional requirement in case of non-volatile mount seems to be that
> as soon as we detect first error, we probably should mark whole file
> system bad and start returning error for overlay operations so that
> upper layer can be thrown away and process restarted.
> 

That was the reason the patch did the errseq_set on every sync_fs
invocation for a volatile mount. That should ensure that syncfs always
returns an error. Still, there probably are cleaner ways to do this...

> And final non-volatile mount requirement seems to that we want to detect
> writeback errors in non syncfs() paths, for ex. mount(). That's what
> Sargun is trying to do. Keep a snapshot of upper_sb errseq on disk
> and upon remount of volatile overlay make sure no writeback errors
> have happened since then. And that's where I think we should be using
> new errseq_peek() and errseq_check(&upper_sb->s_wb_err, ofs->errseq)
> infracture. That way we can detect error on upper without consuming
> it upon overlay remount.
> 
> IOW, IMHO, ovl_sync_fs(), should use same mechanism to report error to
> user space both for volatile and non-volatile mounts. And this new
> mechanism of peeking at error without consuming it should be used
> in other paths like remount and possibly other overlay operations(if need
> be). 
> 
> But creating a special path in ovl_sync_fs() for volatile mounts
> only will create conflicts with error reporting for non-volatile
> mounts. And IMHO, these should be same.
> 
> Is there a good reason that why we should treat volatile and non-volatile
> mounts differently in ovl_sync_fs() from error detection and reporting
> point of view.
> 

Fair enough. I'm not that well-versed in overlayfs, so if you see a
better way to do this, then that's fine by me. I just sent this out as a
demonstration of how you could do it. Feel free to drop the second
patch.

I think the simplest solution to most of these issues is to add a new
f_op->syncfs vector. You shouldn't need to propagate errors to the ovl
sb at all if you add that. You can just operate on the upper sb's
s_wb_err, and ignore the one in the ovl sb.

> >  	/*
> >  	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
> >  	 * All the super blocks will be iterated, including upper_sb.
> > @@ -277,8 +285,6 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> >  	if (!wait)
> >  		return 0;
> >  
> > 
> > 
> > 
> > -	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > -
> >  	down_read(&upper_sb->s_umount);
> >  	ret = sync_filesystem(upper_sb);
> >  	up_read(&upper_sb->s_umount);
> > @@ -1945,8 +1951,11 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> >  
> > 
> > 
> > 
> >  		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> >  		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > -
> >  	}
> > +
> > +	if (ofs->config.ovl_volatile)
> > +		ofs->errseq = errseq_peek(&ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
> > +
> >  	oe = ovl_get_lowerstack(sb, splitlower, numlower, ofs, layers);
> >  	err = PTR_ERR(oe);
> >  	if (IS_ERR(oe))
> > -- 
> > 2.29.2
> > 
> 

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH v2 1/2] errseq: split the SEEN flag into two new flags
  2020-12-14 22:14 ` [RFC PATCH v2 1/2] errseq: split the SEEN flag into two new flags Jeff Layton
@ 2020-12-16 23:51   ` NeilBrown
  0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2020-12-16 23:51 UTC (permalink / raw)
  To: Jeff Layton, Amir Goldstein, Sargun Dhillon
  Cc: Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, NeilBrown, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 12982 bytes --]

On Mon, Dec 14 2020, Jeff Layton wrote:

> Overlayfs's volatile mounts want to be able to sample an error for
> their own purposes, without preventing a later opener from potentially
> seeing the error.
>
> The original reason for the SEEN flag was to make it so that we didn't
> need to increment the counter if nothing had observed the latest value
> and the error was the same. Eventually, a regression was reported in
> the errseq_t conversion, and we fixed that by using the SEEN flag to
> also mean that the error had been reported to userland at least once
> somewhere.
>
> Those are two different states, however. If we instead take a second
> flag bit from the counter, we can track these two things separately,
> and accomodate the overlayfs volatile mount use-case.
>
> Add a new MUSTINC flag that indicates that the counter must be
> incremented the next time an error is set, and rework the errseq
> functions to set and clear that flag whenever the SEEN bit is set or
> cleared.
>
> Test only for the MUSTINC bit when deciding whether to increment the
> counter and only for the SEEN bit when deciding what to return in
> errseq_sample.
>
> Add a new errseq_peek function to allow for the overlayfs use-case.
> This just grabs the latest counter and sets the MUSTINC bit, leaving
> the SEEN bit untouched.
>
> errseq_check_and_advance must now handle a single special case where
> it races against a "peek" of an as of yet unseen value. The do/while
> loop looks scary, but shouldn't loop more than once.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  Documentation/core-api/errseq.rst |  22 +++--
>  include/linux/errseq.h            |   2 +
>  lib/errseq.c                      | 136 ++++++++++++++++++++++--------
>  3 files changed, 117 insertions(+), 43 deletions(-)
>
> diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst
> index ff332e272405..43a4042a0546 100644
> --- a/Documentation/core-api/errseq.rst
> +++ b/Documentation/core-api/errseq.rst
> @@ -18,18 +18,22 @@ these functions can be called from any context.
>  Note that there is a risk of collisions if new errors are being recorded
>  frequently, since we have so few bits to use as a counter.
>  
> -To mitigate this, the bit between the error value and counter is used as
> -a flag to tell whether the value has been sampled since a new value was
> -recorded.  That allows us to avoid bumping the counter if no one has
> -sampled it since the last time an error was recorded.
> +To mitigate this, the bits between the error value and counter are used
> +as flags to tell whether the value has been sampled since a new value
> +was recorded, and whether the latest error has been seen by userland.
> +That allows us to avoid bumping the counter if no one has sampled it
> +since the last time an error was recorded, and also ensures that any
> +recorded error will be seen at least once.
>  
>  Thus we end up with a value that looks something like this:
>  
> -+--------------------------------------+----+------------------------+
> -| 31..13                               | 12 | 11..0                  |
> -+--------------------------------------+----+------------------------+
> -| counter                              | SF | errno                  |
> -+--------------------------------------+----+------------------------+
> ++---------------------------------+----+----+------------------------+
> +| 31..13                          | 13 | 12 | 11..0                  |

  31..14 :-)

Otherwise this all seems to make sense.

Reviewed-by: NeilBrown <neilb@suse.de>

Thanks,
NeilBrown



> ++---------------------------------+----+----+------------------------+
> +| counter                         | MF | SF | errno                  |
> ++---------------------------------+----+----+------------------------+
> +SF = ERRSEQ_SEEN flag
> +MI = ERRSEQ_MUSTINC flag
>  
>  The general idea is for "watchers" to sample an errseq_t value and keep
>  it as a running cursor.  That value can later be used to tell whether
> diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> index fc2777770768..6d4b9bc629ac 100644
> --- a/include/linux/errseq.h
> +++ b/include/linux/errseq.h
> @@ -9,6 +9,8 @@ typedef u32	errseq_t;
>  
>  errseq_t errseq_set(errseq_t *eseq, int err);
>  errseq_t errseq_sample(errseq_t *eseq);
> +errseq_t errseq_peek(errseq_t *eseq);
> +errseq_t errseq_sample_advance(errseq_t *eseq);
>  int errseq_check(errseq_t *eseq, errseq_t since);
>  int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
>  #endif
> diff --git a/lib/errseq.c b/lib/errseq.c
> index 81f9e33aa7e7..cee9f6b45725 100644
> --- a/lib/errseq.c
> +++ b/lib/errseq.c
> @@ -21,10 +21,14 @@
>   * Note that there is a risk of collisions if new errors are being recorded
>   * frequently, since we have so few bits to use as a counter.
>   *
> - * To mitigate this, one bit is used as a flag to tell whether the value has
> - * been sampled since a new value was recorded. That allows us to avoid bumping
> - * the counter if no one has sampled it since the last time an error was
> - * recorded.
> + * To mitigate this, one bit is used as a flag to tell whether the value has been
> + * observed in some fashion. That allows us to avoid bumping the counter if no
> + * one has sampled it since the last time an error was recorded.
> + *
> + * A second flag bit is used to indicate whether the latest error that has been
> + * recorded has been reported to userland. If the SEEN bit is not set when the
> + * file is opened, then we ensure that the opener will see the error by setting
> + * its sample to 0.
>   *
>   * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
>   * is the special (but common) case where there has never been an error. An all
> @@ -36,10 +40,32 @@
>  #define ERRSEQ_SHIFT		ilog2(MAX_ERRNO + 1)
>  
>  /* This bit is used as a flag to indicate whether the value has been seen */
> -#define ERRSEQ_SEEN		(1 << ERRSEQ_SHIFT)
> +#define ERRSEQ_SEEN		BIT(ERRSEQ_SHIFT)
> +
> +/* This bit indicates that value must be incremented even when error is same */
> +#define ERRSEQ_MUSTINC		BIT(ERRSEQ_SHIFT + 1)
>  
>  /* The lowest bit of the counter */
> -#define ERRSEQ_CTR_INC		(1 << (ERRSEQ_SHIFT + 1))
> +#define ERRSEQ_CTR_INC		BIT(ERRSEQ_SHIFT + 2)
> +
> +/* Mask that just contains the counter bits */
> +#define ERRSEQ_CTR_MASK		~(ERRSEQ_CTR_INC - 1)
> +
> +/* Mask that just contains flags */
> +#define ERRSEQ_FLAG_MASK	(ERRSEQ_SEEN|ERRSEQ_MUSTINC)
> +
> +/**
> + * errseq_same - return true if the errseq counters and values are the same
> + * @a: first errseq
> + * @b: second errseq
> + *
> + * Compare two errseqs and return true if they are the same, ignoring their
> + * flag bits.
> + */
> +static inline bool errseq_same(errseq_t a, errseq_t b)
> +{
> +	return (a & ~ERRSEQ_FLAG_MASK) == (b & ~ERRSEQ_FLAG_MASK);
> +}
>  
>  /**
>   * errseq_set - set a errseq_t for later reporting
> @@ -53,7 +79,7 @@
>   *
>   * Return: The previous value, primarily for debugging purposes. The
>   * return value should not be used as a previously sampled value in later
> - * calls as it will not have the SEEN flag set.
> + * calls as it will not have the MUSTINC flag set.
>   */
>  errseq_t errseq_set(errseq_t *eseq, int err)
>  {
> @@ -77,11 +103,11 @@ errseq_t errseq_set(errseq_t *eseq, int err)
>  	for (;;) {
>  		errseq_t new;
>  
> -		/* Clear out error bits and set new error */
> -		new = (old & ~(MAX_ERRNO|ERRSEQ_SEEN)) | -err;
> +		/* Clear out flag bits and old errors, and set new error */
> +		new = (old & ERRSEQ_CTR_MASK) | -err;
>  
> -		/* Only increment if someone has looked at it */
> -		if (old & ERRSEQ_SEEN)
> +		/* Only increment if we have to */
> +		if (old & ERRSEQ_MUSTINC)
>  			new += ERRSEQ_CTR_INC;
>  
>  		/* If there would be no change, then call it done */
> @@ -108,11 +134,38 @@ errseq_t errseq_set(errseq_t *eseq, int err)
>  EXPORT_SYMBOL(errseq_set);
>  
>  /**
> - * errseq_sample() - Grab current errseq_t value.
> + * errseq_peek - Grab current errseq_t value
> + * @eseq: Pointer to errseq_t to be sampled.
> + *
> + * In some cases, we need to be able to sample the errseq_t, but we're not
> + * in a situation where we can report the value to userland. Use this
> + * function to do that. This ensures that later errors will be recorded,
> + * and that any current errors are reported at least once when it is
> + * next sampled.
> + *
> + * Context: Any context.
> + * Return: The current errseq value.
> + */
> +errseq_t errseq_peek(errseq_t *eseq)
> +{
> +	errseq_t old = READ_ONCE(*eseq);
> +	errseq_t new = old;
> +
> +	if (old != 0) {
> +		new |= ERRSEQ_MUSTINC;
> +		if (old != new)
> +			cmpxchg(eseq, old, new);
> +	}
> +	return new;
> +}
> +EXPORT_SYMBOL(errseq_peek);
> +
> +/**
> + * errseq_sample() - Sample errseq_t value, and ensure that unseen errors are reported
>   * @eseq: Pointer to errseq_t to be sampled.
>   *
>   * This function allows callers to initialise their errseq_t variable.
> - * If the error has been "seen", new callers will not see an old error.
> + * If the latest error has been "seen", new callers will not see an old error.
>   * If there is an unseen error in @eseq, the caller of this function will
>   * see it the next time it checks for an error.
>   *
> @@ -121,12 +174,11 @@ EXPORT_SYMBOL(errseq_set);
>   */
>  errseq_t errseq_sample(errseq_t *eseq)
>  {
> -	errseq_t old = READ_ONCE(*eseq);
> +	errseq_t new = errseq_peek(eseq);
>  
> -	/* If nobody has seen this error yet, then we can be the first. */
> -	if (!(old & ERRSEQ_SEEN))
> -		old = 0;
> -	return old;
> +	if (!(new & ERRSEQ_SEEN))
> +		return 0;
> +	return new;
>  }
>  EXPORT_SYMBOL(errseq_sample);
>  
> @@ -145,7 +197,7 @@ int errseq_check(errseq_t *eseq, errseq_t since)
>  {
>  	errseq_t cur = READ_ONCE(*eseq);
>  
> -	if (likely(cur == since))
> +	if (errseq_same(cur, since))
>  		return 0;
>  	return -(cur & MAX_ERRNO);
>  }
> @@ -159,9 +211,9 @@ EXPORT_SYMBOL(errseq_check);
>   * Grab the eseq value, and see whether it matches the value that @since
>   * points to. If it does, then just return 0.
>   *
> - * If it doesn't, then the value has changed. Set the "seen" flag, and try to
> - * swap it into place as the new eseq value. Then, set that value as the new
> - * "since" value, and return whatever the error portion is set to.
> + * If it doesn't, then the value has changed. Set the SEEN+MUSTINC flags, and
> + * try to swap it into place as the new eseq value. Then, set that value as
> + * the new "since" value, and return whatever the error portion is set to.
>   *
>   * Note that no locking is provided here for concurrent updates to the "since"
>   * value. The caller must provide that if necessary. Because of this, callers
> @@ -183,21 +235,37 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
>  	 */
>  	old = READ_ONCE(*eseq);
>  	if (old != *since) {
> +		int loops = 0;
> +
>  		/*
> -		 * Set the flag and try to swap it into place if it has
> -		 * changed.
> +		 * Set the flag and try to swap it into place if it has changed.
> +		 *
> +		 * If the swap doesn't occur, then it has either been updated by a
> +		 * writer who is setting a new error and/or bumping the counter, or
> +		 * another reader who is setting flags.
>  		 *
> -		 * We don't care about the outcome of the swap here. If the
> -		 * swap doesn't occur, then it has either been updated by a
> -		 * writer who is altering the value in some way (updating
> -		 * counter or resetting the error), or another reader who is
> -		 * just setting the "seen" flag. Either outcome is OK, and we
> -		 * can advance "since" and return an error based on what we
> -		 * have.
> +		 * We only need to retry in one case -- if we raced with another
> +		 * reader that is only setting the MUSTINC flag. We need the
> +		 * current value to have the SEEN bit set if the other fields
> +		 * didn't change, or we might report the same error twice.
>  		 */
> -		new = old | ERRSEQ_SEEN;
> -		if (new != old)
> -			cmpxchg(eseq, old, new);
> +		do {
> +			if (unlikely(loops >= 2)) {
> +				/*
> +				 * This should never loop more than once, as any
> +				 * change not involving the SEEN bit would also
> +				 * involve non-flag bits. WARN and just go with
> +				 * what we have in that case.
> +				 */
> +				WARN_ON_ONCE(true);
> +				break;
> +			}
> +			loops++;
> +			new = old | ERRSEQ_SEEN | ERRSEQ_MUSTINC;
> +			if (new == old)
> +				break;
> +			old = cmpxchg(eseq, old, new);
> +		} while (old == (new & ~ERRSEQ_SEEN));
>  		*since = new;
>  		err = -(new & MAX_ERRNO);
>  	}
> -- 
> 2.29.2

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-12-16 23:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-14 22:14 [RFC PATCH v2 0/2] errseq+overlayfs: accomodate the volatile upper layer use-case Jeff Layton
2020-12-14 22:14 ` [RFC PATCH v2 1/2] errseq: split the SEEN flag into two new flags Jeff Layton
2020-12-16 23:51   ` NeilBrown
2020-12-14 22:14 ` [RFC PATCH v2 2/2] overlayfs: propagate errors from upper to overlay sb in sync_fs Jeff Layton
2020-12-15 16:30   ` Vivek Goyal
2020-12-15 16:43     ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.