linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
@ 2020-12-17 15:00 Jeff Layton
  2020-12-17 20:35 ` Sargun Dhillon
  2020-12-19  6:13 ` Matthew Wilcox
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff Layton @ 2020-12-17 15:00 UTC (permalink / raw)
  To: Amir Goldstein, Sargun Dhillon
  Cc: Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, NeilBrown, Jan Kara

Overlayfs's volatile mounts want to be able to sample an error for their
own purposes, without preventing a later opener from potentially seeing
the error.

The original reason for the ERRSEQ_SEEN flag was to make it so that we
didn't need to increment the counter if nothing had observed the latest
value and the error was the same. Eventually, a regression was reported
in the errseq_t conversion, and we fixed that by using the ERRSEQ_SEEN
flag to also mean that the error had been reported to userland at least
once somewhere.

Those are two different states, however. If we instead take a second
flag bit from the counter, we can track these two things separately, and
accomodate the overlayfs volatile mount use-case.

Rename the ERRSEQ_SEEN flag to ERRSEQ_OBSERVED and use that to indicate
that the counter must be incremented the next time an error is set.
Also, add a new ERRSEQ_REPORTED flag that indicates whether the current
error was returned to userland (and thus doesn't need to be reported on
newly open file descriptions).

Test only for the OBSERVED bit when deciding whether to increment the
counter and only for the REPORTED bit when deciding what to return in
errseq_sample.

Add a new errseq_peek function to allow for the overlayfs use-case.
This just grabs the latest counter and sets the OBSERVED bit, leaving the
REPORTED bit untouched.

errseq_check_and_advance must now handle a single special case where
it races against a "peek" of an as of yet unseen value. The do/while
loop looks scary, but shouldn't loop more than once.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 Documentation/core-api/errseq.rst |  22 +++--
 include/linux/errseq.h            |   1 +
 lib/errseq.c                      | 139 ++++++++++++++++++++++--------
 3 files changed, 118 insertions(+), 44 deletions(-)

v3: rename SEEN/MUSTINC flags to REPORTED/OBSERVED

Hopefully the new flag names will make this a bit more clear. We could
also rename some of the functions if that helps too. We could consider
moving from errseq_sample/_check_and_advance to
errseq_observe/errseq_report?  I'm not sure that helps anything though.

I know that Vivek and Sargun are working on syncfs() for overlayfs, so
we probably don't want to merge this until that work is ready. I think
the errseq_peek call will need to be part of their solution for volatile
mounts, however, so I'm fine with merging this via the overlayfs tree,
once that work is complete.

diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst
index ff332e272405..ce46ddcc1487 100644
--- a/Documentation/core-api/errseq.rst
+++ b/Documentation/core-api/errseq.rst
@@ -18,18 +18,22 @@ these functions can be called from any context.
 Note that there is a risk of collisions if new errors are being recorded
 frequently, since we have so few bits to use as a counter.
 
-To mitigate this, the bit between the error value and counter is used as
-a flag to tell whether the value has been sampled since a new value was
-recorded.  That allows us to avoid bumping the counter if no one has
-sampled it since the last time an error was recorded.
+To mitigate this, the bits between the error value and counter are used
+as flags to tell whether the value has been sampled since a new value
+was recorded, and whether the latest error has been seen by userland.
+That allows us to avoid bumping the counter if no one has sampled it
+since the last time an error was recorded, and also ensures that any
+recorded error will be seen at least once.
 
 Thus we end up with a value that looks something like this:
 
-+--------------------------------------+----+------------------------+
-| 31..13                               | 12 | 11..0                  |
-+--------------------------------------+----+------------------------+
-| counter                              | SF | errno                  |
-+--------------------------------------+----+------------------------+
++---------------------------------+----+----+------------------------+
+| 31..14                          | 13 | 12 | 11..0                  |
++---------------------------------+----+----+------------------------+
+| counter                         | OF | RF | errno                  |
++---------------------------------+----+----+------------------------+
+OF = ERRSEQ_OBSERVED flag
+RF = ERRSEQ_REPORTED flag
 
 The general idea is for "watchers" to sample an errseq_t value and keep
 it as a running cursor.  That value can later be used to tell whether
diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index fc2777770768..7e3634269c95 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -9,6 +9,7 @@ typedef u32	errseq_t;
 
 errseq_t errseq_set(errseq_t *eseq, int err);
 errseq_t errseq_sample(errseq_t *eseq);
+errseq_t errseq_peek(errseq_t *eseq);
 int errseq_check(errseq_t *eseq, errseq_t since);
 int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
 #endif
diff --git a/lib/errseq.c b/lib/errseq.c
index 81f9e33aa7e7..8fd6be134dcc 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -21,10 +21,14 @@
  * Note that there is a risk of collisions if new errors are being recorded
  * frequently, since we have so few bits to use as a counter.
  *
- * To mitigate this, one bit is used as a flag to tell whether the value has
- * been sampled since a new value was recorded. That allows us to avoid bumping
- * the counter if no one has sampled it since the last time an error was
- * recorded.
+ * To mitigate this, one bit is used as a flag to tell whether the value has been
+ * observed in some fashion. That allows us to avoid bumping the counter if no
+ * one has sampled it since the last time an error was recorded.
+ *
+ * A second flag bit is used to indicate whether the latest error that has been
+ * recorded has been reported to userland. If the REPORTED bit is not set when the
+ * file is opened, then we ensure that the opener will see the error by setting
+ * its sample to 0.
  *
  * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
  * is the special (but common) case where there has never been an error. An all
@@ -35,11 +39,33 @@
 /* The low bits are designated for error code (max of MAX_ERRNO) */
 #define ERRSEQ_SHIFT		ilog2(MAX_ERRNO + 1)
 
-/* This bit is used as a flag to indicate whether the value has been seen */
-#define ERRSEQ_SEEN		(1 << ERRSEQ_SHIFT)
+/* Flag to indicate whether the value will be or has been reported */
+#define ERRSEQ_REPORTED		BIT(ERRSEQ_SHIFT)
+
+/* Flag to ndicate that error must be recorded */
+#define ERRSEQ_OBSERVED		BIT(ERRSEQ_SHIFT + 1)
 
 /* The lowest bit of the counter */
-#define ERRSEQ_CTR_INC		(1 << (ERRSEQ_SHIFT + 1))
+#define ERRSEQ_CTR_INC		BIT(ERRSEQ_SHIFT + 2)
+
+/* Mask that just contains the counter bits */
+#define ERRSEQ_CTR_MASK		~(ERRSEQ_CTR_INC - 1)
+
+/* Mask that just contains flags */
+#define ERRSEQ_FLAG_MASK	(ERRSEQ_REPORTED|ERRSEQ_OBSERVED)
+
+/**
+ * errseq_same - return true if the errseq counters and values are the same
+ * @a: first errseq
+ * @b: second errseq
+ *
+ * Compare two errseqs and return true if they are the same, ignoring their
+ * flag bits.
+ */
+static inline bool errseq_same(errseq_t a, errseq_t b)
+{
+	return (a & ~ERRSEQ_FLAG_MASK) == (b & ~ERRSEQ_FLAG_MASK);
+}
 
 /**
  * errseq_set - set a errseq_t for later reporting
@@ -53,7 +79,7 @@
  *
  * Return: The previous value, primarily for debugging purposes. The
  * return value should not be used as a previously sampled value in later
- * calls as it will not have the SEEN flag set.
+ * calls as it will not have the OBSERVED flag set.
  */
 errseq_t errseq_set(errseq_t *eseq, int err)
 {
@@ -77,11 +103,11 @@ errseq_t errseq_set(errseq_t *eseq, int err)
 	for (;;) {
 		errseq_t new;
 
-		/* Clear out error bits and set new error */
-		new = (old & ~(MAX_ERRNO|ERRSEQ_SEEN)) | -err;
+		/* Clear out flag bits and old errors, and set new error */
+		new = (old & ERRSEQ_CTR_MASK) | -err;
 
-		/* Only increment if someone has looked at it */
-		if (old & ERRSEQ_SEEN)
+		/* Only increment if we have to */
+		if (old & ERRSEQ_OBSERVED)
 			new += ERRSEQ_CTR_INC;
 
 		/* If there would be no change, then call it done */
@@ -108,11 +134,38 @@ errseq_t errseq_set(errseq_t *eseq, int err)
 EXPORT_SYMBOL(errseq_set);
 
 /**
- * errseq_sample() - Grab current errseq_t value.
+ * errseq_peek - Grab current errseq_t value
+ * @eseq: Pointer to errseq_t to be sampled.
+ *
+ * In some cases, we need to be able to sample the errseq_t, but we're not
+ * in a situation where we can report the value to userland. Use this
+ * function to do that. This ensures that later errors will be recorded,
+ * and that any current errors are reported at least once when it is
+ * next sampled.
+ *
+ * Context: Any context.
+ * Return: The current errseq value.
+ */
+errseq_t errseq_peek(errseq_t *eseq)
+{
+	errseq_t old = READ_ONCE(*eseq);
+	errseq_t new = old;
+
+	if (old != 0) {
+		new |= ERRSEQ_OBSERVED;
+		if (old != new)
+			cmpxchg(eseq, old, new);
+	}
+	return new;
+}
+EXPORT_SYMBOL(errseq_peek);
+
+/**
+ * errseq_sample() - Sample errseq_t value, and ensure that unseen errors are reported
  * @eseq: Pointer to errseq_t to be sampled.
  *
  * This function allows callers to initialise their errseq_t variable.
- * If the error has been "seen", new callers will not see an old error.
+ * If the latest error has been "seen", new callers will not see an old error.
  * If there is an unseen error in @eseq, the caller of this function will
  * see it the next time it checks for an error.
  *
@@ -121,12 +174,11 @@ EXPORT_SYMBOL(errseq_set);
  */
 errseq_t errseq_sample(errseq_t *eseq)
 {
-	errseq_t old = READ_ONCE(*eseq);
+	errseq_t new = errseq_peek(eseq);
 
-	/* If nobody has seen this error yet, then we can be the first. */
-	if (!(old & ERRSEQ_SEEN))
-		old = 0;
-	return old;
+	if (!(new & ERRSEQ_REPORTED))
+		return 0;
+	return new;
 }
 EXPORT_SYMBOL(errseq_sample);
 
@@ -145,7 +197,7 @@ int errseq_check(errseq_t *eseq, errseq_t since)
 {
 	errseq_t cur = READ_ONCE(*eseq);
 
-	if (likely(cur == since))
+	if (errseq_same(cur, since))
 		return 0;
 	return -(cur & MAX_ERRNO);
 }
@@ -159,9 +211,9 @@ EXPORT_SYMBOL(errseq_check);
  * Grab the eseq value, and see whether it matches the value that @since
  * points to. If it does, then just return 0.
  *
- * If it doesn't, then the value has changed. Set the "seen" flag, and try to
- * swap it into place as the new eseq value. Then, set that value as the new
- * "since" value, and return whatever the error portion is set to.
+ * If it doesn't, then the value has changed. Set the REPORTED+OBSERVED flags, and
+ * try to swap it into place as the new eseq value. Then, set that value as
+ * the new "since" value, and return whatever the error portion is set to.
  *
  * Note that no locking is provided here for concurrent updates to the "since"
  * value. The caller must provide that if necessary. Because of this, callers
@@ -183,21 +235,38 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
 	 */
 	old = READ_ONCE(*eseq);
 	if (old != *since) {
+		int loops = 0;
+
 		/*
-		 * Set the flag and try to swap it into place if it has
-		 * changed.
+		 * Set the flag and try to swap it into place if it has changed.
+		 *
+		 * If the swap doesn't occur, then it has either been updated by a
+		 * writer who is setting a new error and/or bumping the counter, or
+		 * another reader who is setting flags.
 		 *
-		 * We don't care about the outcome of the swap here. If the
-		 * swap doesn't occur, then it has either been updated by a
-		 * writer who is altering the value in some way (updating
-		 * counter or resetting the error), or another reader who is
-		 * just setting the "seen" flag. Either outcome is OK, and we
-		 * can advance "since" and return an error based on what we
-		 * have.
+		 * We only need to retry in one case -- if we raced with another
+		 * reader that is only setting the OBSERVED flag. We need the
+		 * current value to have the REPORTED bit set if the other fields
+		 * didn't change, or we might report the same error on newly opened
+		 * files.
 		 */
-		new = old | ERRSEQ_SEEN;
-		if (new != old)
-			cmpxchg(eseq, old, new);
+		do {
+			if (unlikely(loops >= 2)) {
+				/*
+				 * This should never loop more than once, as any
+				 * change not involving the REPORTED bit would also
+				 * involve non-flag bits. WARN and just go with
+				 * what we have in that case.
+				 */
+				WARN_ON_ONCE(true);
+				break;
+			}
+			loops++;
+			new = old | ERRSEQ_REPORTED | ERRSEQ_OBSERVED;
+			if (new == old)
+				break;
+			old = cmpxchg(eseq, old, new);
+		} while (old == (new & ~ERRSEQ_REPORTED));
 		*since = new;
 		err = -(new & MAX_ERRNO);
 	}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
  2020-12-17 15:00 [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags Jeff Layton
@ 2020-12-17 20:35 ` Sargun Dhillon
  2020-12-17 21:18   ` Jeff Layton
  2020-12-19  6:13 ` Matthew Wilcox
  1 sibling, 1 reply; 13+ messages in thread
From: Sargun Dhillon @ 2020-12-17 20:35 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, NeilBrown, Jan Kara

On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> Overlayfs's volatile mounts want to be able to sample an error for their
> own purposes, without preventing a later opener from potentially seeing
> the error.
> 
> The original reason for the ERRSEQ_SEEN flag was to make it so that we
> didn't need to increment the counter if nothing had observed the latest
> value and the error was the same. Eventually, a regression was reported
> in the errseq_t conversion, and we fixed that by using the ERRSEQ_SEEN
> flag to also mean that the error had been reported to userland at least
> once somewhere.
> 
> Those are two different states, however. If we instead take a second
> flag bit from the counter, we can track these two things separately, and
> accomodate the overlayfs volatile mount use-case.
> 
> Rename the ERRSEQ_SEEN flag to ERRSEQ_OBSERVED and use that to indicate
> that the counter must be incremented the next time an error is set.
> Also, add a new ERRSEQ_REPORTED flag that indicates whether the current
> error was returned to userland (and thus doesn't need to be reported on
> newly open file descriptions).
> 
> Test only for the OBSERVED bit when deciding whether to increment the
> counter and only for the REPORTED bit when deciding what to return in
> errseq_sample.
> 
> Add a new errseq_peek function to allow for the overlayfs use-case.
> This just grabs the latest counter and sets the OBSERVED bit, leaving the
> REPORTED bit untouched.
> 
> errseq_check_and_advance must now handle a single special case where
> it races against a "peek" of an as of yet unseen value. The do/while
> loop looks scary, but shouldn't loop more than once.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  Documentation/core-api/errseq.rst |  22 +++--
>  include/linux/errseq.h            |   1 +
>  lib/errseq.c                      | 139 ++++++++++++++++++++++--------
>  3 files changed, 118 insertions(+), 44 deletions(-)
> 
> v3: rename SEEN/MUSTINC flags to REPORTED/OBSERVED
> 
> Hopefully the new flag names will make this a bit more clear. We could
> also rename some of the functions if that helps too. We could consider
> moving from errseq_sample/_check_and_advance to
> errseq_observe/errseq_report?  I'm not sure that helps anything though.
> 
> I know that Vivek and Sargun are working on syncfs() for overlayfs, so
> we probably don't want to merge this until that work is ready. I think

I disagree. I think that this work can land ahead of that, given that I think 
this is probably backportable to v5.10 without much risk, with the addition of 
your RFC v2 Overlay patch. I think the work proper long-term repair Vivek is 
embarking upon seems like it may be far more invasive.

> the errseq_peek call will need to be part of their solution for volatile
> mounts, however, so I'm fine with merging this via the overlayfs tree,
> once that work is complete.
> 
> diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst
> index ff332e272405..ce46ddcc1487 100644
> --- a/Documentation/core-api/errseq.rst
> +++ b/Documentation/core-api/errseq.rst
> @@ -18,18 +18,22 @@ these functions can be called from any context.
>  Note that there is a risk of collisions if new errors are being recorded
>  frequently, since we have so few bits to use as a counter.
>  
> -To mitigate this, the bit between the error value and counter is used as
> -a flag to tell whether the value has been sampled since a new value was
> -recorded.  That allows us to avoid bumping the counter if no one has
> -sampled it since the last time an error was recorded.
> +To mitigate this, the bits between the error value and counter are used
> +as flags to tell whether the value has been sampled since a new value
> +was recorded, and whether the latest error has been seen by userland.
> +That allows us to avoid bumping the counter if no one has sampled it
> +since the last time an error was recorded, and also ensures that any
> +recorded error will be seen at least once.
>  
>  Thus we end up with a value that looks something like this:
>  
> -+--------------------------------------+----+------------------------+
> -| 31..13                               | 12 | 11..0                  |
> -+--------------------------------------+----+------------------------+
> -| counter                              | SF | errno                  |
> -+--------------------------------------+----+------------------------+
> ++---------------------------------+----+----+------------------------+
> +| 31..14                          | 13 | 12 | 11..0                  |
> ++---------------------------------+----+----+------------------------+
> +| counter                         | OF | RF | errno                  |
> ++---------------------------------+----+----+------------------------+
> +OF = ERRSEQ_OBSERVED flag
> +RF = ERRSEQ_REPORTED flag
>  
>  The general idea is for "watchers" to sample an errseq_t value and keep
>  it as a running cursor.  That value can later be used to tell whether
> diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> index fc2777770768..7e3634269c95 100644
> --- a/include/linux/errseq.h
> +++ b/include/linux/errseq.h
> @@ -9,6 +9,7 @@ typedef u32	errseq_t;
>  
>  errseq_t errseq_set(errseq_t *eseq, int err);
>  errseq_t errseq_sample(errseq_t *eseq);
> +errseq_t errseq_peek(errseq_t *eseq);
>  int errseq_check(errseq_t *eseq, errseq_t since);
>  int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
>  #endif
> diff --git a/lib/errseq.c b/lib/errseq.c
> index 81f9e33aa7e7..8fd6be134dcc 100644
> --- a/lib/errseq.c
> +++ b/lib/errseq.c
> @@ -21,10 +21,14 @@
>   * Note that there is a risk of collisions if new errors are being recorded
>   * frequently, since we have so few bits to use as a counter.
>   *
> - * To mitigate this, one bit is used as a flag to tell whether the value has
> - * been sampled since a new value was recorded. That allows us to avoid bumping
> - * the counter if no one has sampled it since the last time an error was
> - * recorded.
> + * To mitigate this, one bit is used as a flag to tell whether the value has been
> + * observed in some fashion. That allows us to avoid bumping the counter if no
> + * one has sampled it since the last time an error was recorded.
> + *
> + * A second flag bit is used to indicate whether the latest error that has been
> + * recorded has been reported to userland. If the REPORTED bit is not set when the
> + * file is opened, then we ensure that the opener will see the error by setting
> + * its sample to 0.

Since there are only a few places that report to userland (as far as I can tell, 
a bit of usage in ceph), does it make sense to maintain this specific flag that
indicates it's reported to userspace? Instead can userspace keep a snapshot
of the last errseq it reported (say on the superblock), and use that to drive
reports to userspace?

It's a 32-bit sacrifice per SB though, but it means we can get rid of 
errseq_check_and_advance and potentially remove any need for locking and just
rely on cmpxchg.

>   *
>   * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
>   * is the special (but common) case where there has never been an error. An all
> @@ -35,11 +39,33 @@
>  /* The low bits are designated for error code (max of MAX_ERRNO) */
>  #define ERRSEQ_SHIFT		ilog2(MAX_ERRNO + 1)
>  
> -/* This bit is used as a flag to indicate whether the value has been seen */
> -#define ERRSEQ_SEEN		(1 << ERRSEQ_SHIFT)
> +/* Flag to indicate whether the value will be or has been reported */
> +#define ERRSEQ_REPORTED		BIT(ERRSEQ_SHIFT)
> +
> +/* Flag to ndicate that error must be recorded */
> +#define ERRSEQ_OBSERVED		BIT(ERRSEQ_SHIFT + 1)
>  
>  /* The lowest bit of the counter */
> -#define ERRSEQ_CTR_INC		(1 << (ERRSEQ_SHIFT + 1))
> +#define ERRSEQ_CTR_INC		BIT(ERRSEQ_SHIFT + 2)
> +
> +/* Mask that just contains the counter bits */
> +#define ERRSEQ_CTR_MASK		~(ERRSEQ_CTR_INC - 1)
> +
> +/* Mask that just contains flags */
> +#define ERRSEQ_FLAG_MASK	(ERRSEQ_REPORTED|ERRSEQ_OBSERVED)
> +
> +/**
> + * errseq_same - return true if the errseq counters and values are the same
> + * @a: first errseq
> + * @b: second errseq
> + *
> + * Compare two errseqs and return true if they are the same, ignoring their
> + * flag bits.
> + */
> +static inline bool errseq_same(errseq_t a, errseq_t b)
> +{
> +	return (a & ~ERRSEQ_FLAG_MASK) == (b & ~ERRSEQ_FLAG_MASK);
> +}
>  
>  /**
>   * errseq_set - set a errseq_t for later reporting
> @@ -53,7 +79,7 @@
>   *
>   * Return: The previous value, primarily for debugging purposes. The
>   * return value should not be used as a previously sampled value in later
> - * calls as it will not have the SEEN flag set.
> + * calls as it will not have the OBSERVED flag set.
>   */
>  errseq_t errseq_set(errseq_t *eseq, int err)
>  {
> @@ -77,11 +103,11 @@ errseq_t errseq_set(errseq_t *eseq, int err)
>  	for (;;) {
>  		errseq_t new;
>  
> -		/* Clear out error bits and set new error */
> -		new = (old & ~(MAX_ERRNO|ERRSEQ_SEEN)) | -err;
> +		/* Clear out flag bits and old errors, and set new error */
> +		new = (old & ERRSEQ_CTR_MASK) | -err;
>  
> -		/* Only increment if someone has looked at it */
> -		if (old & ERRSEQ_SEEN)
> +		/* Only increment if we have to */
> +		if (old & ERRSEQ_OBSERVED)
>  			new += ERRSEQ_CTR_INC;
>  
>  		/* If there would be no change, then call it done */
> @@ -108,11 +134,38 @@ errseq_t errseq_set(errseq_t *eseq, int err)
>  EXPORT_SYMBOL(errseq_set);
>  
>  /**
> - * errseq_sample() - Grab current errseq_t value.
> + * errseq_peek - Grab current errseq_t value
> + * @eseq: Pointer to errseq_t to be sampled.
> + *
> + * In some cases, we need to be able to sample the errseq_t, but we're not
> + * in a situation where we can report the value to userland. Use this
> + * function to do that. This ensures that later errors will be recorded,
> + * and that any current errors are reported at least once when it is
> + * next sampled.
I would add that this function has side effects, and mutates errseq_t,
so callers understand they're mutating data.


> + *
> + * Context: Any context.
> + * Return: The current errseq value.
> + */
> +errseq_t errseq_peek(errseq_t *eseq)
> +{
> +	errseq_t old = READ_ONCE(*eseq);
> +	errseq_t new = old;
> +
> +	if (old != 0) {
> +		new |= ERRSEQ_OBSERVED;
> +		if (old != new)
> +			cmpxchg(eseq, old, new);
> +	}
> +	return new;
> +}
> +EXPORT_SYMBOL(errseq_peek);
> +
> +/**
> + * errseq_sample() - Sample errseq_t value, and ensure that unseen errors are reported
>   * @eseq: Pointer to errseq_t to be sampled.
>   *
>   * This function allows callers to initialise their errseq_t variable.
> - * If the error has been "seen", new callers will not see an old error.
> + * If the latest error has been "seen", new callers will not see an old error.
>   * If there is an unseen error in @eseq, the caller of this function will
>   * see it the next time it checks for an error.
>   *
> @@ -121,12 +174,11 @@ EXPORT_SYMBOL(errseq_set);
>   */
>  errseq_t errseq_sample(errseq_t *eseq)
>  {
> -	errseq_t old = READ_ONCE(*eseq);
> +	errseq_t new = errseq_peek(eseq);
>  
> -	/* If nobody has seen this error yet, then we can be the first. */
> -	if (!(old & ERRSEQ_SEEN))
> -		old = 0;
> -	return old;
> +	if (!(new & ERRSEQ_REPORTED))
> +		return 0;
> +	return new;
>  }
>  EXPORT_SYMBOL(errseq_sample);
>  
> @@ -145,7 +197,7 @@ int errseq_check(errseq_t *eseq, errseq_t since)
>  {
>  	errseq_t cur = READ_ONCE(*eseq);
>  
> -	if (likely(cur == since))
> +	if (errseq_same(cur, since))
>  		return 0;
>  	return -(cur & MAX_ERRNO);
>  }
> @@ -159,9 +211,9 @@ EXPORT_SYMBOL(errseq_check);
>   * Grab the eseq value, and see whether it matches the value that @since
>   * points to. If it does, then just return 0.
>   *
> - * If it doesn't, then the value has changed. Set the "seen" flag, and try to
> - * swap it into place as the new eseq value. Then, set that value as the new
> - * "since" value, and return whatever the error portion is set to.
> + * If it doesn't, then the value has changed. Set the REPORTED+OBSERVED flags, and
> + * try to swap it into place as the new eseq value. Then, set that value as
> + * the new "since" value, and return whatever the error portion is set to.
>   *
>   * Note that no locking is provided here for concurrent updates to the "since"
>   * value. The caller must provide that if necessary. Because of this, callers
> @@ -183,21 +235,38 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
>  	 */
>  	old = READ_ONCE(*eseq);
>  	if (old != *since) {
> +		int loops = 0;
> +
>  		/*
> -		 * Set the flag and try to swap it into place if it has
> -		 * changed.
> +		 * Set the flag and try to swap it into place if it has changed.
> +		 *
> +		 * If the swap doesn't occur, then it has either been updated by a
> +		 * writer who is setting a new error and/or bumping the counter, or
> +		 * another reader who is setting flags.
>  		 *
> -		 * We don't care about the outcome of the swap here. If the
> -		 * swap doesn't occur, then it has either been updated by a
> -		 * writer who is altering the value in some way (updating
> -		 * counter or resetting the error), or another reader who is
> -		 * just setting the "seen" flag. Either outcome is OK, and we
> -		 * can advance "since" and return an error based on what we
> -		 * have.
> +		 * We only need to retry in one case -- if we raced with another
> +		 * reader that is only setting the OBSERVED flag. We need the
> +		 * current value to have the REPORTED bit set if the other fields
> +		 * didn't change, or we might report the same error on newly opened
> +		 * files.
>  		 */
> -		new = old | ERRSEQ_SEEN;
> -		if (new != old)
> -			cmpxchg(eseq, old, new);
> +		do {
> +			if (unlikely(loops >= 2)) {
Any reason not to just make this:

if (WARN_ON_ONCE(loops >= 2))
	break

Given WARN_ON_ONCE already does the unlikely.

> +				/*
> +				 * This should never loop more than once, as any
> +				 * change not involving the REPORTED bit would also
> +				 * involve non-flag bits. WARN and just go with
> +				 * what we have in that case.
Couldn't we get a race of errors were being incremented faster than this loop runs?
> +				 */
> +				WARN_ON_ONCE(true);
> +				break;
> +			}
> +			loops++;
> +			new = old | ERRSEQ_REPORTED | ERRSEQ_OBSERVED;
> +			if (new == old)
> +				break;
> +			old = cmpxchg(eseq, old, new);
> +		} while (old == (new & ~ERRSEQ_REPORTED));
>  		*since = new;
>  		err = -(new & MAX_ERRNO);
>  	}
> -- 
> 2.29.2
> 

Thanks,

I'd like to see if we can get this in, and a janky version of the ovl_syncfs 
changes for volatile in stable, and then close the loop on the work that Vivek 
is doing to fix this for all of VFS in a later release.

Do you think that's reasonable?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
  2020-12-17 20:35 ` Sargun Dhillon
@ 2020-12-17 21:18   ` Jeff Layton
  2020-12-18 23:44     ` Sargun Dhillon
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2020-12-17 21:18 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Amir Goldstein, Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, NeilBrown, Jan Kara

On Thu, 2020-12-17 at 20:35 +0000, Sargun Dhillon wrote:
> On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> > Overlayfs's volatile mounts want to be able to sample an error for their
> > own purposes, without preventing a later opener from potentially seeing
> > the error.
> > 
> > The original reason for the ERRSEQ_SEEN flag was to make it so that we
> > didn't need to increment the counter if nothing had observed the latest
> > value and the error was the same. Eventually, a regression was reported
> > in the errseq_t conversion, and we fixed that by using the ERRSEQ_SEEN
> > flag to also mean that the error had been reported to userland at least
> > once somewhere.
> > 
> > Those are two different states, however. If we instead take a second
> > flag bit from the counter, we can track these two things separately, and
> > accomodate the overlayfs volatile mount use-case.
> > 
> > Rename the ERRSEQ_SEEN flag to ERRSEQ_OBSERVED and use that to indicate
> > that the counter must be incremented the next time an error is set.
> > Also, add a new ERRSEQ_REPORTED flag that indicates whether the current
> > error was returned to userland (and thus doesn't need to be reported on
> > newly open file descriptions).
> > 
> > Test only for the OBSERVED bit when deciding whether to increment the
> > counter and only for the REPORTED bit when deciding what to return in
> > errseq_sample.
> > 
> > Add a new errseq_peek function to allow for the overlayfs use-case.
> > This just grabs the latest counter and sets the OBSERVED bit, leaving the
> > REPORTED bit untouched.
> > 
> > errseq_check_and_advance must now handle a single special case where
> > it races against a "peek" of an as of yet unseen value. The do/while
> > loop looks scary, but shouldn't loop more than once.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  Documentation/core-api/errseq.rst |  22 +++--
> >  include/linux/errseq.h            |   1 +
> >  lib/errseq.c                      | 139 ++++++++++++++++++++++--------
> >  3 files changed, 118 insertions(+), 44 deletions(-)
> > 
> > v3: rename SEEN/MUSTINC flags to REPORTED/OBSERVED
> > 
> > Hopefully the new flag names will make this a bit more clear. We could
> > also rename some of the functions if that helps too. We could consider
> > moving from errseq_sample/_check_and_advance to
> > errseq_observe/errseq_report?  I'm not sure that helps anything though.
> > 
> > I know that Vivek and Sargun are working on syncfs() for overlayfs, so
> > we probably don't want to merge this until that work is ready. I think
> 
> I disagree. I think that this work can land ahead of that, given that I think 
> this is probably backportable to v5.10 without much risk, with the addition of 
> your RFC v2 Overlay patch. I think the work proper long-term repair Vivek is 
> embarking upon seems like it may be far more invasive.
> 
> > the errseq_peek call will need to be part of their solution for volatile
> > mounts, however, so I'm fine with merging this via the overlayfs tree,
> > once that work is complete.
> > 
> > diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst
> > index ff332e272405..ce46ddcc1487 100644
> > --- a/Documentation/core-api/errseq.rst
> > +++ b/Documentation/core-api/errseq.rst
> > @@ -18,18 +18,22 @@ these functions can be called from any context.
> >  Note that there is a risk of collisions if new errors are being recorded
> >  frequently, since we have so few bits to use as a counter.
> >  
> > 
> > 
> > 
> > -To mitigate this, the bit between the error value and counter is used as
> > -a flag to tell whether the value has been sampled since a new value was
> > -recorded.  That allows us to avoid bumping the counter if no one has
> > -sampled it since the last time an error was recorded.
> > +To mitigate this, the bits between the error value and counter are used
> > +as flags to tell whether the value has been sampled since a new value
> > +was recorded, and whether the latest error has been seen by userland.
> > +That allows us to avoid bumping the counter if no one has sampled it
> > +since the last time an error was recorded, and also ensures that any
> > +recorded error will be seen at least once.
> >  
> > 
> > 
> > 
> >  Thus we end up with a value that looks something like this:
> >  
> > 
> > 
> > 
> > -+--------------------------------------+----+------------------------+
> > -| 31..13                               | 12 | 11..0                  |
> > -+--------------------------------------+----+------------------------+
> > -| counter                              | SF | errno                  |
> > -+--------------------------------------+----+------------------------+
> > ++---------------------------------+----+----+------------------------+
> > +| 31..14                          | 13 | 12 | 11..0                  |
> > ++---------------------------------+----+----+------------------------+
> > +| counter                         | OF | RF | errno                  |
> > ++---------------------------------+----+----+------------------------+
> > +OF = ERRSEQ_OBSERVED flag
> > +RF = ERRSEQ_REPORTED flag
> >  
> > 
> > 
> > 
> >  The general idea is for "watchers" to sample an errseq_t value and keep
> >  it as a running cursor.  That value can later be used to tell whether
> > diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> > index fc2777770768..7e3634269c95 100644
> > --- a/include/linux/errseq.h
> > +++ b/include/linux/errseq.h
> > @@ -9,6 +9,7 @@ typedef u32	errseq_t;
> >  
> > 
> > 
> > 
> >  errseq_t errseq_set(errseq_t *eseq, int err);
> >  errseq_t errseq_sample(errseq_t *eseq);
> > +errseq_t errseq_peek(errseq_t *eseq);
> >  int errseq_check(errseq_t *eseq, errseq_t since);
> >  int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
> >  #endif
> > diff --git a/lib/errseq.c b/lib/errseq.c
> > index 81f9e33aa7e7..8fd6be134dcc 100644
> > --- a/lib/errseq.c
> > +++ b/lib/errseq.c
> > @@ -21,10 +21,14 @@
> >   * Note that there is a risk of collisions if new errors are being recorded
> >   * frequently, since we have so few bits to use as a counter.
> >   *
> > - * To mitigate this, one bit is used as a flag to tell whether the value has
> > - * been sampled since a new value was recorded. That allows us to avoid bumping
> > - * the counter if no one has sampled it since the last time an error was
> > - * recorded.
> > + * To mitigate this, one bit is used as a flag to tell whether the value has been
> > + * observed in some fashion. That allows us to avoid bumping the counter if no
> > + * one has sampled it since the last time an error was recorded.
> > + *
> > + * A second flag bit is used to indicate whether the latest error that has been
> > + * recorded has been reported to userland. If the REPORTED bit is not set when the
> > + * file is opened, then we ensure that the opener will see the error by setting
> > + * its sample to 0.
> 
> Since there are only a few places that report to userland (as far as I can tell, 
> a bit of usage in ceph), does it make sense to maintain this specific flag that
> indicates it's reported to userspace? Instead can userspace keep a snapshot
> of the last errseq it reported (say on the superblock), and use that to drive
> reports to userspace?
> 
> It's a 32-bit sacrifice per SB though, but it means we can get rid of 
> errseq_check_and_advance and potentially remove any need for locking and just
> rely on cmpxchg.

I think it makes sense. You are essentially adding a new class of
"samplers" that use the error for their own purposes and won't be
reporting it to userland via normal channels (syncfs, etc.). A single
bit to indicate whether it has only been observed by such samplers is
not a huge sacrifice.

I worry too about race conditions when tracking this information across
multiple words. You'll either need to use some locking to manage that,
or get clever with memory barriers. Keeping everything in one word makes
things a lot simpler.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
  2020-12-17 21:18   ` Jeff Layton
@ 2020-12-18 23:44     ` Sargun Dhillon
  2020-12-19  1:03       ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: Sargun Dhillon @ 2020-12-18 23:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, NeilBrown, Jan Kara

On Thu, Dec 17, 2020 at 04:18:49PM -0500, Jeff Layton wrote:
> On Thu, 2020-12-17 at 20:35 +0000, Sargun Dhillon wrote:
> > On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> > > Overlayfs's volatile mounts want to be able to sample an error for their
> > > own purposes, without preventing a later opener from potentially seeing
> > > the error.
> > > 
> > > The original reason for the ERRSEQ_SEEN flag was to make it so that we
> > > didn't need to increment the counter if nothing had observed the latest
> > > value and the error was the same. Eventually, a regression was reported
> > > in the errseq_t conversion, and we fixed that by using the ERRSEQ_SEEN
> > > flag to also mean that the error had been reported to userland at least
> > > once somewhere.
> > > 
> > > Those are two different states, however. If we instead take a second
> > > flag bit from the counter, we can track these two things separately, and
> > > accomodate the overlayfs volatile mount use-case.
> > > 
> > > Rename the ERRSEQ_SEEN flag to ERRSEQ_OBSERVED and use that to indicate
> > > that the counter must be incremented the next time an error is set.
> > > Also, add a new ERRSEQ_REPORTED flag that indicates whether the current
> > > error was returned to userland (and thus doesn't need to be reported on
> > > newly open file descriptions).
> > > 
> > > Test only for the OBSERVED bit when deciding whether to increment the
> > > counter and only for the REPORTED bit when deciding what to return in
> > > errseq_sample.
> > > 
> > > Add a new errseq_peek function to allow for the overlayfs use-case.
> > > This just grabs the latest counter and sets the OBSERVED bit, leaving the
> > > REPORTED bit untouched.
> > > 
> > > errseq_check_and_advance must now handle a single special case where
> > > it races against a "peek" of an as of yet unseen value. The do/while
> > > loop looks scary, but shouldn't loop more than once.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  Documentation/core-api/errseq.rst |  22 +++--
> > >  include/linux/errseq.h            |   1 +
> > >  lib/errseq.c                      | 139 ++++++++++++++++++++++--------
> > >  3 files changed, 118 insertions(+), 44 deletions(-)
> > > 
> > > v3: rename SEEN/MUSTINC flags to REPORTED/OBSERVED
> > > 
> > > Hopefully the new flag names will make this a bit more clear. We could
> > > also rename some of the functions if that helps too. We could consider
> > > moving from errseq_sample/_check_and_advance to
> > > errseq_observe/errseq_report?  I'm not sure that helps anything though.
> > > 
> > > I know that Vivek and Sargun are working on syncfs() for overlayfs, so
> > > we probably don't want to merge this until that work is ready. I think
> > 
> > I disagree. I think that this work can land ahead of that, given that I think 
> > this is probably backportable to v5.10 without much risk, with the addition of 
> > your RFC v2 Overlay patch. I think the work proper long-term repair Vivek is 
> > embarking upon seems like it may be far more invasive.
> > 
> > > the errseq_peek call will need to be part of their solution for volatile
> > > mounts, however, so I'm fine with merging this via the overlayfs tree,
> > > once that work is complete.
> > > 
> > > diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst
> > > index ff332e272405..ce46ddcc1487 100644
> > > --- a/Documentation/core-api/errseq.rst
> > > +++ b/Documentation/core-api/errseq.rst
> > > @@ -18,18 +18,22 @@ these functions can be called from any context.
> > >  Note that there is a risk of collisions if new errors are being recorded
> > >  frequently, since we have so few bits to use as a counter.
> > >  
> > > 
> > > 
> > > 
> > > -To mitigate this, the bit between the error value and counter is used as
> > > -a flag to tell whether the value has been sampled since a new value was
> > > -recorded.  That allows us to avoid bumping the counter if no one has
> > > -sampled it since the last time an error was recorded.
> > > +To mitigate this, the bits between the error value and counter are used
> > > +as flags to tell whether the value has been sampled since a new value
> > > +was recorded, and whether the latest error has been seen by userland.
> > > +That allows us to avoid bumping the counter if no one has sampled it
> > > +since the last time an error was recorded, and also ensures that any
> > > +recorded error will be seen at least once.
> > >  
> > > 
> > > 
> > > 
> > >  Thus we end up with a value that looks something like this:
> > >  
> > > 
> > > 
> > > 
> > > -+--------------------------------------+----+------------------------+
> > > -| 31..13                               | 12 | 11..0                  |
> > > -+--------------------------------------+----+------------------------+
> > > -| counter                              | SF | errno                  |
> > > -+--------------------------------------+----+------------------------+
> > > ++---------------------------------+----+----+------------------------+
> > > +| 31..14                          | 13 | 12 | 11..0                  |
> > > ++---------------------------------+----+----+------------------------+
> > > +| counter                         | OF | RF | errno                  |
> > > ++---------------------------------+----+----+------------------------+
> > > +OF = ERRSEQ_OBSERVED flag
> > > +RF = ERRSEQ_REPORTED flag
> > >  
> > > 
> > > 
> > > 
> > >  The general idea is for "watchers" to sample an errseq_t value and keep
> > >  it as a running cursor.  That value can later be used to tell whether
> > > diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> > > index fc2777770768..7e3634269c95 100644
> > > --- a/include/linux/errseq.h
> > > +++ b/include/linux/errseq.h
> > > @@ -9,6 +9,7 @@ typedef u32	errseq_t;
> > >  
> > > 
> > > 
> > > 
> > >  errseq_t errseq_set(errseq_t *eseq, int err);
> > >  errseq_t errseq_sample(errseq_t *eseq);
> > > +errseq_t errseq_peek(errseq_t *eseq);
> > >  int errseq_check(errseq_t *eseq, errseq_t since);
> > >  int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
> > >  #endif
> > > diff --git a/lib/errseq.c b/lib/errseq.c
> > > index 81f9e33aa7e7..8fd6be134dcc 100644
> > > --- a/lib/errseq.c
> > > +++ b/lib/errseq.c
> > > @@ -21,10 +21,14 @@
> > >   * Note that there is a risk of collisions if new errors are being recorded
> > >   * frequently, since we have so few bits to use as a counter.
> > >   *
> > > - * To mitigate this, one bit is used as a flag to tell whether the value has
> > > - * been sampled since a new value was recorded. That allows us to avoid bumping
> > > - * the counter if no one has sampled it since the last time an error was
> > > - * recorded.
> > > + * To mitigate this, one bit is used as a flag to tell whether the value has been
> > > + * observed in some fashion. That allows us to avoid bumping the counter if no
> > > + * one has sampled it since the last time an error was recorded.
> > > + *
> > > + * A second flag bit is used to indicate whether the latest error that has been
> > > + * recorded has been reported to userland. If the REPORTED bit is not set when the
> > > + * file is opened, then we ensure that the opener will see the error by setting
> > > + * its sample to 0.
> > 
> > Since there are only a few places that report to userland (as far as I can tell, 
> > a bit of usage in ceph), does it make sense to maintain this specific flag that
> > indicates it's reported to userspace? Instead can userspace keep a snapshot
> > of the last errseq it reported (say on the superblock), and use that to drive
> > reports to userspace?
> > 
> > It's a 32-bit sacrifice per SB though, but it means we can get rid of 
> > errseq_check_and_advance and potentially remove any need for locking and just
> > rely on cmpxchg.
> 
> I think it makes sense. You are essentially adding a new class of
> "samplers" that use the error for their own purposes and won't be
> reporting it to userland via normal channels (syncfs, etc.). A single
> bit to indicate whether it has only been observed by such samplers is
> not a huge sacrifice.
> 
> I worry too about race conditions when tracking this information across
> multiple words. You'll either need to use some locking to manage that,
> or get clever with memory barriers. Keeping everything in one word makes
> things a lot simpler.
> -- 
> Jeff Layton <jlayton@kernel.org>
> 

I'll wait for Amir or Miklos to chime in, but I'm happy with this, and it solves 
my problems.

Do you want to respin this patch with the overlayfs patch as well, so
we can cherry-pick to stable, and then focus on how we want to deal
with this problem in the future?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
  2020-12-18 23:44     ` Sargun Dhillon
@ 2020-12-19  1:03       ` Jeff Layton
  2020-12-19  9:09         ` Amir Goldstein
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2020-12-19  1:03 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Amir Goldstein, Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, NeilBrown, Jan Kara

On Fri, 2020-12-18 at 23:44 +0000, Sargun Dhillon wrote:
> On Thu, Dec 17, 2020 at 04:18:49PM -0500, Jeff Layton wrote:
> > On Thu, 2020-12-17 at 20:35 +0000, Sargun Dhillon wrote:
> > > On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> > > > Overlayfs's volatile mounts want to be able to sample an error for their
> > > > own purposes, without preventing a later opener from potentially seeing
> > > > the error.
> > > > 
> > > > The original reason for the ERRSEQ_SEEN flag was to make it so that we
> > > > didn't need to increment the counter if nothing had observed the latest
> > > > value and the error was the same. Eventually, a regression was reported
> > > > in the errseq_t conversion, and we fixed that by using the ERRSEQ_SEEN
> > > > flag to also mean that the error had been reported to userland at least
> > > > once somewhere.
> > > > 
> > > > Those are two different states, however. If we instead take a second
> > > > flag bit from the counter, we can track these two things separately, and
> > > > accomodate the overlayfs volatile mount use-case.
> > > > 
> > > > Rename the ERRSEQ_SEEN flag to ERRSEQ_OBSERVED and use that to indicate
> > > > that the counter must be incremented the next time an error is set.
> > > > Also, add a new ERRSEQ_REPORTED flag that indicates whether the current
> > > > error was returned to userland (and thus doesn't need to be reported on
> > > > newly open file descriptions).
> > > > 
> > > > Test only for the OBSERVED bit when deciding whether to increment the
> > > > counter and only for the REPORTED bit when deciding what to return in
> > > > errseq_sample.
> > > > 
> > > > Add a new errseq_peek function to allow for the overlayfs use-case.
> > > > This just grabs the latest counter and sets the OBSERVED bit, leaving the
> > > > REPORTED bit untouched.
> > > > 
> > > > errseq_check_and_advance must now handle a single special case where
> > > > it races against a "peek" of an as of yet unseen value. The do/while
> > > > loop looks scary, but shouldn't loop more than once.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  Documentation/core-api/errseq.rst |  22 +++--
> > > >  include/linux/errseq.h            |   1 +
> > > >  lib/errseq.c                      | 139 ++++++++++++++++++++++--------
> > > >  3 files changed, 118 insertions(+), 44 deletions(-)
> > > > 
> > > > v3: rename SEEN/MUSTINC flags to REPORTED/OBSERVED
> > > > 
> > > > Hopefully the new flag names will make this a bit more clear. We could
> > > > also rename some of the functions if that helps too. We could consider
> > > > moving from errseq_sample/_check_and_advance to
> > > > errseq_observe/errseq_report?  I'm not sure that helps anything though.
> > > > 
> > > > I know that Vivek and Sargun are working on syncfs() for overlayfs, so
> > > > we probably don't want to merge this until that work is ready. I think
> > > 
> > > I disagree. I think that this work can land ahead of that, given that I think 
> > > this is probably backportable to v5.10 without much risk, with the addition of 
> > > your RFC v2 Overlay patch. I think the work proper long-term repair Vivek is 
> > > embarking upon seems like it may be far more invasive.
> > > 
> > > > the errseq_peek call will need to be part of their solution for volatile
> > > > mounts, however, so I'm fine with merging this via the overlayfs tree,
> > > > once that work is complete.
> > > > 
> > > > diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst
> > > > index ff332e272405..ce46ddcc1487 100644
> > > > --- a/Documentation/core-api/errseq.rst
> > > > +++ b/Documentation/core-api/errseq.rst
> > > > @@ -18,18 +18,22 @@ these functions can be called from any context.
> > > >  Note that there is a risk of collisions if new errors are being recorded
> > > >  frequently, since we have so few bits to use as a counter.
> > > >  
> > > > 
> > > > 
> > > > 
> > > > -To mitigate this, the bit between the error value and counter is used as
> > > > -a flag to tell whether the value has been sampled since a new value was
> > > > -recorded.  That allows us to avoid bumping the counter if no one has
> > > > -sampled it since the last time an error was recorded.
> > > > +To mitigate this, the bits between the error value and counter are used
> > > > +as flags to tell whether the value has been sampled since a new value
> > > > +was recorded, and whether the latest error has been seen by userland.
> > > > +That allows us to avoid bumping the counter if no one has sampled it
> > > > +since the last time an error was recorded, and also ensures that any
> > > > +recorded error will be seen at least once.
> > > >  
> > > > 
> > > > 
> > > > 
> > > >  Thus we end up with a value that looks something like this:
> > > >  
> > > > 
> > > > 
> > > > 
> > > > -+--------------------------------------+----+------------------------+
> > > > -| 31..13                               | 12 | 11..0                  |
> > > > -+--------------------------------------+----+------------------------+
> > > > -| counter                              | SF | errno                  |
> > > > -+--------------------------------------+----+------------------------+
> > > > ++---------------------------------+----+----+------------------------+
> > > > +| 31..14                          | 13 | 12 | 11..0                  |
> > > > ++---------------------------------+----+----+------------------------+
> > > > +| counter                         | OF | RF | errno                  |
> > > > ++---------------------------------+----+----+------------------------+
> > > > +OF = ERRSEQ_OBSERVED flag
> > > > +RF = ERRSEQ_REPORTED flag
> > > >  
> > > > 
> > > > 
> > > > 
> > > >  The general idea is for "watchers" to sample an errseq_t value and keep
> > > >  it as a running cursor.  That value can later be used to tell whether
> > > > diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> > > > index fc2777770768..7e3634269c95 100644
> > > > --- a/include/linux/errseq.h
> > > > +++ b/include/linux/errseq.h
> > > > @@ -9,6 +9,7 @@ typedef u32	errseq_t;
> > > >  
> > > > 
> > > > 
> > > > 
> > > >  errseq_t errseq_set(errseq_t *eseq, int err);
> > > >  errseq_t errseq_sample(errseq_t *eseq);
> > > > +errseq_t errseq_peek(errseq_t *eseq);
> > > >  int errseq_check(errseq_t *eseq, errseq_t since);
> > > >  int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
> > > >  #endif
> > > > diff --git a/lib/errseq.c b/lib/errseq.c
> > > > index 81f9e33aa7e7..8fd6be134dcc 100644
> > > > --- a/lib/errseq.c
> > > > +++ b/lib/errseq.c
> > > > @@ -21,10 +21,14 @@
> > > >   * Note that there is a risk of collisions if new errors are being recorded
> > > >   * frequently, since we have so few bits to use as a counter.
> > > >   *
> > > > - * To mitigate this, one bit is used as a flag to tell whether the value has
> > > > - * been sampled since a new value was recorded. That allows us to avoid bumping
> > > > - * the counter if no one has sampled it since the last time an error was
> > > > - * recorded.
> > > > + * To mitigate this, one bit is used as a flag to tell whether the value has been
> > > > + * observed in some fashion. That allows us to avoid bumping the counter if no
> > > > + * one has sampled it since the last time an error was recorded.
> > > > + *
> > > > + * A second flag bit is used to indicate whether the latest error that has been
> > > > + * recorded has been reported to userland. If the REPORTED bit is not set when the
> > > > + * file is opened, then we ensure that the opener will see the error by setting
> > > > + * its sample to 0.
> > > 
> > > Since there are only a few places that report to userland (as far as I can tell, 
> > > a bit of usage in ceph), does it make sense to maintain this specific flag that
> > > indicates it's reported to userspace? Instead can userspace keep a snapshot
> > > of the last errseq it reported (say on the superblock), and use that to drive
> > > reports to userspace?
> > > 
> > > It's a 32-bit sacrifice per SB though, but it means we can get rid of 
> > > errseq_check_and_advance and potentially remove any need for locking and just
> > > rely on cmpxchg.
> > 
> > I think it makes sense. You are essentially adding a new class of
> > "samplers" that use the error for their own purposes and won't be
> > reporting it to userland via normal channels (syncfs, etc.). A single
> > bit to indicate whether it has only been observed by such samplers is
> > not a huge sacrifice.
> > 
> > I worry too about race conditions when tracking this information across
> > multiple words. You'll either need to use some locking to manage that,
> > or get clever with memory barriers. Keeping everything in one word makes
> > things a lot simpler.
> > -- 
> > Jeff Layton <jlayton@kernel.org>
> > 
> 
> I'll wait for Amir or Miklos to chime in, but I'm happy with this, and it solves 
> my problems.
> 
> Do you want to respin this patch with the overlayfs patch as well, so
> we can cherry-pick to stable, and then focus on how we want to deal
> with this problem in the future?

Assuming no one sees issues with it and that this solves the problem of
writeback errors on volatile mounts, I'm fine with this going in via the
overlayfs tree, just ahead of the patch that adds the first caller of
errseq_peek.

I think we're finding that the thornier problem is how to pass along
writeback errors on non-volatile mounts. That's probably going to
require some vfs-layer surgery, so it may be best to wait until the
shape of that is clear.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
  2020-12-17 15:00 [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags Jeff Layton
  2020-12-17 20:35 ` Sargun Dhillon
@ 2020-12-19  6:13 ` Matthew Wilcox
  2020-12-19 12:53   ` Jeff Layton
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2020-12-19  6:13 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Sargun Dhillon, Miklos Szeredi, Vivek Goyal,
	overlayfs, Linux FS-devel Mailing List, NeilBrown, Jan Kara

On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> Overlayfs's volatile mounts want to be able to sample an error for their
> own purposes, without preventing a later opener from potentially seeing
> the error.

umm ... can't they just copy the errseq_t they're interested in, followed
by calling errseq_check() later?

actually, isn't errseq_check() buggy in the face of multiple
watchers?  consider this:

worker.es starts at 0
t2.es = errseq_sample(&worker.es)
errseq_set(&worker.es, -EIO)
t1.es = errseq_sample(&worker.es)
t2.err = errseq_check_and_advance(&es, t2.es)
	** this sets ERRSEQ_SEEN **
t1.err = errseq_check(&worker.es, t1.es)
	** reports an error, even though the only change is that
	   ERRSEQ_SEEN moved **.

i think errseq_check() should be:

	if (likely(cur | ERRSEQ_SEEN) == (since | ERRSEQ_SEEN))
		return 0;

i'm not yet convinced other changes are needed to errseq.  but i am
having great trouble understanding exactly what overlayfs is trying to do.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
  2020-12-19  1:03       ` Jeff Layton
@ 2020-12-19  9:09         ` Amir Goldstein
  0 siblings, 0 replies; 13+ messages in thread
From: Amir Goldstein @ 2020-12-19  9:09 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Sargun Dhillon, Miklos Szeredi, Vivek Goyal, overlayfs,
	Linux FS-devel Mailing List, Matthew Wilcox, NeilBrown, Jan Kara

On Sat, Dec 19, 2020 at 3:03 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Fri, 2020-12-18 at 23:44 +0000, Sargun Dhillon wrote:
> > On Thu, Dec 17, 2020 at 04:18:49PM -0500, Jeff Layton wrote:
> > > On Thu, 2020-12-17 at 20:35 +0000, Sargun Dhillon wrote:
> > > > On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> > > > > Overlayfs's volatile mounts want to be able to sample an error for their
> > > > > own purposes, without preventing a later opener from potentially seeing
> > > > > the error.
> > > > >
> > > > > The original reason for the ERRSEQ_SEEN flag was to make it so that we
> > > > > didn't need to increment the counter if nothing had observed the latest
> > > > > value and the error was the same. Eventually, a regression was reported
> > > > > in the errseq_t conversion, and we fixed that by using the ERRSEQ_SEEN
> > > > > flag to also mean that the error had been reported to userland at least
> > > > > once somewhere.
> > > > >
> > > > > Those are two different states, however. If we instead take a second
> > > > > flag bit from the counter, we can track these two things separately, and
> > > > > accomodate the overlayfs volatile mount use-case.
> > > > >
> > > > > Rename the ERRSEQ_SEEN flag to ERRSEQ_OBSERVED and use that to indicate
> > > > > that the counter must be incremented the next time an error is set.
> > > > > Also, add a new ERRSEQ_REPORTED flag that indicates whether the current
> > > > > error was returned to userland (and thus doesn't need to be reported on
> > > > > newly open file descriptions).
> > > > >
> > > > > Test only for the OBSERVED bit when deciding whether to increment the
> > > > > counter and only for the REPORTED bit when deciding what to return in
> > > > > errseq_sample.
> > > > >
> > > > > Add a new errseq_peek function to allow for the overlayfs use-case.
> > > > > This just grabs the latest counter and sets the OBSERVED bit, leaving the
> > > > > REPORTED bit untouched.
> > > > >
> > > > > errseq_check_and_advance must now handle a single special case where
> > > > > it races against a "peek" of an as of yet unseen value. The do/while
> > > > > loop looks scary, but shouldn't loop more than once.
> > > > >
> > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > ---
> > > > >  Documentation/core-api/errseq.rst |  22 +++--
> > > > >  include/linux/errseq.h            |   1 +
> > > > >  lib/errseq.c                      | 139 ++++++++++++++++++++++--------
> > > > >  3 files changed, 118 insertions(+), 44 deletions(-)
> > > > >
> > > > > v3: rename SEEN/MUSTINC flags to REPORTED/OBSERVED
> > > > >
> > > > > Hopefully the new flag names will make this a bit more clear. We could
> > > > > also rename some of the functions if that helps too. We could consider
> > > > > moving from errseq_sample/_check_and_advance to
> > > > > errseq_observe/errseq_report?  I'm not sure that helps anything though.
> > > > >
> > > > > I know that Vivek and Sargun are working on syncfs() for overlayfs, so
> > > > > we probably don't want to merge this until that work is ready. I think
> > > >
> > > > I disagree. I think that this work can land ahead of that, given that I think
> > > > this is probably backportable to v5.10 without much risk, with the addition of
> > > > your RFC v2 Overlay patch. I think the work proper long-term repair Vivek is
> > > > embarking upon seems like it may be far more invasive.
> > > >
> > > > > the errseq_peek call will need to be part of their solution for volatile
> > > > > mounts, however, so I'm fine with merging this via the overlayfs tree,
> > > > > once that work is complete.
> > > > >
> > > > > diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst
> > > > > index ff332e272405..ce46ddcc1487 100644
> > > > > --- a/Documentation/core-api/errseq.rst
> > > > > +++ b/Documentation/core-api/errseq.rst
> > > > > @@ -18,18 +18,22 @@ these functions can be called from any context.
> > > > >  Note that there is a risk of collisions if new errors are being recorded
> > > > >  frequently, since we have so few bits to use as a counter.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -To mitigate this, the bit between the error value and counter is used as
> > > > > -a flag to tell whether the value has been sampled since a new value was
> > > > > -recorded.  That allows us to avoid bumping the counter if no one has
> > > > > -sampled it since the last time an error was recorded.
> > > > > +To mitigate this, the bits between the error value and counter are used
> > > > > +as flags to tell whether the value has been sampled since a new value
> > > > > +was recorded, and whether the latest error has been seen by userland.
> > > > > +That allows us to avoid bumping the counter if no one has sampled it
> > > > > +since the last time an error was recorded, and also ensures that any
> > > > > +recorded error will be seen at least once.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >  Thus we end up with a value that looks something like this:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -+--------------------------------------+----+------------------------+
> > > > > -| 31..13                               | 12 | 11..0                  |
> > > > > -+--------------------------------------+----+------------------------+
> > > > > -| counter                              | SF | errno                  |
> > > > > -+--------------------------------------+----+------------------------+
> > > > > ++---------------------------------+----+----+------------------------+
> > > > > +| 31..14                          | 13 | 12 | 11..0                  |
> > > > > ++---------------------------------+----+----+------------------------+
> > > > > +| counter                         | OF | RF | errno                  |
> > > > > ++---------------------------------+----+----+------------------------+
> > > > > +OF = ERRSEQ_OBSERVED flag
> > > > > +RF = ERRSEQ_REPORTED flag
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >  The general idea is for "watchers" to sample an errseq_t value and keep
> > > > >  it as a running cursor.  That value can later be used to tell whether
> > > > > diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> > > > > index fc2777770768..7e3634269c95 100644
> > > > > --- a/include/linux/errseq.h
> > > > > +++ b/include/linux/errseq.h
> > > > > @@ -9,6 +9,7 @@ typedef u32     errseq_t;
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >  errseq_t errseq_set(errseq_t *eseq, int err);
> > > > >  errseq_t errseq_sample(errseq_t *eseq);
> > > > > +errseq_t errseq_peek(errseq_t *eseq);
> > > > >  int errseq_check(errseq_t *eseq, errseq_t since);
> > > > >  int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
> > > > >  #endif
> > > > > diff --git a/lib/errseq.c b/lib/errseq.c
> > > > > index 81f9e33aa7e7..8fd6be134dcc 100644
> > > > > --- a/lib/errseq.c
> > > > > +++ b/lib/errseq.c
> > > > > @@ -21,10 +21,14 @@
> > > > >   * Note that there is a risk of collisions if new errors are being recorded
> > > > >   * frequently, since we have so few bits to use as a counter.
> > > > >   *
> > > > > - * To mitigate this, one bit is used as a flag to tell whether the value has
> > > > > - * been sampled since a new value was recorded. That allows us to avoid bumping
> > > > > - * the counter if no one has sampled it since the last time an error was
> > > > > - * recorded.
> > > > > + * To mitigate this, one bit is used as a flag to tell whether the value has been
> > > > > + * observed in some fashion. That allows us to avoid bumping the counter if no
> > > > > + * one has sampled it since the last time an error was recorded.
> > > > > + *
> > > > > + * A second flag bit is used to indicate whether the latest error that has been
> > > > > + * recorded has been reported to userland. If the REPORTED bit is not set when the
> > > > > + * file is opened, then we ensure that the opener will see the error by setting
> > > > > + * its sample to 0.
> > > >
> > > > Since there are only a few places that report to userland (as far as I can tell,
> > > > a bit of usage in ceph), does it make sense to maintain this specific flag that
> > > > indicates it's reported to userspace? Instead can userspace keep a snapshot
> > > > of the last errseq it reported (say on the superblock), and use that to drive
> > > > reports to userspace?
> > > >
> > > > It's a 32-bit sacrifice per SB though, but it means we can get rid of
> > > > errseq_check_and_advance and potentially remove any need for locking and just
> > > > rely on cmpxchg.
> > >
> > > I think it makes sense. You are essentially adding a new class of
> > > "samplers" that use the error for their own purposes and won't be
> > > reporting it to userland via normal channels (syncfs, etc.). A single
> > > bit to indicate whether it has only been observed by such samplers is
> > > not a huge sacrifice.
> > >
> > > I worry too about race conditions when tracking this information across
> > > multiple words. You'll either need to use some locking to manage that,
> > > or get clever with memory barriers. Keeping everything in one word makes
> > > things a lot simpler.
> > > --
> > > Jeff Layton <jlayton@kernel.org>
> > >
> >
> > I'll wait for Amir or Miklos to chime in, but I'm happy with this, and it solves
> > my problems.
> >
> > Do you want to respin this patch with the overlayfs patch as well, so
> > we can cherry-pick to stable, and then focus on how we want to deal
> > with this problem in the future?
>
> Assuming no one sees issues with it and that this solves the problem of
> writeback errors on volatile mounts, I'm fine with this going in via the
> overlayfs tree, just ahead of the patch that adds the first caller of
> errseq_peek.
>

I like the ERRSEQ_OBSERVED/ERRSEQ_REPORTED abstraction.
I agree with Jeff that ERRSEQ_SEEN wrongly multiplexies two
completely different things.

We've had to maintain backward compact to the syncfs() behavior
expected by existing users, but I can also imagine that fsinfo() would
want to check for sb error without consuming it, so errseq_peek()
looks like the right direction to take.

> I think we're finding that the thornier problem is how to pass along
> writeback errors on non-volatile mounts. That's probably going to
> require some vfs-layer surgery, so it may be best to wait until the
> shape of that is clear.

I have to say, following the thread of each of those problems is pretty
challenging. Following both issues in several intewinding threads is a
workout...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
  2020-12-19  6:13 ` Matthew Wilcox
@ 2020-12-19 12:53   ` Jeff Layton
  2020-12-19 13:25     ` Jeff Layton
                       ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Jeff Layton @ 2020-12-19 12:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Amir Goldstein, Sargun Dhillon, Miklos Szeredi, Vivek Goyal,
	overlayfs, Linux FS-devel Mailing List, NeilBrown, Jan Kara

On Sat, 2020-12-19 at 06:13 +0000, Matthew Wilcox wrote:
> On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> > Overlayfs's volatile mounts want to be able to sample an error for their
> > own purposes, without preventing a later opener from potentially seeing
> > the error.
> 
> umm ... can't they just copy the errseq_t they're interested in, followed
> by calling errseq_check() later?
> 

They don't want the sampling for the volatile mount to prevent later
openers from seeing an error that hasn't yet been reported.

If they copy the errseq_t (or just do an errseq_sample), and then follow
it with a errseq_check_and_advance then the SEEN bit will end up being
set and a later opener wouldn't see the error.

Aside from that though, I think this patch clarifies things a bit since
the SEEN flag currently means two different things:

1/ do I need to increment the counter when recording another error?

2/ do I need to report this error to new samplers (at open time)

That was ok before, since we those conditions were always changed
together, but with the overlayfs volatile mount use-case, it no longer
does.

> actually, isn't errseq_check() buggy in the face of multiple
> watchers?  consider this:
> 
> worker.es starts at 0
> t2.es = errseq_sample(&worker.es)
> errseq_set(&worker.es, -EIO)
> t1.es = errseq_sample(&worker.es)
> t2.err = errseq_check_and_advance(&es, t2.es)
> 	** this sets ERRSEQ_SEEN **
> t1.err = errseq_check(&worker.es, t1.es)
> 	** reports an error, even though the only change is that
> 	   ERRSEQ_SEEN moved **.
> 
> i think errseq_check() should be:
> 
> 	if (likely(cur | ERRSEQ_SEEN) == (since | ERRSEQ_SEEN))
> 		return 0;
> 
> i'm not yet convinced other changes are needed to errseq.  but i am
> having great trouble understanding exactly what overlayfs is trying to do.

I think you're right on errseq_check. I'll plan to do a patch to fix
that up as well.

I too am having a bit of trouble understanding all of the nuances here.
My current understanding is that it depends on the "volatility" of the
mount:

normal (non-volatile): they basically want to be able to track errors as
if the files were being opened on the upper layer. For this case I think
they should aim to just do all of the error checking against the upper
sb and ignore the overlayfs s_wb_err field. This does mean pushing the
errseq_check_and_advance down into the individual filesystems in some
fashion though.

volatile: they want to sample at mount time and always return an error
to syncfs if there has been another error since the original sample
point. This sampling should also not affect later openers on the upper
layer (or on other overlayfs mounts).

I'm not 100% clear on whether I understand both use-cases correctly
though.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
  2020-12-19 12:53   ` Jeff Layton
@ 2020-12-19 13:25     ` Jeff Layton
  2020-12-19 15:33     ` Matthew Wilcox
  2020-12-21 15:14     ` Vivek Goyal
  2 siblings, 0 replies; 13+ messages in thread
From: Jeff Layton @ 2020-12-19 13:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Amir Goldstein, Sargun Dhillon, Miklos Szeredi, Vivek Goyal,
	overlayfs, Linux FS-devel Mailing List, NeilBrown, Jan Kara

On Sat, 2020-12-19 at 07:53 -0500, Jeff Layton wrote:
> On Sat, 2020-12-19 at 06:13 +0000, Matthew Wilcox wrote:
> > On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> > > Overlayfs's volatile mounts want to be able to sample an error for their
> > > own purposes, without preventing a later opener from potentially seeing
> > > the error.
> > 
> > umm ... can't they just copy the errseq_t they're interested in, followed
> > by calling errseq_check() later?
> > 
> 
> They don't want the sampling for the volatile mount to prevent later
> openers from seeing an error that hasn't yet been reported.
> 
> If they copy the errseq_t (or just do an errseq_sample), and then follow
> it with a errseq_check_and_advance then the SEEN bit will end up being
> set and a later opener wouldn't see the error.
> 
> Aside from that though, I think this patch clarifies things a bit since
> the SEEN flag currently means two different things:
> 
> 1/ do I need to increment the counter when recording another error?
> 
> 2/ do I need to report this error to new samplers (at open time)
> 
> That was ok before, since we those conditions were always changed
> together, but with the overlayfs volatile mount use-case, it no longer
> does.
> 
> > actually, isn't errseq_check() buggy in the face of multiple
> > watchers?  consider this:
> > 
> > worker.es starts at 0
> > t2.es = errseq_sample(&worker.es)
> > errseq_set(&worker.es, -EIO)
> > t1.es = errseq_sample(&worker.es)
> > t2.err = errseq_check_and_advance(&es, t2.es)
> > 	** this sets ERRSEQ_SEEN **
> > t1.err = errseq_check(&worker.es, t1.es)
> > 	** reports an error, even though the only change is that
> > 	   ERRSEQ_SEEN moved **.
> > 
> > i think errseq_check() should be:
> > 
> > 	if (likely(cur | ERRSEQ_SEEN) == (since | ERRSEQ_SEEN))
> > 		return 0;
> > 
> > i'm not yet convinced other changes are needed to errseq.  but i am
> > having great trouble understanding exactly what overlayfs is trying to do.
> 
> I think you're right on errseq_check. I'll plan to do a patch to fix
> that up as well.
> 

I take it back. This isn't a problem for a couple of (non-obvious)
reasons:

1/ any "real" sample will either have the SEEN bit set or be 0. In the
first case, errseq_check will return true (which is correct). In the
second any change from what you sampled will have to have been a
recorded error.

2/ ...but even if that weren't the case and you got a false positive
from errseq_check, all of the existing callers call
errseq_check_and_advance just afterward, which handles this correctly.

Either way, splitting the flag in two means that we do need to take the
flag bits into account, so errseq_same masks them off before comparing.

> I too am having a bit of trouble understanding all of the nuances here.
> My current understanding is that it depends on the "volatility" of the
> mount:
> 
> normal (non-volatile): they basically want to be able to track errors as
> if the files were being opened on the upper layer. For this case I think
> they should aim to just do all of the error checking against the upper
> sb and ignore the overlayfs s_wb_err field. This does mean pushing the
> errseq_check_and_advance down into the individual filesystems in some
> fashion though.
> 
> volatile: they want to sample at mount time and always return an error
> to syncfs if there has been another error since the original sample
> point. This sampling should also not affect later openers on the upper
> layer (or on other overlayfs mounts).
> 
> I'm not 100% clear on whether I understand both use-cases correctly
> though.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
  2020-12-19 12:53   ` Jeff Layton
  2020-12-19 13:25     ` Jeff Layton
@ 2020-12-19 15:33     ` Matthew Wilcox
  2020-12-19 15:49       ` Jeff Layton
  2020-12-21 15:14     ` Vivek Goyal
  2 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2020-12-19 15:33 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Sargun Dhillon, Miklos Szeredi, Vivek Goyal,
	overlayfs, Linux FS-devel Mailing List, NeilBrown, Jan Kara

On Sat, Dec 19, 2020 at 07:53:12AM -0500, Jeff Layton wrote:
> On Sat, 2020-12-19 at 06:13 +0000, Matthew Wilcox wrote:
> > On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> > > Overlayfs's volatile mounts want to be able to sample an error for their
> > > own purposes, without preventing a later opener from potentially seeing
> > > the error.
> > 
> > umm ... can't they just copy the errseq_t they're interested in, followed
> > by calling errseq_check() later?
> > 
> 
> They don't want the sampling for the volatile mount to prevent later
> openers from seeing an error that hasn't yet been reported.

That's why they should use errseq_check(), not errseq_check_and_advance()
...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
  2020-12-19 15:33     ` Matthew Wilcox
@ 2020-12-19 15:49       ` Jeff Layton
  2020-12-19 16:53         ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2020-12-19 15:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Amir Goldstein, Sargun Dhillon, Miklos Szeredi, Vivek Goyal,
	overlayfs, Linux FS-devel Mailing List, NeilBrown, Jan Kara

On Sat, 2020-12-19 at 15:33 +0000, Matthew Wilcox wrote:
> On Sat, Dec 19, 2020 at 07:53:12AM -0500, Jeff Layton wrote:
> > On Sat, 2020-12-19 at 06:13 +0000, Matthew Wilcox wrote:
> > > On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> > > > Overlayfs's volatile mounts want to be able to sample an error for their
> > > > own purposes, without preventing a later opener from potentially seeing
> > > > the error.
> > > 
> > > umm ... can't they just copy the errseq_t they're interested in, followed
> > > by calling errseq_check() later?
> > > 
> > 
> > They don't want the sampling for the volatile mount to prevent later
> > openers from seeing an error that hasn't yet been reported.
> 
> That's why they should use errseq_check(), not errseq_check_and_advance()
> ...

If you sample it without setting the OBSERVED (aka SEEN) bit, then you
can't guarantee that the next error that occurs will be recorded. The
counter won't be bumped unless that flag is set.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
  2020-12-19 15:49       ` Jeff Layton
@ 2020-12-19 16:53         ` Matthew Wilcox
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2020-12-19 16:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Sargun Dhillon, Miklos Szeredi, Vivek Goyal,
	overlayfs, Linux FS-devel Mailing List, NeilBrown, Jan Kara

On Sat, Dec 19, 2020 at 10:49:58AM -0500, Jeff Layton wrote:
> On Sat, 2020-12-19 at 15:33 +0000, Matthew Wilcox wrote:
> > On Sat, Dec 19, 2020 at 07:53:12AM -0500, Jeff Layton wrote:
> > > On Sat, 2020-12-19 at 06:13 +0000, Matthew Wilcox wrote:
> > > > On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> > > > > Overlayfs's volatile mounts want to be able to sample an error for their
> > > > > own purposes, without preventing a later opener from potentially seeing
> > > > > the error.
> > > > 
> > > > umm ... can't they just copy the errseq_t they're interested in, followed
> > > > by calling errseq_check() later?
> > > > 
> > > 
> > > They don't want the sampling for the volatile mount to prevent later
> > > openers from seeing an error that hasn't yet been reported.
> > 
> > That's why they should use errseq_check(), not errseq_check_and_advance()
> > ...
> 
> If you sample it without setting the OBSERVED (aka SEEN) bit, then you
> can't guarantee that the next error that occurs will be recorded. The
> counter won't be bumped unless that flag is set.

Ah, right, that's why we set to zero when sampling.

It isn't clear to me that overlayfs doesn't want that behaviour ...
because the overlayfs people have been so very unclear on what they
actually want.

I'm beginning to think we want a test-suite for errseq_t, which would
serve the twin purpose of documenting how to use it and what behaviours
you can get from it, as well as making sure we don't regress anything
when making changes.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags
  2020-12-19 12:53   ` Jeff Layton
  2020-12-19 13:25     ` Jeff Layton
  2020-12-19 15:33     ` Matthew Wilcox
@ 2020-12-21 15:14     ` Vivek Goyal
  2 siblings, 0 replies; 13+ messages in thread
From: Vivek Goyal @ 2020-12-21 15:14 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Matthew Wilcox, Amir Goldstein, Sargun Dhillon, Miklos Szeredi,
	overlayfs, Linux FS-devel Mailing List, NeilBrown, Jan Kara

On Sat, Dec 19, 2020 at 07:53:12AM -0500, Jeff Layton wrote:
> On Sat, 2020-12-19 at 06:13 +0000, Matthew Wilcox wrote:
> > On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> > > Overlayfs's volatile mounts want to be able to sample an error for their
> > > own purposes, without preventing a later opener from potentially seeing
> > > the error.
> > 
> > umm ... can't they just copy the errseq_t they're interested in, followed
> > by calling errseq_check() later?
> > 
> 
> They don't want the sampling for the volatile mount to prevent later
> openers from seeing an error that hasn't yet been reported.
> 
> If they copy the errseq_t (or just do an errseq_sample), and then follow
> it with a errseq_check_and_advance then the SEEN bit will end up being
> set and a later opener wouldn't see the error.
> 
> Aside from that though, I think this patch clarifies things a bit since
> the SEEN flag currently means two different things:
> 
> 1/ do I need to increment the counter when recording another error?
> 
> 2/ do I need to report this error to new samplers (at open time)
> 
> That was ok before, since we those conditions were always changed
> together, but with the overlayfs volatile mount use-case, it no longer
> does.
> 
> > actually, isn't errseq_check() buggy in the face of multiple
> > watchers?  consider this:
> > 
> > worker.es starts at 0
> > t2.es = errseq_sample(&worker.es)
> > errseq_set(&worker.es, -EIO)
> > t1.es = errseq_sample(&worker.es)
> > t2.err = errseq_check_and_advance(&es, t2.es)
> > 	** this sets ERRSEQ_SEEN **
> > t1.err = errseq_check(&worker.es, t1.es)
> > 	** reports an error, even though the only change is that
> > 	   ERRSEQ_SEEN moved **.
> > 
> > i think errseq_check() should be:
> > 
> > 	if (likely(cur | ERRSEQ_SEEN) == (since | ERRSEQ_SEEN))
> > 		return 0;
> > 
> > i'm not yet convinced other changes are needed to errseq.  but i am
> > having great trouble understanding exactly what overlayfs is trying to do.
> 
> I think you're right on errseq_check. I'll plan to do a patch to fix
> that up as well.
> 
> I too am having a bit of trouble understanding all of the nuances here.
> My current understanding is that it depends on the "volatility" of the
> mount:
> 
> normal (non-volatile): they basically want to be able to track errors as
> if the files were being opened on the upper layer. For this case I think
> they should aim to just do all of the error checking against the upper
> sb and ignore the overlayfs s_wb_err field. This does mean pushing the
> errseq_check_and_advance down into the individual filesystems in some
> fashion though.
> 
> volatile: they want to sample at mount time and always return an error
> to syncfs if there has been another error since the original sample
> point. This sampling should also not affect later openers on the upper
> layer (or on other overlayfs mounts).

I think syncfs() for both volatile and non-volatile mounts should
probably be treated same way and existing erreseq API should be
good enough for that.

So say overlayfs installs file->f_sb_err at file open time. 

ovl_file->f_sb_err = errseq_sample(&upper_sb->s_wb_err)

And at syncfs() time either file system specific ->sync_fs routine
or a separate ->sb_errseq_check() routine does following

ret = errseq_check_and_advance(&upper_sb->s_wb_err, &ovl_file->f_sb_err);

And syncfs() will return error to user space. 

So both volatile and non-volatile mount see similar to me. There is
one key difference.

- non-volatile mount does not actually sync the file to disk. So it
  it only return error if some other writeback error has been recorded
  on upper_sb since file was opened.

So what are the use cases for new errseq API? I think there are
two.

- Sargun has been trying to remount volatile mounts without every calling
  syncfs on upper. And he wants to detect if there has been an error
  since volatile mount was last mounted/unmounted. In this case he
  wants to take a snapshot of current state of upper_sb->s_wb_err
  and check this again at remount time to make sure there are not
  any new errors since he took snapshot.

  He does not want to consume the UNSEEN error in this process because
  it can potentially lead to complaints from normal syncfs users
  saying that sheer act of mounting an overlayfs consumed UNSEEN
  error on upper sb. They would expect it to be consumed when one
  of the callers does syncfs() either on upper->sb directly or
  indirectly through overlay mount.

  So IMHO, that's the core reason behind the new errseq API.

- Once we have it, I think we can potentialy use it at other places
  in overlayfs to check for errors. For example, at overlay read()
  routines might want to check if there has been an error recorded
  on upper since volatile overlay was mounted. If yes, fail read
  with -EIO and fail the filesystem so all the future reads
  return -EIO.

  That's one thing different with volatile overlayfs. Typically
  if applications write something to a file, they can do.

  write(fd)
  fsync(fd)
  read(fd)

  And this should make sure apps are protected against writeback
  errors.

  But with volatile mounts, we actually don't do any fsync(). So it
  is possible that fysnc() returns success and when read() happens
  we either read data we wrote or we read old data from disk (because
  writeback failed and pagecache page was reclaimed).

  To avoid this situation with volatile overlay mounts, we thought
  that we will fail overlay volatile mount as soon as first writeback
  error is detected.

  Now syncfs() is just only one of the sample points. We should be able
  to check for writeback errors in other paths like mount/read/write
  etc and fail volatile fs. And in all these paths, we don't want
  to mark error as SEEN otherwise user space might complain.

  In summary, with volatile overlay use case, there is a need to
  check for errors outside the syncfs() (mount/read/write/...). If
  we check for errors in these paths we have two alternatives.

  - Either mark unseen error on upper_sb as SEEN. And then user space
    can complain that an unseen error was consumed without anybody
    ever calling syncfs.

  - Or we introduce new errseq API which allows checking for errors
    without marking it SEEN. That way we can detect errors at the
    same time unseen error will be reported to user space through
    syncfs().

And second option seems to be safer choice, that's why errseq_peek()
patches.

> I'm not 100% clear on whether I understand both use-cases correctly
> though.

I hope, above is a decent summary of various discussions we have had
so far.

Vivek


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-12-21 15:16 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-17 15:00 [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags Jeff Layton
2020-12-17 20:35 ` Sargun Dhillon
2020-12-17 21:18   ` Jeff Layton
2020-12-18 23:44     ` Sargun Dhillon
2020-12-19  1:03       ` Jeff Layton
2020-12-19  9:09         ` Amir Goldstein
2020-12-19  6:13 ` Matthew Wilcox
2020-12-19 12:53   ` Jeff Layton
2020-12-19 13:25     ` Jeff Layton
2020-12-19 15:33     ` Matthew Wilcox
2020-12-19 15:49       ` Jeff Layton
2020-12-19 16:53         ` Matthew Wilcox
2020-12-21 15:14     ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).