All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] xfs-4.20: scrub fixes
@ 2018-10-01 22:48 Darrick J. Wong
  2018-10-01 22:48 ` [PATCH 1/2] xfs: xrep_findroot_block should reject root blocks with siblings Darrick J. Wong
  2018-10-01 22:48 ` [PATCH 2/2] xfs: fix buffer state management in xrep_findroot_block Darrick J. Wong
  0 siblings, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2018-10-01 22:48 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, bfoster

Hi all,

Here are a couple of online fsck fixes for 4.20.

The first patch fixes the online repair "find AG btree root" function to
ignore btree blocks that have siblings and to ignore a btree level if
multiple sibling-less blocks are found.

The second patch fixes some buffer state management bugs so that we
don't accidentally clobber b_ops on buffers that were already in-core
when we try to find an AG header's btree root blocks.

Comments and questions are, as always, welcome.

--D

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] xfs: xrep_findroot_block should reject root blocks with siblings
  2018-10-01 22:48 [PATCH v2 0/2] xfs-4.20: scrub fixes Darrick J. Wong
@ 2018-10-01 22:48 ` Darrick J. Wong
  2018-10-02 12:34   ` Brian Foster
  2018-10-03  2:01   ` [PATCH v2 " Darrick J. Wong
  2018-10-01 22:48 ` [PATCH 2/2] xfs: fix buffer state management in xrep_findroot_block Darrick J. Wong
  1 sibling, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2018-10-01 22:48 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, bfoster

From: Darrick J. Wong <darrick.wong@oracle.com>

In xrep_findroot_block, if we find a candidate root block with sibling
pointers or sibling blocks on the same tree level, we should not return
that block as a tree root because root blocks cannot have siblings.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/repair.c |   61 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 48 insertions(+), 13 deletions(-)


diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 9f08dd9bf1d5..6eb66b3543ff 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -692,12 +692,13 @@ xrep_findroot_block(
 	struct xrep_find_ag_btree	*fab,
 	uint64_t			owner,
 	xfs_agblock_t			agbno,
-	bool				*found_it)
+	bool				*done_with_block)
 {
 	struct xfs_mount		*mp = ri->sc->mp;
 	struct xfs_buf			*bp;
 	struct xfs_btree_block		*btblock;
 	xfs_daddr_t			daddr;
+	int				block_level;
 	int				error;
 
 	daddr = XFS_AGB_TO_DADDR(mp, ri->sc->sa.agno, agbno);
@@ -735,18 +736,52 @@ xrep_findroot_block(
 		goto out;
 	bp->b_ops = fab->buf_ops;
 
-	/* Ignore this block if it's lower in the tree than we've seen. */
-	if (fab->root != NULLAGBLOCK &&
-	    xfs_btree_get_level(btblock) < fab->height)
-		goto out;
-
 	/* Make sure we pass the verifiers. */
 	bp->b_ops->verify_read(bp);
 	if (bp->b_error)
 		goto out;
-	fab->root = agbno;
-	fab->height = xfs_btree_get_level(btblock) + 1;
-	*found_it = true;
+
+	/*
+	 * This block passes the magic/uuid and verifier tests for this btree
+	 * type.  We don't need the caller to try the other tree types.
+	 */
+	*done_with_block = true;
+
+	block_level = xfs_btree_get_level(btblock);
+	if (block_level + 1 == fab->height) {
+		/*
+		 * This block claims to be at the same level as the root we
+		 * found previously.  There can't be two candidate roots, so
+		 * we'll throw away both of them and hope we later find a block
+		 * even higher in the tree.
+		 */
+		fab->root = NULLAGBLOCK;
+		goto out;
+	} else if (block_level < fab->height) {
+		/*
+		 * This block is lower in the tree than the root we found
+		 * previously, so just ignore it.
+		 */
+		goto out;
+	}
+
+	/*
+	 * This is the highest block in the tree that we've found so far.
+	 * Update the btree height to reflect what we've learned from this
+	 * block.
+	 */
+	fab->height = block_level + 1;
+
+	/*
+	 * If this block doesn't have sibling pointers, then it's the new root
+	 * block candidate.  Otherwise, the root will be found farther up the
+	 * tree.
+	 */
+	if (btblock->bb_u.s.bb_leftsib == cpu_to_be32(NULLAGBLOCK) &&
+	    btblock->bb_u.s.bb_rightsib == cpu_to_be32(NULLAGBLOCK))
+		fab->root = agbno;
+	else
+		fab->root = NULLAGBLOCK;
 
 	trace_xrep_findroot_block(mp, ri->sc->sa.agno, agbno,
 			be32_to_cpu(btblock->bb_magic), fab->height - 1);
@@ -768,7 +803,7 @@ xrep_findroot_rmap(
 	struct xrep_findroot		*ri = priv;
 	struct xrep_find_ag_btree	*fab;
 	xfs_agblock_t			b;
-	bool				found_it;
+	bool				done;
 	int				error = 0;
 
 	/* Ignore anything that isn't AG metadata. */
@@ -777,16 +812,16 @@ xrep_findroot_rmap(
 
 	/* Otherwise scan each block + btree type. */
 	for (b = 0; b < rec->rm_blockcount; b++) {
-		found_it = false;
+		done = false;
 		for (fab = ri->btree_info; fab->buf_ops; fab++) {
 			if (rec->rm_owner != fab->rmap_owner)
 				continue;
 			error = xrep_findroot_block(ri, fab,
 					rec->rm_owner, rec->rm_startblock + b,
-					&found_it);
+					&done);
 			if (error)
 				return error;
-			if (found_it)
+			if (done)
 				break;
 		}
 	}

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] xfs: fix buffer state management in xrep_findroot_block
  2018-10-01 22:48 [PATCH v2 0/2] xfs-4.20: scrub fixes Darrick J. Wong
  2018-10-01 22:48 ` [PATCH 1/2] xfs: xrep_findroot_block should reject root blocks with siblings Darrick J. Wong
@ 2018-10-01 22:48 ` Darrick J. Wong
  2018-10-02 12:36   ` Brian Foster
  2018-10-03  2:02   ` [PATCH v2 " Darrick J. Wong
  1 sibling, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2018-10-01 22:48 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, bfoster

From: Darrick J. Wong <darrick.wong@oracle.com>

We don't quite handle buffer state properly in online repair's findroot
routine.  If the buffer is already in-core we don't want to trash its
b_ops and state, but we /do/ want to read the contents in from disk if
the buffer wasn't already in memory.

Therefore, introduce the ONESHOT_OPS buffer flag.  If the buffer had to
be read in from disk, the buf_ops supplied to read_buf will be used once
to verify the buffer contents, but they will not be attached to the
buffer itself.

With this change, xrep_findroot_block's buffer handling can be
simplified -- if b_ops is null, we know that it was freshly read (and
verified), we can set b_ops, and proceed.  If b_ops is not null, the
buffer was already in memory and we need to do some more structure
checks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/repair.c |   76 ++++++++++++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_buf.c      |    7 ++++-
 fs/xfs/xfs_buf.h      |    2 +
 3 files changed, 67 insertions(+), 18 deletions(-)


diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 6eb66b3543ff..8e1dfa5752b4 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -29,6 +29,8 @@
 #include "xfs_ag_resv.h"
 #include "xfs_trans_space.h"
 #include "xfs_quota.h"
+#include "xfs_attr.h"
+#include "xfs_reflink.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -697,9 +699,10 @@ xrep_findroot_block(
 	struct xfs_mount		*mp = ri->sc->mp;
 	struct xfs_buf			*bp;
 	struct xfs_btree_block		*btblock;
+	xfs_failaddr_t			fa;
 	xfs_daddr_t			daddr;
 	int				block_level;
-	int				error;
+	int				error = 0;
 
 	daddr = XFS_AGB_TO_DADDR(mp, ri->sc->sa.agno, agbno);
 
@@ -718,28 +721,67 @@ xrep_findroot_block(
 			return error;
 	}
 
+	/*
+	 * Read the buffer into memory with ONESHOT_OPS.  We can tell if we
+	 * had to go to disk by whether or not b_ops is set.
+	 */
 	error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
-			mp->m_bsize, 0, &bp, NULL);
+			mp->m_bsize, XBF_ONESHOT_OPS, &bp, fab->buf_ops);
 	if (error)
 		return error;
 
-	/*
-	 * Does this look like a block matching our fs and higher than any
-	 * other block we've found so far?  If so, reattach buffer verifiers
-	 * so the AIL won't complain if the buffer is also dirty.
-	 */
 	btblock = XFS_BUF_TO_BLOCK(bp);
-	if (be32_to_cpu(btblock->bb_magic) != fab->magic)
-		goto out;
-	if (xfs_sb_version_hascrc(&mp->m_sb) &&
-	    !uuid_equal(&btblock->bb_u.s.bb_uuid, &mp->m_sb.sb_meta_uuid))
-		goto out;
-	bp->b_ops = fab->buf_ops;
+	if (bp->b_ops == NULL) {
+		/*
+		 * The buffer came back with NULL b_ops, which means that the
+		 * buffer was not in memory, we had to read it off disk, and
+		 * the verifier has run.
+		 *
+		 * If b_error == 0, the block matches the magic, the uuid, and
+		 * ->verify_struct of the btree type and we can proceed.  Set
+		 * b_ops to the btree type's buf_ops so that we don't have a
+		 * buffer in memory with no verifiers.
+		 *
+		 * Otherwise, the block doesn't match the btree type so we
+		 * mark it oneshot so it doesn't stick around, and move on.
+		 */
+		if (bp->b_error) {
+			xfs_buf_oneshot(bp);
+			goto out;
+		}
+		bp->b_ops = fab->buf_ops;
+	} else {
+		/*
+		 * The buffer came back with b_ops set, which means that the
+		 * buffer was already in memory.  We'll run the magic, uuid,
+		 * and ->verify_struct checks by hand to see if we want to
+		 * examine it further.
+		 */
+		if (be32_to_cpu(btblock->bb_magic) != fab->magic)
+			goto out;
+		if (xfs_sb_version_hascrc(&mp->m_sb) &&
+		    !uuid_equal(&btblock->bb_u.s.bb_uuid,
+				&mp->m_sb.sb_meta_uuid))
+			goto out;
 
-	/* Make sure we pass the verifiers. */
-	bp->b_ops->verify_read(bp);
-	if (bp->b_error)
-		goto out;
+		/*
+		 * If b_ops do not match fab->buf_ops even though the magic
+		 * does, something is seriously wrong here and we'd rather
+		 * abort the entire repair than risk screwing things up.
+		 */
+		if (bp->b_ops != fab->buf_ops)
+			goto out;
+
+		/*
+		 * If the buffer was already incore (on a v5 fs) then it should
+		 * already have had b_ops assigned.  Call ->verify_struct to
+		 * check the structure.  Avoid checking the CRC because we
+		 * don't calculate CRCs until the buffer is written by the log.
+		 */
+		fa = bp->b_ops->verify_struct(bp);
+		if (fa)
+			goto out;
+	}
 
 	/*
 	 * This block passes the magic/uuid and verifier tests for this btree
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e839907e8492..139d35f6da64 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -206,7 +206,8 @@ _xfs_buf_alloc(
 	 * We don't want certain flags to appear in b_flags unless they are
 	 * specifically set by later operations on the buffer.
 	 */
-	flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD);
+	flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD |
+		   XBF_ONESHOT_OPS);
 
 	atomic_set(&bp->b_hold, 1);
 	atomic_set(&bp->b_lru_ref, 1);
@@ -758,6 +759,7 @@ xfs_buf_read_map(
 	const struct xfs_buf_ops *ops)
 {
 	struct xfs_buf		*bp;
+	const struct xfs_buf_ops *orig_ops;
 
 	flags |= XBF_READ;
 
@@ -767,8 +769,11 @@ xfs_buf_read_map(
 
 		if (!(bp->b_flags & XBF_DONE)) {
 			XFS_STATS_INC(target->bt_mount, xb_get_read);
+			orig_ops = bp->b_ops;
 			bp->b_ops = ops;
 			_xfs_buf_read(bp, flags);
+			if (flags & XBF_ONESHOT_OPS)
+				bp->b_ops = orig_ops;
 		} else if (flags & XBF_ASYNC) {
 			/*
 			 * Read ahead call which is already satisfied,
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 4e3171acd0f8..62139d3ad349 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -34,6 +34,7 @@ typedef enum {
 #define XBF_ASYNC	 (1 << 4) /* initiator will not wait for completion */
 #define XBF_DONE	 (1 << 5) /* all pages in the buffer uptodate */
 #define XBF_STALE	 (1 << 6) /* buffer has been staled, do not find it */
+#define XBF_ONESHOT_OPS	 (1 << 7) /* only use ops if we read in the buf */
 #define XBF_WRITE_FAIL	 (1 << 24)/* async writes have failed on this buffer */
 
 /* I/O hints for the BIO layer */
@@ -61,6 +62,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_ASYNC,		"ASYNC" }, \
 	{ XBF_DONE,		"DONE" }, \
 	{ XBF_STALE,		"STALE" }, \
+	{ XBF_ONESHOT_OPS,	"ONESHOT_OPS" }, /* should never be set */ \
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
 	{ XBF_SYNCIO,		"SYNCIO" }, \
 	{ XBF_FUA,		"FUA" }, \

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] xfs: xrep_findroot_block should reject root blocks with siblings
  2018-10-01 22:48 ` [PATCH 1/2] xfs: xrep_findroot_block should reject root blocks with siblings Darrick J. Wong
@ 2018-10-02 12:34   ` Brian Foster
  2018-10-02 15:39     ` Darrick J. Wong
  2018-10-03  2:01   ` [PATCH v2 " Darrick J. Wong
  1 sibling, 1 reply; 13+ messages in thread
From: Brian Foster @ 2018-10-02 12:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Mon, Oct 01, 2018 at 03:48:22PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> In xrep_findroot_block, if we find a candidate root block with sibling
> pointers or sibling blocks on the same tree level, we should not return
> that block as a tree root because root blocks cannot have siblings.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/repair.c |   61 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 48 insertions(+), 13 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> index 9f08dd9bf1d5..6eb66b3543ff 100644
> --- a/fs/xfs/scrub/repair.c
> +++ b/fs/xfs/scrub/repair.c
...
> @@ -735,18 +736,52 @@ xrep_findroot_block(
>  		goto out;
>  	bp->b_ops = fab->buf_ops;
>  
> -	/* Ignore this block if it's lower in the tree than we've seen. */
> -	if (fab->root != NULLAGBLOCK &&
> -	    xfs_btree_get_level(btblock) < fab->height)
> -		goto out;
> -
>  	/* Make sure we pass the verifiers. */
>  	bp->b_ops->verify_read(bp);
>  	if (bp->b_error)
>  		goto out;
> -	fab->root = agbno;
> -	fab->height = xfs_btree_get_level(btblock) + 1;
> -	*found_it = true;
> +
> +	/*
> +	 * This block passes the magic/uuid and verifier tests for this btree
> +	 * type.  We don't need the caller to try the other tree types.
> +	 */
> +	*done_with_block = true;
> +
> +	block_level = xfs_btree_get_level(btblock);
> +	if (block_level + 1 == fab->height) {
> +		/*
> +		 * This block claims to be at the same level as the root we
> +		 * found previously.  There can't be two candidate roots, so
> +		 * we'll throw away both of them and hope we later find a block
> +		 * even higher in the tree.
> +		 */
> +		fab->root = NULLAGBLOCK;
> +		goto out;
> +	} else if (block_level < fab->height) {
> +		/*
> +		 * This block is lower in the tree than the root we found
> +		 * previously, so just ignore it.
> +		 */

Nit: can we combine these two comments into one above the whole if/else
statement? It makes the code slightly more readable than multi-line
comments in each branch, IMO. E.g.:

        /*
         * Compare the current block level with the height of the current
         * candidate root block. If the level matches the current candidate,
         * we know the current candidate can't be a root so we must invalidate
         * it. If the level is lower than the current candidate, just ignore
         * this block.
         */
        block_level = xfs_btree_get_level(btblock);
        if (block_level + 1 == fab->height) {
                fab->root = NULLAGBLOCK;
                goto out;
        } else if (block_level < fab->height) {
                goto out;
        }

With that:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +		goto out;
> +	}
> +
> +	/*
> +	 * This is the highest block in the tree that we've found so far.
> +	 * Update the btree height to reflect what we've learned from this
> +	 * block.
> +	 */
> +	fab->height = block_level + 1;
> +
> +	/*
> +	 * If this block doesn't have sibling pointers, then it's the new root
> +	 * block candidate.  Otherwise, the root will be found farther up the
> +	 * tree.
> +	 */
> +	if (btblock->bb_u.s.bb_leftsib == cpu_to_be32(NULLAGBLOCK) &&
> +	    btblock->bb_u.s.bb_rightsib == cpu_to_be32(NULLAGBLOCK))
> +		fab->root = agbno;
> +	else
> +		fab->root = NULLAGBLOCK;
>  
>  	trace_xrep_findroot_block(mp, ri->sc->sa.agno, agbno,
>  			be32_to_cpu(btblock->bb_magic), fab->height - 1);
> @@ -768,7 +803,7 @@ xrep_findroot_rmap(
>  	struct xrep_findroot		*ri = priv;
>  	struct xrep_find_ag_btree	*fab;
>  	xfs_agblock_t			b;
> -	bool				found_it;
> +	bool				done;
>  	int				error = 0;
>  
>  	/* Ignore anything that isn't AG metadata. */
> @@ -777,16 +812,16 @@ xrep_findroot_rmap(
>  
>  	/* Otherwise scan each block + btree type. */
>  	for (b = 0; b < rec->rm_blockcount; b++) {
> -		found_it = false;
> +		done = false;
>  		for (fab = ri->btree_info; fab->buf_ops; fab++) {
>  			if (rec->rm_owner != fab->rmap_owner)
>  				continue;
>  			error = xrep_findroot_block(ri, fab,
>  					rec->rm_owner, rec->rm_startblock + b,
> -					&found_it);
> +					&done);
>  			if (error)
>  				return error;
> -			if (found_it)
> +			if (done)
>  				break;
>  		}
>  	}
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: fix buffer state management in xrep_findroot_block
  2018-10-01 22:48 ` [PATCH 2/2] xfs: fix buffer state management in xrep_findroot_block Darrick J. Wong
@ 2018-10-02 12:36   ` Brian Foster
  2018-10-02 19:56     ` Darrick J. Wong
  2018-10-03  2:02   ` [PATCH v2 " Darrick J. Wong
  1 sibling, 1 reply; 13+ messages in thread
From: Brian Foster @ 2018-10-02 12:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Mon, Oct 01, 2018 at 03:48:28PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> We don't quite handle buffer state properly in online repair's findroot
> routine.  If the buffer is already in-core we don't want to trash its
> b_ops and state, but we /do/ want to read the contents in from disk if
> the buffer wasn't already in memory.
> 
> Therefore, introduce the ONESHOT_OPS buffer flag.  If the buffer had to
> be read in from disk, the buf_ops supplied to read_buf will be used once
> to verify the buffer contents, but they will not be attached to the
> buffer itself.
> 
> With this change, xrep_findroot_block's buffer handling can be
> simplified -- if b_ops is null, we know that it was freshly read (and
> verified), we can set b_ops, and proceed.  If b_ops is not null, the
> buffer was already in memory and we need to do some more structure
> checks.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/repair.c |   76 ++++++++++++++++++++++++++++++++++++++-----------
>  fs/xfs/xfs_buf.c      |    7 ++++-
>  fs/xfs/xfs_buf.h      |    2 +
>  3 files changed, 67 insertions(+), 18 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> index 6eb66b3543ff..8e1dfa5752b4 100644
> --- a/fs/xfs/scrub/repair.c
> +++ b/fs/xfs/scrub/repair.c
...
> @@ -718,28 +721,67 @@ xrep_findroot_block(
>  			return error;
>  	}
>  
> +	/*
> +	 * Read the buffer into memory with ONESHOT_OPS.  We can tell if we
> +	 * had to go to disk by whether or not b_ops is set.
> +	 */
>  	error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
> -			mp->m_bsize, 0, &bp, NULL);
> +			mp->m_bsize, XBF_ONESHOT_OPS, &bp, fab->buf_ops);
>  	if (error)
>  		return error;
>  
> -	/*
> -	 * Does this look like a block matching our fs and higher than any
> -	 * other block we've found so far?  If so, reattach buffer verifiers
> -	 * so the AIL won't complain if the buffer is also dirty.
> -	 */
>  	btblock = XFS_BUF_TO_BLOCK(bp);
> -	if (be32_to_cpu(btblock->bb_magic) != fab->magic)
> -		goto out;
> -	if (xfs_sb_version_hascrc(&mp->m_sb) &&
> -	    !uuid_equal(&btblock->bb_u.s.bb_uuid, &mp->m_sb.sb_meta_uuid))
> -		goto out;
> -	bp->b_ops = fab->buf_ops;
> +	if (bp->b_ops == NULL) {
> +		/*
> +		 * The buffer came back with NULL b_ops, which means that the
> +		 * buffer was not in memory, we had to read it off disk, and
> +		 * the verifier has run.
> +		 *
> +		 * If b_error == 0, the block matches the magic, the uuid, and
> +		 * ->verify_struct of the btree type and we can proceed.  Set
> +		 * b_ops to the btree type's buf_ops so that we don't have a
> +		 * buffer in memory with no verifiers.
> +		 *
> +		 * Otherwise, the block doesn't match the btree type so we
> +		 * mark it oneshot so it doesn't stick around, and move on.
> +		 */

Hmm, the fact that this interface still requires looking at internal
buffer state to process what happened suggests to me it may not be the
best approach. IIUC, a buffer that's already in-core is just going to
return with the existing b_ops (the else branch below). Otherwise,
->b_ops is NULL to indicate that we read it from disk and ran the passed
in (prospective) verifier...

> +		if (bp->b_error) {
> +			xfs_buf_oneshot(bp);
> +			goto out;
> +		}

... which means that this can't happen, because if we ran the passed in
verifier and it fails the xfs_trans_read_buf() call invalidates the
buffer and returns an error. I think that also means we'd abort the scan
if we don't filter out -EFSCORRUPTED returns so long as we pass a
potentially incorrect verifier.

Also note that I think this still means that each failure to match a
verifier has to re-read the buffer from disk, but we can deal with that
separately.

> +		bp->b_ops = fab->buf_ops;

Therefore, a successful return with a NULL ->b_ops means we ran the
verifier, it passed, and we're going to attach it. Am I missing
something, or is this a roundabout way to just do what the buffer read
mechanism already does without the flag?

I could easily be missing some of the verification intricacies that are
required from scrub context, but is this logic roughly equivalent to the
following (without any special control flags)? We'd repeat some checks
in the event the buffer was read from disk, but perhaps that can also be
optimized out separately if it really matters. Hm?

	...
        error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
                        mp->m_bsize, 0, &bp, fab->buf_ops);
        if (error == -EFSCORRUPTED)
                goto out;
        else if (error)
                return error;

        /*
         * If b_ops doesn't match fab, the buffer was already in-core with a
         * different verifier. We know it doesn't match in that case.
         */
        if (bp->b_ops != fab->buf_ops)
                goto out;

        /*
         * The verifier matches but the buffer may have already been in-core
         * with this verifier. Sanity check the latest content before we
         * proceed. Don't check the CRC because CRCs aren't updated until
         * writeback.
         */
        btblock = XFS_BUF_TO_BLOCK(bp);
        if (be32_to_cpu(btblock->bb_magic) != fab->magic)
                goto out;
        if (xfs_sb_version_hascrc(&mp->m_sb) &&
            !uuid_equal(&btblock->bb_u.s.bb_uuid,
                        &mp->m_sb.sb_meta_uuid))
                goto out;
        fa = bp->b_ops->verify_struct(bp);
        if (fa)
                goto out;
	...

Brian

> +	} else {
> +		/*
> +		 * The buffer came back with b_ops set, which means that the
> +		 * buffer was already in memory.  We'll run the magic, uuid,
> +		 * and ->verify_struct checks by hand to see if we want to
> +		 * examine it further.
> +		 */
> +		if (be32_to_cpu(btblock->bb_magic) != fab->magic)
> +			goto out;
> +		if (xfs_sb_version_hascrc(&mp->m_sb) &&
> +		    !uuid_equal(&btblock->bb_u.s.bb_uuid,
> +				&mp->m_sb.sb_meta_uuid))
> +			goto out;
>  
> -	/* Make sure we pass the verifiers. */
> -	bp->b_ops->verify_read(bp);
> -	if (bp->b_error)
> -		goto out;
> +		/*
> +		 * If b_ops do not match fab->buf_ops even though the magic
> +		 * does, something is seriously wrong here and we'd rather
> +		 * abort the entire repair than risk screwing things up.
> +		 */
> +		if (bp->b_ops != fab->buf_ops)
> +			goto out;
> +
> +		/*
> +		 * If the buffer was already incore (on a v5 fs) then it should
> +		 * already have had b_ops assigned.  Call ->verify_struct to
> +		 * check the structure.  Avoid checking the CRC because we
> +		 * don't calculate CRCs until the buffer is written by the log.
> +		 */
> +		fa = bp->b_ops->verify_struct(bp);
> +		if (fa)
> +			goto out;
> +	}
>  
>  	/*
>  	 * This block passes the magic/uuid and verifier tests for this btree
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index e839907e8492..139d35f6da64 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -206,7 +206,8 @@ _xfs_buf_alloc(
>  	 * We don't want certain flags to appear in b_flags unless they are
>  	 * specifically set by later operations on the buffer.
>  	 */
> -	flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD);
> +	flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD |
> +		   XBF_ONESHOT_OPS);
>  
>  	atomic_set(&bp->b_hold, 1);
>  	atomic_set(&bp->b_lru_ref, 1);
> @@ -758,6 +759,7 @@ xfs_buf_read_map(
>  	const struct xfs_buf_ops *ops)
>  {
>  	struct xfs_buf		*bp;
> +	const struct xfs_buf_ops *orig_ops;
>  
>  	flags |= XBF_READ;
>  
> @@ -767,8 +769,11 @@ xfs_buf_read_map(
>  
>  		if (!(bp->b_flags & XBF_DONE)) {
>  			XFS_STATS_INC(target->bt_mount, xb_get_read);
> +			orig_ops = bp->b_ops;
>  			bp->b_ops = ops;
>  			_xfs_buf_read(bp, flags);
> +			if (flags & XBF_ONESHOT_OPS)
> +				bp->b_ops = orig_ops;
>  		} else if (flags & XBF_ASYNC) {
>  			/*
>  			 * Read ahead call which is already satisfied,
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 4e3171acd0f8..62139d3ad349 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -34,6 +34,7 @@ typedef enum {
>  #define XBF_ASYNC	 (1 << 4) /* initiator will not wait for completion */
>  #define XBF_DONE	 (1 << 5) /* all pages in the buffer uptodate */
>  #define XBF_STALE	 (1 << 6) /* buffer has been staled, do not find it */
> +#define XBF_ONESHOT_OPS	 (1 << 7) /* only use ops if we read in the buf */
>  #define XBF_WRITE_FAIL	 (1 << 24)/* async writes have failed on this buffer */
>  
>  /* I/O hints for the BIO layer */
> @@ -61,6 +62,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ XBF_ASYNC,		"ASYNC" }, \
>  	{ XBF_DONE,		"DONE" }, \
>  	{ XBF_STALE,		"STALE" }, \
> +	{ XBF_ONESHOT_OPS,	"ONESHOT_OPS" }, /* should never be set */ \
>  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
>  	{ XBF_SYNCIO,		"SYNCIO" }, \
>  	{ XBF_FUA,		"FUA" }, \
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] xfs: xrep_findroot_block should reject root blocks with siblings
  2018-10-02 12:34   ` Brian Foster
@ 2018-10-02 15:39     ` Darrick J. Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2018-10-02 15:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Tue, Oct 02, 2018 at 08:34:30AM -0400, Brian Foster wrote:
> On Mon, Oct 01, 2018 at 03:48:22PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > In xrep_findroot_block, if we find a candidate root block with sibling
> > pointers or sibling blocks on the same tree level, we should not return
> > that block as a tree root because root blocks cannot have siblings.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/scrub/repair.c |   61 +++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 48 insertions(+), 13 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> > index 9f08dd9bf1d5..6eb66b3543ff 100644
> > --- a/fs/xfs/scrub/repair.c
> > +++ b/fs/xfs/scrub/repair.c
> ...
> > @@ -735,18 +736,52 @@ xrep_findroot_block(
> >  		goto out;
> >  	bp->b_ops = fab->buf_ops;
> >  
> > -	/* Ignore this block if it's lower in the tree than we've seen. */
> > -	if (fab->root != NULLAGBLOCK &&
> > -	    xfs_btree_get_level(btblock) < fab->height)
> > -		goto out;
> > -
> >  	/* Make sure we pass the verifiers. */
> >  	bp->b_ops->verify_read(bp);
> >  	if (bp->b_error)
> >  		goto out;
> > -	fab->root = agbno;
> > -	fab->height = xfs_btree_get_level(btblock) + 1;
> > -	*found_it = true;
> > +
> > +	/*
> > +	 * This block passes the magic/uuid and verifier tests for this btree
> > +	 * type.  We don't need the caller to try the other tree types.
> > +	 */
> > +	*done_with_block = true;
> > +
> > +	block_level = xfs_btree_get_level(btblock);
> > +	if (block_level + 1 == fab->height) {
> > +		/*
> > +		 * This block claims to be at the same level as the root we
> > +		 * found previously.  There can't be two candidate roots, so
> > +		 * we'll throw away both of them and hope we later find a block
> > +		 * even higher in the tree.
> > +		 */
> > +		fab->root = NULLAGBLOCK;
> > +		goto out;
> > +	} else if (block_level < fab->height) {
> > +		/*
> > +		 * This block is lower in the tree than the root we found
> > +		 * previously, so just ignore it.
> > +		 */
> 
> Nit: can we combine these two comments into one above the whole if/else
> statement? It makes the code slightly more readable than multi-line
> comments in each branch, IMO. E.g.:
> 
>         /*
>          * Compare the current block level with the height of the current
>          * candidate root block. If the level matches the current candidate,
>          * we know the current candidate can't be a root so we must invalidate
>          * it. If the level is lower than the current candidate, just ignore
>          * this block.
>          */
>         block_level = xfs_btree_get_level(btblock);
>         if (block_level + 1 == fab->height) {
>                 fab->root = NULLAGBLOCK;
>                 goto out;
>         } else if (block_level < fab->height) {
>                 goto out;
>         }
> 
> With that:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Done; thanks for the review!

--D

> 
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * This is the highest block in the tree that we've found so far.
> > +	 * Update the btree height to reflect what we've learned from this
> > +	 * block.
> > +	 */
> > +	fab->height = block_level + 1;
> > +
> > +	/*
> > +	 * If this block doesn't have sibling pointers, then it's the new root
> > +	 * block candidate.  Otherwise, the root will be found farther up the
> > +	 * tree.
> > +	 */
> > +	if (btblock->bb_u.s.bb_leftsib == cpu_to_be32(NULLAGBLOCK) &&
> > +	    btblock->bb_u.s.bb_rightsib == cpu_to_be32(NULLAGBLOCK))
> > +		fab->root = agbno;
> > +	else
> > +		fab->root = NULLAGBLOCK;
> >  
> >  	trace_xrep_findroot_block(mp, ri->sc->sa.agno, agbno,
> >  			be32_to_cpu(btblock->bb_magic), fab->height - 1);
> > @@ -768,7 +803,7 @@ xrep_findroot_rmap(
> >  	struct xrep_findroot		*ri = priv;
> >  	struct xrep_find_ag_btree	*fab;
> >  	xfs_agblock_t			b;
> > -	bool				found_it;
> > +	bool				done;
> >  	int				error = 0;
> >  
> >  	/* Ignore anything that isn't AG metadata. */
> > @@ -777,16 +812,16 @@ xrep_findroot_rmap(
> >  
> >  	/* Otherwise scan each block + btree type. */
> >  	for (b = 0; b < rec->rm_blockcount; b++) {
> > -		found_it = false;
> > +		done = false;
> >  		for (fab = ri->btree_info; fab->buf_ops; fab++) {
> >  			if (rec->rm_owner != fab->rmap_owner)
> >  				continue;
> >  			error = xrep_findroot_block(ri, fab,
> >  					rec->rm_owner, rec->rm_startblock + b,
> > -					&found_it);
> > +					&done);
> >  			if (error)
> >  				return error;
> > -			if (found_it)
> > +			if (done)
> >  				break;
> >  		}
> >  	}
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: fix buffer state management in xrep_findroot_block
  2018-10-02 12:36   ` Brian Foster
@ 2018-10-02 19:56     ` Darrick J. Wong
  2018-10-03 11:03       ` Brian Foster
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2018-10-02 19:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Tue, Oct 02, 2018 at 08:36:41AM -0400, Brian Foster wrote:
> On Mon, Oct 01, 2018 at 03:48:28PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > We don't quite handle buffer state properly in online repair's findroot
> > routine.  If the buffer is already in-core we don't want to trash its
> > b_ops and state, but we /do/ want to read the contents in from disk if
> > the buffer wasn't already in memory.
> > 
> > Therefore, introduce the ONESHOT_OPS buffer flag.  If the buffer had to
> > be read in from disk, the buf_ops supplied to read_buf will be used once
> > to verify the buffer contents, but they will not be attached to the
> > buffer itself.
> > 
> > With this change, xrep_findroot_block's buffer handling can be
> > simplified -- if b_ops is null, we know that it was freshly read (and
> > verified), we can set b_ops, and proceed.  If b_ops is not null, the
> > buffer was already in memory and we need to do some more structure
> > checks.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/scrub/repair.c |   76 ++++++++++++++++++++++++++++++++++++++-----------
> >  fs/xfs/xfs_buf.c      |    7 ++++-
> >  fs/xfs/xfs_buf.h      |    2 +
> >  3 files changed, 67 insertions(+), 18 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> > index 6eb66b3543ff..8e1dfa5752b4 100644
> > --- a/fs/xfs/scrub/repair.c
> > +++ b/fs/xfs/scrub/repair.c
> ...
> > @@ -718,28 +721,67 @@ xrep_findroot_block(
> >  			return error;
> >  	}
> >  
> > +	/*
> > +	 * Read the buffer into memory with ONESHOT_OPS.  We can tell if we
> > +	 * had to go to disk by whether or not b_ops is set.
> > +	 */
> >  	error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
> > -			mp->m_bsize, 0, &bp, NULL);
> > +			mp->m_bsize, XBF_ONESHOT_OPS, &bp, fab->buf_ops);
> >  	if (error)
> >  		return error;
> >  
> > -	/*
> > -	 * Does this look like a block matching our fs and higher than any
> > -	 * other block we've found so far?  If so, reattach buffer verifiers
> > -	 * so the AIL won't complain if the buffer is also dirty.
> > -	 */
> >  	btblock = XFS_BUF_TO_BLOCK(bp);
> > -	if (be32_to_cpu(btblock->bb_magic) != fab->magic)
> > -		goto out;
> > -	if (xfs_sb_version_hascrc(&mp->m_sb) &&
> > -	    !uuid_equal(&btblock->bb_u.s.bb_uuid, &mp->m_sb.sb_meta_uuid))
> > -		goto out;
> > -	bp->b_ops = fab->buf_ops;
> > +	if (bp->b_ops == NULL) {
> > +		/*
> > +		 * The buffer came back with NULL b_ops, which means that the
> > +		 * buffer was not in memory, we had to read it off disk, and
> > +		 * the verifier has run.
> > +		 *
> > +		 * If b_error == 0, the block matches the magic, the uuid, and
> > +		 * ->verify_struct of the btree type and we can proceed.  Set
> > +		 * b_ops to the btree type's buf_ops so that we don't have a
> > +		 * buffer in memory with no verifiers.
> > +		 *
> > +		 * Otherwise, the block doesn't match the btree type so we
> > +		 * mark it oneshot so it doesn't stick around, and move on.
> > +		 */
> 
> Hmm, the fact that this interface still requires looking at internal
> buffer state to process what happened suggests to me it may not be the
> best approach.

Aha!  Having recollected more of my marbles, now I remember why the
previous iteration of this patch did the weird dance with get_buf and
read_buf.  At a high level, we want to do:

  1. If the buffer is already in memory, we want to grab it, check the
  magic, uuid, and structure, and if it doesn't matches our btree type
  then we move on.

  2. If the buffer is /not/ in memory (i.e. !XBF_DONE), then we want to
  read it off the disk and run the verifier to see if it matches our
  btree type.  If not, we move on.

However, the buffer state management doesn't /quite/ work that way.  In
particular, if we call _trans_read_buf on a !DONE buffer and fail
->verify_read, it will log the verifier error and stale the buffer,
which means that a subsequent call of xrep_findroot_block with the same
buffer will have to reread the disk and may very well log a verifier
error too.  This is obnoxious since we're wasting resources and logging
non-errors.

That's why the previous patch called _trans_read_buf with NULL ops.  In
the first case we'd get back our DONE buffer and proceed with checks.
In the !DONE case, _trans_read_buf would read the metadata off disk into
the buffer, set DONE, and return it to us unchecked.  Then we'd call
->verify_read ourselves; if it succeeds we attach the ops and use the
buffer, or else we release the buffer (which would still have null ops).
In theory if anyone ever tried to write the buffer without attaching ops
the log code will complain.

However, there /is/ a bug -- if none of the find_ag_btree profiles match
the buffer, the xrep code will never attach any buf ops.  A subsequent
_trans_read_buf call from some other part of XFS will see the DONE state
and decline to attach the passed in ops, which later leads to a log
splat on the NULL b_ops.

I think the solution here is to modify xfs_buf_read_map to handle a
(DONE buffer with NULL b_ops and !NULL passed-in ops) to call
->verify_read on the buffer and set b_ops if successful.

(please keep reading)

> IIUC, a buffer that's already in-core is just going to
> return with the existing b_ops (the else branch below). Otherwise,
> ->b_ops is NULL to indicate that we read it from disk and ran the passed
> in (prospective) verifier...
> 
> > +		if (bp->b_error) {
> > +			xfs_buf_oneshot(bp);
> > +			goto out;
> > +		}
> 
> ... which means that this can't happen, because if we ran the passed in
> verifier and it fails the xfs_trans_read_buf() call invalidates the
> buffer and returns an error. I think that also means we'd abort the scan
> if we don't filter out -EFSCORRUPTED returns so long as we pass a
> potentially incorrect verifier.

Correct.  Good catch! :)

> Also note that I think this still means that each failure to match a
> verifier has to re-read the buffer from disk, but we can deal with that
> separately.
> 
> > +		bp->b_ops = fab->buf_ops;
> 
> Therefore, a successful return with a NULL ->b_ops means we ran the
> verifier, it passed, and we're going to attach it. Am I missing
> something, or is this a roundabout way to just do what the buffer read
> mechanism already does without the flag?

Yes, it's roundabout.

> I could easily be missing some of the verification intricacies that are
> required from scrub context,

I think I've explained it above, now that I finally remember what I was
trying to accomplish. :)

> but is this logic roughly equivalent to the following (without any
> special control flags)? We'd repeat some checks in the event the
> buffer was read from disk, but perhaps that can also be optimized out
> separately if it really matters. Hm?

How about something like this?

	/*
	 * Read the buffer into memory so that we can see if it's a match for
	 * our btree type.  We have no clue if it is beforehand, and we want to
	 * avoid xfs_trans_read_buf's behavior of dumping the DONE state (which
	 * will cause needless disk reads in subsequent calls to this function)
	 * and logging metadata verifier failures.
	 *
	 * Therefore, pass in NULL buffer ops.  If the buffer was already in
	 * memory from some other caller it will already have b_ops assigned.
	 * If it was in memory from a previous unsuccessful findroot_block
	 * call, the buffer won't have b_ops but it should be clean and ready
	 * for us to try to verify if the read call succeeds.  The same applies
	 * if the buffer wasn't in memory at all.
	 *
	 * Note: If we never match a btree type with this buffer, it will be
	 * left in memory with NULL b_ops.  This shouldn't be a problem unless
	 * the buffer gets written.
	 */
	error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
			mp->m_bsize, 0, &bp, NULL);
	if (error)
		return error;

	/*
	 * If the buffer already has ops applied and they're not the ones for
	 * this btree type, we know this block doesn't match the btree and we
	 * can bail out.
	 *
	 * If the buffer ops match ours, someone else has already validated
	 * the block for us, so we can move on to checking if this is a root
	 * block candidate.
	 *
	 * If the buffer does not have ops, nobody has successfully validated
	 * the contents and the buffer cannot be dirty.  If the magic, uuid,
	 * and structure match this btree type then we'll move on to checking
	 * if it's a root block candidate.  If there is no match, bail out.
	 */
	if (bp->b_ops) {
		if (b->b_ops != fab->buf_ops)
			goto out;
	} else {
		if (be32_to_cpu(btblock->bb_magic) != fab->magic)
			goto out;
		if (xfs_sb_version_hascrc(&mp->m_sb) &&
		    !uuid_equal(&btblock->bb_u.s.bb_uuid,
				&mp->m_sb.sb_meta_uuid))
			goto out;
		fa = fab->buf_ops->verify_struct(bp);
		if (fa)
			goto out;
		b->b_ops = fab->buf_ops;
	}

and then xfs_buf_read_map can be taught to call the read verifier if it
ever encounters an XBF_DONE buffer with b_ops == NULL and ops != NULL.

--D

> 
> 	...
>         error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
>                         mp->m_bsize, 0, &bp, fab->buf_ops);
>         if (error == -EFSCORRUPTED)
>                 goto out;
>         else if (error)
>                 return error;
> 
>         /*
>          * If b_ops doesn't match fab, the buffer was already in-core with a
>          * different verifier. We know it doesn't match in that case.
>          */
>         if (bp->b_ops != fab->buf_ops)
>                 goto out;
> 
>         /*
>          * The verifier matches but the buffer may have already been in-core
>          * with this verifier. Sanity check the latest content before we
>          * proceed. Don't check the CRC because CRCs aren't updated until
>          * writeback.
>          */
>         btblock = XFS_BUF_TO_BLOCK(bp);
>         if (be32_to_cpu(btblock->bb_magic) != fab->magic)
>                 goto out;
>         if (xfs_sb_version_hascrc(&mp->m_sb) &&
>             !uuid_equal(&btblock->bb_u.s.bb_uuid,
>                         &mp->m_sb.sb_meta_uuid))
>                 goto out;
>         fa = bp->b_ops->verify_struct(bp);
>         if (fa)
>                 goto out;
> 	...
> 
> Brian
> 
> > +	} else {
> > +		/*
> > +		 * The buffer came back with b_ops set, which means that the
> > +		 * buffer was already in memory.  We'll run the magic, uuid,
> > +		 * and ->verify_struct checks by hand to see if we want to
> > +		 * examine it further.
> > +		 */
> > +		if (be32_to_cpu(btblock->bb_magic) != fab->magic)
> > +			goto out;
> > +		if (xfs_sb_version_hascrc(&mp->m_sb) &&
> > +		    !uuid_equal(&btblock->bb_u.s.bb_uuid,
> > +				&mp->m_sb.sb_meta_uuid))
> > +			goto out;
> >  
> > -	/* Make sure we pass the verifiers. */
> > -	bp->b_ops->verify_read(bp);
> > -	if (bp->b_error)
> > -		goto out;
> > +		/*
> > +		 * If b_ops do not match fab->buf_ops even though the magic
> > +		 * does, something is seriously wrong here and we'd rather
> > +		 * abort the entire repair than risk screwing things up.
> > +		 */
> > +		if (bp->b_ops != fab->buf_ops)
> > +			goto out;
> > +
> > +		/*
> > +		 * If the buffer was already incore (on a v5 fs) then it should
> > +		 * already have had b_ops assigned.  Call ->verify_struct to
> > +		 * check the structure.  Avoid checking the CRC because we
> > +		 * don't calculate CRCs until the buffer is written by the log.
> > +		 */
> > +		fa = bp->b_ops->verify_struct(bp);
> > +		if (fa)
> > +			goto out;
> > +	}
> >  
> >  	/*
> >  	 * This block passes the magic/uuid and verifier tests for this btree
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index e839907e8492..139d35f6da64 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -206,7 +206,8 @@ _xfs_buf_alloc(
> >  	 * We don't want certain flags to appear in b_flags unless they are
> >  	 * specifically set by later operations on the buffer.
> >  	 */
> > -	flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD);
> > +	flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD |
> > +		   XBF_ONESHOT_OPS);
> >  
> >  	atomic_set(&bp->b_hold, 1);
> >  	atomic_set(&bp->b_lru_ref, 1);
> > @@ -758,6 +759,7 @@ xfs_buf_read_map(
> >  	const struct xfs_buf_ops *ops)
> >  {
> >  	struct xfs_buf		*bp;
> > +	const struct xfs_buf_ops *orig_ops;
> >  
> >  	flags |= XBF_READ;
> >  
> > @@ -767,8 +769,11 @@ xfs_buf_read_map(
> >  
> >  		if (!(bp->b_flags & XBF_DONE)) {
> >  			XFS_STATS_INC(target->bt_mount, xb_get_read);
> > +			orig_ops = bp->b_ops;
> >  			bp->b_ops = ops;
> >  			_xfs_buf_read(bp, flags);
> > +			if (flags & XBF_ONESHOT_OPS)
> > +				bp->b_ops = orig_ops;
> >  		} else if (flags & XBF_ASYNC) {
> >  			/*
> >  			 * Read ahead call which is already satisfied,
> > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > index 4e3171acd0f8..62139d3ad349 100644
> > --- a/fs/xfs/xfs_buf.h
> > +++ b/fs/xfs/xfs_buf.h
> > @@ -34,6 +34,7 @@ typedef enum {
> >  #define XBF_ASYNC	 (1 << 4) /* initiator will not wait for completion */
> >  #define XBF_DONE	 (1 << 5) /* all pages in the buffer uptodate */
> >  #define XBF_STALE	 (1 << 6) /* buffer has been staled, do not find it */
> > +#define XBF_ONESHOT_OPS	 (1 << 7) /* only use ops if we read in the buf */
> >  #define XBF_WRITE_FAIL	 (1 << 24)/* async writes have failed on this buffer */
> >  
> >  /* I/O hints for the BIO layer */
> > @@ -61,6 +62,7 @@ typedef unsigned int xfs_buf_flags_t;
> >  	{ XBF_ASYNC,		"ASYNC" }, \
> >  	{ XBF_DONE,		"DONE" }, \
> >  	{ XBF_STALE,		"STALE" }, \
> > +	{ XBF_ONESHOT_OPS,	"ONESHOT_OPS" }, /* should never be set */ \
> >  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
> >  	{ XBF_SYNCIO,		"SYNCIO" }, \
> >  	{ XBF_FUA,		"FUA" }, \
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] xfs: xrep_findroot_block should reject root blocks with siblings
  2018-10-01 22:48 ` [PATCH 1/2] xfs: xrep_findroot_block should reject root blocks with siblings Darrick J. Wong
  2018-10-02 12:34   ` Brian Foster
@ 2018-10-03  2:01   ` Darrick J. Wong
  1 sibling, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2018-10-03  2:01 UTC (permalink / raw)
  To: david; +Cc: linux-xfs, bfoster


From: Darrick J. Wong <darrick.wong@oracle.com>
Subject: [PATCH] xfs: xrep_findroot_block should reject root blocks with siblings

In xrep_findroot_block, if we find a candidate root block with sibling
pointers or sibling blocks on the same tree level, we should not return
that block as a tree root because root blocks cannot have siblings.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/scrub/repair.c |   61 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 48 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 9f08dd9bf1d5..63786341ac2a 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -692,12 +692,13 @@ xrep_findroot_block(
 	struct xrep_find_ag_btree	*fab,
 	uint64_t			owner,
 	xfs_agblock_t			agbno,
-	bool				*found_it)
+	bool				*done_with_block)
 {
 	struct xfs_mount		*mp = ri->sc->mp;
 	struct xfs_buf			*bp;
 	struct xfs_btree_block		*btblock;
 	xfs_daddr_t			daddr;
+	int				block_level;
 	int				error;
 
 	daddr = XFS_AGB_TO_DADDR(mp, ri->sc->sa.agno, agbno);
@@ -735,18 +736,52 @@ xrep_findroot_block(
 		goto out;
 	bp->b_ops = fab->buf_ops;
 
-	/* Ignore this block if it's lower in the tree than we've seen. */
-	if (fab->root != NULLAGBLOCK &&
-	    xfs_btree_get_level(btblock) < fab->height)
-		goto out;
-
 	/* Make sure we pass the verifiers. */
 	bp->b_ops->verify_read(bp);
 	if (bp->b_error)
 		goto out;
-	fab->root = agbno;
-	fab->height = xfs_btree_get_level(btblock) + 1;
-	*found_it = true;
+
+	/*
+	 * This block passes the magic/uuid and verifier tests for this btree
+	 * type.  We don't need the caller to try the other tree types.
+	 */
+	*done_with_block = true;
+
+	/*
+	 * Compare this btree block's level to the height of the current
+	 * candidate root block.
+	 *
+	 * If the level matches the root we found previously, throw away both
+	 * blocks because there can't be two candidate roots.
+	 *
+	 * If level is lower in the tree than the root we found previously,
+	 * ignore this block.
+	 */
+	block_level = xfs_btree_get_level(btblock);
+	if (block_level + 1 == fab->height) {
+		fab->root = NULLAGBLOCK;
+		goto out;
+	} else if (block_level < fab->height) {
+		goto out;
+	}
+
+	/*
+	 * This is the highest block in the tree that we've found so far.
+	 * Update the btree height to reflect what we've learned from this
+	 * block.
+	 */
+	fab->height = block_level + 1;
+
+	/*
+	 * If this block doesn't have sibling pointers, then it's the new root
+	 * block candidate.  Otherwise, the root will be found farther up the
+	 * tree.
+	 */
+	if (btblock->bb_u.s.bb_leftsib == cpu_to_be32(NULLAGBLOCK) &&
+	    btblock->bb_u.s.bb_rightsib == cpu_to_be32(NULLAGBLOCK))
+		fab->root = agbno;
+	else
+		fab->root = NULLAGBLOCK;
 
 	trace_xrep_findroot_block(mp, ri->sc->sa.agno, agbno,
 			be32_to_cpu(btblock->bb_magic), fab->height - 1);
@@ -768,7 +803,7 @@ xrep_findroot_rmap(
 	struct xrep_findroot		*ri = priv;
 	struct xrep_find_ag_btree	*fab;
 	xfs_agblock_t			b;
-	bool				found_it;
+	bool				done;
 	int				error = 0;
 
 	/* Ignore anything that isn't AG metadata. */
@@ -777,16 +812,16 @@ xrep_findroot_rmap(
 
 	/* Otherwise scan each block + btree type. */
 	for (b = 0; b < rec->rm_blockcount; b++) {
-		found_it = false;
+		done = false;
 		for (fab = ri->btree_info; fab->buf_ops; fab++) {
 			if (rec->rm_owner != fab->rmap_owner)
 				continue;
 			error = xrep_findroot_block(ri, fab,
 					rec->rm_owner, rec->rm_startblock + b,
-					&found_it);
+					&done);
 			if (error)
 				return error;
-			if (found_it)
+			if (done)
 				break;
 		}
 	}

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/2] xfs: fix buffer state management in xrep_findroot_block
  2018-10-01 22:48 ` [PATCH 2/2] xfs: fix buffer state management in xrep_findroot_block Darrick J. Wong
  2018-10-02 12:36   ` Brian Foster
@ 2018-10-03  2:02   ` Darrick J. Wong
  2018-10-03 11:05     ` Brian Foster
  1 sibling, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2018-10-03  2:02 UTC (permalink / raw)
  To: david; +Cc: linux-xfs, bfoster

We don't handle buffer state properly in online repair's findroot
routine.  If a buffer already has b_ops set, we don't ever want to touch
that, and we don't want to call the read verifiers on a buffer that
could be dirty (CRCs are only recomputed during log checkpoints).

Therefore, be more careful about what we do with a buffer -- if someone
else already attached ops that are not the ones for this btree type,
just ignore the buffer.  We only attach our btree type's buf ops if it
matches the magic/uuid and structure checks.

We also modify xfs_buf_read_map to allow callers to set buffer ops on a
DONE buffer with NULL ops so that repair doesn't leave behind buffers
which won't have buffers attached to them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/repair.c  |   63 +++++++++++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_buf.c       |   12 +++++++++
 fs/xfs/xfs_trans.h     |    1 +
 fs/xfs/xfs_trans_buf.c |   13 ++++++++++
 4 files changed, 75 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 63786341ac2a..cebaebb26566 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -29,6 +29,8 @@
 #include "xfs_ag_resv.h"
 #include "xfs_trans_space.h"
 #include "xfs_quota.h"
+#include "xfs_attr.h"
+#include "xfs_reflink.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -699,7 +701,7 @@ xrep_findroot_block(
 	struct xfs_btree_block		*btblock;
 	xfs_daddr_t			daddr;
 	int				block_level;
-	int				error;
+	int				error = 0;
 
 	daddr = XFS_AGB_TO_DADDR(mp, ri->sc->sa.agno, agbno);
 
@@ -718,28 +720,61 @@ xrep_findroot_block(
 			return error;
 	}
 
+	/*
+	 * Read the buffer into memory so that we can see if it's a match for
+	 * our btree type.  We have no clue if it is beforehand, and we want to
+	 * avoid xfs_trans_read_buf's behavior of dumping the DONE state (which
+	 * will cause needless disk reads in subsequent calls to this function)
+	 * and logging metadata verifier failures.
+	 *
+	 * Therefore, pass in NULL buffer ops.  If the buffer was already in
+	 * memory from some other caller it will already have b_ops assigned.
+	 * If it was in memory from a previous unsuccessful findroot_block
+	 * call, the buffer won't have b_ops but it should be clean and ready
+	 * for us to try to verify if the read call succeeds.  The same applies
+	 * if the buffer wasn't in memory at all.
+	 *
+	 * Note: If we never match a btree type with this buffer, it will be
+	 * left in memory with NULL b_ops.  This shouldn't be a problem unless
+	 * the buffer gets written.
+	 */
 	error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
 			mp->m_bsize, 0, &bp, NULL);
 	if (error)
 		return error;
 
-	/*
-	 * Does this look like a block matching our fs and higher than any
-	 * other block we've found so far?  If so, reattach buffer verifiers
-	 * so the AIL won't complain if the buffer is also dirty.
-	 */
+	/* Ensure the block magic matches the btree type we're looking for. */
 	btblock = XFS_BUF_TO_BLOCK(bp);
 	if (be32_to_cpu(btblock->bb_magic) != fab->magic)
 		goto out;
-	if (xfs_sb_version_hascrc(&mp->m_sb) &&
-	    !uuid_equal(&btblock->bb_u.s.bb_uuid, &mp->m_sb.sb_meta_uuid))
-		goto out;
-	bp->b_ops = fab->buf_ops;
 
-	/* Make sure we pass the verifiers. */
-	bp->b_ops->verify_read(bp);
-	if (bp->b_error)
-		goto out;
+	/*
+	 * If the buffer already has ops applied and they're not the ones for
+	 * this btree type, we know this block doesn't match the btree and we
+	 * can bail out.
+	 *
+	 * If the buffer ops match ours, someone else has already validated
+	 * the block for us, so we can move on to checking if this is a root
+	 * block candidate.
+	 *
+	 * If the buffer does not have ops, nobody has successfully validated
+	 * the contents and the buffer cannot be dirty.  If the magic, uuid,
+	 * and structure match this btree type then we'll move on to checking
+	 * if it's a root block candidate.  If there is no match, bail out.
+	 */
+	if (bp->b_ops) {
+		if (bp->b_ops != fab->buf_ops)
+			goto out;
+	} else {
+		ASSERT(!xfs_trans_buf_is_dirty(bp));
+		if (!uuid_equal(&btblock->bb_u.s.bb_uuid,
+				&mp->m_sb.sb_meta_uuid))
+			goto out;
+		fab->buf_ops->verify_read(bp);
+		if (bp->b_error)
+			goto out;
+		bp->b_ops = fab->buf_ops;
+	}
 
 	/*
 	 * This block passes the magic/uuid and verifier tests for this btree
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e839907e8492..7aa78cef9c3a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -777,6 +777,18 @@ xfs_buf_read_map(
 			xfs_buf_relse(bp);
 			return NULL;
 		} else {
+			/*
+			 * Online repair can leave done buffers in memory with
+			 * no b_ops assigned.  If the caller passed in an ops
+			 * structure, see if the buffer passes the verifier.
+			 * If successful, assign the ops to the buffer.
+			 */
+			if (!bp->b_ops && ops) {
+				ops->verify_read(bp);
+				if (!bp->b_error)
+					bp->b_ops = ops;
+			}
+
 			/* We do not want read in the flags */
 			bp->b_flags &= ~XBF_READ;
 		}
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index c3d278e96ad1..a0c5dbda18aa 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -220,6 +220,7 @@ void		xfs_trans_ijoin(struct xfs_trans *, struct xfs_inode *, uint);
 void		xfs_trans_log_buf(struct xfs_trans *, struct xfs_buf *, uint,
 				  uint);
 void		xfs_trans_dirty_buf(struct xfs_trans *, struct xfs_buf *);
+bool		xfs_trans_buf_is_dirty(struct xfs_buf *bp);
 void		xfs_trans_log_inode(xfs_trans_t *, struct xfs_inode *, uint);
 
 void		xfs_extent_free_init_defer_op(void);
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 15919f67a88f..bdc5be4d0210 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -321,6 +321,19 @@ xfs_trans_read_buf_map(
 
 }
 
+/* Has this buffer been dirtied by anyone? */
+bool
+xfs_trans_buf_is_dirty(
+	struct xfs_buf		*bp)
+{
+	struct xfs_buf_log_item	*bip = bp->b_log_item;
+
+	if (!bip)
+		return false;
+	ASSERT(bip->bli_item.li_type == XFS_LI_BUF);
+	return test_bit(XFS_LI_DIRTY, &bip->bli_item.li_flags);
+}
+
 /*
  * Release the buffer bp which was previously acquired with one of the
  * xfs_trans_... buffer allocation routines if the buffer has not

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: fix buffer state management in xrep_findroot_block
  2018-10-02 19:56     ` Darrick J. Wong
@ 2018-10-03 11:03       ` Brian Foster
  0 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2018-10-03 11:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Tue, Oct 02, 2018 at 12:56:17PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 02, 2018 at 08:36:41AM -0400, Brian Foster wrote:
> > On Mon, Oct 01, 2018 at 03:48:28PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > We don't quite handle buffer state properly in online repair's findroot
> > > routine.  If the buffer is already in-core we don't want to trash its
> > > b_ops and state, but we /do/ want to read the contents in from disk if
> > > the buffer wasn't already in memory.
> > > 
> > > Therefore, introduce the ONESHOT_OPS buffer flag.  If the buffer had to
> > > be read in from disk, the buf_ops supplied to read_buf will be used once
> > > to verify the buffer contents, but they will not be attached to the
> > > buffer itself.
> > > 
> > > With this change, xrep_findroot_block's buffer handling can be
> > > simplified -- if b_ops is null, we know that it was freshly read (and
> > > verified), we can set b_ops, and proceed.  If b_ops is not null, the
> > > buffer was already in memory and we need to do some more structure
> > > checks.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/scrub/repair.c |   76 ++++++++++++++++++++++++++++++++++++++-----------
> > >  fs/xfs/xfs_buf.c      |    7 ++++-
> > >  fs/xfs/xfs_buf.h      |    2 +
> > >  3 files changed, 67 insertions(+), 18 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> > > index 6eb66b3543ff..8e1dfa5752b4 100644
> > > --- a/fs/xfs/scrub/repair.c
> > > +++ b/fs/xfs/scrub/repair.c
> > ...
> > > @@ -718,28 +721,67 @@ xrep_findroot_block(
> > >  			return error;
> > >  	}
> > >  
> > > +	/*
> > > +	 * Read the buffer into memory with ONESHOT_OPS.  We can tell if we
> > > +	 * had to go to disk by whether or not b_ops is set.
> > > +	 */
> > >  	error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
> > > -			mp->m_bsize, 0, &bp, NULL);
> > > +			mp->m_bsize, XBF_ONESHOT_OPS, &bp, fab->buf_ops);
> > >  	if (error)
> > >  		return error;
> > >  
> > > -	/*
> > > -	 * Does this look like a block matching our fs and higher than any
> > > -	 * other block we've found so far?  If so, reattach buffer verifiers
> > > -	 * so the AIL won't complain if the buffer is also dirty.
> > > -	 */
> > >  	btblock = XFS_BUF_TO_BLOCK(bp);
> > > -	if (be32_to_cpu(btblock->bb_magic) != fab->magic)
> > > -		goto out;
> > > -	if (xfs_sb_version_hascrc(&mp->m_sb) &&
> > > -	    !uuid_equal(&btblock->bb_u.s.bb_uuid, &mp->m_sb.sb_meta_uuid))
> > > -		goto out;
> > > -	bp->b_ops = fab->buf_ops;
> > > +	if (bp->b_ops == NULL) {
> > > +		/*
> > > +		 * The buffer came back with NULL b_ops, which means that the
> > > +		 * buffer was not in memory, we had to read it off disk, and
> > > +		 * the verifier has run.
> > > +		 *
> > > +		 * If b_error == 0, the block matches the magic, the uuid, and
> > > +		 * ->verify_struct of the btree type and we can proceed.  Set
> > > +		 * b_ops to the btree type's buf_ops so that we don't have a
> > > +		 * buffer in memory with no verifiers.
> > > +		 *
> > > +		 * Otherwise, the block doesn't match the btree type so we
> > > +		 * mark it oneshot so it doesn't stick around, and move on.
> > > +		 */
> > 
> > Hmm, the fact that this interface still requires looking at internal
> > buffer state to process what happened suggests to me it may not be the
> > best approach.
> 
> Aha!  Having recollected more of my marbles, now I remember why the
> previous iteration of this patch did the weird dance with get_buf and
> read_buf.  At a high level, we want to do:
> 
>   1. If the buffer is already in memory, we want to grab it, check the
>   magic, uuid, and structure, and if it doesn't matches our btree type
>   then we move on.
> 
>   2. If the buffer is /not/ in memory (i.e. !XBF_DONE), then we want to
>   read it off the disk and run the verifier to see if it matches our
>   btree type.  If not, we move on.
> 
> However, the buffer state management doesn't /quite/ work that way.  In
> particular, if we call _trans_read_buf on a !DONE buffer and fail
> ->verify_read, it will log the verifier error and stale the buffer,
> which means that a subsequent call of xrep_findroot_block with the same
> buffer will have to reread the disk and may very well log a verifier
> error too.  This is obnoxious since we're wasting resources and logging
> non-errors.
> 
> That's why the previous patch called _trans_read_buf with NULL ops.  In
> the first case we'd get back our DONE buffer and proceed with checks.
> In the !DONE case, _trans_read_buf would read the metadata off disk into
> the buffer, set DONE, and return it to us unchecked.  Then we'd call
> ->verify_read ourselves; if it succeeds we attach the ops and use the
> buffer, or else we release the buffer (which would still have null ops).
> In theory if anyone ever tried to write the buffer without attaching ops
> the log code will complain.
> 
> However, there /is/ a bug -- if none of the find_ag_btree profiles match
> the buffer, the xrep code will never attach any buf ops.  A subsequent
> _trans_read_buf call from some other part of XFS will see the DONE state
> and decline to attach the passed in ops, which later leads to a log
> splat on the NULL b_ops.
> 

Right, that's my understanding of the original problem. And IIRC, one of
the tradeoffs with the previous approach (and this one) was that in
attempting to fix that problem, we'd introduce behavior where the buffer
had to be re-read from disk each time.

> I think the solution here is to modify xfs_buf_read_map to handle a
> (DONE buffer with NULL b_ops and !NULL passed-in ops) to call
> ->verify_read on the buffer and set b_ops if successful.
> 
> (please keep reading)
> 
> > IIUC, a buffer that's already in-core is just going to
> > return with the existing b_ops (the else branch below). Otherwise,
> > ->b_ops is NULL to indicate that we read it from disk and ran the passed
> > in (prospective) verifier...
> > 
> > > +		if (bp->b_error) {
> > > +			xfs_buf_oneshot(bp);
> > > +			goto out;
> > > +		}
> > 
> > ... which means that this can't happen, because if we ran the passed in
> > verifier and it fails the xfs_trans_read_buf() call invalidates the
> > buffer and returns an error. I think that also means we'd abort the scan
> > if we don't filter out -EFSCORRUPTED returns so long as we pass a
> > potentially incorrect verifier.
> 
> Correct.  Good catch! :)
> 
> > Also note that I think this still means that each failure to match a
> > verifier has to re-read the buffer from disk, but we can deal with that
> > separately.
> > 
> > > +		bp->b_ops = fab->buf_ops;
> > 
> > Therefore, a successful return with a NULL ->b_ops means we ran the
> > verifier, it passed, and we're going to attach it. Am I missing
> > something, or is this a roundabout way to just do what the buffer read
> > mechanism already does without the flag?
> 
> Yes, it's roundabout.
> 
> > I could easily be missing some of the verification intricacies that are
> > required from scrub context,
> 
> I think I've explained it above, now that I finally remember what I was
> trying to accomplish. :)
> 
> > but is this logic roughly equivalent to the following (without any
> > special control flags)? We'd repeat some checks in the event the
> > buffer was read from disk, but perhaps that can also be optimized out
> > separately if it really matters. Hm?
> 
> How about something like this?
> 

This looks reasonable enough to me. I'd just suggest to streamline the
comments a bit...

> 	/*
> 	 * Read the buffer into memory so that we can see if it's a match for
> 	 * our btree type.  We have no clue if it is beforehand, and we want to
> 	 * avoid xfs_trans_read_buf's behavior of dumping the DONE state (which
> 	 * will cause needless disk reads in subsequent calls to this function)
> 	 * and logging metadata verifier failures.
> 	 *
> 	 * Therefore, pass in NULL buffer ops.  If the buffer was already in
> 	 * memory from some other caller it will already have b_ops assigned.
> 	 * If it was in memory from a previous unsuccessful findroot_block
> 	 * call, the buffer won't have b_ops but it should be clean and ready
> 	 * for us to try to verify if the read call succeeds.  The same applies
> 	 * if the buffer wasn't in memory at all.
> 	 *
> 	 * Note: If we never match a btree type with this buffer, it will be
> 	 * left in memory with NULL b_ops.  This shouldn't be a problem unless
> 	 * the buffer gets written.
> 	 */

/*
 * Read an unverified buffer so we can match it to one of the caller's
 * verifiers without generating I/O errors and log noise. If we can't
 * match it, the buffer may remain in-core without a verifier. This is
 * fine because the first reader with a non-NULL buf_ops verifies an
 * unverified buffer regardless of whether it was already in-core.
 */

> 	error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
> 			mp->m_bsize, 0, &bp, NULL);
> 	if (error)
> 		return error;
> 
> 	/*
> 	 * If the buffer already has ops applied and they're not the ones for
> 	 * this btree type, we know this block doesn't match the btree and we
> 	 * can bail out.
> 	 *
> 	 * If the buffer ops match ours, someone else has already validated
> 	 * the block for us, so we can move on to checking if this is a root
> 	 * block candidate.
> 	 *
> 	 * If the buffer does not have ops, nobody has successfully validated
> 	 * the contents and the buffer cannot be dirty.  If the magic, uuid,
> 	 * and structure match this btree type then we'll move on to checking
> 	 * if it's a root block candidate.  If there is no match, bail out.
> 	 */
> 	if (bp->b_ops) {
> 		if (b->b_ops != fab->buf_ops)
> 			goto out;
> 	} else {
> 		if (be32_to_cpu(btblock->bb_magic) != fab->magic)
> 			goto out;
> 		if (xfs_sb_version_hascrc(&mp->m_sb) &&
> 		    !uuid_equal(&btblock->bb_u.s.bb_uuid,
> 				&mp->m_sb.sb_meta_uuid))
> 			goto out;
> 		fa = fab->buf_ops->verify_struct(bp);
> 		if (fa)
> 			goto out;
> 		b->b_ops = fab->buf_ops;
> 	}
> 

... and then maybe reorder that logic:

	/*
	 * Verify the buffer against fab->buf_ops. If it's a match,
	 * attach buf_ops and proceed. If not, ->b_ops remains NULL
	 * until we find a match or another reader attaches the
	 * appropriate buf_ops.
	 */
	if (!bp->b_ops) {
		...
		b->b_ops = fab->buf_ops;
	}

	/*
	 * If the buffer was in-core and already verified with a
	 * different buf_ops, we know it's not a match.
	 */
	if (bp->b_ops != fab->buf_ops)
		goto out;

> and then xfs_buf_read_map can be taught to call the read verifier if it
> ever encounters an XBF_DONE buffer with b_ops == NULL and ops != NULL.
> 

Yep, but I think the higher level buffer behavior change should be an
independent patch rather than be buried in a scrub bug fix.

Brian

> --D
> 
> > 
> > 	...
> >         error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
> >                         mp->m_bsize, 0, &bp, fab->buf_ops);
> >         if (error == -EFSCORRUPTED)
> >                 goto out;
> >         else if (error)
> >                 return error;
> > 
> >         /*
> >          * If b_ops doesn't match fab, the buffer was already in-core with a
> >          * different verifier. We know it doesn't match in that case.
> >          */
> >         if (bp->b_ops != fab->buf_ops)
> >                 goto out;
> > 
> >         /*
> >          * The verifier matches but the buffer may have already been in-core
> >          * with this verifier. Sanity check the latest content before we
> >          * proceed. Don't check the CRC because CRCs aren't updated until
> >          * writeback.
> >          */
> >         btblock = XFS_BUF_TO_BLOCK(bp);
> >         if (be32_to_cpu(btblock->bb_magic) != fab->magic)
> >                 goto out;
> >         if (xfs_sb_version_hascrc(&mp->m_sb) &&
> >             !uuid_equal(&btblock->bb_u.s.bb_uuid,
> >                         &mp->m_sb.sb_meta_uuid))
> >                 goto out;
> >         fa = bp->b_ops->verify_struct(bp);
> >         if (fa)
> >                 goto out;
> > 	...
> > 
> > Brian
> > 
> > > +	} else {
> > > +		/*
> > > +		 * The buffer came back with b_ops set, which means that the
> > > +		 * buffer was already in memory.  We'll run the magic, uuid,
> > > +		 * and ->verify_struct checks by hand to see if we want to
> > > +		 * examine it further.
> > > +		 */
> > > +		if (be32_to_cpu(btblock->bb_magic) != fab->magic)
> > > +			goto out;
> > > +		if (xfs_sb_version_hascrc(&mp->m_sb) &&
> > > +		    !uuid_equal(&btblock->bb_u.s.bb_uuid,
> > > +				&mp->m_sb.sb_meta_uuid))
> > > +			goto out;
> > >  
> > > -	/* Make sure we pass the verifiers. */
> > > -	bp->b_ops->verify_read(bp);
> > > -	if (bp->b_error)
> > > -		goto out;
> > > +		/*
> > > +		 * If b_ops do not match fab->buf_ops even though the magic
> > > +		 * does, something is seriously wrong here and we'd rather
> > > +		 * abort the entire repair than risk screwing things up.
> > > +		 */
> > > +		if (bp->b_ops != fab->buf_ops)
> > > +			goto out;
> > > +
> > > +		/*
> > > +		 * If the buffer was already incore (on a v5 fs) then it should
> > > +		 * already have had b_ops assigned.  Call ->verify_struct to
> > > +		 * check the structure.  Avoid checking the CRC because we
> > > +		 * don't calculate CRCs until the buffer is written by the log.
> > > +		 */
> > > +		fa = bp->b_ops->verify_struct(bp);
> > > +		if (fa)
> > > +			goto out;
> > > +	}
> > >  
> > >  	/*
> > >  	 * This block passes the magic/uuid and verifier tests for this btree
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index e839907e8492..139d35f6da64 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -206,7 +206,8 @@ _xfs_buf_alloc(
> > >  	 * We don't want certain flags to appear in b_flags unless they are
> > >  	 * specifically set by later operations on the buffer.
> > >  	 */
> > > -	flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD);
> > > +	flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD |
> > > +		   XBF_ONESHOT_OPS);
> > >  
> > >  	atomic_set(&bp->b_hold, 1);
> > >  	atomic_set(&bp->b_lru_ref, 1);
> > > @@ -758,6 +759,7 @@ xfs_buf_read_map(
> > >  	const struct xfs_buf_ops *ops)
> > >  {
> > >  	struct xfs_buf		*bp;
> > > +	const struct xfs_buf_ops *orig_ops;
> > >  
> > >  	flags |= XBF_READ;
> > >  
> > > @@ -767,8 +769,11 @@ xfs_buf_read_map(
> > >  
> > >  		if (!(bp->b_flags & XBF_DONE)) {
> > >  			XFS_STATS_INC(target->bt_mount, xb_get_read);
> > > +			orig_ops = bp->b_ops;
> > >  			bp->b_ops = ops;
> > >  			_xfs_buf_read(bp, flags);
> > > +			if (flags & XBF_ONESHOT_OPS)
> > > +				bp->b_ops = orig_ops;
> > >  		} else if (flags & XBF_ASYNC) {
> > >  			/*
> > >  			 * Read ahead call which is already satisfied,
> > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > > index 4e3171acd0f8..62139d3ad349 100644
> > > --- a/fs/xfs/xfs_buf.h
> > > +++ b/fs/xfs/xfs_buf.h
> > > @@ -34,6 +34,7 @@ typedef enum {
> > >  #define XBF_ASYNC	 (1 << 4) /* initiator will not wait for completion */
> > >  #define XBF_DONE	 (1 << 5) /* all pages in the buffer uptodate */
> > >  #define XBF_STALE	 (1 << 6) /* buffer has been staled, do not find it */
> > > +#define XBF_ONESHOT_OPS	 (1 << 7) /* only use ops if we read in the buf */
> > >  #define XBF_WRITE_FAIL	 (1 << 24)/* async writes have failed on this buffer */
> > >  
> > >  /* I/O hints for the BIO layer */
> > > @@ -61,6 +62,7 @@ typedef unsigned int xfs_buf_flags_t;
> > >  	{ XBF_ASYNC,		"ASYNC" }, \
> > >  	{ XBF_DONE,		"DONE" }, \
> > >  	{ XBF_STALE,		"STALE" }, \
> > > +	{ XBF_ONESHOT_OPS,	"ONESHOT_OPS" }, /* should never be set */ \
> > >  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
> > >  	{ XBF_SYNCIO,		"SYNCIO" }, \
> > >  	{ XBF_FUA,		"FUA" }, \
> > > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] xfs: fix buffer state management in xrep_findroot_block
  2018-10-03  2:02   ` [PATCH v2 " Darrick J. Wong
@ 2018-10-03 11:05     ` Brian Foster
  2018-10-04 22:15       ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Brian Foster @ 2018-10-03 11:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Tue, Oct 02, 2018 at 07:02:11PM -0700, Darrick J. Wong wrote:
> We don't handle buffer state properly in online repair's findroot
> routine.  If a buffer already has b_ops set, we don't ever want to touch
> that, and we don't want to call the read verifiers on a buffer that
> could be dirty (CRCs are only recomputed during log checkpoints).
> 
> Therefore, be more careful about what we do with a buffer -- if someone
> else already attached ops that are not the ones for this btree type,
> just ignore the buffer.  We only attach our btree type's buf ops if it
> matches the magic/uuid and structure checks.
> 
> We also modify xfs_buf_read_map to allow callers to set buffer ops on a
> DONE buffer with NULL ops so that repair doesn't leave behind buffers
> which won't have buffers attached to them.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/repair.c  |   63 +++++++++++++++++++++++++++++++++++++-----------
>  fs/xfs/xfs_buf.c       |   12 +++++++++
>  fs/xfs/xfs_trans.h     |    1 +
>  fs/xfs/xfs_trans_buf.c |   13 ++++++++++
>  4 files changed, 75 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> index 63786341ac2a..cebaebb26566 100644
> --- a/fs/xfs/scrub/repair.c
> +++ b/fs/xfs/scrub/repair.c
> @@ -29,6 +29,8 @@
>  #include "xfs_ag_resv.h"
>  #include "xfs_trans_space.h"
>  #include "xfs_quota.h"
> +#include "xfs_attr.h"
> +#include "xfs_reflink.h"
>  #include "scrub/xfs_scrub.h"
>  #include "scrub/scrub.h"
>  #include "scrub/common.h"
> @@ -699,7 +701,7 @@ xrep_findroot_block(
>  	struct xfs_btree_block		*btblock;
>  	xfs_daddr_t			daddr;
>  	int				block_level;
> -	int				error;
> +	int				error = 0;
>  
>  	daddr = XFS_AGB_TO_DADDR(mp, ri->sc->sa.agno, agbno);
>  
> @@ -718,28 +720,61 @@ xrep_findroot_block(
>  			return error;
>  	}
>  
> +	/*
> +	 * Read the buffer into memory so that we can see if it's a match for
> +	 * our btree type.  We have no clue if it is beforehand, and we want to
> +	 * avoid xfs_trans_read_buf's behavior of dumping the DONE state (which
> +	 * will cause needless disk reads in subsequent calls to this function)
> +	 * and logging metadata verifier failures.
> +	 *
> +	 * Therefore, pass in NULL buffer ops.  If the buffer was already in
> +	 * memory from some other caller it will already have b_ops assigned.
> +	 * If it was in memory from a previous unsuccessful findroot_block
> +	 * call, the buffer won't have b_ops but it should be clean and ready
> +	 * for us to try to verify if the read call succeeds.  The same applies
> +	 * if the buffer wasn't in memory at all.
> +	 *
> +	 * Note: If we never match a btree type with this buffer, it will be
> +	 * left in memory with NULL b_ops.  This shouldn't be a problem unless
> +	 * the buffer gets written.
> +	 */
>  	error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
>  			mp->m_bsize, 0, &bp, NULL);
>  	if (error)
>  		return error;
>  
> -	/*
> -	 * Does this look like a block matching our fs and higher than any
> -	 * other block we've found so far?  If so, reattach buffer verifiers
> -	 * so the AIL won't complain if the buffer is also dirty.
> -	 */
> +	/* Ensure the block magic matches the btree type we're looking for. */
>  	btblock = XFS_BUF_TO_BLOCK(bp);
>  	if (be32_to_cpu(btblock->bb_magic) != fab->magic)
>  		goto out;
> -	if (xfs_sb_version_hascrc(&mp->m_sb) &&
> -	    !uuid_equal(&btblock->bb_u.s.bb_uuid, &mp->m_sb.sb_meta_uuid))
> -		goto out;
> -	bp->b_ops = fab->buf_ops;
>  
> -	/* Make sure we pass the verifiers. */
> -	bp->b_ops->verify_read(bp);
> -	if (bp->b_error)
> -		goto out;
> +	/*
> +	 * If the buffer already has ops applied and they're not the ones for
> +	 * this btree type, we know this block doesn't match the btree and we
> +	 * can bail out.
> +	 *
> +	 * If the buffer ops match ours, someone else has already validated
> +	 * the block for us, so we can move on to checking if this is a root
> +	 * block candidate.
> +	 *
> +	 * If the buffer does not have ops, nobody has successfully validated
> +	 * the contents and the buffer cannot be dirty.  If the magic, uuid,
> +	 * and structure match this btree type then we'll move on to checking
> +	 * if it's a root block candidate.  If there is no match, bail out.
> +	 */
> +	if (bp->b_ops) {
> +		if (bp->b_ops != fab->buf_ops)
> +			goto out;
> +	} else {
> +		ASSERT(!xfs_trans_buf_is_dirty(bp));
> +		if (!uuid_equal(&btblock->bb_u.s.bb_uuid,
> +				&mp->m_sb.sb_meta_uuid))
> +			goto out;
> +		fab->buf_ops->verify_read(bp);
> +		if (bp->b_error)
> +			goto out;
> +		bp->b_ops = fab->buf_ops;
> +	}
>  
>  	/*
>  	 * This block passes the magic/uuid and verifier tests for this btree
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index e839907e8492..7aa78cef9c3a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c

Separate patch for the core buffer changes please. :)

> @@ -777,6 +777,18 @@ xfs_buf_read_map(
>  			xfs_buf_relse(bp);
>  			return NULL;
>  		} else {
> +			/*
> +			 * Online repair can leave done buffers in memory with
> +			 * no b_ops assigned.  If the caller passed in an ops
> +			 * structure, see if the buffer passes the verifier.
> +			 * If successful, assign the ops to the buffer.
> +			 */
> +			if (!bp->b_ops && ops) {
> +				ops->verify_read(bp);
> +				if (!bp->b_error)
> +					bp->b_ops = ops;

Why the conditional assignment? The read branch above looks like all
current ops != NULL callers essentially force attach ->b_ops. E.g., a
verification failure is an unconditional I/O error in this context, so
ops is not considered "prospective" at this layer like it is in the
scrub context. Not that it matters that much, but I think it's strange
to attach ->b_ops (on failure) based on whether the buffer had already
been read unverified or not.

Another quirk to consider here is that we currently don't set XBF_DONE
on read verifier failure. Should we clear XBF_DONE in the event of
deferred verification failure so a follow on read can retry?

> +			}
> +

And since this is a core infrastructure change, I also think we need to
step back and think about how this could be used in a variety of cases
going forward (not just our current scrub use case). For one (and nobody
does this right now), but it looks like an xfs_trans_read_buf(..., NULL)
caller followed by an xfs_trans_read_buf(..., ops) caller would still
never verify the buffer because it can be looked up in the transaction
without hitting the read path.

Also, it may be technically Ok to defer verification in the readahead
case of a previously unverified buffer, but I _think_ I'm of the opinion
that we should verify the buffer ASAP when the caller has provided a
non-NULL ops. So if verification currently happens on read-ahead in some
particular workload (in combination with scrub), we should preserve that
behavior. At minimum, doing so ensures we don't lose potential
verification failures if a particular readahead buffer never otherwise
ends up being read "for real."

Hmm, I'm wondering if it might be a good idea to factor some combination
of the current ->verify_read() call and ->b_error handling into a small
helper, make sure we call it in all the right places and conditionalize
the error handling to callers that aren't in ioend context. I dunno,
maybe there's a cleaner way to do that. I also think we could use some
asserts to (for example) make sure we always return a ->b_ops != NULL
buffer if the caller passed a buf_ops. FWIW, I'm also not against the
XBF_VERIFIED thing to facilitate that, if that's cleaner than just
checking ->b_ops everywhere.

Brian

>  			/* We do not want read in the flags */
>  			bp->b_flags &= ~XBF_READ;
>  		}
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index c3d278e96ad1..a0c5dbda18aa 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -220,6 +220,7 @@ void		xfs_trans_ijoin(struct xfs_trans *, struct xfs_inode *, uint);
>  void		xfs_trans_log_buf(struct xfs_trans *, struct xfs_buf *, uint,
>  				  uint);
>  void		xfs_trans_dirty_buf(struct xfs_trans *, struct xfs_buf *);
> +bool		xfs_trans_buf_is_dirty(struct xfs_buf *bp);
>  void		xfs_trans_log_inode(xfs_trans_t *, struct xfs_inode *, uint);
>  
>  void		xfs_extent_free_init_defer_op(void);
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 15919f67a88f..bdc5be4d0210 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -321,6 +321,19 @@ xfs_trans_read_buf_map(
>  
>  }
>  
> +/* Has this buffer been dirtied by anyone? */
> +bool
> +xfs_trans_buf_is_dirty(
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_buf_log_item	*bip = bp->b_log_item;
> +
> +	if (!bip)
> +		return false;
> +	ASSERT(bip->bli_item.li_type == XFS_LI_BUF);
> +	return test_bit(XFS_LI_DIRTY, &bip->bli_item.li_flags);
> +}
> +
>  /*
>   * Release the buffer bp which was previously acquired with one of the
>   * xfs_trans_... buffer allocation routines if the buffer has not

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] xfs: fix buffer state management in xrep_findroot_block
  2018-10-03 11:05     ` Brian Foster
@ 2018-10-04 22:15       ` Darrick J. Wong
  2018-10-05 10:27         ` Brian Foster
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2018-10-04 22:15 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Wed, Oct 03, 2018 at 07:05:27AM -0400, Brian Foster wrote:
> On Tue, Oct 02, 2018 at 07:02:11PM -0700, Darrick J. Wong wrote:
> > We don't handle buffer state properly in online repair's findroot
> > routine.  If a buffer already has b_ops set, we don't ever want to touch
> > that, and we don't want to call the read verifiers on a buffer that
> > could be dirty (CRCs are only recomputed during log checkpoints).
> > 
> > Therefore, be more careful about what we do with a buffer -- if someone
> > else already attached ops that are not the ones for this btree type,
> > just ignore the buffer.  We only attach our btree type's buf ops if it
> > matches the magic/uuid and structure checks.
> > 
> > We also modify xfs_buf_read_map to allow callers to set buffer ops on a
> > DONE buffer with NULL ops so that repair doesn't leave behind buffers
> > which won't have buffers attached to them.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/scrub/repair.c  |   63 +++++++++++++++++++++++++++++++++++++-----------
> >  fs/xfs/xfs_buf.c       |   12 +++++++++
> >  fs/xfs/xfs_trans.h     |    1 +
> >  fs/xfs/xfs_trans_buf.c |   13 ++++++++++
> >  4 files changed, 75 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> > index 63786341ac2a..cebaebb26566 100644
> > --- a/fs/xfs/scrub/repair.c
> > +++ b/fs/xfs/scrub/repair.c
> > @@ -29,6 +29,8 @@
> >  #include "xfs_ag_resv.h"
> >  #include "xfs_trans_space.h"
> >  #include "xfs_quota.h"
> > +#include "xfs_attr.h"
> > +#include "xfs_reflink.h"
> >  #include "scrub/xfs_scrub.h"
> >  #include "scrub/scrub.h"
> >  #include "scrub/common.h"
> > @@ -699,7 +701,7 @@ xrep_findroot_block(
> >  	struct xfs_btree_block		*btblock;
> >  	xfs_daddr_t			daddr;
> >  	int				block_level;
> > -	int				error;
> > +	int				error = 0;
> >  
> >  	daddr = XFS_AGB_TO_DADDR(mp, ri->sc->sa.agno, agbno);
> >  
> > @@ -718,28 +720,61 @@ xrep_findroot_block(
> >  			return error;
> >  	}
> >  
> > +	/*
> > +	 * Read the buffer into memory so that we can see if it's a match for
> > +	 * our btree type.  We have no clue if it is beforehand, and we want to
> > +	 * avoid xfs_trans_read_buf's behavior of dumping the DONE state (which
> > +	 * will cause needless disk reads in subsequent calls to this function)
> > +	 * and logging metadata verifier failures.
> > +	 *
> > +	 * Therefore, pass in NULL buffer ops.  If the buffer was already in
> > +	 * memory from some other caller it will already have b_ops assigned.
> > +	 * If it was in memory from a previous unsuccessful findroot_block
> > +	 * call, the buffer won't have b_ops but it should be clean and ready
> > +	 * for us to try to verify if the read call succeeds.  The same applies
> > +	 * if the buffer wasn't in memory at all.
> > +	 *
> > +	 * Note: If we never match a btree type with this buffer, it will be
> > +	 * left in memory with NULL b_ops.  This shouldn't be a problem unless
> > +	 * the buffer gets written.
> > +	 */
> >  	error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
> >  			mp->m_bsize, 0, &bp, NULL);
> >  	if (error)
> >  		return error;
> >  
> > -	/*
> > -	 * Does this look like a block matching our fs and higher than any
> > -	 * other block we've found so far?  If so, reattach buffer verifiers
> > -	 * so the AIL won't complain if the buffer is also dirty.
> > -	 */
> > +	/* Ensure the block magic matches the btree type we're looking for. */
> >  	btblock = XFS_BUF_TO_BLOCK(bp);
> >  	if (be32_to_cpu(btblock->bb_magic) != fab->magic)
> >  		goto out;
> > -	if (xfs_sb_version_hascrc(&mp->m_sb) &&
> > -	    !uuid_equal(&btblock->bb_u.s.bb_uuid, &mp->m_sb.sb_meta_uuid))
> > -		goto out;
> > -	bp->b_ops = fab->buf_ops;
> >  
> > -	/* Make sure we pass the verifiers. */
> > -	bp->b_ops->verify_read(bp);
> > -	if (bp->b_error)
> > -		goto out;
> > +	/*
> > +	 * If the buffer already has ops applied and they're not the ones for
> > +	 * this btree type, we know this block doesn't match the btree and we
> > +	 * can bail out.
> > +	 *
> > +	 * If the buffer ops match ours, someone else has already validated
> > +	 * the block for us, so we can move on to checking if this is a root
> > +	 * block candidate.
> > +	 *
> > +	 * If the buffer does not have ops, nobody has successfully validated
> > +	 * the contents and the buffer cannot be dirty.  If the magic, uuid,
> > +	 * and structure match this btree type then we'll move on to checking
> > +	 * if it's a root block candidate.  If there is no match, bail out.
> > +	 */
> > +	if (bp->b_ops) {
> > +		if (bp->b_ops != fab->buf_ops)
> > +			goto out;
> > +	} else {
> > +		ASSERT(!xfs_trans_buf_is_dirty(bp));
> > +		if (!uuid_equal(&btblock->bb_u.s.bb_uuid,
> > +				&mp->m_sb.sb_meta_uuid))
> > +			goto out;
> > +		fab->buf_ops->verify_read(bp);
> > +		if (bp->b_error)
> > +			goto out;
> > +		bp->b_ops = fab->buf_ops;
> > +	}
> >  
> >  	/*
> >  	 * This block passes the magic/uuid and verifier tests for this btree
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index e839907e8492..7aa78cef9c3a 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> 
> Separate patch for the core buffer changes please. :)

Ok, sorry about that.

> > @@ -777,6 +777,18 @@ xfs_buf_read_map(
> >  			xfs_buf_relse(bp);
> >  			return NULL;
> >  		} else {
> > +			/*
> > +			 * Online repair can leave done buffers in memory with
> > +			 * no b_ops assigned.  If the caller passed in an ops
> > +			 * structure, see if the buffer passes the verifier.
> > +			 * If successful, assign the ops to the buffer.
> > +			 */
> > +			if (!bp->b_ops && ops) {
> > +				ops->verify_read(bp);
> > +				if (!bp->b_error)
> > +					bp->b_ops = ops;
> 
> Why the conditional assignment? The read branch above looks like all
> current ops != NULL callers essentially force attach ->b_ops. E.g., a
> verification failure is an unconditional I/O error in this context, so
> ops is not considered "prospective" at this layer like it is in the
> scrub context.

You're right.  If the caller gave us an *ops that means they want to
take ownership of the buffer, and that implies setting b_ops.

> Not that it matters that much, but I think it's strange to attach
> ->b_ops (on failure) based on whether the buffer had already been read
> unverified or not.
> 
> Another quirk to consider here is that we currently don't set XBF_DONE
> on read verifier failure. Should we clear XBF_DONE in the event of
> deferred verification failure so a follow on read can retry?

Yes.  Originally I thought that we'd leave the NULL b_ops on verifier
failure so that a future caller might succeed with its own ops, but I
realize that doing so means the caller passes in ops and we'd pass back
a corrupt buffer with NULL ops.  Oops.

> > +			}
> > +
> 
> And since this is a core infrastructure change, I also think we need to
> step back and think about how this could be used in a variety of cases
> going forward (not just our current scrub use case). For one (and nobody
> does this right now), but it looks like an xfs_trans_read_buf(..., NULL)
> caller followed by an xfs_trans_read_buf(..., ops) caller would still
> never verify the buffer because it can be looked up in the transaction
> without hitting the read path.

I don't think I quite agree with you here.  Normally we require callers
of xfs_trans_read_buf to provide an ops structure so that we don't play
with obviously corrupt data.  Upon receipt of the buffer, the caller
needs to decide if it's truly interested in that buffer.

If so, the caller needs to set the b_ops manually before moving
on, in which case a subsequent _trans_read_buf on the same transaction
and buffer will be fine.  The existing caller (quotacheck) does this,
as will repair.

If the caller doesn't want the buffer, it should release the buffer
immediately and let a subsequent reader try its own read verifier.
Currently we don't try the read verifier again, which is what this part
of the patch aims to fix.

That said, we could be a little more defensive in regular operation.  If
someone tries a _trans_read_buf of a buffer that's already joined to the
transaction but has no ops we'll complain but only shut down if the
buffer fails verification.

> Also, it may be technically Ok to defer verification in the readahead
> case of a previously unverified buffer, but I _think_ I'm of the opinion
> that we should verify the buffer ASAP when the caller has provided a
> non-NULL ops. So if verification currently happens on read-ahead in some
> particular workload (in combination with scrub), we should preserve that
> behavior. At minimum, doing so ensures we don't lose potential
> verification failures if a particular readahead buffer never otherwise
> ends up being read "for real."

Agreed.

> Hmm, I'm wondering if it might be a good idea to factor some combination
> of the current ->verify_read() call and ->b_error handling into a small
> helper, make sure we call it in all the right places and conditionalize
> the error handling to callers that aren't in ioend context. I dunno,
> maybe there's a cleaner way to do that. I also think we could use some
> asserts to (for example) make sure we always return a ->b_ops != NULL
> buffer if the caller passed a buf_ops. FWIW, I'm also not against the
> XBF_VERIFIED thing to facilitate that, if that's cleaner than just
> checking ->b_ops everywhere.

Yes, I think we can share that in a xfs_buf_ensure_ops helper.

--D

> Brian
> 
> >  			/* We do not want read in the flags */
> >  			bp->b_flags &= ~XBF_READ;
> >  		}
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index c3d278e96ad1..a0c5dbda18aa 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -220,6 +220,7 @@ void		xfs_trans_ijoin(struct xfs_trans *, struct xfs_inode *, uint);
> >  void		xfs_trans_log_buf(struct xfs_trans *, struct xfs_buf *, uint,
> >  				  uint);
> >  void		xfs_trans_dirty_buf(struct xfs_trans *, struct xfs_buf *);
> > +bool		xfs_trans_buf_is_dirty(struct xfs_buf *bp);
> >  void		xfs_trans_log_inode(xfs_trans_t *, struct xfs_inode *, uint);
> >  
> >  void		xfs_extent_free_init_defer_op(void);
> > diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> > index 15919f67a88f..bdc5be4d0210 100644
> > --- a/fs/xfs/xfs_trans_buf.c
> > +++ b/fs/xfs/xfs_trans_buf.c
> > @@ -321,6 +321,19 @@ xfs_trans_read_buf_map(
> >  
> >  }
> >  
> > +/* Has this buffer been dirtied by anyone? */
> > +bool
> > +xfs_trans_buf_is_dirty(
> > +	struct xfs_buf		*bp)
> > +{
> > +	struct xfs_buf_log_item	*bip = bp->b_log_item;
> > +
> > +	if (!bip)
> > +		return false;
> > +	ASSERT(bip->bli_item.li_type == XFS_LI_BUF);
> > +	return test_bit(XFS_LI_DIRTY, &bip->bli_item.li_flags);
> > +}
> > +
> >  /*
> >   * Release the buffer bp which was previously acquired with one of the
> >   * xfs_trans_... buffer allocation routines if the buffer has not

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] xfs: fix buffer state management in xrep_findroot_block
  2018-10-04 22:15       ` Darrick J. Wong
@ 2018-10-05 10:27         ` Brian Foster
  0 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2018-10-05 10:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Oct 04, 2018 at 03:15:38PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 03, 2018 at 07:05:27AM -0400, Brian Foster wrote:
> > On Tue, Oct 02, 2018 at 07:02:11PM -0700, Darrick J. Wong wrote:
> > > We don't handle buffer state properly in online repair's findroot
> > > routine.  If a buffer already has b_ops set, we don't ever want to touch
> > > that, and we don't want to call the read verifiers on a buffer that
> > > could be dirty (CRCs are only recomputed during log checkpoints).
> > > 
> > > Therefore, be more careful about what we do with a buffer -- if someone
> > > else already attached ops that are not the ones for this btree type,
> > > just ignore the buffer.  We only attach our btree type's buf ops if it
> > > matches the magic/uuid and structure checks.
> > > 
> > > We also modify xfs_buf_read_map to allow callers to set buffer ops on a
> > > DONE buffer with NULL ops so that repair doesn't leave behind buffers
> > > which won't have buffers attached to them.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/scrub/repair.c  |   63 +++++++++++++++++++++++++++++++++++++-----------
> > >  fs/xfs/xfs_buf.c       |   12 +++++++++
> > >  fs/xfs/xfs_trans.h     |    1 +
> > >  fs/xfs/xfs_trans_buf.c |   13 ++++++++++
> > >  4 files changed, 75 insertions(+), 14 deletions(-)
> > > 
..
> > 
> > And since this is a core infrastructure change, I also think we need to
> > step back and think about how this could be used in a variety of cases
> > going forward (not just our current scrub use case). For one (and nobody
> > does this right now), but it looks like an xfs_trans_read_buf(..., NULL)
> > caller followed by an xfs_trans_read_buf(..., ops) caller would still
> > never verify the buffer because it can be looked up in the transaction
> > without hitting the read path.
> 
> I don't think I quite agree with you here.  Normally we require callers
> of xfs_trans_read_buf to provide an ops structure so that we don't play
> with obviously corrupt data.  Upon receipt of the buffer, the caller
> needs to decide if it's truly interested in that buffer.
> 
> If so, the caller needs to set the b_ops manually before moving
> on, in which case a subsequent _trans_read_buf on the same transaction
> and buffer will be fine.  The existing caller (quotacheck) does this,
> as will repair.
> 
> If the caller doesn't want the buffer, it should release the buffer
> immediately and let a subsequent reader try its own read verifier.
> Currently we don't try the read verifier again, which is what this part
> of the patch aims to fix.
> 

That all sounds reasonable. My point is not necessarily to say that we
have to support this case. The concern is that nothing in the interface
prevents it and thus it's a landmine at best. The point is basically
that we should never return an unverified buffer to a caller that passes
a non-NULL ops...

> That said, we could be a little more defensive in regular operation.  If
> someone tries a _trans_read_buf of a buffer that's already joined to the
> transaction but has no ops we'll complain but only shut down if the
> buffer fails verification.
> 

... which doesn't preclude returning an error (or some sufficiently loud
combination of assert/warning or whatever..) in the case where an
unverified buffer was found in the current transaction, if that's what
you prefer. ;)

Brian

> > Also, it may be technically Ok to defer verification in the readahead
> > case of a previously unverified buffer, but I _think_ I'm of the opinion
> > that we should verify the buffer ASAP when the caller has provided a
> > non-NULL ops. So if verification currently happens on read-ahead in some
> > particular workload (in combination with scrub), we should preserve that
> > behavior. At minimum, doing so ensures we don't lose potential
> > verification failures if a particular readahead buffer never otherwise
> > ends up being read "for real."
> 
> Agreed.
> 
> > Hmm, I'm wondering if it might be a good idea to factor some combination
> > of the current ->verify_read() call and ->b_error handling into a small
> > helper, make sure we call it in all the right places and conditionalize
> > the error handling to callers that aren't in ioend context. I dunno,
> > maybe there's a cleaner way to do that. I also think we could use some
> > asserts to (for example) make sure we always return a ->b_ops != NULL
> > buffer if the caller passed a buf_ops. FWIW, I'm also not against the
> > XBF_VERIFIED thing to facilitate that, if that's cleaner than just
> > checking ->b_ops everywhere.
> 
> Yes, I think we can share that in a xfs_buf_ensure_ops helper.
> 
> --D
> 
> > Brian
> > 
> > >  			/* We do not want read in the flags */
> > >  			bp->b_flags &= ~XBF_READ;
> > >  		}
> > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > index c3d278e96ad1..a0c5dbda18aa 100644
> > > --- a/fs/xfs/xfs_trans.h
> > > +++ b/fs/xfs/xfs_trans.h
> > > @@ -220,6 +220,7 @@ void		xfs_trans_ijoin(struct xfs_trans *, struct xfs_inode *, uint);
> > >  void		xfs_trans_log_buf(struct xfs_trans *, struct xfs_buf *, uint,
> > >  				  uint);
> > >  void		xfs_trans_dirty_buf(struct xfs_trans *, struct xfs_buf *);
> > > +bool		xfs_trans_buf_is_dirty(struct xfs_buf *bp);
> > >  void		xfs_trans_log_inode(xfs_trans_t *, struct xfs_inode *, uint);
> > >  
> > >  void		xfs_extent_free_init_defer_op(void);
> > > diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> > > index 15919f67a88f..bdc5be4d0210 100644
> > > --- a/fs/xfs/xfs_trans_buf.c
> > > +++ b/fs/xfs/xfs_trans_buf.c
> > > @@ -321,6 +321,19 @@ xfs_trans_read_buf_map(
> > >  
> > >  }
> > >  
> > > +/* Has this buffer been dirtied by anyone? */
> > > +bool
> > > +xfs_trans_buf_is_dirty(
> > > +	struct xfs_buf		*bp)
> > > +{
> > > +	struct xfs_buf_log_item	*bip = bp->b_log_item;
> > > +
> > > +	if (!bip)
> > > +		return false;
> > > +	ASSERT(bip->bli_item.li_type == XFS_LI_BUF);
> > > +	return test_bit(XFS_LI_DIRTY, &bip->bli_item.li_flags);
> > > +}
> > > +
> > >  /*
> > >   * Release the buffer bp which was previously acquired with one of the
> > >   * xfs_trans_... buffer allocation routines if the buffer has not

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-10-05 17:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-01 22:48 [PATCH v2 0/2] xfs-4.20: scrub fixes Darrick J. Wong
2018-10-01 22:48 ` [PATCH 1/2] xfs: xrep_findroot_block should reject root blocks with siblings Darrick J. Wong
2018-10-02 12:34   ` Brian Foster
2018-10-02 15:39     ` Darrick J. Wong
2018-10-03  2:01   ` [PATCH v2 " Darrick J. Wong
2018-10-01 22:48 ` [PATCH 2/2] xfs: fix buffer state management in xrep_findroot_block Darrick J. Wong
2018-10-02 12:36   ` Brian Foster
2018-10-02 19:56     ` Darrick J. Wong
2018-10-03 11:03       ` Brian Foster
2018-10-03  2:02   ` [PATCH v2 " Darrick J. Wong
2018-10-03 11:05     ` Brian Foster
2018-10-04 22:15       ` Darrick J. Wong
2018-10-05 10:27         ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.