All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] fix log recovery for v5 superblocks
@ 2016-08-11 17:11 Brian Foster
  2016-08-11 17:11 ` [PATCH 1/5] xfs: rework log recovery to submit buffers on LSN boundaries Brian Foster
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Brian Foster @ 2016-08-11 17:11 UTC (permalink / raw)
  To: xfs

Hi all,

This series addresses a problem with log recovery on v5 superblocks
where log recovery doesn't update the metadata LSN of buffers it writes
out. The v5 superblock metadata LSN recovery ordering logic helps
prevent transient corruption states and works fine in most cases. Since
recovery doesn't update the metadata LSN of buffers it writes out,
however, it introduces the possibility of ordering problems if the
recovery happens to repeat from the beginning (e.g., system crash after
recovery I/O has completed but before the log tail is pushed forward).

This tends to manifest as reports of directory corruption during log
recovery and commonly reproduces in situations where a system is
restarted frequently (e.g., constant/frequent reboot tests). It can be
reproduced fairly easily on demand by running fsstress in a simple
(non-log forcing) shutdown loop, e.g.:

	while [ true ]; do 
		fsstress -d <dir> -p 4 &
		sleep $((RANDOM % 5))
		xfs_io -xc shutdown <dir>
		umount <dir>
		mount <dev> <dir> || break
	done

The problem is that the log attempts to replay a buffer update that is
no longer valid due to subsequent, already replayed updates. This
results in a corruption error when, in fact, the filesystem is perfectly
fine. The problem is addressed by updating the metadata LSN of buffers
submitted for I/O by log recovery. This bit is implemented in patch 4.

Updating metadata LSNs in this manner uncovered another issue in which
log recovery would incorrectly skip modifications from independent
transactions if they happen to share an LSN with a previous transaction
that modified the same buffer (which had already updated the LSN). This
particular problem is addressed in patch 2, which modifies log recovery
to submit buffers on current LSN boundaries rather than transaction
boundaries. The remaining patches fix up some minor issues, refactor a
bit and add some tracepoints.

This has survived a single xfstests run so far but still requires more
regression testing. It has also survived several thousand iterations of
a shutdown test similar to the above that normally reproduces problems
within tens of iterations (xfstests test tbd).

Brian

Brian Foster (5):
  xfs: rework log recovery to submit buffers on LSN boundaries
  xfs: pass current lsn to log recovery buffer validation
  xfs: don't warn on buffers not being recovered due to LSN
  xfs: update metadata LSN in buffers during log recovery
  xfs: log recovery tracepoints to track current lsn and buffer
    submission

 fs/xfs/xfs_log_priv.h    |   3 +-
 fs/xfs/xfs_log_recover.c | 188 +++++++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_trace.h       |  31 +++++++-
 3 files changed, 163 insertions(+), 59 deletions(-)

-- 
2.5.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/5] xfs: rework log recovery to submit buffers on LSN boundaries
  2016-08-11 17:11 [PATCH 0/5] fix log recovery for v5 superblocks Brian Foster
@ 2016-08-11 17:11 ` Brian Foster
  2016-08-29  1:16   ` Dave Chinner
  2016-08-11 17:11 ` [PATCH 2/5] xfs: pass current lsn to log recovery buffer validation Brian Foster
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2016-08-11 17:11 UTC (permalink / raw)
  To: xfs

The fix to log recovery to update the metadata LSN in recovered buffers
introduces the requirement that a buffer is submitted only once per
current LSN. Log recovery currently submits buffers on transaction
boundaries. This is not sufficient as the abstraction between log
records and transactions allows for various scenarios where multiple
transactions can share the same current LSN. If independent transactions
share an LSN and both modify the same buffer, log recovery can
incorrectly skip updates and leave the filesystem in an inconsisent
state.

In preparation for proper metadata LSN updates during log recovery,
update log recovery to submit buffers for write on LSN change boundaries
rather than transaction boundaries. Explicitly track the current LSN in
a new struct xlog field to handle the various corner cases of when the
current LSN may or may not change.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_priv.h    |  3 +-
 fs/xfs/xfs_log_recover.c | 82 +++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 66 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 765f084..2b6eec5 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -413,7 +413,8 @@ struct xlog {
 	/* log record crc error injection factor */
 	uint32_t		l_badcrc_factor;
 #endif
-
+	/* log recovery lsn tracking (for buffer submission */
+	xfs_lsn_t		l_recovery_lsn;
 };
 
 #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index e8638fd..97a0af9 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3846,14 +3846,13 @@ STATIC int
 xlog_recover_commit_trans(
 	struct xlog		*log,
 	struct xlog_recover	*trans,
-	int			pass)
+	int			pass,
+	struct list_head	*buffer_list)
 {
 	int				error = 0;
-	int				error2;
 	int				items_queued = 0;
 	struct xlog_recover_item	*item;
 	struct xlog_recover_item	*next;
-	LIST_HEAD			(buffer_list);
 	LIST_HEAD			(ra_list);
 	LIST_HEAD			(done_list);
 
@@ -3876,7 +3875,7 @@ xlog_recover_commit_trans(
 			items_queued++;
 			if (items_queued >= XLOG_RECOVER_COMMIT_QUEUE_MAX) {
 				error = xlog_recover_items_pass2(log, trans,
-						&buffer_list, &ra_list);
+						buffer_list, &ra_list);
 				list_splice_tail_init(&ra_list, &done_list);
 				items_queued = 0;
 			}
@@ -3894,15 +3893,14 @@ out:
 	if (!list_empty(&ra_list)) {
 		if (!error)
 			error = xlog_recover_items_pass2(log, trans,
-					&buffer_list, &ra_list);
+					buffer_list, &ra_list);
 		list_splice_tail_init(&ra_list, &done_list);
 	}
 
 	if (!list_empty(&done_list))
 		list_splice_init(&done_list, &trans->r_itemq);
 
-	error2 = xfs_buf_delwri_submit(&buffer_list);
-	return error ? error : error2;
+	return error;
 }
 
 STATIC void
@@ -4085,7 +4083,8 @@ xlog_recovery_process_trans(
 	char			*dp,
 	unsigned int		len,
 	unsigned int		flags,
-	int			pass)
+	int			pass,
+	struct list_head	*buffer_list)
 {
 	int			error = 0;
 	bool			freeit = false;
@@ -4109,7 +4108,8 @@ xlog_recovery_process_trans(
 		error = xlog_recover_add_to_cont_trans(log, trans, dp, len);
 		break;
 	case XLOG_COMMIT_TRANS:
-		error = xlog_recover_commit_trans(log, trans, pass);
+		error = xlog_recover_commit_trans(log, trans, pass,
+						  buffer_list);
 		/* success or fail, we are now done with this transaction. */
 		freeit = true;
 		break;
@@ -4191,10 +4191,12 @@ xlog_recover_process_ophdr(
 	struct xlog_op_header	*ohead,
 	char			*dp,
 	char			*end,
-	int			pass)
+	int			pass,
+	struct list_head	*buffer_list)
 {
 	struct xlog_recover	*trans;
 	unsigned int		len;
+	int			error;
 
 	/* Do we understand who wrote this op? */
 	if (ohead->oh_clientid != XFS_TRANSACTION &&
@@ -4221,8 +4223,39 @@ xlog_recover_process_ophdr(
 		return 0;
 	}
 
+	/*
+	 * Recovered buffers are submitted for I/O on current LSN change
+	 * boundaries. This is necessary to accommodate metadata LSN ordering
+	 * rules of v5 superblock filesystems.
+	 *
+	 * Store the new current LSN in l_recovery_lsn as we cannot rely on
+	 * either record boundaries or transaction boundaries alone to track LSN
+	 * changes. This has several contributing factors:
+	 *
+	 * - Metadata LSNs are updated at buffer submission time. Thus, buffers
+	 *   can only be submitted safely once per current LSN value.
+	 * - The current LSN is defined as the start cycle/block of the first
+	 *   record in which a transaction appears.
+	 * - A record can hold multiple transactions. Thus, a transaction change
+	 *   does not guarantee a change in current LSN.
+	 * - A transaction can span multiple records. Thus, a record change does
+	 *   not guarantee a change in current LSN. Consider the case where a
+	 *   record holds one small transaction and a subsequent that carries
+	 *   over to the next record. Both transactions share the same LSN as
+	 *   per the definition of the current LSN.
+	 *
+	 * In summary, this means we must track the current LSN independently
+	 * and submit buffers for the previous LSN only when it has changed.
+	 */
+	if (log->l_recovery_lsn != trans->r_lsn) {
+		error = xfs_buf_delwri_submit(buffer_list);
+		if (error)
+			return error;
+		log->l_recovery_lsn = trans->r_lsn;
+	}
+
 	return xlog_recovery_process_trans(log, trans, dp, len,
-					   ohead->oh_flags, pass);
+					   ohead->oh_flags, pass, buffer_list);
 }
 
 /*
@@ -4240,7 +4273,8 @@ xlog_recover_process_data(
 	struct hlist_head	rhash[],
 	struct xlog_rec_header	*rhead,
 	char			*dp,
-	int			pass)
+	int			pass,
+	struct list_head	*buffer_list)
 {
 	struct xlog_op_header	*ohead;
 	char			*end;
@@ -4262,7 +4296,7 @@ xlog_recover_process_data(
 
 		/* errors will abort recovery */
 		error = xlog_recover_process_ophdr(log, rhash, rhead, ohead,
-						    dp, end, pass);
+						   dp, end, pass, buffer_list);
 		if (error)
 			return error;
 
@@ -4685,7 +4719,8 @@ xlog_recover_process(
 	struct hlist_head	rhash[],
 	struct xlog_rec_header	*rhead,
 	char			*dp,
-	int			pass)
+	int			pass,
+	struct list_head	*buffer_list)
 {
 	int			error;
 	__le32			crc;
@@ -4732,7 +4767,8 @@ xlog_recover_process(
 	if (error)
 		return error;
 
-	return xlog_recover_process_data(log, rhash, rhead, dp, pass);
+	return xlog_recover_process_data(log, rhash, rhead, dp, pass,
+					 buffer_list);
 }
 
 STATIC int
@@ -4793,9 +4829,11 @@ xlog_do_recovery_pass(
 	char			*offset;
 	xfs_buf_t		*hbp, *dbp;
 	int			error = 0, h_size, h_len;
+	int			error2 = 0;
 	int			bblks, split_bblks;
 	int			hblks, split_hblks, wrapped_hblks;
 	struct hlist_head	rhash[XLOG_RHASH_SIZE];
+	LIST_HEAD		(buffer_list);
 
 	ASSERT(head_blk != tail_blk);
 	rhead_blk = 0;
@@ -4981,7 +5019,7 @@ xlog_do_recovery_pass(
 			}
 
 			error = xlog_recover_process(log, rhash, rhead, offset,
-						     pass);
+						     pass, &buffer_list);
 			if (error)
 				goto bread_err2;
 
@@ -5012,7 +5050,8 @@ xlog_do_recovery_pass(
 		if (error)
 			goto bread_err2;
 
-		error = xlog_recover_process(log, rhash, rhead, offset, pass);
+		error = xlog_recover_process(log, rhash, rhead, offset, pass,
+					     &buffer_list);
 		if (error)
 			goto bread_err2;
 
@@ -5025,10 +5064,17 @@ xlog_do_recovery_pass(
  bread_err1:
 	xlog_put_bp(hbp);
 
+	/*
+	 * Submit buffers that have been added from the last record processed,
+	 * regardless of error status.
+	 */
+	if (!list_empty(&buffer_list))
+		error2 = xfs_buf_delwri_submit(&buffer_list);
+
 	if (error && first_bad)
 		*first_bad = rhead_blk;
 
-	return error;
+	return error ? error : error2;
 }
 
 /*
-- 
2.5.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/5] xfs: pass current lsn to log recovery buffer validation
  2016-08-11 17:11 [PATCH 0/5] fix log recovery for v5 superblocks Brian Foster
  2016-08-11 17:11 ` [PATCH 1/5] xfs: rework log recovery to submit buffers on LSN boundaries Brian Foster
@ 2016-08-11 17:11 ` Brian Foster
  2016-08-11 17:11 ` [PATCH 3/5] xfs: don't warn on buffers not being recovered due to LSN Brian Foster
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2016-08-11 17:11 UTC (permalink / raw)
  To: xfs

The current LSN must be available to the buffer validation function to
provide the ability to update the metadata LSN of the buffer. Pass the
current_lsn value down to xlog_recover_validate_buf_type() in
preparation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_recover.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 97a0af9..283bb7e 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2360,7 +2360,8 @@ static void
 xlog_recover_validate_buf_type(
 	struct xfs_mount	*mp,
 	struct xfs_buf		*bp,
-	xfs_buf_log_format_t	*buf_f)
+	xfs_buf_log_format_t	*buf_f,
+	xfs_lsn_t		current_lsn)
 {
 	struct xfs_da_blkinfo	*info = bp->b_addr;
 	__uint32_t		magic32;
@@ -2569,7 +2570,8 @@ xlog_recover_do_reg_buffer(
 	struct xfs_mount	*mp,
 	xlog_recover_item_t	*item,
 	struct xfs_buf		*bp,
-	xfs_buf_log_format_t	*buf_f)
+	xfs_buf_log_format_t	*buf_f,
+	xfs_lsn_t		current_lsn)
 {
 	int			i;
 	int			bit;
@@ -2642,7 +2644,7 @@ xlog_recover_do_reg_buffer(
 	/* Shouldn't be any more regions */
 	ASSERT(i == item->ri_total);
 
-	xlog_recover_validate_buf_type(mp, bp, buf_f);
+	xlog_recover_validate_buf_type(mp, bp, buf_f, current_lsn);
 }
 
 /*
@@ -2685,7 +2687,7 @@ xlog_recover_do_dquot_buffer(
 	if (log->l_quotaoffs_flag & type)
 		return false;
 
-	xlog_recover_do_reg_buffer(mp, item, bp, buf_f);
+	xlog_recover_do_reg_buffer(mp, item, bp, buf_f, NULLCOMMITLSN);
 	return true;
 }
 
@@ -2773,7 +2775,7 @@ xlog_recover_buffer_pass2(
 	 */
 	lsn = xlog_recover_get_buf_lsn(mp, bp);
 	if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
-		xlog_recover_validate_buf_type(mp, bp, buf_f);
+		xlog_recover_validate_buf_type(mp, bp, buf_f, NULLCOMMITLSN);
 		goto out_release;
 	}
 
@@ -2789,7 +2791,7 @@ xlog_recover_buffer_pass2(
 		if (!dirty)
 			goto out_release;
 	} else {
-		xlog_recover_do_reg_buffer(mp, item, bp, buf_f);
+		xlog_recover_do_reg_buffer(mp, item, bp, buf_f, current_lsn);
 	}
 
 	/*
-- 
2.5.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/5] xfs: don't warn on buffers not being recovered due to LSN
  2016-08-11 17:11 [PATCH 0/5] fix log recovery for v5 superblocks Brian Foster
  2016-08-11 17:11 ` [PATCH 1/5] xfs: rework log recovery to submit buffers on LSN boundaries Brian Foster
  2016-08-11 17:11 ` [PATCH 2/5] xfs: pass current lsn to log recovery buffer validation Brian Foster
@ 2016-08-11 17:11 ` Brian Foster
  2016-08-29  1:25   ` Dave Chinner
  2016-08-11 17:11 ` [PATCH 4/5] xfs: update metadata LSN in buffers during log recovery Brian Foster
  2016-08-11 17:11 ` [PATCH 5/5] xfs: log recovery tracepoints to track current lsn and buffer submission Brian Foster
  4 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2016-08-11 17:11 UTC (permalink / raw)
  To: xfs

The log recovery buffer validation function is invoked in cases where a
buffer update may be skipped due to LSN ordering. If the validation
function happens to come across directory conversion situations (e.g., a
dir3 block to data conversion), it may warn about seeing a buffer log
format of one type and a buffer with a magic number of another.

This warning is not valid as the buffer update is ultimately skipped.
This is indicated by a current_lsn of NULLCOMMITLSN provided by the
caller. As such, update xlog_recover_validate_buf_type() to only warn in
such cases when a buffer update is expected.

XXX: other issues here? better to not validate in such cases?

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_recover.c | 59 ++++++++++++++++++++++--------------------------
 1 file changed, 27 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 283bb7e..cc46db7 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2367,6 +2367,7 @@ xlog_recover_validate_buf_type(
 	__uint32_t		magic32;
 	__uint16_t		magic16;
 	__uint16_t		magicda;
+	char			*warnmsg = NULL;
 
 	/*
 	 * We can only do post recovery validation on items on CRC enabled
@@ -2405,31 +2406,27 @@ xlog_recover_validate_buf_type(
 			bp->b_ops = &xfs_rmapbt_buf_ops;
 			break;
 		default:
-			xfs_warn(mp, "Bad btree block magic!");
-			ASSERT(0);
+			warnmsg = "Bad btree block magic!";
 			break;
 		}
 		break;
 	case XFS_BLFT_AGF_BUF:
 		if (magic32 != XFS_AGF_MAGIC) {
-			xfs_warn(mp, "Bad AGF block magic!");
-			ASSERT(0);
+			warnmsg = "Bad AGF block magic!";
 			break;
 		}
 		bp->b_ops = &xfs_agf_buf_ops;
 		break;
 	case XFS_BLFT_AGFL_BUF:
 		if (magic32 != XFS_AGFL_MAGIC) {
-			xfs_warn(mp, "Bad AGFL block magic!");
-			ASSERT(0);
+			warnmsg = "Bad AGFL block magic!";
 			break;
 		}
 		bp->b_ops = &xfs_agfl_buf_ops;
 		break;
 	case XFS_BLFT_AGI_BUF:
 		if (magic32 != XFS_AGI_MAGIC) {
-			xfs_warn(mp, "Bad AGI block magic!");
-			ASSERT(0);
+			warnmsg = "Bad AGI block magic!";
 			break;
 		}
 		bp->b_ops = &xfs_agi_buf_ops;
@@ -2439,8 +2436,7 @@ xlog_recover_validate_buf_type(
 	case XFS_BLFT_GDQUOT_BUF:
 #ifdef CONFIG_XFS_QUOTA
 		if (magic16 != XFS_DQUOT_MAGIC) {
-			xfs_warn(mp, "Bad DQUOT block magic!");
-			ASSERT(0);
+			warnmsg = "Bad DQUOT block magic!";
 			break;
 		}
 		bp->b_ops = &xfs_dquot_buf_ops;
@@ -2452,16 +2448,14 @@ xlog_recover_validate_buf_type(
 		break;
 	case XFS_BLFT_DINO_BUF:
 		if (magic16 != XFS_DINODE_MAGIC) {
-			xfs_warn(mp, "Bad INODE block magic!");
-			ASSERT(0);
+			warnmsg = "Bad INODE block magic!";
 			break;
 		}
 		bp->b_ops = &xfs_inode_buf_ops;
 		break;
 	case XFS_BLFT_SYMLINK_BUF:
 		if (magic32 != XFS_SYMLINK_MAGIC) {
-			xfs_warn(mp, "Bad symlink block magic!");
-			ASSERT(0);
+			warnmsg = "Bad symlink block magic!";
 			break;
 		}
 		bp->b_ops = &xfs_symlink_buf_ops;
@@ -2469,8 +2463,7 @@ xlog_recover_validate_buf_type(
 	case XFS_BLFT_DIR_BLOCK_BUF:
 		if (magic32 != XFS_DIR2_BLOCK_MAGIC &&
 		    magic32 != XFS_DIR3_BLOCK_MAGIC) {
-			xfs_warn(mp, "Bad dir block magic!");
-			ASSERT(0);
+			warnmsg = "Bad dir block magic!";
 			break;
 		}
 		bp->b_ops = &xfs_dir3_block_buf_ops;
@@ -2478,8 +2471,7 @@ xlog_recover_validate_buf_type(
 	case XFS_BLFT_DIR_DATA_BUF:
 		if (magic32 != XFS_DIR2_DATA_MAGIC &&
 		    magic32 != XFS_DIR3_DATA_MAGIC) {
-			xfs_warn(mp, "Bad dir data magic!");
-			ASSERT(0);
+			warnmsg = "Bad dir data magic!";
 			break;
 		}
 		bp->b_ops = &xfs_dir3_data_buf_ops;
@@ -2487,8 +2479,7 @@ xlog_recover_validate_buf_type(
 	case XFS_BLFT_DIR_FREE_BUF:
 		if (magic32 != XFS_DIR2_FREE_MAGIC &&
 		    magic32 != XFS_DIR3_FREE_MAGIC) {
-			xfs_warn(mp, "Bad dir3 free magic!");
-			ASSERT(0);
+			warnmsg = "Bad dir3 free magic!";
 			break;
 		}
 		bp->b_ops = &xfs_dir3_free_buf_ops;
@@ -2496,8 +2487,7 @@ xlog_recover_validate_buf_type(
 	case XFS_BLFT_DIR_LEAF1_BUF:
 		if (magicda != XFS_DIR2_LEAF1_MAGIC &&
 		    magicda != XFS_DIR3_LEAF1_MAGIC) {
-			xfs_warn(mp, "Bad dir leaf1 magic!");
-			ASSERT(0);
+			warnmsg = "Bad dir leaf1 magic!";
 			break;
 		}
 		bp->b_ops = &xfs_dir3_leaf1_buf_ops;
@@ -2505,8 +2495,7 @@ xlog_recover_validate_buf_type(
 	case XFS_BLFT_DIR_LEAFN_BUF:
 		if (magicda != XFS_DIR2_LEAFN_MAGIC &&
 		    magicda != XFS_DIR3_LEAFN_MAGIC) {
-			xfs_warn(mp, "Bad dir leafn magic!");
-			ASSERT(0);
+			warnmsg = "Bad dir leafn magic!";
 			break;
 		}
 		bp->b_ops = &xfs_dir3_leafn_buf_ops;
@@ -2514,8 +2503,7 @@ xlog_recover_validate_buf_type(
 	case XFS_BLFT_DA_NODE_BUF:
 		if (magicda != XFS_DA_NODE_MAGIC &&
 		    magicda != XFS_DA3_NODE_MAGIC) {
-			xfs_warn(mp, "Bad da node magic!");
-			ASSERT(0);
+			warnmsg = "Bad da node magic!";
 			break;
 		}
 		bp->b_ops = &xfs_da3_node_buf_ops;
@@ -2523,24 +2511,21 @@ xlog_recover_validate_buf_type(
 	case XFS_BLFT_ATTR_LEAF_BUF:
 		if (magicda != XFS_ATTR_LEAF_MAGIC &&
 		    magicda != XFS_ATTR3_LEAF_MAGIC) {
-			xfs_warn(mp, "Bad attr leaf magic!");
-			ASSERT(0);
+			warnmsg = "Bad attr leaf magic!";
 			break;
 		}
 		bp->b_ops = &xfs_attr3_leaf_buf_ops;
 		break;
 	case XFS_BLFT_ATTR_RMT_BUF:
 		if (magic32 != XFS_ATTR3_RMT_MAGIC) {
-			xfs_warn(mp, "Bad attr remote magic!");
-			ASSERT(0);
+			warnmsg = "Bad attr remote magic!";
 			break;
 		}
 		bp->b_ops = &xfs_attr3_rmt_buf_ops;
 		break;
 	case XFS_BLFT_SB_BUF:
 		if (magic32 != XFS_SB_MAGIC) {
-			xfs_warn(mp, "Bad SB block magic!");
-			ASSERT(0);
+			warnmsg = "Bad SB block magic!";
 			break;
 		}
 		bp->b_ops = &xfs_sb_buf_ops;
@@ -2557,6 +2542,16 @@ xlog_recover_validate_buf_type(
 			 xfs_blft_from_flags(buf_f));
 		break;
 	}
+
+	/*
+	 * A NULL current LSN indicates the buffer update is skipped due to LSN
+	 * ordering. Don't warn in such cases, we skip the update for a reason
+	 * (it's no longer valid)!
+	 */
+	if (warnmsg && current_lsn != NULLCOMMITLSN) {
+		xfs_warn(mp, warnmsg);
+		ASSERT(0);
+	}
 }
 
 /*
-- 
2.5.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/5] xfs: update metadata LSN in buffers during log recovery
  2016-08-11 17:11 [PATCH 0/5] fix log recovery for v5 superblocks Brian Foster
                   ` (2 preceding siblings ...)
  2016-08-11 17:11 ` [PATCH 3/5] xfs: don't warn on buffers not being recovered due to LSN Brian Foster
@ 2016-08-11 17:11 ` Brian Foster
  2016-08-29  1:29   ` Dave Chinner
  2016-08-11 17:11 ` [PATCH 5/5] xfs: log recovery tracepoints to track current lsn and buffer submission Brian Foster
  4 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2016-08-11 17:11 UTC (permalink / raw)
  To: xfs

Log recovery is currently broken for v5 superblocks in that it never
updates the metadata LSN of buffers written out during recovery. The
metadata LSN is recorded in various bits of metadata to provide recovery
ordering criteria that prevents transient corruption states reported by
buffer write verifiers. Without such ordering logic, buffer updates can
be replayed out of order and lead to false positive transient corruption
states. This is generally not a corruption vector on its own, but
corruption detection shuts down the filesystem and ultimately prevents a
mount if it occurs during log recovery. This requires an xfs_repair run
that clears the log and potentially loses filesystem updates.

This problem is avoided in most cases as metadata writes during normal
filesystem operation update the metadata LSN appropriately. The problem
with log recovery not updating metadata LSNs manifests if the system
happens to crash shortly after log recovery itself. In this scenario, it
is possible for log recovery to complete all metadata I/O such that the
filesystem is consistent. If a crash occurs after that point but before
the log tail is pushed forward by subsequent operations, however, the
next mount performs the same log recovery over again. If a buffer is
updated multiple times in the dirty range of the log, an earlier update
in the log might not be valid based on the current state of the
associated buffer after all of the updates in the log had been replayed
(before the previous crash). If a verifier happens to detect such a
problem, the filesystem claims corruption and immediately shuts down.

This commonly manifests in practice as directory block verifier failures
such as the following, likely due to directory verifiers being
particularly detailed in their checks as compared to most others:

  ...
  Mounting V5 Filesystem
  XFS (dm-0): Starting recovery (logdev: internal)
  XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line ... of \
    file fs/xfs/libxfs/xfs_dir2_data.c.  Caller xfs_dir3_data_verify ...
  ...

Update log recovery to update the metadata LSN of recovered buffers.
Since metadata LSNs are already updated by write verifer functions via
attached log items, attach a dummy log item to the buffer during
validation and explicitly set the LSN of the current transaction. This
ensures that the metadata LSN of a buffer is updated based on whether
the recovery I/O actually completes, and if so, that subsequent recovery
attempts identify that the buffer is already up to date with respect to
the current transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_recover.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index cc46db7..0c0b743 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -44,6 +44,7 @@
 #include "xfs_error.h"
 #include "xfs_dir2.h"
 #include "xfs_rmap_item.h"
+#include "xfs_buf_item.h"
 
 #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
 
@@ -381,6 +382,15 @@ xlog_recover_iodone(
 						SHUTDOWN_META_IO_ERROR);
 		}
 	}
+
+	/*
+	 * On v5 supers, a bli could be attached to update the metadata LSN.
+	 * Clean it up.
+	 */
+	if (bp->b_fspriv)
+		xfs_buf_item_relse(bp);
+	ASSERT(bp->b_fspriv == NULL);
+
 	bp->b_iodone = NULL;
 	xfs_buf_ioend(bp);
 }
@@ -2552,6 +2562,27 @@ xlog_recover_validate_buf_type(
 		xfs_warn(mp, warnmsg);
 		ASSERT(0);
 	}
+
+	/*
+	 * We must update the metadata LSN of the buffer as it is written out to
+	 * ensure that older transactions never replay over this one and corrupt
+	 * the buffer. This can occur if log recovery is interrupted at some
+	 * point after the current transaction completes, at which point a
+	 * subsequent mount starts recovery from the beginning.
+	 *
+	 * Write verifiers update the metadata LSN from log items attached to
+	 * the buffer. Therefore, initialize a bli purely to carry the LSN to
+	 * the verifier. We'll clean it up in our ->iodone() callback.
+	 */
+	if (bp->b_ops && current_lsn != NULLCOMMITLSN) {
+		struct xfs_buf_log_item	*bip;
+
+		ASSERT(!bp->b_iodone || bp->b_iodone == xlog_recover_iodone);
+		bp->b_iodone = xlog_recover_iodone;
+		xfs_buf_item_init(bp, mp);
+		bip = bp->b_fspriv;
+		bip->bli_item.li_lsn = current_lsn;
+	}
 }
 
 /*
-- 
2.5.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/5] xfs: log recovery tracepoints to track current lsn and buffer submission
  2016-08-11 17:11 [PATCH 0/5] fix log recovery for v5 superblocks Brian Foster
                   ` (3 preceding siblings ...)
  2016-08-11 17:11 ` [PATCH 4/5] xfs: update metadata LSN in buffers during log recovery Brian Foster
@ 2016-08-11 17:11 ` Brian Foster
  4 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2016-08-11 17:11 UTC (permalink / raw)
  To: xfs

Log recovery has particular rules around buffer submission along with
tricky corner cases where independent transactions can share an LSN. As
such, it can be difficult to follow when/why buffers are submitted
during recovery.

Add a couple tracepoints to post the current LSN of a record when a new
record is being processed and when a buffer is being skipped due to LSN
ordering. Also, update the recover item class to include the LSN of the
current transaction for the item being processed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_recover.c |  2 ++
 fs/xfs/xfs_trace.h       | 31 +++++++++++++++++++++++++++++--
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 0c0b743..ad63ec0 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2801,6 +2801,7 @@ xlog_recover_buffer_pass2(
 	 */
 	lsn = xlog_recover_get_buf_lsn(mp, bp);
 	if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
+		trace_xfs_log_recover_buf_skip(log, buf_f);
 		xlog_recover_validate_buf_type(mp, bp, buf_f, NULLCOMMITLSN);
 		goto out_release;
 	}
@@ -4316,6 +4317,7 @@ xlog_recover_process_data(
 	if (xlog_header_check_recover(log->l_mp, rhead))
 		return -EIO;
 
+	trace_xfs_log_recover_record(log, rhead, pass);
 	while ((dp < end) && num_logops) {
 
 		ohead = (struct xlog_op_header *)dp;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 551b7e2..d5a8abd 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1985,6 +1985,29 @@ DEFINE_EVENT(xfs_swap_extent_class, name, \
 DEFINE_SWAPEXT_EVENT(xfs_swap_extent_before);
 DEFINE_SWAPEXT_EVENT(xfs_swap_extent_after);
 
+TRACE_EVENT(xfs_log_recover_record,
+	TP_PROTO(struct xlog *log, struct xlog_rec_header *rhead, int pass),
+	TP_ARGS(log, rhead, pass),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_lsn_t, lsn)
+		__field(int, len)
+		__field(int, num_logops)
+		__field(int, pass)
+	),
+	TP_fast_assign(
+		__entry->dev = log->l_mp->m_super->s_dev;
+		__entry->lsn = be64_to_cpu(rhead->h_lsn);
+		__entry->len = be32_to_cpu(rhead->h_len);
+		__entry->num_logops = be32_to_cpu(rhead->h_num_logops);
+		__entry->pass = pass;
+	),
+	TP_printk("dev %d:%d lsn 0x%llx len 0x%x num_logops 0x%x pass %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->lsn, __entry->len, __entry->num_logops,
+		   __entry->pass)
+)
+
 DECLARE_EVENT_CLASS(xfs_log_recover_item_class,
 	TP_PROTO(struct xlog *log, struct xlog_recover *trans,
 		struct xlog_recover_item *item, int pass),
@@ -1993,6 +2016,7 @@ DECLARE_EVENT_CLASS(xfs_log_recover_item_class,
 		__field(dev_t, dev)
 		__field(unsigned long, item)
 		__field(xlog_tid_t, tid)
+		__field(xfs_lsn_t, lsn)
 		__field(int, type)
 		__field(int, pass)
 		__field(int, count)
@@ -2002,15 +2026,17 @@ DECLARE_EVENT_CLASS(xfs_log_recover_item_class,
 		__entry->dev = log->l_mp->m_super->s_dev;
 		__entry->item = (unsigned long)item;
 		__entry->tid = trans->r_log_tid;
+		__entry->lsn = trans->r_lsn;
 		__entry->type = ITEM_TYPE(item);
 		__entry->pass = pass;
 		__entry->count = item->ri_cnt;
 		__entry->total = item->ri_total;
 	),
-	TP_printk("dev %d:%d trans 0x%x, pass %d, item 0x%p, item type %s "
-		  "item region count/total %d/%d",
+	TP_printk("dev %d:%d tid 0x%x lsn 0x%llx, pass %d, item 0x%p, "
+		  "item type %s item region count/total %d/%d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->tid,
+		  __entry->lsn,
 		  __entry->pass,
 		  (void *)__entry->item,
 		  __print_symbolic(__entry->type, XFS_LI_TYPE_DESC),
@@ -2069,6 +2095,7 @@ DEFINE_LOG_RECOVER_BUF_ITEM(xfs_log_recover_buf_cancel);
 DEFINE_LOG_RECOVER_BUF_ITEM(xfs_log_recover_buf_cancel_add);
 DEFINE_LOG_RECOVER_BUF_ITEM(xfs_log_recover_buf_cancel_ref_inc);
 DEFINE_LOG_RECOVER_BUF_ITEM(xfs_log_recover_buf_recover);
+DEFINE_LOG_RECOVER_BUF_ITEM(xfs_log_recover_buf_skip);
 DEFINE_LOG_RECOVER_BUF_ITEM(xfs_log_recover_buf_inode_buf);
 DEFINE_LOG_RECOVER_BUF_ITEM(xfs_log_recover_buf_reg_buf);
 DEFINE_LOG_RECOVER_BUF_ITEM(xfs_log_recover_buf_dquot_buf);
-- 
2.5.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/5] xfs: rework log recovery to submit buffers on LSN boundaries
  2016-08-11 17:11 ` [PATCH 1/5] xfs: rework log recovery to submit buffers on LSN boundaries Brian Foster
@ 2016-08-29  1:16   ` Dave Chinner
  2016-08-29 18:17     ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2016-08-29  1:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Aug 11, 2016 at 01:11:03PM -0400, Brian Foster wrote:
> The fix to log recovery to update the metadata LSN in recovered buffers
> introduces the requirement that a buffer is submitted only once per
> current LSN. Log recovery currently submits buffers on transaction
> boundaries. This is not sufficient as the abstraction between log
> records and transactions allows for various scenarios where multiple
> transactions can share the same current LSN. If independent transactions
> share an LSN and both modify the same buffer, log recovery can
> incorrectly skip updates and leave the filesystem in an inconsisent
> state.
> 
> In preparation for proper metadata LSN updates during log recovery,
> update log recovery to submit buffers for write on LSN change boundaries
> rather than transaction boundaries. Explicitly track the current LSN in
> a new struct xlog field to handle the various corner cases of when the
> current LSN may or may not change.

.....

> @@ -4221,8 +4223,39 @@ xlog_recover_process_ophdr(
>  		return 0;
>  	}
>  
> +	/*
> +	 * Recovered buffers are submitted for I/O on current LSN change
> +	 * boundaries. This is necessary to accommodate metadata LSN ordering
> +	 * rules of v5 superblock filesystems.
> +	 *
> +	 * Store the new current LSN in l_recovery_lsn as we cannot rely on
> +	 * either record boundaries or transaction boundaries alone to track LSN
> +	 * changes. This has several contributing factors:
> +	 *
> +	 * - Metadata LSNs are updated at buffer submission time. Thus, buffers
> +	 *   can only be submitted safely once per current LSN value.
> +	 * - The current LSN is defined as the start cycle/block of the first
> +	 *   record in which a transaction appears.
> +	 * - A record can hold multiple transactions. Thus, a transaction change
> +	 *   does not guarantee a change in current LSN.
> +	 * - A transaction can span multiple records. Thus, a record change does
> +	 *   not guarantee a change in current LSN. Consider the case where a
> +	 *   record holds one small transaction and a subsequent that carries
> +	 *   over to the next record. Both transactions share the same LSN as
> +	 *   per the definition of the current LSN.
> +	 *
> +	 * In summary, this means we must track the current LSN independently
> +	 * and submit buffers for the previous LSN only when it has changed.
> +	 */
> +	if (log->l_recovery_lsn != trans->r_lsn) {
> +		error = xfs_buf_delwri_submit(buffer_list);
> +		if (error)
> +			return error;
> +		log->l_recovery_lsn = trans->r_lsn;
> +	}

I'm not sure this is the right place to be submitting buffers. We
can have multiple transactions open at once because the writing of
the transaction to the log is split into two parts: xlog_write()
which writes the changes to the log, and xfs_log_done() which writes
the commit record (via xlog_commit_record()) to close the
transaction.

Hence we can get the situation where we have multiple open
transactions such as:

                              CA CB                  CC CD
 +---------+--------+--------+--+--+--------+-------+--+--+
   trans A   trans B  trans C       trans C  trans D

where the changes in multiple transactions are written before the
ophdr that contains the commit record ("CA", "CB", ....) is written.
With the above code, we'd be doing writeback of A when we see B, not
when we see the commit record for A. Like wise B when we see C. And
worse, partial writeback of C when we see the commit record for A...

i.e. We are very careful to write commit records in the correct
order because that is what determines recovery order, but we don't
care what order we write the actual contents of the checkpoints or
whether they interleave with other checkpoints.  As such, ophdrs
change transactions and LSNs without having actually completed
recovery of a checkpoint. 

I think writeback should occur when all the transactions with a
given lsn have been committed. I'm not sure there's a simple way to
track and detect this, but using the ophdrs to detect a change of
lsn to trigger buffer writeback does not look correct to me at this
point in time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/5] xfs: don't warn on buffers not being recovered due to LSN
  2016-08-11 17:11 ` [PATCH 3/5] xfs: don't warn on buffers not being recovered due to LSN Brian Foster
@ 2016-08-29  1:25   ` Dave Chinner
  2016-08-29 18:17     ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2016-08-29  1:25 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Aug 11, 2016 at 01:11:05PM -0400, Brian Foster wrote:
> The log recovery buffer validation function is invoked in cases where a
> buffer update may be skipped due to LSN ordering. If the validation
> function happens to come across directory conversion situations (e.g., a
> dir3 block to data conversion), it may warn about seeing a buffer log
> format of one type and a buffer with a magic number of another.
> 
> This warning is not valid as the buffer update is ultimately skipped.
> This is indicated by a current_lsn of NULLCOMMITLSN provided by the
> caller. As such, update xlog_recover_validate_buf_type() to only warn in
> such cases when a buffer update is expected.
> 
> XXX: other issues here? better to not validate in such cases?

I think this is OK - we really want to ensure that buffers that are
in cache always have the correct verifier attached to them. Hence if
we've read the buffer in, even if we are not modifying it because
it's more recent that what is being replayed we should still attach
the verifiers to it.

If it changes type due to later recovery replay, we'll change the
verifier appropriately at that point.

> @@ -2557,6 +2542,16 @@ xlog_recover_validate_buf_type(
>  			 xfs_blft_from_flags(buf_f));
>  		break;
>  	}
> +
> +	/*
> +	 * A NULL current LSN indicates the buffer update is skipped due to LSN
> +	 * ordering. Don't warn in such cases, we skip the update for a reason
> +	 * (it's no longer valid)!
> +	 */

I read that the first time as "we skip the update for a reason that
is no longer valid" :P

perhaps rework this to make it clear what is being referred to here.
e.g. No need to warn if the the buffer contents are more recent
that this recovery item and hence recovery did not modify the
buffer.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/5] xfs: update metadata LSN in buffers during log recovery
  2016-08-11 17:11 ` [PATCH 4/5] xfs: update metadata LSN in buffers during log recovery Brian Foster
@ 2016-08-29  1:29   ` Dave Chinner
  2016-08-29 18:17     ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2016-08-29  1:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Aug 11, 2016 at 01:11:06PM -0400, Brian Foster wrote:
> @@ -2552,6 +2562,27 @@ xlog_recover_validate_buf_type(
>  		xfs_warn(mp, warnmsg);
>  		ASSERT(0);
>  	}
> +
> +	/*
> +	 * We must update the metadata LSN of the buffer as it is written out to
> +	 * ensure that older transactions never replay over this one and corrupt
> +	 * the buffer. This can occur if log recovery is interrupted at some
> +	 * point after the current transaction completes, at which point a
> +	 * subsequent mount starts recovery from the beginning.
> +	 *
> +	 * Write verifiers update the metadata LSN from log items attached to
> +	 * the buffer. Therefore, initialize a bli purely to carry the LSN to
> +	 * the verifier. We'll clean it up in our ->iodone() callback.
> +	 */
> +	if (bp->b_ops && current_lsn != NULLCOMMITLSN) {
> +		struct xfs_buf_log_item	*bip;
> +
> +		ASSERT(!bp->b_iodone || bp->b_iodone == xlog_recover_iodone);
> +		bp->b_iodone = xlog_recover_iodone;
> +		xfs_buf_item_init(bp, mp);
> +		bip = bp->b_fspriv;
> +		bip->bli_item.li_lsn = current_lsn;
> +	}
>  }

Of, so now we have two things we do when current_lsn !=
NULLCOMMITLSN. I'd change this to something like:


	ASSERT(bp->b_fspriv == NULL);
	if (current_lsn == NULLCOMMITLSN)
		return;

	if (warn) {
		....
	}

	if (!bp->b_ops)
		return

	/* add buf_item */

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/5] xfs: rework log recovery to submit buffers on LSN boundaries
  2016-08-29  1:16   ` Dave Chinner
@ 2016-08-29 18:17     ` Brian Foster
  2016-09-20  0:13       ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2016-08-29 18:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Aug 29, 2016 at 11:16:31AM +1000, Dave Chinner wrote:
> On Thu, Aug 11, 2016 at 01:11:03PM -0400, Brian Foster wrote:
> > The fix to log recovery to update the metadata LSN in recovered buffers
> > introduces the requirement that a buffer is submitted only once per
> > current LSN. Log recovery currently submits buffers on transaction
> > boundaries. This is not sufficient as the abstraction between log
> > records and transactions allows for various scenarios where multiple
> > transactions can share the same current LSN. If independent transactions
> > share an LSN and both modify the same buffer, log recovery can
> > incorrectly skip updates and leave the filesystem in an inconsisent
> > state.
> > 
> > In preparation for proper metadata LSN updates during log recovery,
> > update log recovery to submit buffers for write on LSN change boundaries
> > rather than transaction boundaries. Explicitly track the current LSN in
> > a new struct xlog field to handle the various corner cases of when the
> > current LSN may or may not change.
> 
> .....
> 
> > @@ -4221,8 +4223,39 @@ xlog_recover_process_ophdr(
> >  		return 0;
> >  	}
> >  
> > +	/*
> > +	 * Recovered buffers are submitted for I/O on current LSN change
> > +	 * boundaries. This is necessary to accommodate metadata LSN ordering
> > +	 * rules of v5 superblock filesystems.
> > +	 *
> > +	 * Store the new current LSN in l_recovery_lsn as we cannot rely on
> > +	 * either record boundaries or transaction boundaries alone to track LSN
> > +	 * changes. This has several contributing factors:
> > +	 *
> > +	 * - Metadata LSNs are updated at buffer submission time. Thus, buffers
> > +	 *   can only be submitted safely once per current LSN value.
> > +	 * - The current LSN is defined as the start cycle/block of the first
> > +	 *   record in which a transaction appears.
> > +	 * - A record can hold multiple transactions. Thus, a transaction change
> > +	 *   does not guarantee a change in current LSN.
> > +	 * - A transaction can span multiple records. Thus, a record change does
> > +	 *   not guarantee a change in current LSN. Consider the case where a
> > +	 *   record holds one small transaction and a subsequent that carries
> > +	 *   over to the next record. Both transactions share the same LSN as
> > +	 *   per the definition of the current LSN.
> > +	 *
> > +	 * In summary, this means we must track the current LSN independently
> > +	 * and submit buffers for the previous LSN only when it has changed.
> > +	 */
> > +	if (log->l_recovery_lsn != trans->r_lsn) {
> > +		error = xfs_buf_delwri_submit(buffer_list);
> > +		if (error)
> > +			return error;
> > +		log->l_recovery_lsn = trans->r_lsn;
> > +	}
> 
> I'm not sure this is the right place to be submitting buffers. We
> can have multiple transactions open at once because the writing of
> the transaction to the log is split into two parts: xlog_write()
> which writes the changes to the log, and xfs_log_done() which writes
> the commit record (via xlog_commit_record()) to close the
> transaction.
> 
> Hence we can get the situation where we have multiple open
> transactions such as:
> 
>                               CA CB                  CC CD
>  +---------+--------+--------+--+--+--------+-------+--+--+
>    trans A   trans B  trans C       trans C  trans D
> 

Ok, thanks for drawing this out. I hadn't thought about it from this
angle...

> where the changes in multiple transactions are written before the
> ophdr that contains the commit record ("CA", "CB", ....) is written.
> With the above code, we'd be doing writeback of A when we see B, not
> when we see the commit record for A. Like wise B when we see C. And
> worse, partial writeback of C when we see the commit record for A...
> 

... but I don't think that accurately describes the behavior here. At
least, I'm not sure there is enough information presented to describe
when buffers are going to be submitted because we don't have the mapping
of transactions to records.

That aside for a moment, the current recovery code makes a couple passes
over the entire dirty log. The first pass is for management of cancelled
items and thus not really relevant to this problem. The second pass
constructs all of the transactions in memory and on XLOG_COMMIT_TRANS,
we iterate all of the items in said transaction, perform recovery on the
actual filesystem metadata buffers and submit the buffers for I/O. As
you've outlined above, I believe this corresponds to the commit record
as this flag is written out by xlog_commit_record().

Given that, this change can't cause writeback of A when we see B because
the buffers for A still aren't read, recovered and queued for I/O until
we see the commit record for A. Instead, what this change does is to
defer the final I/O submission from commit record time until we see a
metadata LSN change. This effectively means that the current "per
transaction" buffer delwri queue becomes a "per metadata LSN" queue and
ensures that a subsequent transaction that happens to use the same LSN
and modify the same block(s) does not submit the buffer for write
multiple times from contexts that would stamp the same metadata LSN.

In other words, this change can be viewed as batching or "plugging"
buffer submissions from commit records to subsequent metadata LSN
updates. It doesn't actually reorder how transactions are recovered in
any ways. Getting back to the example above, what should happen (in pass
2) is:

- Allocate trans A and start populating ->r_itemq.
- Allocate trans B and start populating ->r_itemq.
- Allocate trans C and start populating ->r_itemq.
[Note: buffer_list is still empty so any current LSN updates cause no
actions to this point.]
- Hit CR A, recover all of the items in trans A and queue the buffers
  onto buffer_list. Now that buffer_list is populated, the buffers will
  be submitted on the next metadata LSN change.
- Hit CR B. We don't know what's going to happen to buffer_list here
  because the example doesn't define record context. We'll look up
  transaction B and if transaction B started in the same record as A,
  for example, we won't actually drain buffer_list. If they started in
  different records (e.g., different LSNs), we drain buffer_list and
  proceed. In either case, we perform recovery of the items in trans B
  and queue B's buffers to buffer_list.
- We revisit trans C. Again, the starting LSN of trans C and B determine
  whether we drain buffer_list. Note that buffer_list still only
  contains buffers up through transaction B (i.e., possibly A as well)
  as we have not yet recovered or queued anything from transaction C.
- Allocate trans D and start populating ->r_itemq.
...

... and so on until the end and we drain buffer_list from the last LSN
in xlog_do_recovery_pass().

> i.e. We are very careful to write commit records in the correct
> order because that is what determines recovery order, but we don't
> care what order we write the actual contents of the checkpoints or
> whether they interleave with other checkpoints.  As such, ophdrs
> change transactions and LSNs without having actually completed
> recovery of a checkpoint. 
> 
> I think writeback should occur when all the transactions with a
> given lsn have been committed. I'm not sure there's a simple way to
> track and detect this, but using the ophdrs to detect a change of
> lsn to trigger buffer writeback does not look correct to me at this
> point in time.
> 

That is precisely the intent of this patch. What I think could be a
problem is something like the following, if possible:

                    CA         CB                  CC CD
+---------+--------+--+-------+--+--------+-------+--+--+
  trans A   trans B    trans C    trans C  trans D

Assume that trans A and trans B are within the same record and trans C
is in a separate record. In that case, we commit trans A which populates
buffer_list. We lookup trans C, note a new LSN and drain buffer_list.
Then we ultimately commit trans B, which has the same metadata LSN as
trans A and thus is a path to the original problem if trans B happened
to modify any of the same blocks as trans A.

Do note however that this is just an occurrence of the problem with log
recovery as implemented today (once we update metadata LSNs, and is
likely rare as I haven't been able to reproduce corruption in many
tries). If that analysis is correct, I think a straightforward solution
might be to defer submission to the lookup of a transaction with a new
LSN that _also_ corresponds with processing of a commit record based on
where we are in the on-disk log. E.g.:

	if (log->l_recovery_lsn != trans->r_lsn &&
	    oh_flags & XLOG_COMMIT_TRANS) {
		error = xfs_buf_delwri_submit(buffer_list);
		...
	}

So in the above, we'd submit buffers for A and B once we visit the
commit record for trans C. Thoughts?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/5] xfs: don't warn on buffers not being recovered due to LSN
  2016-08-29  1:25   ` Dave Chinner
@ 2016-08-29 18:17     ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2016-08-29 18:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Aug 29, 2016 at 11:25:14AM +1000, Dave Chinner wrote:
> On Thu, Aug 11, 2016 at 01:11:05PM -0400, Brian Foster wrote:
> > The log recovery buffer validation function is invoked in cases where a
> > buffer update may be skipped due to LSN ordering. If the validation
> > function happens to come across directory conversion situations (e.g., a
> > dir3 block to data conversion), it may warn about seeing a buffer log
> > format of one type and a buffer with a magic number of another.
> > 
> > This warning is not valid as the buffer update is ultimately skipped.
> > This is indicated by a current_lsn of NULLCOMMITLSN provided by the
> > caller. As such, update xlog_recover_validate_buf_type() to only warn in
> > such cases when a buffer update is expected.
> > 
> > XXX: other issues here? better to not validate in such cases?
> 
> I think this is OK - we really want to ensure that buffers that are
> in cache always have the correct verifier attached to them. Hence if
> we've read the buffer in, even if we are not modifying it because
> it's more recent that what is being replayed we should still attach
> the verifiers to it.
> 
> If it changes type due to later recovery replay, we'll change the
> verifier appropriately at that point.
> 

Sounds good, thanks for commenting on this.

> > @@ -2557,6 +2542,16 @@ xlog_recover_validate_buf_type(
> >  			 xfs_blft_from_flags(buf_f));
> >  		break;
> >  	}
> > +
> > +	/*
> > +	 * A NULL current LSN indicates the buffer update is skipped due to LSN
> > +	 * ordering. Don't warn in such cases, we skip the update for a reason
> > +	 * (it's no longer valid)!
> > +	 */
> 
> I read that the first time as "we skip the update for a reason that
> is no longer valid" :P
> 

Heh, Ok.

> perhaps rework this to make it clear what is being referred to here.
> e.g. No need to warn if the the buffer contents are more recent
> that this recovery item and hence recovery did not modify the
> buffer.
> 

Yeah, I'll try to make that more clear.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/5] xfs: update metadata LSN in buffers during log recovery
  2016-08-29  1:29   ` Dave Chinner
@ 2016-08-29 18:17     ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2016-08-29 18:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Aug 29, 2016 at 11:29:23AM +1000, Dave Chinner wrote:
> On Thu, Aug 11, 2016 at 01:11:06PM -0400, Brian Foster wrote:
> > @@ -2552,6 +2562,27 @@ xlog_recover_validate_buf_type(
> >  		xfs_warn(mp, warnmsg);
> >  		ASSERT(0);
> >  	}
> > +
> > +	/*
> > +	 * We must update the metadata LSN of the buffer as it is written out to
> > +	 * ensure that older transactions never replay over this one and corrupt
> > +	 * the buffer. This can occur if log recovery is interrupted at some
> > +	 * point after the current transaction completes, at which point a
> > +	 * subsequent mount starts recovery from the beginning.
> > +	 *
> > +	 * Write verifiers update the metadata LSN from log items attached to
> > +	 * the buffer. Therefore, initialize a bli purely to carry the LSN to
> > +	 * the verifier. We'll clean it up in our ->iodone() callback.
> > +	 */
> > +	if (bp->b_ops && current_lsn != NULLCOMMITLSN) {
> > +		struct xfs_buf_log_item	*bip;
> > +
> > +		ASSERT(!bp->b_iodone || bp->b_iodone == xlog_recover_iodone);
> > +		bp->b_iodone = xlog_recover_iodone;
> > +		xfs_buf_item_init(bp, mp);
> > +		bip = bp->b_fspriv;
> > +		bip->bli_item.li_lsn = current_lsn;
> > +	}
> >  }
> 
> Of, so now we have two things we do when current_lsn !=
> NULLCOMMITLSN. I'd change this to something like:
> 
> 
> 	ASSERT(bp->b_fspriv == NULL);
> 	if (current_lsn == NULLCOMMITLSN)
> 		return;
> 
> 	if (warn) {
> 		....
> 	}
> 
> 	if (!bp->b_ops)
> 		return
> 
> 	/* add buf_item */

Ok, I may still invert the bp->b_ops check as that seems more clear to
me for whatever reason, but otherwise that looks like a nice cleanup.
Thanks.

Brian

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/5] xfs: rework log recovery to submit buffers on LSN boundaries
  2016-08-29 18:17     ` Brian Foster
@ 2016-09-20  0:13       ` Dave Chinner
  2016-09-23 17:08         ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2016-09-20  0:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, linux-xfs

[ sorry to take so long to get back to this, Brian, I missed your
reply and only yesterday when I was sorting out for-next updates
that I still had this on my "for-review" patch stack. ]

On Mon, Aug 29, 2016 at 02:17:22PM -0400, Brian Foster wrote:
> On Mon, Aug 29, 2016 at 11:16:31AM +1000, Dave Chinner wrote:
> > On Thu, Aug 11, 2016 at 01:11:03PM -0400, Brian Foster wrote:
> > i.e. We are very careful to write commit records in the correct
> > order because that is what determines recovery order, but we don't
> > care what order we write the actual contents of the checkpoints or
> > whether they interleave with other checkpoints.  As such, ophdrs
> > change transactions and LSNs without having actually completed
> > recovery of a checkpoint. 
> > 
> > I think writeback should occur when all the transactions with a
> > given lsn have been committed. I'm not sure there's a simple way to
> > track and detect this, but using the ophdrs to detect a change of
> > lsn to trigger buffer writeback does not look correct to me at this
> > point in time.
> > 
> 
> That is precisely the intent of this patch. What I think could be a
> problem is something like the following, if possible:
> 
>                     CA         CB                  CC CD
> +---------+--------+--+-------+--+--------+-------+--+--+
>   trans A   trans B    trans C    trans C  trans D

Yes, that's possible.

> Assume that trans A and trans B are within the same record and trans C
> is in a separate record. In that case, we commit trans A which populates
> buffer_list. We lookup trans C, note a new LSN and drain buffer_list.
> Then we ultimately commit trans B, which has the same metadata LSN as
> trans A and thus is a path to the original problem if trans B happened
> to modify any of the same blocks as trans A.

Yes, that's right, we still are exposed to the same problem, and
there's much more convoluted versions of it possible.

> Do note however that this is just an occurrence of the problem with log
> recovery as implemented today (once we update metadata LSNs, and is
> likely rare as I haven't been able to reproduce corruption in many
> tries).

Yeah, it's damn hard to intentionally cause interleaving of
checkpoint and commit records these days because of the delayed
logging does aggregation in memory rather than in the log buffers
themselves.

> If that analysis is correct, I think a straightforward solution
> might be to defer submission to the lookup of a transaction with a new
> LSN that _also_ corresponds with processing of a commit record based on
> where we are in the on-disk log. E.g.:
> 
> 	if (log->l_recovery_lsn != trans->r_lsn &&
> 	    oh_flags & XLOG_COMMIT_TRANS) {
> 		error = xfs_buf_delwri_submit(buffer_list);
> 		...
> 	}
> 
> So in the above, we'd submit buffers for A and B once we visit the
> commit record for trans C. Thoughts?

Sounds plausible - let me just check I understood by repeating it
back. Given the above case, we start with log->l_recovery_lsn set to
the lsn before trans A and an empty buffer list.

1. We now recover trans A and trans B into their respective structures,
but we don't don't add their dirty buffers to the delwri list yet -
they are kept internal to the trans.

2. We then see commit A, and because the buffer list is empty we
simply add them to the buffer list and update log->l_recovery_lsn to
point at the transaction LSN.

3. We now see trans C, and start recovering it into an internal buffer
list.

4. Then we process commit B, see that there are already queued buffers
and so check the transaction LSN against log->l_recovery_lsn. They
are the same, so we simply add the transactions dirty buffers to
the buffer list.

5. We continue processing transaction C, and start on transaction D.
We then see commit C. Buffer list is populated, so we check
transaction lsn against log->l_recovery_lsn. They are different.
At this point we know we have fully processed all the transactions
that are associated with log->l_recovery_lsn, hence we can submit
the buffer_list and mark it empty again.

6. At this point we jump back to step 2, this time processing commit
C onwards....

7. At the end of log recovery, we commit the remaining buffer list
from the last transaction we recovered from the log.

Did I understand it right? If so, I think this will work just fine.

Thanks, Brian!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/5] xfs: rework log recovery to submit buffers on LSN boundaries
  2016-09-20  0:13       ` Dave Chinner
@ 2016-09-23 17:08         ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2016-09-23 17:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, xfs

On Tue, Sep 20, 2016 at 10:13:30AM +1000, Dave Chinner wrote:
> [ sorry to take so long to get back to this, Brian, I missed your
> reply and only yesterday when I was sorting out for-next updates
> that I still had this on my "for-review" patch stack. ]
> 

No problem. I've been away anyways..

> On Mon, Aug 29, 2016 at 02:17:22PM -0400, Brian Foster wrote:
> > On Mon, Aug 29, 2016 at 11:16:31AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 11, 2016 at 01:11:03PM -0400, Brian Foster wrote:
> > > i.e. We are very careful to write commit records in the correct
> > > order because that is what determines recovery order, but we don't
> > > care what order we write the actual contents of the checkpoints or
> > > whether they interleave with other checkpoints.  As such, ophdrs
> > > change transactions and LSNs without having actually completed
> > > recovery of a checkpoint. 
> > > 
> > > I think writeback should occur when all the transactions with a
> > > given lsn have been committed. I'm not sure there's a simple way to
> > > track and detect this, but using the ophdrs to detect a change of
> > > lsn to trigger buffer writeback does not look correct to me at this
> > > point in time.
> > > 
> > 
> > That is precisely the intent of this patch. What I think could be a
> > problem is something like the following, if possible:
> > 
> >                     CA         CB                  CC CD
> > +---------+--------+--+-------+--+--------+-------+--+--+
> >   trans A   trans B    trans C    trans C  trans D
> 
> Yes, that's possible.
> 

Ok.

> > Assume that trans A and trans B are within the same record and trans C
> > is in a separate record. In that case, we commit trans A which populates
> > buffer_list. We lookup trans C, note a new LSN and drain buffer_list.
> > Then we ultimately commit trans B, which has the same metadata LSN as
> > trans A and thus is a path to the original problem if trans B happened
> > to modify any of the same blocks as trans A.
> 
> Yes, that's right, we still are exposed to the same problem, and
> there's much more convoluted versions of it possible.
> 
> > Do note however that this is just an occurrence of the problem with log
> > recovery as implemented today (once we update metadata LSNs, and is
> > likely rare as I haven't been able to reproduce corruption in many
> > tries).
> 
> Yeah, it's damn hard to intentionally cause interleaving of
> checkpoint and commit records these days because of the delayed
> logging does aggregation in memory rather than in the log buffers
> themselves.
> 

Makes sense.

> > If that analysis is correct, I think a straightforward solution
> > might be to defer submission to the lookup of a transaction with a new
> > LSN that _also_ corresponds with processing of a commit record based on
> > where we are in the on-disk log. E.g.:
> > 
> > 	if (log->l_recovery_lsn != trans->r_lsn &&
> > 	    oh_flags & XLOG_COMMIT_TRANS) {
> > 		error = xfs_buf_delwri_submit(buffer_list);
> > 		...
> > 	}
> > 
> > So in the above, we'd submit buffers for A and B once we visit the
> > commit record for trans C. Thoughts?
> 
> Sounds plausible - let me just check I understood by repeating it
> back. Given the above case, we start with log->l_recovery_lsn set to
> the lsn before trans A and an empty buffer list.
> 
> 1. We now recover trans A and trans B into their respective structures,
> but we don't don't add their dirty buffers to the delwri list yet -
> they are kept internal to the trans.
> 
> 2. We then see commit A, and because the buffer list is empty we
> simply add them to the buffer list and update log->l_recovery_lsn to
> point at the transaction LSN.
> 

Right...

> 3. We now see trans C, and start recovering it into an internal buffer
> list.
> 
> 4. Then we process commit B, see that there are already queued buffers
> and so check the transaction LSN against log->l_recovery_lsn. They
> are the same, so we simply add the transactions dirty buffers to
> the buffer list.
> 

Maybe just weird wording here, but to be precise (and pedantic), the
top-level check is for the current LSN change, not necessarily whether
the buffer_list is empty or not. The behavior is the same either way.

> 5. We continue processing transaction C, and start on transaction D.
> We then see commit C. Buffer list is populated, so we check
> transaction lsn against log->l_recovery_lsn. They are different.
> At this point we know we have fully processed all the transactions
> that are associated with log->l_recovery_lsn, hence we can submit
> the buffer_list and mark it empty again.
> 
> 6. At this point we jump back to step 2, this time processing commit
> C onwards....
> 
> 7. At the end of log recovery, we commit the remaining buffer list
> from the last transaction we recovered from the log.
> 
> Did I understand it right? If so, I think this will work just fine.
> 

Yep, I think so. I'll send an updated version.

Brian

> Thanks, Brian!
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-09-23 17:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-11 17:11 [PATCH 0/5] fix log recovery for v5 superblocks Brian Foster
2016-08-11 17:11 ` [PATCH 1/5] xfs: rework log recovery to submit buffers on LSN boundaries Brian Foster
2016-08-29  1:16   ` Dave Chinner
2016-08-29 18:17     ` Brian Foster
2016-09-20  0:13       ` Dave Chinner
2016-09-23 17:08         ` Brian Foster
2016-08-11 17:11 ` [PATCH 2/5] xfs: pass current lsn to log recovery buffer validation Brian Foster
2016-08-11 17:11 ` [PATCH 3/5] xfs: don't warn on buffers not being recovered due to LSN Brian Foster
2016-08-29  1:25   ` Dave Chinner
2016-08-29 18:17     ` Brian Foster
2016-08-11 17:11 ` [PATCH 4/5] xfs: update metadata LSN in buffers during log recovery Brian Foster
2016-08-29  1:29   ` Dave Chinner
2016-08-29 18:17     ` Brian Foster
2016-08-11 17:11 ` [PATCH 5/5] xfs: log recovery tracepoints to track current lsn and buffer submission Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.