linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5 v2] xfs: strictly order log start records
@ 2021-07-14  3:36 Dave Chinner
  2021-07-14  3:36 ` [PATCH 1/5] xfs: move xlog_commit_record to xfs_log_cil.c Dave Chinner
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Dave Chinner @ 2021-07-14  3:36 UTC (permalink / raw)
  To: linux-xfs


We recently found a zero-day log recovery issue where overlapping
log transactions used the wrong LSN for recovery operations despite
being replayed in the correct commit record order. The issue is that
the log recovery code uses the LSN of the start record of the log
transaction for ordering and metadata stamping, while the actual
order of transaction replay is determined by the commit record.
Hence if we pipeline CIL commits we can end up with overlapping
transactions in the log like:

Start A .. Start C .. Start B .... Commit A .. Commit B .. Commit C

The issue is that the "start B" lsn is later than the "start C" lsn.
When the same metadata block is modified in both transaction B and
C, writeback from "commit B" will correctly stamp "start B" into the
metadata.

However, when "commit C" runs, it will see the LSN in that metadata
block is "start B", which is *more recent than "Start C" and so
will, incorrectly, fail to recover that change into the metadata
block. This results in silent metadata corruption, which can then be
exposed by future recovery operations failing, runtime
inconsistencies causing shutdowns and/or xfs_scrub/xfs_repair check
failures.

We could fix log recovery to avoid this problem, but there's a
runtime problem as well: the AIL is ordered by start record LSN. We
cannot order the AIL by commit LSN as we cannot allow the tail of
the log to overwrite -any- of the log transaction until the entire
transaction has been written. As the lowest LSN of the items in the
AIL defines the current log tail, the same metadata writeback
ordering issues apply as with log recovery.

In this case, we run the callbacks for commit B first, which place
all the items at the head of the log at "start B". Then we run
callbacks for "commit C", which then do not insert at the head -
they get inserted before "start B". If the item was modified in both
B and C, then it moves *backwards* in the AIL and this screws up all
manner of things that assume relogging can only move objects
forwards in the log. One of these things it can screw up is the tail
lsn of the log. Nothing good comes from this...

Because we have both runtime and journal-based ordering requirements
for the start_lsn, we have multiple places where there is an
implicit assumption that transaction start records are strictly
ordered. Rather than play whack-a-mole with such assumptions, and to
avoid the eternal "are you running a fixed kernel" question, it's
better just to strictly order the start records in the same way we
strictly order the commit records.

This patch series takes the mechanisms of the strict commit record
ordering and utilises them for strict start record ordering. It
builds upon the shutdown rework patchset to guarantee that the CIL
context structure will not get freed from under it by a racing
shutdown, and so moves the LSN recording for ordering up into a
callback from xlog_write() once we have a guaranteed iclog write
location. This means we have one mechanism for both start and commit
record ordering, and they both work in exactly the same way.


Version 2:
- rebase on 5.14-rc1 + "xfs: shutdown is a racy mess"
- fixed typos in commit messages

Version 1:
- https://lore.kernel.org/linux-xfs/20210630072108.1752073-1-david@fromorbit.com/


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] xfs: move xlog_commit_record to xfs_log_cil.c
  2021-07-14  3:36 [PATCH 0/5 v2] xfs: strictly order log start records Dave Chinner
@ 2021-07-14  3:36 ` Dave Chinner
  2021-07-14  3:36 ` [PATCH 2/5] xfs: pass a CIL context to xlog_write() Dave Chinner
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2021-07-14  3:36 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

It is only used by the CIL checkpoints, and is the counterpart to
start record formatting and writing that is already local to
xfs_log_cil.c.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log.c      | 31 -------------------------------
 fs/xfs/xfs_log_cil.c  | 35 ++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_log_priv.h |  2 --
 3 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 6617cdccaf00..f919c8736d50 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1595,37 +1595,6 @@ xlog_alloc_log(
 	return ERR_PTR(error);
 }	/* xlog_alloc_log */
 
-/*
- * Write out the commit record of a transaction associated with the given
- * ticket to close off a running log write. Return the lsn of the commit record.
- */
-int
-xlog_commit_record(
-	struct xlog		*log,
-	struct xlog_ticket	*ticket,
-	struct xlog_in_core	**iclog,
-	xfs_lsn_t		*lsn)
-{
-	struct xfs_log_iovec reg = {
-		.i_addr = NULL,
-		.i_len = 0,
-		.i_type = XLOG_REG_TYPE_COMMIT,
-	};
-	struct xfs_log_vec vec = {
-		.lv_niovecs = 1,
-		.lv_iovecp = &reg,
-	};
-	int	error;
-
-	if (xlog_is_shutdown(log))
-		return -EIO;
-
-	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
-	if (error)
-		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
-	return error;
-}
-
 /*
  * Compute the LSN that we'd need to push the log tail towards in order to have
  * (a) enough on-disk log space to log the number of bytes specified, (b) at
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 62967449fe8c..5eb854bda839 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -623,6 +623,38 @@ xlog_cil_process_committed(
 	}
 }
 
+/*
+ * Write out the commit record of a checkpoint transaction associated with the
+ * given ticket to close off a running log write. Return the lsn of the commit
+ * record.
+ */
+static int
+xlog_cil_write_commit_record(
+	struct xlog		*log,
+	struct xlog_ticket	*ticket,
+	struct xlog_in_core	**iclog,
+	xfs_lsn_t		*lsn)
+{
+	struct xfs_log_iovec reg = {
+		.i_addr = NULL,
+		.i_len = 0,
+		.i_type = XLOG_REG_TYPE_COMMIT,
+	};
+	struct xfs_log_vec vec = {
+		.lv_niovecs = 1,
+		.lv_iovecp = &reg,
+	};
+	int	error;
+
+	if (xlog_is_shutdown(log))
+		return -EIO;
+
+	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
+	if (error)
+		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
+	return error;
+}
+
 /*
  * Push the Committed Item List to the log.
  *
@@ -867,7 +899,8 @@ xlog_cil_push_work(
 	}
 	spin_unlock(&cil->xc_push_lock);
 
-	error = xlog_commit_record(log, tic, &commit_iclog, &commit_lsn);
+	error = xlog_cil_write_commit_record(log, tic, &commit_iclog,
+			&commit_lsn);
 	if (error)
 		goto out_abort_free_ticket;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 6c2f88e06ac3..2de9fe62d8ca 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -504,8 +504,6 @@ void	xlog_print_trans(struct xfs_trans *);
 int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
 		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
 		struct xlog_in_core **commit_iclog, uint optype);
-int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
-		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] xfs: pass a CIL context to xlog_write()
  2021-07-14  3:36 [PATCH 0/5 v2] xfs: strictly order log start records Dave Chinner
  2021-07-14  3:36 ` [PATCH 1/5] xfs: move xlog_commit_record to xfs_log_cil.c Dave Chinner
@ 2021-07-14  3:36 ` Dave Chinner
  2021-07-14  3:36 ` [PATCH 3/5] xfs: factor out log write ordering from xlog_cil_push_work() Dave Chinner
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2021-07-14  3:36 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Pass the CIL context to xlog_write() rather than a pointer to a LSN
variable. Only the CIL checkpoint calls to xlog_write() need to know
about the start LSN of the writes, so rework xlog_write to directly
write the LSNs into the CIL context structure.

This removes the commit_lsn variable from xlog_cil_push_work(), so
now we only have to issue the commit record ordering wakeup from
there.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c      | 18 +++++++++------
 fs/xfs/xfs_log_cil.c  | 52 ++++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_log_priv.h |  7 ++++--
 3 files changed, 52 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index f919c8736d50..a190efbbe451 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -901,7 +901,7 @@ xlog_write_unmount_record(
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp)
 		blkdev_issue_flush(log->l_targ->bt_bdev);
-	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
+	return xlog_write(log, NULL, &vec, ticket, NULL, XLOG_UNMOUNT_TRANS);
 }
 
 /*
@@ -2397,9 +2397,9 @@ xlog_write_copy_finish(
 int
 xlog_write(
 	struct xlog		*log,
+	struct xfs_cil_ctx	*ctx,
 	struct xfs_log_vec	*log_vector,
 	struct xlog_ticket	*ticket,
-	xfs_lsn_t		*start_lsn,
 	struct xlog_in_core	**commit_iclog,
 	uint			optype)
 {
@@ -2430,8 +2430,6 @@ xlog_write(
 	}
 
 	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
-	if (start_lsn)
-		*start_lsn = 0;
 	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
 		void		*ptr;
 		int		log_offset;
@@ -2444,9 +2442,15 @@ xlog_write(
 		ASSERT(log_offset <= iclog->ic_size - 1);
 		ptr = iclog->ic_datap + log_offset;
 
-		/* Start_lsn is the first lsn written to. */
-		if (start_lsn && !*start_lsn)
-			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
+		/*
+		 * If we have a context pointer, pass it the first iclog we are
+		 * writing to so it can record state needed for iclog write
+		 * ordering.
+		 */
+		if (ctx) {
+			xlog_cil_set_ctx_write_state(ctx, iclog);
+			ctx = NULL;
+		}
 
 		/*
 		 * This loop writes out as many regions as can fit in the amount
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 5eb854bda839..581feb043fba 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -623,6 +623,30 @@ xlog_cil_process_committed(
 	}
 }
 
+/*
+* Record the LSN of the iclog we were just granted space to start writing into.
+* If the context doesn't have a start_lsn recorded, then this iclog will
+* contain the start record for the checkpoint. Otherwise this write contains
+* the commit record for the checkpoint.
+*/
+void
+xlog_cil_set_ctx_write_state(
+	struct xfs_cil_ctx	*ctx,
+	struct xlog_in_core	*iclog)
+{
+	struct xfs_cil		*cil = ctx->cil;
+	xfs_lsn_t		lsn = be64_to_cpu(iclog->ic_header.h_lsn);
+
+	ASSERT(!ctx->commit_lsn);
+	spin_lock(&cil->xc_push_lock);
+	if (!ctx->start_lsn)
+		ctx->start_lsn = lsn;
+	else
+		ctx->commit_lsn = lsn;
+	spin_unlock(&cil->xc_push_lock);
+}
+
+
 /*
  * Write out the commit record of a checkpoint transaction associated with the
  * given ticket to close off a running log write. Return the lsn of the commit
@@ -630,26 +654,26 @@ xlog_cil_process_committed(
  */
 static int
 xlog_cil_write_commit_record(
-	struct xlog		*log,
-	struct xlog_ticket	*ticket,
-	struct xlog_in_core	**iclog,
-	xfs_lsn_t		*lsn)
+	struct xfs_cil_ctx	*ctx,
+	struct xlog_in_core	**iclog)
 {
-	struct xfs_log_iovec reg = {
+	struct xlog		*log = ctx->cil->xc_log;
+	struct xfs_log_iovec	reg = {
 		.i_addr = NULL,
 		.i_len = 0,
 		.i_type = XLOG_REG_TYPE_COMMIT,
 	};
-	struct xfs_log_vec vec = {
+	struct xfs_log_vec	vec = {
 		.lv_niovecs = 1,
 		.lv_iovecp = &reg,
 	};
-	int	error;
+	int			error;
 
 	if (xlog_is_shutdown(log))
 		return -EIO;
 
-	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
+	error = xlog_write(log, ctx, &vec, ctx->ticket, iclog,
+			XLOG_COMMIT_TRANS);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	return error;
@@ -686,7 +710,6 @@ xlog_cil_push_work(
 	struct xfs_trans_header thdr;
 	struct xfs_log_iovec	lhdr;
 	struct xfs_log_vec	lvhdr = { NULL };
-	xfs_lsn_t		commit_lsn;
 	xfs_lsn_t		push_seq;
 	struct bio		bio;
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
@@ -860,8 +883,7 @@ xlog_cil_push_work(
 	 */
 	wait_for_completion(&bdev_flush);
 
-	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL,
-				XLOG_START_TRANS);
+	error = xlog_write(log, ctx, &lvhdr, tic, NULL, XLOG_START_TRANS);
 	if (error)
 		goto out_abort_free_ticket;
 
@@ -899,8 +921,7 @@ xlog_cil_push_work(
 	}
 	spin_unlock(&cil->xc_push_lock);
 
-	error = xlog_cil_write_commit_record(log, tic, &commit_iclog,
-			&commit_lsn);
+	error = xlog_cil_write_commit_record(ctx, &commit_iclog);
 	if (error)
 		goto out_abort_free_ticket;
 
@@ -927,7 +948,6 @@ xlog_cil_push_work(
 	 * and wake up anyone who is waiting for the commit to complete.
 	 */
 	spin_lock(&cil->xc_push_lock);
-	ctx->commit_lsn = commit_lsn;
 	wake_up_all(&cil->xc_commit_wait);
 	spin_unlock(&cil->xc_push_lock);
 
@@ -943,11 +963,11 @@ xlog_cil_push_work(
 	 * iclog header lsn and compare it to the commit lsn to determine if we
 	 * need to wait on iclogs or not.
 	 */
-	if (ctx->start_lsn != commit_lsn) {
+	if (ctx->start_lsn != ctx->commit_lsn) {
 		xfs_lsn_t	plsn;
 
 		plsn = be64_to_cpu(commit_iclog->ic_prev->ic_header.h_lsn);
-		if (plsn && XFS_LSN_CMP(plsn, commit_lsn) < 0) {
+		if (plsn && XFS_LSN_CMP(plsn, ctx->commit_lsn) < 0) {
 			/*
 			 * Waiting on ic_force_wait orders the completion of
 			 * iclogs older than ic_prev. Hence we only need to wait
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 2de9fe62d8ca..2a02ce05b649 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -501,8 +501,8 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
 
 void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
-int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
-		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
+int	xlog_write(struct xlog *log, struct xfs_cil_ctx *ctx,
+		struct xfs_log_vec *log_vector, struct xlog_ticket *tic,
 		struct xlog_in_core **commit_iclog, uint optype);
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
@@ -573,6 +573,9 @@ void	xlog_cil_destroy(struct xlog *log);
 bool	xlog_cil_empty(struct xlog *log);
 void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
 			xfs_csn_t *commit_seq, bool regrant);
+void	xlog_cil_set_ctx_write_state(struct xfs_cil_ctx *ctx,
+			struct xlog_in_core *iclog);
+
 
 /*
  * CIL force routines
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] xfs: factor out log write ordering from xlog_cil_push_work()
  2021-07-14  3:36 [PATCH 0/5 v2] xfs: strictly order log start records Dave Chinner
  2021-07-14  3:36 ` [PATCH 1/5] xfs: move xlog_commit_record to xfs_log_cil.c Dave Chinner
  2021-07-14  3:36 ` [PATCH 2/5] xfs: pass a CIL context to xlog_write() Dave Chinner
@ 2021-07-14  3:36 ` Dave Chinner
  2021-07-14  3:36 ` [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state() Dave Chinner
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2021-07-14  3:36 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

So we can use it for start record ordering as well as commit record
ordering in future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log_cil.c | 87 ++++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 36 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 581feb043fba..cac3c9c894e5 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -648,9 +648,54 @@ xlog_cil_set_ctx_write_state(
 
 
 /*
- * Write out the commit record of a checkpoint transaction associated with the
- * given ticket to close off a running log write. Return the lsn of the commit
- * record.
+ * Ensure that the order of log writes follows checkpoint sequence order. This
+ * relies on the context LSN being zero until the log write has guaranteed the
+ * LSN that the log write will start at via xlog_state_get_iclog_space().
+ */
+static int
+xlog_cil_order_write(
+	struct xfs_cil		*cil,
+	xfs_csn_t		sequence)
+{
+	struct xfs_cil_ctx	*ctx;
+
+restart:
+	spin_lock(&cil->xc_push_lock);
+	list_for_each_entry(ctx, &cil->xc_committing, committing) {
+		/*
+		 * Avoid getting stuck in this loop because we were woken by the
+		 * shutdown, but then went back to sleep once already in the
+		 * shutdown state.
+		 */
+		if (xlog_is_shutdown(cil->xc_log)) {
+			spin_unlock(&cil->xc_push_lock);
+			return -EIO;
+		}
+
+		/*
+		 * Higher sequences will wait for this one so skip them.
+		 * Don't wait for our own sequence, either.
+		 */
+		if (ctx->sequence >= sequence)
+			continue;
+		if (!ctx->commit_lsn) {
+			/*
+			 * It is still being pushed! Wait for the push to
+			 * complete, then start again from the beginning.
+			 */
+			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
+			goto restart;
+		}
+	}
+	spin_unlock(&cil->xc_push_lock);
+	return 0;
+}
+
+/*
+ * Write out the commit record of a checkpoint transaction to close off a
+ * running log write. These commit records are strictly ordered in ascending CIL
+ * sequence order so that log recovery will always replay the checkpoints in the
+ * correct order.
  */
 static int
 xlog_cil_write_commit_record(
@@ -887,39 +932,9 @@ xlog_cil_push_work(
 	if (error)
 		goto out_abort_free_ticket;
 
-	/*
-	 * now that we've written the checkpoint into the log, strictly
-	 * order the commit records so replay will get them in the right order.
-	 */
-restart:
-	spin_lock(&cil->xc_push_lock);
-	list_for_each_entry(new_ctx, &cil->xc_committing, committing) {
-		/*
-		 * Avoid getting stuck in this loop because we were woken by the
-		 * shutdown, but then went back to sleep once already in the
-		 * shutdown state.
-		 */
-		if (xlog_is_shutdown(log)) {
-			spin_unlock(&cil->xc_push_lock);
-			goto out_abort_free_ticket;
-		}
-
-		/*
-		 * Higher sequences will wait for this one so skip them.
-		 * Don't wait for our own sequence, either.
-		 */
-		if (new_ctx->sequence >= ctx->sequence)
-			continue;
-		if (!new_ctx->commit_lsn) {
-			/*
-			 * It is still being pushed! Wait for the push to
-			 * complete, then start again from the beginning.
-			 */
-			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
-			goto restart;
-		}
-	}
-	spin_unlock(&cil->xc_push_lock);
+	error = xlog_cil_order_write(ctx->cil, ctx->sequence);
+	if (error)
+		goto out_abort_free_ticket;
 
 	error = xlog_cil_write_commit_record(ctx, &commit_iclog);
 	if (error)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state()
  2021-07-14  3:36 [PATCH 0/5 v2] xfs: strictly order log start records Dave Chinner
                   ` (2 preceding siblings ...)
  2021-07-14  3:36 ` [PATCH 3/5] xfs: factor out log write ordering from xlog_cil_push_work() Dave Chinner
@ 2021-07-14  3:36 ` Dave Chinner
  2021-07-14  6:21   ` Christoph Hellwig
  2021-07-14 22:36   ` Darrick J. Wong
  2021-07-14  3:36 ` [PATCH 5/5] xfs: order CIL checkpoint start records Dave Chinner
  2021-08-09 18:39 ` [PATCH 0/5 v2] xfs: strictly order log " Darrick J. Wong
  5 siblings, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2021-07-14  3:36 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have a mechanism to guarantee that the callbacks
attached to an iclog are owned by the context that attaches them
until they drop their reference to the iclog via
xlog_state_release_iclog(), we can attach callbacks to the iclog at
any time we have an active reference to the iclog.

xlog_state_get_iclog_space() always guarantees that the commit
record will fit in the iclog it returns, so we can move this IO
callback setting to xlog_cil_set_ctx_write_state(), record the
commit iclog in the context and remove the need for the commit iclog
to be returned by xlog_write() altogether.

This, in turn, allows us to move the wakeup for ordered commit
record writes up into xlog_cil_set_ctx_write_state(), too, because
we have been guaranteed that this commit record will be physically
located in the iclog before any waiting commit record at a higher
sequence number will be granted iclog space.

This further cleans up the post commit record write processing in
the CIL push code, especially as xlog_state_release_iclog() will now
clean up the context when shutdown errors occur.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 47 ++++++++--------------
 fs/xfs/xfs_log_cil.c  | 94 ++++++++++++++++++++++++-------------------
 fs/xfs/xfs_log_priv.h |  3 +-
 3 files changed, 70 insertions(+), 74 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index a190efbbe451..f41c6f388456 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -901,7 +901,7 @@ xlog_write_unmount_record(
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp)
 		blkdev_issue_flush(log->l_targ->bt_bdev);
-	return xlog_write(log, NULL, &vec, ticket, NULL, XLOG_UNMOUNT_TRANS);
+	return xlog_write(log, NULL, &vec, ticket, XLOG_UNMOUNT_TRANS);
 }
 
 /*
@@ -2307,8 +2307,7 @@ xlog_write_copy_finish(
 	int			*data_cnt,
 	int			*partial_copy,
 	int			*partial_copy_len,
-	int			log_offset,
-	struct xlog_in_core	**commit_iclog)
+	int			log_offset)
 {
 	int			error;
 
@@ -2327,27 +2326,20 @@ xlog_write_copy_finish(
 	*partial_copy = 0;
 	*partial_copy_len = 0;
 
-	if (iclog->ic_size - log_offset <= sizeof(xlog_op_header_t)) {
-		/* no more space in this iclog - push it. */
-		spin_lock(&log->l_icloglock);
-		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
-		*record_cnt = 0;
-		*data_cnt = 0;
-
-		if (iclog->ic_state == XLOG_STATE_ACTIVE)
-			xlog_state_switch_iclogs(log, iclog, 0);
-		else
-			ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
-				xlog_is_shutdown(log));
-		if (!commit_iclog)
-			goto release_iclog;
-		spin_unlock(&log->l_icloglock);
-		ASSERT(flags & XLOG_COMMIT_TRANS);
-		*commit_iclog = iclog;
-	}
+	if (iclog->ic_size - log_offset > sizeof(xlog_op_header_t))
+		return 0;
 
-	return 0;
+	/* no more space in this iclog - push it. */
+	spin_lock(&log->l_icloglock);
+	xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
+	*record_cnt = 0;
+	*data_cnt = 0;
 
+	if (iclog->ic_state == XLOG_STATE_ACTIVE)
+		xlog_state_switch_iclogs(log, iclog, 0);
+	else
+		ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
+			xlog_is_shutdown(log));
 release_iclog:
 	error = xlog_state_release_iclog(log, iclog);
 	spin_unlock(&log->l_icloglock);
@@ -2400,7 +2392,6 @@ xlog_write(
 	struct xfs_cil_ctx	*ctx,
 	struct xfs_log_vec	*log_vector,
 	struct xlog_ticket	*ticket,
-	struct xlog_in_core	**commit_iclog,
 	uint			optype)
 {
 	struct xlog_in_core	*iclog = NULL;
@@ -2529,8 +2520,7 @@ xlog_write(
 						       &record_cnt, &data_cnt,
 						       &partial_copy,
 						       &partial_copy_len,
-						       log_offset,
-						       commit_iclog);
+						       log_offset);
 			if (error)
 				return error;
 
@@ -2568,12 +2558,7 @@ xlog_write(
 
 	spin_lock(&log->l_icloglock);
 	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
-	if (commit_iclog) {
-		ASSERT(optype & XLOG_COMMIT_TRANS);
-		*commit_iclog = iclog;
-	} else {
-		error = xlog_state_release_iclog(log, iclog);
-	}
+	error = xlog_state_release_iclog(log, iclog);
 	spin_unlock(&log->l_icloglock);
 
 	return error;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index cac3c9c894e5..18a2d6a9f26e 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -638,11 +638,41 @@ xlog_cil_set_ctx_write_state(
 	xfs_lsn_t		lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
 	ASSERT(!ctx->commit_lsn);
-	spin_lock(&cil->xc_push_lock);
-	if (!ctx->start_lsn)
+	if (!ctx->start_lsn) {
+		spin_lock(&cil->xc_push_lock);
 		ctx->start_lsn = lsn;
-	else
-		ctx->commit_lsn = lsn;
+		spin_unlock(&cil->xc_push_lock);
+		return;
+	}
+
+	/*
+	 * Take a reference to the iclog for the context so that we still hold
+	 * it when xlog_write is done and has released it. This means the
+	 * context controls when the iclog is released for IO.
+	 */
+	atomic_inc(&iclog->ic_refcnt);
+
+	/*
+	 * xlog_state_get_iclog_space() guarantees there is enough space in the
+	 * iclog for an entire commit record, so we can attach the context
+	 * callbacks now.  This needs to be done before we make the commit_lsn
+	 * visible to waiters so that checkpoints with commit records in the
+	 * same iclog order their IO completion callbacks in the same order that
+	 * the commit records appear in the iclog.
+	 */
+	spin_lock(&cil->xc_log->l_icloglock);
+	list_add_tail(&ctx->iclog_entry, &iclog->ic_callbacks);
+	spin_unlock(&cil->xc_log->l_icloglock);
+
+	/*
+	 * Now we can record the commit LSN and wake anyone waiting for this
+	 * sequence to have the ordered commit record assigned to a physical
+	 * location in the log.
+	 */
+	spin_lock(&cil->xc_push_lock);
+	ctx->commit_iclog = iclog;
+	ctx->commit_lsn = lsn;
+	wake_up_all(&cil->xc_commit_wait);
 	spin_unlock(&cil->xc_push_lock);
 }
 
@@ -699,8 +729,7 @@ xlog_cil_order_write(
  */
 static int
 xlog_cil_write_commit_record(
-	struct xfs_cil_ctx	*ctx,
-	struct xlog_in_core	**iclog)
+	struct xfs_cil_ctx	*ctx)
 {
 	struct xlog		*log = ctx->cil->xc_log;
 	struct xfs_log_iovec	reg = {
@@ -717,8 +746,7 @@ xlog_cil_write_commit_record(
 	if (xlog_is_shutdown(log))
 		return -EIO;
 
-	error = xlog_write(log, ctx, &vec, ctx->ticket, iclog,
-			XLOG_COMMIT_TRANS);
+	error = xlog_write(log, ctx, &vec, ctx->ticket, XLOG_COMMIT_TRANS);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	return error;
@@ -748,7 +776,6 @@ xlog_cil_push_work(
 	struct xfs_log_vec	*lv;
 	struct xfs_cil_ctx	*ctx;
 	struct xfs_cil_ctx	*new_ctx;
-	struct xlog_in_core	*commit_iclog;
 	struct xlog_ticket	*tic;
 	int			num_iovecs;
 	int			error = 0;
@@ -928,7 +955,7 @@ xlog_cil_push_work(
 	 */
 	wait_for_completion(&bdev_flush);
 
-	error = xlog_write(log, ctx, &lvhdr, tic, NULL, XLOG_START_TRANS);
+	error = xlog_write(log, ctx, &lvhdr, tic, XLOG_START_TRANS);
 	if (error)
 		goto out_abort_free_ticket;
 
@@ -936,36 +963,12 @@ xlog_cil_push_work(
 	if (error)
 		goto out_abort_free_ticket;
 
-	error = xlog_cil_write_commit_record(ctx, &commit_iclog);
+	error = xlog_cil_write_commit_record(ctx);
 	if (error)
 		goto out_abort_free_ticket;
 
 	xfs_log_ticket_ungrant(log, tic);
 
-	/*
-	 * Once we attach the ctx to the iclog, it is effectively owned by the
-	 * iclog and we can only use it while we still have an active reference
-	 * to the iclog. i.e. once we call xlog_state_release_iclog() we can no
-	 * longer safely reference the ctx.
-	 */
-	spin_lock(&log->l_icloglock);
-	if (xlog_is_shutdown(log)) {
-		spin_unlock(&log->l_icloglock);
-		goto out_abort;
-	}
-	ASSERT_ALWAYS(commit_iclog->ic_state == XLOG_STATE_ACTIVE ||
-		      commit_iclog->ic_state == XLOG_STATE_WANT_SYNC);
-	list_add_tail(&ctx->iclog_entry, &commit_iclog->ic_callbacks);
-
-	/*
-	 * now the checkpoint commit is complete and we've attached the
-	 * callbacks to the iclog we can assign the commit LSN to the context
-	 * and wake up anyone who is waiting for the commit to complete.
-	 */
-	spin_lock(&cil->xc_push_lock);
-	wake_up_all(&cil->xc_commit_wait);
-	spin_unlock(&cil->xc_push_lock);
-
 	/*
 	 * If the checkpoint spans multiple iclogs, wait for all previous iclogs
 	 * to complete before we submit the commit_iclog. We can't use state
@@ -978,17 +981,18 @@ xlog_cil_push_work(
 	 * iclog header lsn and compare it to the commit lsn to determine if we
 	 * need to wait on iclogs or not.
 	 */
+	spin_lock(&log->l_icloglock);
 	if (ctx->start_lsn != ctx->commit_lsn) {
 		xfs_lsn_t	plsn;
 
-		plsn = be64_to_cpu(commit_iclog->ic_prev->ic_header.h_lsn);
+		plsn = be64_to_cpu(ctx->commit_iclog->ic_prev->ic_header.h_lsn);
 		if (plsn && XFS_LSN_CMP(plsn, ctx->commit_lsn) < 0) {
 			/*
 			 * Waiting on ic_force_wait orders the completion of
 			 * iclogs older than ic_prev. Hence we only need to wait
 			 * on the most recent older iclog here.
 			 */
-			xlog_wait_on_iclog(commit_iclog->ic_prev);
+			xlog_wait_on_iclog(ctx->commit_iclog->ic_prev);
 			spin_lock(&log->l_icloglock);
 		}
 
@@ -996,7 +1000,7 @@ xlog_cil_push_work(
 		 * We need to issue a pre-flush so that the ordering for this
 		 * checkpoint is correctly preserved down to stable storage.
 		 */
-		commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
+		ctx->commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
 	}
 
 	/*
@@ -1004,8 +1008,8 @@ xlog_cil_push_work(
 	 * journal IO vs metadata writeback IO is correctly ordered on stable
 	 * storage.
 	 */
-	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
-	xlog_state_release_iclog(log, commit_iclog);
+	ctx->commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
+	xlog_state_release_iclog(log, ctx->commit_iclog);
 
 	/* Not safe to reference ctx now! */
 
@@ -1020,9 +1024,15 @@ xlog_cil_push_work(
 
 out_abort_free_ticket:
 	xfs_log_ticket_ungrant(log, tic);
-out_abort:
 	ASSERT(xlog_is_shutdown(log));
-	xlog_cil_committed(ctx);
+	if (!ctx->commit_iclog) {
+		xlog_cil_committed(ctx);
+		return;
+	}
+	spin_lock(&log->l_icloglock);
+	xlog_state_release_iclog(log, ctx->commit_iclog);
+	/* Not safe to reference ctx now! */
+	spin_unlock(&log->l_icloglock);
 }
 
 /*
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 2a02ce05b649..f74e3968bb84 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -232,6 +232,7 @@ struct xfs_cil_ctx {
 	xfs_csn_t		sequence;	/* chkpt sequence # */
 	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
 	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
+	struct xlog_in_core	*commit_iclog;
 	struct xlog_ticket	*ticket;	/* chkpt ticket */
 	int			nvecs;		/* number of regions */
 	int			space_used;	/* aggregate size of regions */
@@ -503,7 +504,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
 int	xlog_write(struct xlog *log, struct xfs_cil_ctx *ctx,
 		struct xfs_log_vec *log_vector, struct xlog_ticket *tic,
-		struct xlog_in_core **commit_iclog, uint optype);
+		uint optype);
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] xfs: order CIL checkpoint start records
  2021-07-14  3:36 [PATCH 0/5 v2] xfs: strictly order log start records Dave Chinner
                   ` (3 preceding siblings ...)
  2021-07-14  3:36 ` [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state() Dave Chinner
@ 2021-07-14  3:36 ` Dave Chinner
  2021-07-14  6:34   ` Christoph Hellwig
  2021-07-14 22:39   ` Darrick J. Wong
  2021-08-09 18:39 ` [PATCH 0/5 v2] xfs: strictly order log " Darrick J. Wong
  5 siblings, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2021-07-14  3:36 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Because log recovery depends on strictly ordered start records as
well as strictly ordered commit records.

This is a zero day bug in the way XFS writes pipelined transactions
to the journal which is exposed by fixing the zero day bug that
prevents the CIL from pipelining checkpoints. This re-introduces
explicit concurrent commits back into the on-disk journal and hence
out of order start records.

The XFS journal commit code has never ordered start records and we
have relied on strict commit record ordering for correct recovery
ordering of concurrently written transactions. Unfortunately, root
cause analysis uncovered the fact that log recovery uses the LSN of
the start record for transaction commit processing. Hence, whilst
the commits are processed in strict order by recovery, the LSNs
associated with the commits can be out of order and so recovery may
stamp incorrect LSNs into objects and/or misorder intents in the AIL
for later processing. This can result in log recovery failures
and/or on disk corruption, sometimes silent.

Because this is a long standing log recovery issue, we can't just
fix log recovery and call it good. This still leaves older kernels
susceptible to recovery failures and corruption when replaying a log
from a kernel that pipelines checkpoints. There is also the issue
that in-memory ordering for AIL pushing and data integrity
operations are based on checkpoint start LSNs, and if the start LSN
is incorrect in the journal, it is also incorrect in memory.

Hence there's really only one choice for fixing this zero-day bug:
we need to strictly order checkpoint start records in ascending
sequence order in the log, the same way we already strictly order
commit records.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      |  1 +
 fs/xfs/xfs_log_cil.c  | 69 +++++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_log_priv.h |  1 +
 3 files changed, 58 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index f41c6f388456..98d8fea6a83b 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3777,6 +3777,7 @@ xlog_force_shutdown(
 	 * avoid races.
 	 */
 	spin_lock(&log->l_cilp->xc_push_lock);
+	wake_up_all(&log->l_cilp->xc_start_wait);
 	wake_up_all(&log->l_cilp->xc_commit_wait);
 	spin_unlock(&log->l_cilp->xc_push_lock);
 	xlog_state_shutdown_callbacks(log);
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 18a2d6a9f26e..5ef47aed10f9 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -587,6 +587,7 @@ xlog_cil_committed(
 	 */
 	if (abort) {
 		spin_lock(&ctx->cil->xc_push_lock);
+		wake_up_all(&ctx->cil->xc_start_wait);
 		wake_up_all(&ctx->cil->xc_commit_wait);
 		spin_unlock(&ctx->cil->xc_push_lock);
 	}
@@ -640,7 +641,14 @@ xlog_cil_set_ctx_write_state(
 	ASSERT(!ctx->commit_lsn);
 	if (!ctx->start_lsn) {
 		spin_lock(&cil->xc_push_lock);
+		/*
+		 * The LSN we need to pass to the log items on transaction
+		 * commit is the LSN reported by the first log vector write, not
+		 * the commit lsn. If we use the commit record lsn then we can
+		 * move the tail beyond the grant write head.
+		 */
 		ctx->start_lsn = lsn;
+		wake_up_all(&cil->xc_start_wait);
 		spin_unlock(&cil->xc_push_lock);
 		return;
 	}
@@ -682,10 +690,16 @@ xlog_cil_set_ctx_write_state(
  * relies on the context LSN being zero until the log write has guaranteed the
  * LSN that the log write will start at via xlog_state_get_iclog_space().
  */
+enum _record_type {
+	_START_RECORD,
+	_COMMIT_RECORD,
+};
+
 static int
 xlog_cil_order_write(
 	struct xfs_cil		*cil,
-	xfs_csn_t		sequence)
+	xfs_csn_t		sequence,
+	enum _record_type	record)
 {
 	struct xfs_cil_ctx	*ctx;
 
@@ -708,19 +722,47 @@ xlog_cil_order_write(
 		 */
 		if (ctx->sequence >= sequence)
 			continue;
-		if (!ctx->commit_lsn) {
-			/*
-			 * It is still being pushed! Wait for the push to
-			 * complete, then start again from the beginning.
-			 */
-			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
-			goto restart;
+
+		/* Wait until the LSN for the record has been recorded. */
+		switch (record) {
+		case _START_RECORD:
+			if (!ctx->start_lsn) {
+				xlog_wait(&cil->xc_start_wait, &cil->xc_push_lock);
+				goto restart;
+			}
+			break;
+		case _COMMIT_RECORD:
+			if (!ctx->commit_lsn) {
+				xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
+				goto restart;
+			}
+			break;
 		}
 	}
 	spin_unlock(&cil->xc_push_lock);
 	return 0;
 }
 
+/*
+ * Write out the log vector change now attached to the CIL context. This will
+ * write a start record that needs to be strictly ordered in ascending CIL
+ * sequence order so that log recovery will always use in-order start LSNs when
+ * replaying checkpoints.
+ */
+static int
+xlog_cil_write_chain(
+	struct xfs_cil_ctx	*ctx,
+	struct xfs_log_vec	*chain)
+{
+	struct xlog		*log = ctx->cil->xc_log;
+	int			error;
+
+	error = xlog_cil_order_write(ctx->cil, ctx->sequence, _START_RECORD);
+	if (error)
+		return error;
+	return xlog_write(log, ctx, chain, ctx->ticket, XLOG_START_TRANS);
+}
+
 /*
  * Write out the commit record of a checkpoint transaction to close off a
  * running log write. These commit records are strictly ordered in ascending CIL
@@ -746,6 +788,10 @@ xlog_cil_write_commit_record(
 	if (xlog_is_shutdown(log))
 		return -EIO;
 
+	error = xlog_cil_order_write(ctx->cil, ctx->sequence, _COMMIT_RECORD);
+	if (error)
+		return error;
+
 	error = xlog_write(log, ctx, &vec, ctx->ticket, XLOG_COMMIT_TRANS);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
@@ -955,11 +1001,7 @@ xlog_cil_push_work(
 	 */
 	wait_for_completion(&bdev_flush);
 
-	error = xlog_write(log, ctx, &lvhdr, tic, XLOG_START_TRANS);
-	if (error)
-		goto out_abort_free_ticket;
-
-	error = xlog_cil_order_write(ctx->cil, ctx->sequence);
+	error = xlog_cil_write_chain(ctx, &lvhdr);
 	if (error)
 		goto out_abort_free_ticket;
 
@@ -1364,6 +1406,7 @@ xlog_cil_init(
 	spin_lock_init(&cil->xc_push_lock);
 	init_waitqueue_head(&cil->xc_push_wait);
 	init_rwsem(&cil->xc_ctx_lock);
+	init_waitqueue_head(&cil->xc_start_wait);
 	init_waitqueue_head(&cil->xc_commit_wait);
 
 	INIT_LIST_HEAD(&ctx->committing);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index f74e3968bb84..400471fa12d2 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -271,6 +271,7 @@ struct xfs_cil {
 	xfs_csn_t		xc_push_seq;
 	struct list_head	xc_committing;
 	wait_queue_head_t	xc_commit_wait;
+	wait_queue_head_t	xc_start_wait;
 	xfs_csn_t		xc_current_sequence;
 	struct work_struct	xc_push_work;
 	wait_queue_head_t	xc_push_wait;	/* background push throttle */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state()
  2021-07-14  3:36 ` [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state() Dave Chinner
@ 2021-07-14  6:21   ` Christoph Hellwig
  2021-07-14 22:36   ` Darrick J. Wong
  1 sibling, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2021-07-14  6:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

s/attached/attach/ in the Subject.

xc_push_lock around the start_lsn assignment still looks weird and
unnecessary.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] xfs: order CIL checkpoint start records
  2021-07-14  3:36 ` [PATCH 5/5] xfs: order CIL checkpoint start records Dave Chinner
@ 2021-07-14  6:34   ` Christoph Hellwig
  2021-07-14 22:39   ` Darrick J. Wong
  1 sibling, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2021-07-14  6:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jul 14, 2021 at 01:36:56PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because log recovery depends on strictly ordered start records as
> well as strictly ordered commit records.
> 
> This is a zero day bug in the way XFS writes pipelined transactions
> to the journal which is exposed by fixing the zero day bug that
> prevents the CIL from pipelining checkpoints. This re-introduces
> explicit concurrent commits back into the on-disk journal and hence
> out of order start records.
> 
> The XFS journal commit code has never ordered start records and we
> have relied on strict commit record ordering for correct recovery
> ordering of concurrently written transactions. Unfortunately, root
> cause analysis uncovered the fact that log recovery uses the LSN of
> the start record for transaction commit processing. Hence, whilst
> the commits are processed in strict order by recovery, the LSNs
> associated with the commits can be out of order and so recovery may
> stamp incorrect LSNs into objects and/or misorder intents in the AIL
> for later processing. This can result in log recovery failures
> and/or on disk corruption, sometimes silent.
> 
> Because this is a long standing log recovery issue, we can't just
> fix log recovery and call it good. This still leaves older kernels
> susceptible to recovery failures and corruption when replaying a log
> from a kernel that pipelines checkpoints. There is also the issue
> that in-memory ordering for AIL pushing and data integrity
> operations are based on checkpoint start LSNs, and if the start LSN
> is incorrect in the journal, it is also incorrect in memory.
> 
> Hence there's really only one choice for fixing this zero-day bug:
> we need to strictly order checkpoint start records in ascending
> sequence order in the log, the same way we already strictly order
> commit records.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

I can't say I like the overloading of a mostly trivial function
with the record enum.  I think just two separate helpers would
be much more obvious.

But technically this looks fine.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state()
  2021-07-14  3:36 ` [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state() Dave Chinner
  2021-07-14  6:21   ` Christoph Hellwig
@ 2021-07-14 22:36   ` Darrick J. Wong
  1 sibling, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2021-07-14 22:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jul 14, 2021 at 01:36:55PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have a mechanism to guarantee that the callbacks
> attached to an iclog are owned by the context that attaches them
> until they drop their reference to the iclog via
> xlog_state_release_iclog(), we can attach callbacks to the iclog at
> any time we have an active reference to the iclog.
> 
> xlog_state_get_iclog_space() always guarantees that the commit
> record will fit in the iclog it returns, so we can move this IO
> callback setting to xlog_cil_set_ctx_write_state(), record the
> commit iclog in the context and remove the need for the commit iclog
> to be returned by xlog_write() altogether.
> 
> This, in turn, allows us to move the wakeup for ordered commit
> record writes up into xlog_cil_set_ctx_write_state(), too, because
> we have been guaranteed that this commit record will be physically
> located in the iclog before any waiting commit record at a higher
> sequence number will be granted iclog space.
> 
> This further cleans up the post commit record write processing in
> the CIL push code, especially as xlog_state_release_iclog() will now
> clean up the context when shutdown errors occur.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Modulo the typo in the subject, I think I figured this out to my
satisfaction.  The addition of the ic_refcnt increment when we're
committing the transaction exists to balance out the now unconditional
release_iclog call.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c      | 47 ++++++++--------------
>  fs/xfs/xfs_log_cil.c  | 94 ++++++++++++++++++++++++-------------------
>  fs/xfs/xfs_log_priv.h |  3 +-
>  3 files changed, 70 insertions(+), 74 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index a190efbbe451..f41c6f388456 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -901,7 +901,7 @@ xlog_write_unmount_record(
>  	 */
>  	if (log->l_targ != log->l_mp->m_ddev_targp)
>  		blkdev_issue_flush(log->l_targ->bt_bdev);
> -	return xlog_write(log, NULL, &vec, ticket, NULL, XLOG_UNMOUNT_TRANS);
> +	return xlog_write(log, NULL, &vec, ticket, XLOG_UNMOUNT_TRANS);
>  }
>  
>  /*
> @@ -2307,8 +2307,7 @@ xlog_write_copy_finish(
>  	int			*data_cnt,
>  	int			*partial_copy,
>  	int			*partial_copy_len,
> -	int			log_offset,
> -	struct xlog_in_core	**commit_iclog)
> +	int			log_offset)
>  {
>  	int			error;
>  
> @@ -2327,27 +2326,20 @@ xlog_write_copy_finish(
>  	*partial_copy = 0;
>  	*partial_copy_len = 0;
>  
> -	if (iclog->ic_size - log_offset <= sizeof(xlog_op_header_t)) {
> -		/* no more space in this iclog - push it. */
> -		spin_lock(&log->l_icloglock);
> -		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
> -		*record_cnt = 0;
> -		*data_cnt = 0;
> -
> -		if (iclog->ic_state == XLOG_STATE_ACTIVE)
> -			xlog_state_switch_iclogs(log, iclog, 0);
> -		else
> -			ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
> -				xlog_is_shutdown(log));
> -		if (!commit_iclog)
> -			goto release_iclog;
> -		spin_unlock(&log->l_icloglock);
> -		ASSERT(flags & XLOG_COMMIT_TRANS);
> -		*commit_iclog = iclog;
> -	}
> +	if (iclog->ic_size - log_offset > sizeof(xlog_op_header_t))
> +		return 0;
>  
> -	return 0;
> +	/* no more space in this iclog - push it. */
> +	spin_lock(&log->l_icloglock);
> +	xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
> +	*record_cnt = 0;
> +	*data_cnt = 0;
>  
> +	if (iclog->ic_state == XLOG_STATE_ACTIVE)
> +		xlog_state_switch_iclogs(log, iclog, 0);
> +	else
> +		ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
> +			xlog_is_shutdown(log));
>  release_iclog:
>  	error = xlog_state_release_iclog(log, iclog);
>  	spin_unlock(&log->l_icloglock);
> @@ -2400,7 +2392,6 @@ xlog_write(
>  	struct xfs_cil_ctx	*ctx,
>  	struct xfs_log_vec	*log_vector,
>  	struct xlog_ticket	*ticket,
> -	struct xlog_in_core	**commit_iclog,
>  	uint			optype)
>  {
>  	struct xlog_in_core	*iclog = NULL;
> @@ -2529,8 +2520,7 @@ xlog_write(
>  						       &record_cnt, &data_cnt,
>  						       &partial_copy,
>  						       &partial_copy_len,
> -						       log_offset,
> -						       commit_iclog);
> +						       log_offset);
>  			if (error)
>  				return error;
>  
> @@ -2568,12 +2558,7 @@ xlog_write(
>  
>  	spin_lock(&log->l_icloglock);
>  	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
> -	if (commit_iclog) {
> -		ASSERT(optype & XLOG_COMMIT_TRANS);
> -		*commit_iclog = iclog;
> -	} else {
> -		error = xlog_state_release_iclog(log, iclog);
> -	}
> +	error = xlog_state_release_iclog(log, iclog);
>  	spin_unlock(&log->l_icloglock);
>  
>  	return error;
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index cac3c9c894e5..18a2d6a9f26e 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -638,11 +638,41 @@ xlog_cil_set_ctx_write_state(
>  	xfs_lsn_t		lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
>  	ASSERT(!ctx->commit_lsn);
> -	spin_lock(&cil->xc_push_lock);
> -	if (!ctx->start_lsn)
> +	if (!ctx->start_lsn) {
> +		spin_lock(&cil->xc_push_lock);
>  		ctx->start_lsn = lsn;
> -	else
> -		ctx->commit_lsn = lsn;
> +		spin_unlock(&cil->xc_push_lock);
> +		return;
> +	}
> +
> +	/*
> +	 * Take a reference to the iclog for the context so that we still hold
> +	 * it when xlog_write is done and has released it. This means the
> +	 * context controls when the iclog is released for IO.
> +	 */
> +	atomic_inc(&iclog->ic_refcnt);
> +
> +	/*
> +	 * xlog_state_get_iclog_space() guarantees there is enough space in the
> +	 * iclog for an entire commit record, so we can attach the context
> +	 * callbacks now.  This needs to be done before we make the commit_lsn
> +	 * visible to waiters so that checkpoints with commit records in the
> +	 * same iclog order their IO completion callbacks in the same order that
> +	 * the commit records appear in the iclog.
> +	 */
> +	spin_lock(&cil->xc_log->l_icloglock);
> +	list_add_tail(&ctx->iclog_entry, &iclog->ic_callbacks);
> +	spin_unlock(&cil->xc_log->l_icloglock);
> +
> +	/*
> +	 * Now we can record the commit LSN and wake anyone waiting for this
> +	 * sequence to have the ordered commit record assigned to a physical
> +	 * location in the log.
> +	 */
> +	spin_lock(&cil->xc_push_lock);
> +	ctx->commit_iclog = iclog;
> +	ctx->commit_lsn = lsn;
> +	wake_up_all(&cil->xc_commit_wait);
>  	spin_unlock(&cil->xc_push_lock);
>  }
>  
> @@ -699,8 +729,7 @@ xlog_cil_order_write(
>   */
>  static int
>  xlog_cil_write_commit_record(
> -	struct xfs_cil_ctx	*ctx,
> -	struct xlog_in_core	**iclog)
> +	struct xfs_cil_ctx	*ctx)
>  {
>  	struct xlog		*log = ctx->cil->xc_log;
>  	struct xfs_log_iovec	reg = {
> @@ -717,8 +746,7 @@ xlog_cil_write_commit_record(
>  	if (xlog_is_shutdown(log))
>  		return -EIO;
>  
> -	error = xlog_write(log, ctx, &vec, ctx->ticket, iclog,
> -			XLOG_COMMIT_TRANS);
> +	error = xlog_write(log, ctx, &vec, ctx->ticket, XLOG_COMMIT_TRANS);
>  	if (error)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	return error;
> @@ -748,7 +776,6 @@ xlog_cil_push_work(
>  	struct xfs_log_vec	*lv;
>  	struct xfs_cil_ctx	*ctx;
>  	struct xfs_cil_ctx	*new_ctx;
> -	struct xlog_in_core	*commit_iclog;
>  	struct xlog_ticket	*tic;
>  	int			num_iovecs;
>  	int			error = 0;
> @@ -928,7 +955,7 @@ xlog_cil_push_work(
>  	 */
>  	wait_for_completion(&bdev_flush);
>  
> -	error = xlog_write(log, ctx, &lvhdr, tic, NULL, XLOG_START_TRANS);
> +	error = xlog_write(log, ctx, &lvhdr, tic, XLOG_START_TRANS);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> @@ -936,36 +963,12 @@ xlog_cil_push_work(
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> -	error = xlog_cil_write_commit_record(ctx, &commit_iclog);
> +	error = xlog_cil_write_commit_record(ctx);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
>  	xfs_log_ticket_ungrant(log, tic);
>  
> -	/*
> -	 * Once we attach the ctx to the iclog, it is effectively owned by the
> -	 * iclog and we can only use it while we still have an active reference
> -	 * to the iclog. i.e. once we call xlog_state_release_iclog() we can no
> -	 * longer safely reference the ctx.
> -	 */
> -	spin_lock(&log->l_icloglock);
> -	if (xlog_is_shutdown(log)) {
> -		spin_unlock(&log->l_icloglock);
> -		goto out_abort;
> -	}
> -	ASSERT_ALWAYS(commit_iclog->ic_state == XLOG_STATE_ACTIVE ||
> -		      commit_iclog->ic_state == XLOG_STATE_WANT_SYNC);
> -	list_add_tail(&ctx->iclog_entry, &commit_iclog->ic_callbacks);
> -
> -	/*
> -	 * now the checkpoint commit is complete and we've attached the
> -	 * callbacks to the iclog we can assign the commit LSN to the context
> -	 * and wake up anyone who is waiting for the commit to complete.
> -	 */
> -	spin_lock(&cil->xc_push_lock);
> -	wake_up_all(&cil->xc_commit_wait);
> -	spin_unlock(&cil->xc_push_lock);
> -
>  	/*
>  	 * If the checkpoint spans multiple iclogs, wait for all previous iclogs
>  	 * to complete before we submit the commit_iclog. We can't use state
> @@ -978,17 +981,18 @@ xlog_cil_push_work(
>  	 * iclog header lsn and compare it to the commit lsn to determine if we
>  	 * need to wait on iclogs or not.
>  	 */
> +	spin_lock(&log->l_icloglock);
>  	if (ctx->start_lsn != ctx->commit_lsn) {
>  		xfs_lsn_t	plsn;
>  
> -		plsn = be64_to_cpu(commit_iclog->ic_prev->ic_header.h_lsn);
> +		plsn = be64_to_cpu(ctx->commit_iclog->ic_prev->ic_header.h_lsn);
>  		if (plsn && XFS_LSN_CMP(plsn, ctx->commit_lsn) < 0) {
>  			/*
>  			 * Waiting on ic_force_wait orders the completion of
>  			 * iclogs older than ic_prev. Hence we only need to wait
>  			 * on the most recent older iclog here.
>  			 */
> -			xlog_wait_on_iclog(commit_iclog->ic_prev);
> +			xlog_wait_on_iclog(ctx->commit_iclog->ic_prev);
>  			spin_lock(&log->l_icloglock);
>  		}
>  
> @@ -996,7 +1000,7 @@ xlog_cil_push_work(
>  		 * We need to issue a pre-flush so that the ordering for this
>  		 * checkpoint is correctly preserved down to stable storage.
>  		 */
> -		commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> +		ctx->commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
>  	}
>  
>  	/*
> @@ -1004,8 +1008,8 @@ xlog_cil_push_work(
>  	 * journal IO vs metadata writeback IO is correctly ordered on stable
>  	 * storage.
>  	 */
> -	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
> -	xlog_state_release_iclog(log, commit_iclog);
> +	ctx->commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
> +	xlog_state_release_iclog(log, ctx->commit_iclog);
>  
>  	/* Not safe to reference ctx now! */
>  
> @@ -1020,9 +1024,15 @@ xlog_cil_push_work(
>  
>  out_abort_free_ticket:
>  	xfs_log_ticket_ungrant(log, tic);
> -out_abort:
>  	ASSERT(xlog_is_shutdown(log));
> -	xlog_cil_committed(ctx);
> +	if (!ctx->commit_iclog) {
> +		xlog_cil_committed(ctx);
> +		return;
> +	}
> +	spin_lock(&log->l_icloglock);
> +	xlog_state_release_iclog(log, ctx->commit_iclog);
> +	/* Not safe to reference ctx now! */
> +	spin_unlock(&log->l_icloglock);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 2a02ce05b649..f74e3968bb84 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -232,6 +232,7 @@ struct xfs_cil_ctx {
>  	xfs_csn_t		sequence;	/* chkpt sequence # */
>  	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
>  	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
> +	struct xlog_in_core	*commit_iclog;
>  	struct xlog_ticket	*ticket;	/* chkpt ticket */
>  	int			nvecs;		/* number of regions */
>  	int			space_used;	/* aggregate size of regions */
> @@ -503,7 +504,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
>  int	xlog_write(struct xlog *log, struct xfs_cil_ctx *ctx,
>  		struct xfs_log_vec *log_vector, struct xlog_ticket *tic,
> -		struct xlog_in_core **commit_iclog, uint optype);
> +		uint optype);
>  void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
>  void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] xfs: order CIL checkpoint start records
  2021-07-14  3:36 ` [PATCH 5/5] xfs: order CIL checkpoint start records Dave Chinner
  2021-07-14  6:34   ` Christoph Hellwig
@ 2021-07-14 22:39   ` Darrick J. Wong
  1 sibling, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2021-07-14 22:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jul 14, 2021 at 01:36:56PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because log recovery depends on strictly ordered start records as
> well as strictly ordered commit records.
> 
> This is a zero day bug in the way XFS writes pipelined transactions
> to the journal which is exposed by fixing the zero day bug that
> prevents the CIL from pipelining checkpoints. This re-introduces
> explicit concurrent commits back into the on-disk journal and hence
> out of order start records.
> 
> The XFS journal commit code has never ordered start records and we
> have relied on strict commit record ordering for correct recovery
> ordering of concurrently written transactions. Unfortunately, root
> cause analysis uncovered the fact that log recovery uses the LSN of
> the start record for transaction commit processing. Hence, whilst
> the commits are processed in strict order by recovery, the LSNs
> associated with the commits can be out of order and so recovery may
> stamp incorrect LSNs into objects and/or misorder intents in the AIL
> for later processing. This can result in log recovery failures
> and/or on disk corruption, sometimes silent.
> 
> Because this is a long standing log recovery issue, we can't just
> fix log recovery and call it good. This still leaves older kernels
> susceptible to recovery failures and corruption when replaying a log
> from a kernel that pipelines checkpoints. There is also the issue
> that in-memory ordering for AIL pushing and data integrity
> operations are based on checkpoint start LSNs, and if the start LSN
> is incorrect in the journal, it is also incorrect in memory.
> 
> Hence there's really only one choice for fixing this zero-day bug:
> we need to strictly order checkpoint start records in ascending
> sequence order in the log, the same way we already strictly order
> commit records.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good to me now,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c      |  1 +
>  fs/xfs/xfs_log_cil.c  | 69 +++++++++++++++++++++++++++++++++++--------
>  fs/xfs/xfs_log_priv.h |  1 +
>  3 files changed, 58 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index f41c6f388456..98d8fea6a83b 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -3777,6 +3777,7 @@ xlog_force_shutdown(
>  	 * avoid races.
>  	 */
>  	spin_lock(&log->l_cilp->xc_push_lock);
> +	wake_up_all(&log->l_cilp->xc_start_wait);
>  	wake_up_all(&log->l_cilp->xc_commit_wait);
>  	spin_unlock(&log->l_cilp->xc_push_lock);
>  	xlog_state_shutdown_callbacks(log);
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 18a2d6a9f26e..5ef47aed10f9 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -587,6 +587,7 @@ xlog_cil_committed(
>  	 */
>  	if (abort) {
>  		spin_lock(&ctx->cil->xc_push_lock);
> +		wake_up_all(&ctx->cil->xc_start_wait);
>  		wake_up_all(&ctx->cil->xc_commit_wait);
>  		spin_unlock(&ctx->cil->xc_push_lock);
>  	}
> @@ -640,7 +641,14 @@ xlog_cil_set_ctx_write_state(
>  	ASSERT(!ctx->commit_lsn);
>  	if (!ctx->start_lsn) {
>  		spin_lock(&cil->xc_push_lock);
> +		/*
> +		 * The LSN we need to pass to the log items on transaction
> +		 * commit is the LSN reported by the first log vector write, not
> +		 * the commit lsn. If we use the commit record lsn then we can
> +		 * move the tail beyond the grant write head.
> +		 */
>  		ctx->start_lsn = lsn;
> +		wake_up_all(&cil->xc_start_wait);
>  		spin_unlock(&cil->xc_push_lock);
>  		return;
>  	}
> @@ -682,10 +690,16 @@ xlog_cil_set_ctx_write_state(
>   * relies on the context LSN being zero until the log write has guaranteed the
>   * LSN that the log write will start at via xlog_state_get_iclog_space().
>   */
> +enum _record_type {
> +	_START_RECORD,
> +	_COMMIT_RECORD,
> +};
> +
>  static int
>  xlog_cil_order_write(
>  	struct xfs_cil		*cil,
> -	xfs_csn_t		sequence)
> +	xfs_csn_t		sequence,
> +	enum _record_type	record)
>  {
>  	struct xfs_cil_ctx	*ctx;
>  
> @@ -708,19 +722,47 @@ xlog_cil_order_write(
>  		 */
>  		if (ctx->sequence >= sequence)
>  			continue;
> -		if (!ctx->commit_lsn) {
> -			/*
> -			 * It is still being pushed! Wait for the push to
> -			 * complete, then start again from the beginning.
> -			 */
> -			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
> -			goto restart;
> +
> +		/* Wait until the LSN for the record has been recorded. */
> +		switch (record) {
> +		case _START_RECORD:
> +			if (!ctx->start_lsn) {
> +				xlog_wait(&cil->xc_start_wait, &cil->xc_push_lock);
> +				goto restart;
> +			}
> +			break;
> +		case _COMMIT_RECORD:
> +			if (!ctx->commit_lsn) {
> +				xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
> +				goto restart;
> +			}
> +			break;
>  		}
>  	}
>  	spin_unlock(&cil->xc_push_lock);
>  	return 0;
>  }
>  
> +/*
> + * Write out the log vector change now attached to the CIL context. This will
> + * write a start record that needs to be strictly ordered in ascending CIL
> + * sequence order so that log recovery will always use in-order start LSNs when
> + * replaying checkpoints.
> + */
> +static int
> +xlog_cil_write_chain(
> +	struct xfs_cil_ctx	*ctx,
> +	struct xfs_log_vec	*chain)
> +{
> +	struct xlog		*log = ctx->cil->xc_log;
> +	int			error;
> +
> +	error = xlog_cil_order_write(ctx->cil, ctx->sequence, _START_RECORD);
> +	if (error)
> +		return error;
> +	return xlog_write(log, ctx, chain, ctx->ticket, XLOG_START_TRANS);
> +}
> +
>  /*
>   * Write out the commit record of a checkpoint transaction to close off a
>   * running log write. These commit records are strictly ordered in ascending CIL
> @@ -746,6 +788,10 @@ xlog_cil_write_commit_record(
>  	if (xlog_is_shutdown(log))
>  		return -EIO;
>  
> +	error = xlog_cil_order_write(ctx->cil, ctx->sequence, _COMMIT_RECORD);
> +	if (error)
> +		return error;
> +
>  	error = xlog_write(log, ctx, &vec, ctx->ticket, XLOG_COMMIT_TRANS);
>  	if (error)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
> @@ -955,11 +1001,7 @@ xlog_cil_push_work(
>  	 */
>  	wait_for_completion(&bdev_flush);
>  
> -	error = xlog_write(log, ctx, &lvhdr, tic, XLOG_START_TRANS);
> -	if (error)
> -		goto out_abort_free_ticket;
> -
> -	error = xlog_cil_order_write(ctx->cil, ctx->sequence);
> +	error = xlog_cil_write_chain(ctx, &lvhdr);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> @@ -1364,6 +1406,7 @@ xlog_cil_init(
>  	spin_lock_init(&cil->xc_push_lock);
>  	init_waitqueue_head(&cil->xc_push_wait);
>  	init_rwsem(&cil->xc_ctx_lock);
> +	init_waitqueue_head(&cil->xc_start_wait);
>  	init_waitqueue_head(&cil->xc_commit_wait);
>  
>  	INIT_LIST_HEAD(&ctx->committing);
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index f74e3968bb84..400471fa12d2 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -271,6 +271,7 @@ struct xfs_cil {
>  	xfs_csn_t		xc_push_seq;
>  	struct list_head	xc_committing;
>  	wait_queue_head_t	xc_commit_wait;
> +	wait_queue_head_t	xc_start_wait;
>  	xfs_csn_t		xc_current_sequence;
>  	struct work_struct	xc_push_work;
>  	wait_queue_head_t	xc_push_wait;	/* background push throttle */
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5 v2] xfs: strictly order log start records
  2021-07-14  3:36 [PATCH 0/5 v2] xfs: strictly order log start records Dave Chinner
                   ` (4 preceding siblings ...)
  2021-07-14  3:36 ` [PATCH 5/5] xfs: order CIL checkpoint start records Dave Chinner
@ 2021-08-09 18:39 ` Darrick J. Wong
  5 siblings, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2021-08-09 18:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jul 14, 2021 at 01:36:51PM +1000, Dave Chinner wrote:
> 
> We recently found a zero-day log recovery issue where overlapping
> log transactions used the wrong LSN for recovery operations despite
> being replayed in the correct commit record order. The issue is that

This series doesn't apply to 5.14-rc4, probably because of the changes
we made in "xfs: fix log cache flush regressions and bugs".  I'd ask if
you could rebase this (and the "make CIL pipelining work" series)
against for-next, except that found regressions with one of Allison's
patches, so I had to kick that one out and restart all the QA stuff.

Between that and fixing all the annoying indentation and whitespace
problems, the 5.15 merge branch isn't going to be ready for for-next
until Wednesday's tree build... and that's why there's still no for-next
in my tree.

--D

> the log recovery code uses the LSN of the start record of the log
> transaction for ordering and metadata stamping, while the actual
> order of transaction replay is determined by the commit record.
> Hence if we pipeline CIL commits we can end up with overlapping
> transactions in the log like:
> 
> Start A .. Start C .. Start B .... Commit A .. Commit B .. Commit C
> 
> The issue is that the "start B" lsn is later than the "start C" lsn.
> When the same metadata block is modified in both transaction B and
> C, writeback from "commit B" will correctly stamp "start B" into the
> metadata.
> 
> However, when "commit C" runs, it will see the LSN in that metadata
> block is "start B", which is *more recent than "Start C" and so
> will, incorrectly, fail to recover that change into the metadata
> block. This results in silent metadata corruption, which can then be
> exposed by future recovery operations failing, runtime
> inconsistencies causing shutdowns and/or xfs_scrub/xfs_repair check
> failures.
> 
> We could fix log recovery to avoid this problem, but there's a
> runtime problem as well: the AIL is ordered by start record LSN. We
> cannot order the AIL by commit LSN as we cannot allow the tail of
> the log to overwrite -any- of the log transaction until the entire
> transaction has been written. As the lowest LSN of the items in the
> AIL defines the current log tail, the same metadata writeback
> ordering issues apply as with log recovery.
> 
> In this case, we run the callbacks for commit B first, which place
> all the items at the head of the log at "start B". Then we run
> callbacks for "commit C", which then do not insert at the head -
> they get inserted before "start B". If the item was modified in both
> B and C, then it moves *backwards* in the AIL and this screws up all
> manner of things that assume relogging can only move objects
> forwards in the log. One of these things it can screw up is the tail
> lsn of the log. Nothing good comes from this...
> 
> Because we have both runtime and journal-based ordering requirements
> for the start_lsn, we have multiple places where there is an
> implicit assumption that transaction start records are strictly
> ordered. Rather than play whack-a-mole with such assumptions, and to
> avoid the eternal "are you running a fixed kernel" question, it's
> better just to strictly order the start records in the same way we
> strictly order the commit records.
> 
> This patch series takes the mechanisms of the strict commit record
> ordering and utilises them for strict start record ordering. It
> builds upon the shutdown rework patchset to guarantee that the CIL
> context structure will not get freed from under it by a racing
> shutdown, and so moves the LSN recording for ordering up into a
> callback from xlog_write() once we have a guaranteed iclog write
> location. This means we have one mechanism for both start and commit
> record ordering, and they both work in exactly the same way.
> 
> 
> Version 2:
> - rebase on 5.14-rc1 + "xfs: shutdown is a racy mess"
> - fixed typos in commit messages
> 
> Version 1:
> - https://lore.kernel.org/linux-xfs/20210630072108.1752073-1-david@fromorbit.com/
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state()
  2021-07-05  5:49     ` Dave Chinner
@ 2021-07-13  0:52       ` Darrick J. Wong
  0 siblings, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2021-07-13  0:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs

On Mon, Jul 05, 2021 at 03:49:19PM +1000, Dave Chinner wrote:
> On Fri, Jul 02, 2021 at 10:17:57AM +0100, Christoph Hellwig wrote:
> > On Wed, Jun 30, 2021 at 05:21:07PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Now that we have a mechanism to guarantee that the callbacks
> > > attached to an iclog are owned by the context that attaches them
> > > until they drop their reference to the iclog via
> > > xlog_state_release_iclog(), we can attach callbacks to the iclog at
> > > any time we have an active reference to the iclog.
> > > 
> > > xlog_state_get_iclog_space() always guarantees that the commit
> > > record will fit in the iclog it returns, so we can move this IO
> > > callback setting to xlog_cil_set_ctx_write_state(), record the
> > > commit iclog in the context and remove the need for the commit iclog
> > > to be returned by xlog_write() altogether.
> > > 
> > > This, in turn, allows us to move the wakeup for ordered commit
> > > recrod writes up into xlog_cil_set_ctx_write_state(), too, because
> > 
> > s/recrod/record/
> > 
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -646,11 +646,41 @@ xlog_cil_set_ctx_write_state(
> > >  	xfs_lsn_t		lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> > >  
> > >  	ASSERT(!ctx->commit_lsn);
> > > +	if (!ctx->start_lsn) {
> > > +		spin_lock(&cil->xc_push_lock);
> > >  		ctx->start_lsn = lsn;
> > > +		spin_unlock(&cil->xc_push_lock);
> > > +		return;
> > 
> > What does xc_push_lock protect here?  None of the read of
> > ->start_lsn are under xc_push_lock, and this patch moves one of the
> > two readers to be under l_icloglock.
> 
> For this patch - nothing. It just maintains the consistency
> introduced in the previous patch of doing the CIL context updates
> under the xc_push_lock. I did that in the previous patch for
> simplicity: the next patch adds the start record ordering which,
> like the commit record ordering, needs to set ctx->start_lsn and run
> the waiter wakeup under the xc_push_lock.
> 
> > Also I wonder if the comment about what is done if start_lsn is not
> > set would be better right above the if instead of on top of the function
> > so that it stays closer to the code it documents.
> 
> I think it's better to document calling conventions at the top of
> the function, rather than having to read the implementation of the
> function to determine how it is supposed to be called. i.e. we
> expect two calls to this function per CIL checkpoint - the first for
> the start record ordering, the second for the commit record
> ordering...

For calling conventions, I totally agree.  It's a lot easier to figure
out a function's preconditions if they're listed in the function comment
and not buried in the body.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state()
  2021-07-02  9:17   ` Christoph Hellwig
@ 2021-07-05  5:49     ` Dave Chinner
  2021-07-13  0:52       ` Darrick J. Wong
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2021-07-05  5:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Fri, Jul 02, 2021 at 10:17:57AM +0100, Christoph Hellwig wrote:
> On Wed, Jun 30, 2021 at 05:21:07PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that we have a mechanism to guarantee that the callbacks
> > attached to an iclog are owned by the context that attaches them
> > until they drop their reference to the iclog via
> > xlog_state_release_iclog(), we can attach callbacks to the iclog at
> > any time we have an active reference to the iclog.
> > 
> > xlog_state_get_iclog_space() always guarantees that the commit
> > record will fit in the iclog it returns, so we can move this IO
> > callback setting to xlog_cil_set_ctx_write_state(), record the
> > commit iclog in the context and remove the need for the commit iclog
> > to be returned by xlog_write() altogether.
> > 
> > This, in turn, allows us to move the wakeup for ordered commit
> > recrod writes up into xlog_cil_set_ctx_write_state(), too, because
> 
> s/recrod/record/
> 
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -646,11 +646,41 @@ xlog_cil_set_ctx_write_state(
> >  	xfs_lsn_t		lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> >  
> >  	ASSERT(!ctx->commit_lsn);
> > +	if (!ctx->start_lsn) {
> > +		spin_lock(&cil->xc_push_lock);
> >  		ctx->start_lsn = lsn;
> > +		spin_unlock(&cil->xc_push_lock);
> > +		return;
> 
> What does xc_push_lock protect here?  None of the read of
> ->start_lsn are under xc_push_lock, and this patch moves one of the
> two readers to be under l_icloglock.

For this patch - nothing. It just maintains the consistency
introduced in the previous patch of doing the CIL context updates
under the xc_push_lock. I did that in the previous patch for
simplicity: the next patch adds the start record ordering which,
like the commit record ordering, needs to set ctx->start_lsn and run
the waiter wakeup under the xc_push_lock.

> Also I wonder if the comment about what is done if start_lsn is not
> set would be better right above the if instead of on top of the function
> so that it stays closer to the code it documents.

I think it's better to document calling conventions at the top of
the function, rather than having to read the implementation of the
function to determine how it is supposed to be called. i.e. we
expect two calls to this function per CIL checkpoint - the first for
the start record ordering, the second for the commit record
ordering...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state()
  2021-06-30  7:21 ` [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state() Dave Chinner
@ 2021-07-02  9:17   ` Christoph Hellwig
  2021-07-05  5:49     ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2021-07-02  9:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 30, 2021 at 05:21:07PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have a mechanism to guarantee that the callbacks
> attached to an iclog are owned by the context that attaches them
> until they drop their reference to the iclog via
> xlog_state_release_iclog(), we can attach callbacks to the iclog at
> any time we have an active reference to the iclog.
> 
> xlog_state_get_iclog_space() always guarantees that the commit
> record will fit in the iclog it returns, so we can move this IO
> callback setting to xlog_cil_set_ctx_write_state(), record the
> commit iclog in the context and remove the need for the commit iclog
> to be returned by xlog_write() altogether.
> 
> This, in turn, allows us to move the wakeup for ordered commit
> recrod writes up into xlog_cil_set_ctx_write_state(), too, because

s/recrod/record/

> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -646,11 +646,41 @@ xlog_cil_set_ctx_write_state(
>  	xfs_lsn_t		lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
>  	ASSERT(!ctx->commit_lsn);
> +	if (!ctx->start_lsn) {
> +		spin_lock(&cil->xc_push_lock);
>  		ctx->start_lsn = lsn;
> +		spin_unlock(&cil->xc_push_lock);
> +		return;

What does xc_push_lock protect here?  None of the read of
->start_lsn are under xc_push_lock, and this patch moves one of the
two readers to be under l_icloglock.

Also I wonder if the comment about what is done if start_lsn is not
set would be better right above the if instead of on top of the function
so that it stays closer to the code it documents.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state()
  2021-06-30  7:21 [PATCH 0/5] " Dave Chinner
@ 2021-06-30  7:21 ` Dave Chinner
  2021-07-02  9:17   ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2021-06-30  7:21 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have a mechanism to guarantee that the callbacks
attached to an iclog are owned by the context that attaches them
until they drop their reference to the iclog via
xlog_state_release_iclog(), we can attach callbacks to the iclog at
any time we have an active reference to the iclog.

xlog_state_get_iclog_space() always guarantees that the commit
record will fit in the iclog it returns, so we can move this IO
callback setting to xlog_cil_set_ctx_write_state(), record the
commit iclog in the context and remove the need for the commit iclog
to be returned by xlog_write() altogether.

This, in turn, allows us to move the wakeup for ordered commit
recrod writes up into xlog_cil_set_ctx_write_state(), too, because
we have been guaranteed that this commit record will be physically
located in the iclog before any waiting commit record at a higher
sequence number will be granted iclog space.

This further cleans up the post commit record write processing in
the CIL push code, especially as xlog_state_release_iclog() will now
clean up the context when shutdown errors occur.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 47 ++++++++--------------
 fs/xfs/xfs_log_cil.c  | 94 ++++++++++++++++++++++++-------------------
 fs/xfs/xfs_log_priv.h |  3 +-
 3 files changed, 70 insertions(+), 74 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index fd0731cf1cb3..d3e1e1a07cdc 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -898,7 +898,7 @@ xlog_write_unmount_record(
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp)
 		blkdev_issue_flush(log->l_targ->bt_bdev);
-	return xlog_write(log, NULL, &vec, ticket, NULL, XLOG_UNMOUNT_TRANS);
+	return xlog_write(log, NULL, &vec, ticket, XLOG_UNMOUNT_TRANS);
 }
 
 /*
@@ -2301,8 +2301,7 @@ xlog_write_copy_finish(
 	int			*data_cnt,
 	int			*partial_copy,
 	int			*partial_copy_len,
-	int			log_offset,
-	struct xlog_in_core	**commit_iclog)
+	int			log_offset)
 {
 	int			error;
 
@@ -2321,27 +2320,20 @@ xlog_write_copy_finish(
 	*partial_copy = 0;
 	*partial_copy_len = 0;
 
-	if (iclog->ic_size - log_offset <= sizeof(xlog_op_header_t)) {
-		/* no more space in this iclog - push it. */
-		spin_lock(&log->l_icloglock);
-		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
-		*record_cnt = 0;
-		*data_cnt = 0;
-
-		if (iclog->ic_state == XLOG_STATE_ACTIVE)
-			xlog_state_switch_iclogs(log, iclog, 0);
-		else
-			ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
-				xlog_is_shutdown(log));
-		if (!commit_iclog)
-			goto release_iclog;
-		spin_unlock(&log->l_icloglock);
-		ASSERT(flags & XLOG_COMMIT_TRANS);
-		*commit_iclog = iclog;
-	}
+	if (iclog->ic_size - log_offset > sizeof(xlog_op_header_t))
+		return 0;
 
-	return 0;
+	/* no more space in this iclog - push it. */
+	spin_lock(&log->l_icloglock);
+	xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
+	*record_cnt = 0;
+	*data_cnt = 0;
 
+	if (iclog->ic_state == XLOG_STATE_ACTIVE)
+		xlog_state_switch_iclogs(log, iclog, 0);
+	else
+		ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
+			xlog_is_shutdown(log));
 release_iclog:
 	error = xlog_state_release_iclog(log, iclog);
 	spin_unlock(&log->l_icloglock);
@@ -2394,7 +2386,6 @@ xlog_write(
 	struct xfs_cil_ctx	*ctx,
 	struct xfs_log_vec	*log_vector,
 	struct xlog_ticket	*ticket,
-	struct xlog_in_core	**commit_iclog,
 	uint			optype)
 {
 	struct xlog_in_core	*iclog = NULL;
@@ -2523,8 +2514,7 @@ xlog_write(
 						       &record_cnt, &data_cnt,
 						       &partial_copy,
 						       &partial_copy_len,
-						       log_offset,
-						       commit_iclog);
+						       log_offset);
 			if (error)
 				return error;
 
@@ -2562,12 +2552,7 @@ xlog_write(
 
 	spin_lock(&log->l_icloglock);
 	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
-	if (commit_iclog) {
-		ASSERT(optype & XLOG_COMMIT_TRANS);
-		*commit_iclog = iclog;
-	} else {
-		error = xlog_state_release_iclog(log, iclog);
-	}
+	error = xlog_state_release_iclog(log, iclog);
 	spin_unlock(&log->l_icloglock);
 
 	return error;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 8f88ee2abdf6..30810b896e46 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -646,11 +646,41 @@ xlog_cil_set_ctx_write_state(
 	xfs_lsn_t		lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
 	ASSERT(!ctx->commit_lsn);
-	spin_lock(&cil->xc_push_lock);
-	if (!ctx->start_lsn)
+	if (!ctx->start_lsn) {
+		spin_lock(&cil->xc_push_lock);
 		ctx->start_lsn = lsn;
-	else
-		ctx->commit_lsn = lsn;
+		spin_unlock(&cil->xc_push_lock);
+		return;
+	}
+
+	/*
+	 * Take a reference to the iclog for the context so that we still hold
+	 * it when xlog_write is done and has released it. This means the
+	 * context controls when the iclog is released for IO.
+	 */
+	atomic_inc(&iclog->ic_refcnt);
+
+	/*
+	 * xlog_state_get_iclog_space() guarantees there is enough space in the
+	 * iclog for an entire commit record, so we can attach the context
+	 * callbacks now.  This needs to be done before we make the commit_lsn
+	 * visible to waiters so that checkpoints with commit records in the
+	 * same iclog order their IO completion callbacks in the same order that
+	 * the commit records appear in the iclog.
+	 */
+	spin_lock(&cil->xc_log->l_icloglock);
+	list_add_tail(&ctx->iclog_entry, &iclog->ic_callbacks);
+	spin_unlock(&cil->xc_log->l_icloglock);
+
+	/*
+	 * Now we can record the commit LSN and wake anyone waiting for this
+	 * sequence to have the ordered commit record assigned to a physical
+	 * location in the log.
+	 */
+	spin_lock(&cil->xc_push_lock);
+	ctx->commit_iclog = iclog;
+	ctx->commit_lsn = lsn;
+	wake_up_all(&cil->xc_commit_wait);
 	spin_unlock(&cil->xc_push_lock);
 }
 
@@ -707,8 +737,7 @@ xlog_cil_order_write(
  */
 static int
 xlog_cil_write_commit_record(
-	struct xfs_cil_ctx	*ctx,
-	struct xlog_in_core	**iclog)
+	struct xfs_cil_ctx	*ctx)
 {
 	struct xlog		*log = ctx->cil->xc_log;
 	struct xfs_log_iovec	reg = {
@@ -725,8 +754,7 @@ xlog_cil_write_commit_record(
 	if (xlog_is_shutdown(log))
 		return -EIO;
 
-	error = xlog_write(log, ctx, &vec, ctx->ticket, iclog,
-			XLOG_COMMIT_TRANS);
+	error = xlog_write(log, ctx, &vec, ctx->ticket, XLOG_COMMIT_TRANS);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	return error;
@@ -756,7 +784,6 @@ xlog_cil_push_work(
 	struct xfs_log_vec	*lv;
 	struct xfs_cil_ctx	*ctx;
 	struct xfs_cil_ctx	*new_ctx;
-	struct xlog_in_core	*commit_iclog;
 	struct xlog_ticket	*tic;
 	int			num_iovecs;
 	int			error = 0;
@@ -936,7 +963,7 @@ xlog_cil_push_work(
 	 */
 	wait_for_completion(&bdev_flush);
 
-	error = xlog_write(log, ctx, &lvhdr, tic, NULL, XLOG_START_TRANS);
+	error = xlog_write(log, ctx, &lvhdr, tic, XLOG_START_TRANS);
 	if (error)
 		goto out_abort_free_ticket;
 
@@ -944,36 +971,12 @@ xlog_cil_push_work(
 	if (error)
 		goto out_abort_free_ticket;
 
-	error = xlog_cil_write_commit_record(ctx, &commit_iclog);
+	error = xlog_cil_write_commit_record(ctx);
 	if (error)
 		goto out_abort_free_ticket;
 
 	xfs_log_ticket_ungrant(log, tic);
 
-	/*
-	 * Once we attach the ctx to the iclog, it is effectively owned by the
-	 * iclog and we can only use it while we still have an active reference
-	 * to the iclog. i.e. once we call xlog_state_release_iclog() we can no
-	 * longer safely reference the ctx.
-	 */
-	spin_lock(&log->l_icloglock);
-	if (xlog_is_shutdown(log)) {
-		spin_unlock(&log->l_icloglock);
-		goto out_abort;
-	}
-	ASSERT_ALWAYS(commit_iclog->ic_state == XLOG_STATE_ACTIVE ||
-		      commit_iclog->ic_state == XLOG_STATE_WANT_SYNC);
-	list_add_tail(&ctx->iclog_entry, &commit_iclog->ic_callbacks);
-
-	/*
-	 * now the checkpoint commit is complete and we've attached the
-	 * callbacks to the iclog we can assign the commit LSN to the context
-	 * and wake up anyone who is waiting for the commit to complete.
-	 */
-	spin_lock(&cil->xc_push_lock);
-	wake_up_all(&cil->xc_commit_wait);
-	spin_unlock(&cil->xc_push_lock);
-
 	/*
 	 * If the checkpoint spans multiple iclogs, wait for all previous iclogs
 	 * to complete before we submit the commit_iclog. We can't use state
@@ -986,17 +989,18 @@ xlog_cil_push_work(
 	 * iclog header lsn and compare it to the commit lsn to determine if we
 	 * need to wait on iclogs or not.
 	 */
+	spin_lock(&log->l_icloglock);
 	if (ctx->start_lsn != ctx->commit_lsn) {
 		xfs_lsn_t	plsn;
 
-		plsn = be64_to_cpu(commit_iclog->ic_prev->ic_header.h_lsn);
+		plsn = be64_to_cpu(ctx->commit_iclog->ic_prev->ic_header.h_lsn);
 		if (plsn && XFS_LSN_CMP(plsn, ctx->commit_lsn) < 0) {
 			/*
 			 * Waiting on ic_force_wait orders the completion of
 			 * iclogs older than ic_prev. Hence we only need to wait
 			 * on the most recent older iclog here.
 			 */
-			xlog_wait_on_iclog(commit_iclog->ic_prev);
+			xlog_wait_on_iclog(ctx->commit_iclog->ic_prev);
 			spin_lock(&log->l_icloglock);
 		}
 
@@ -1004,7 +1008,7 @@ xlog_cil_push_work(
 		 * We need to issue a pre-flush so that the ordering for this
 		 * checkpoint is correctly preserved down to stable storage.
 		 */
-		commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
+		ctx->commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
 	}
 
 	/*
@@ -1012,8 +1016,8 @@ xlog_cil_push_work(
 	 * journal IO vs metadata writeback IO is correctly ordered on stable
 	 * storage.
 	 */
-	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
-	xlog_state_release_iclog(log, commit_iclog);
+	ctx->commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
+	xlog_state_release_iclog(log, ctx->commit_iclog);
 
 	/* Not safe to reference ctx now! */
 
@@ -1028,9 +1032,15 @@ xlog_cil_push_work(
 
 out_abort_free_ticket:
 	xfs_log_ticket_ungrant(log, tic);
-out_abort:
 	ASSERT(xlog_is_shutdown(log));
-	xlog_cil_committed(ctx);
+	if (!ctx->commit_iclog) {
+		xlog_cil_committed(ctx);
+		return;
+	}
+	spin_lock(&log->l_icloglock);
+	xlog_state_release_iclog(log, ctx->commit_iclog);
+	/* Not safe to reference ctx now! */
+	spin_unlock(&log->l_icloglock);
 }
 
 /*
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 2a02ce05b649..f74e3968bb84 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -232,6 +232,7 @@ struct xfs_cil_ctx {
 	xfs_csn_t		sequence;	/* chkpt sequence # */
 	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
 	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
+	struct xlog_in_core	*commit_iclog;
 	struct xlog_ticket	*ticket;	/* chkpt ticket */
 	int			nvecs;		/* number of regions */
 	int			space_used;	/* aggregate size of regions */
@@ -503,7 +504,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
 int	xlog_write(struct xlog *log, struct xfs_cil_ctx *ctx,
 		struct xfs_log_vec *log_vector, struct xlog_ticket *tic,
-		struct xlog_in_core **commit_iclog, uint optype);
+		uint optype);
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-08-09 18:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14  3:36 [PATCH 0/5 v2] xfs: strictly order log start records Dave Chinner
2021-07-14  3:36 ` [PATCH 1/5] xfs: move xlog_commit_record to xfs_log_cil.c Dave Chinner
2021-07-14  3:36 ` [PATCH 2/5] xfs: pass a CIL context to xlog_write() Dave Chinner
2021-07-14  3:36 ` [PATCH 3/5] xfs: factor out log write ordering from xlog_cil_push_work() Dave Chinner
2021-07-14  3:36 ` [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state() Dave Chinner
2021-07-14  6:21   ` Christoph Hellwig
2021-07-14 22:36   ` Darrick J. Wong
2021-07-14  3:36 ` [PATCH 5/5] xfs: order CIL checkpoint start records Dave Chinner
2021-07-14  6:34   ` Christoph Hellwig
2021-07-14 22:39   ` Darrick J. Wong
2021-08-09 18:39 ` [PATCH 0/5 v2] xfs: strictly order log " Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2021-06-30  7:21 [PATCH 0/5] " Dave Chinner
2021-06-30  7:21 ` [PATCH 4/5] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state() Dave Chinner
2021-07-02  9:17   ` Christoph Hellwig
2021-07-05  5:49     ` Dave Chinner
2021-07-13  0:52       ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).