linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] xfs: log race fixes and cleanups
@ 2019-09-04  4:24 Dave Chinner
  2019-09-04  4:24 ` [PATCH 1/7] xfs: push the AIL in xlog_grant_head_wake Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 41+ messages in thread
From: Dave Chinner @ 2019-09-04  4:24 UTC (permalink / raw)
  To: linux-xfs

HI folks,

This series of patches contain fixes for the generic/530 hangs that
Chandan and Jan reported recently. The first patch in the series is
the main fix for this, though in theory the last patch in the series
is necessary to totally close the problem off.

The second patch is an outstanding fix for a log flush wakeup race
in the xlog_state_do_callback() code from Rik van Riel. Given that
the next 5 patches in the series clean up iand address the tail
pushing race in xlog_state_do_callback(), it makes sense to get this
into the kernel at the same time.

Patches 3-6 factor several independent chunks out of
xlog_state_do_callback, greatly simplifying that function and
breaking the iclog IO completion processing down into discrete
chunks of separate functionality. This makes the code much easier
to read, follow and understand.

An the last patch kicks the AIL if necessary when the log tail is
moved to avod the situation where the AIL pushing has stalled
because the previous last_sync_lsn is further into the log than the
push target reaches.

Comments welcome.

-Dave.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/7] xfs: push the AIL in xlog_grant_head_wake
  2019-09-04  4:24 [PATCH 0/7] xfs: log race fixes and cleanups Dave Chinner
@ 2019-09-04  4:24 ` Dave Chinner
  2019-09-04  6:07   ` Christoph Hellwig
  2019-09-04  4:24 ` [PATCH 2/7] xfs: fix missed wakeup on l_flush_wait Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2019-09-04  4:24 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

In the situation where the log is full and the CIL has not recently
flushed, the AIL push threshold is throttled back to the where the
last write of the head of the log was completed. This is stored in
log->l_last_sync_lsn. Hence if the CIL holds > 25% of the log space
pinned by flushes and/or aggregation in progress, we can get the
situation where the head of the log lags a long way behind the
reservation grant head.

When this happens, the AIL push target is trimmed back from where
the reservation grant head wants to push the log tail to, back to
where the head of the log currently is. This means the push target
doesn't reach far enough into the log to actually move the tail
before the transaction reservation goes to sleep.

When the CIL push completes, it moves the log head forward such that
the AIL push target can now be moved, but that has no mechanism for
puhsing the log tail. Further, if the next tail movement of the log
is not large enough wake the waiter (i.e. still not enough space for
it to have a reservation granted), we don't wake anything up, and
hence we do not update the AIL push target to take into account the
head of the log moving and allowing the push target to be moved
forwards.

To avoid this particular condition, if we fail to wake the first
waiter on the grant head because we don't have enough space,
push on the AIL again. This will pick up any movement of the log
head and allow the push target to move forward due to completion of
CIL pushing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index b159a9e9fef0..941f10ff99d9 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -214,15 +214,20 @@ xlog_grant_head_wake(
 {
 	struct xlog_ticket	*tic;
 	int			need_bytes;
+	bool			woken_task = false;
 
 	list_for_each_entry(tic, &head->waiters, t_queue) {
 		need_bytes = xlog_ticket_reservation(log, head, tic);
-		if (*free_bytes < need_bytes)
+		if (*free_bytes < need_bytes) {
+			if (!woken_task)
+				xlog_grant_push_ail(log, need_bytes);
 			return false;
+		}
 
 		*free_bytes -= need_bytes;
 		trace_xfs_log_grant_wake_up(log, tic);
 		wake_up_process(tic->t_task);
+		woken_task = true;
 	}
 
 	return true;
-- 
2.23.0.rc1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/7] xfs: fix missed wakeup on l_flush_wait
  2019-09-04  4:24 [PATCH 0/7] xfs: log race fixes and cleanups Dave Chinner
  2019-09-04  4:24 ` [PATCH 1/7] xfs: push the AIL in xlog_grant_head_wake Dave Chinner
@ 2019-09-04  4:24 ` Dave Chinner
  2019-09-04  6:07   ` Christoph Hellwig
  2019-09-04  4:24 ` [PATCH 3/7] xfs: factor debug code out of xlog_state_do_callback() Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2019-09-04  4:24 UTC (permalink / raw)
  To: linux-xfs

From: Rik van Riel <riel@surriel.com>

The code in xlog_wait uses the spinlock to make adding the task to
the wait queue, and setting the task state to UNINTERRUPTIBLE atomic
with respect to the waker.

Doing the wakeup after releasing the spinlock opens up the following
race condition:

Task 1					task 2
add task to wait queue
					wake up task
set task state to UNINTERRUPTIBLE

This issue was found through code inspection as a result of kworkers
being observed stuck in UNINTERRUPTIBLE state with an empty
wait queue. It is rare and largely unreproducable.

Simply moving the spin_unlock to after the wake_up_all results
in the waker not being able to see a task on the waitqueue before
it has set its state to UNINTERRUPTIBLE.

This bug dates back to the conversion of this code to generic
waitqueue infrastructure from a counting semaphore back in 2008
which didn't place the wakeups consistently w.r.t. to the relevant
spin locks.

[dchinner: Also fix a similar issue in the shutdown path on
xc_commit_wait. Update commit log with more details of the issue.]

Fixes: d748c62367eb ("[XFS] Convert l_flushsema to a sv_t")
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 941f10ff99d9..6f494f6369e8 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2628,7 +2628,6 @@ xlog_state_do_callback(
 	int		   funcdidcallbacks; /* flag: function did callbacks */
 	int		   repeats;	/* for issuing console warnings if
 					 * looping too many times */
-	int		   wake = 0;
 
 	spin_lock(&log->l_icloglock);
 	first_iclog = iclog = log->l_iclog;
@@ -2824,11 +2823,9 @@ xlog_state_do_callback(
 #endif
 
 	if (log->l_iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_IOERROR))
-		wake = 1;
-	spin_unlock(&log->l_icloglock);
-
-	if (wake)
 		wake_up_all(&log->l_flush_wait);
+
+	spin_unlock(&log->l_icloglock);
 }
 
 
@@ -3928,7 +3925,9 @@ xfs_log_force_umount(
 	 * item committed callback functions will do this again under lock to
 	 * avoid races.
 	 */
+	spin_lock(&log->l_cilp->xc_push_lock);
 	wake_up_all(&log->l_cilp->xc_commit_wait);
+	spin_unlock(&log->l_cilp->xc_push_lock);
 	xlog_state_do_callback(log, true, NULL);
 
 #ifdef XFSERRORDEBUG
-- 
2.23.0.rc1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/7] xfs: factor debug code out of xlog_state_do_callback()
  2019-09-04  4:24 [PATCH 0/7] xfs: log race fixes and cleanups Dave Chinner
  2019-09-04  4:24 ` [PATCH 1/7] xfs: push the AIL in xlog_grant_head_wake Dave Chinner
  2019-09-04  4:24 ` [PATCH 2/7] xfs: fix missed wakeup on l_flush_wait Dave Chinner
@ 2019-09-04  4:24 ` Dave Chinner
  2019-09-04  6:10   ` Christoph Hellwig
  2019-09-04  4:24 ` [PATCH 4/7] xfs: factor callbacks " Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2019-09-04  4:24 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Start making this function readable by lifting the debug code into
a conditional function.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 83 ++++++++++++++++++++++++++++--------------------
 1 file changed, 48 insertions(+), 35 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 6f494f6369e8..b3d8dbb3b956 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2612,6 +2612,53 @@ xlog_get_lowest_lsn(
 	return lowest_lsn;
 }
 
+#ifdef DEBUG
+/*
+ * Make one last gasp attempt to see if iclogs are being left in limbo.  If the
+ * above loop finds an iclog earlier than the current iclog and in one of the
+ * syncing states, the current iclog is put into DO_CALLBACK and the callbacks
+ * are deferred to the completion of the earlier iclog. Walk the iclogs in order
+ * and make sure that no iclog is in DO_CALLBACK unless an earlier iclog is in
+ * one of the syncing states.
+ *
+ * Note that SYNCING|IOABORT is a valid state so we cannot just check for
+ * ic_state == SYNCING.
+ */
+static void
+xlog_state_callback_check_state(
+	struct xlog		*log,
+	int			did_callbacks)
+{
+	struct xlog_in_core	*iclog;
+	struct xlog_in_core	*first_iclog;
+
+	if (!did_callbacks)
+		return;
+
+	first_iclog = iclog = log->l_iclog;
+	do {
+		ASSERT(iclog->ic_state != XLOG_STATE_DO_CALLBACK);
+		/*
+		 * Terminate the loop if iclogs are found in states
+		 * which will cause other threads to clean up iclogs.
+		 *
+		 * SYNCING - i/o completion will go through logs
+		 * DONE_SYNC - interrupt thread should be waiting for
+		 *              l_icloglock
+		 * IOERROR - give up hope all ye who enter here
+		 */
+		if (iclog->ic_state == XLOG_STATE_WANT_SYNC ||
+		    iclog->ic_state & XLOG_STATE_SYNCING ||
+		    iclog->ic_state == XLOG_STATE_DONE_SYNC ||
+		    iclog->ic_state == XLOG_STATE_IOERROR )
+			break;
+		iclog = iclog->ic_next;
+	} while (first_iclog != iclog);
+}
+#else
+#define xlog_state_callback_check_state(l,d)	((void)0)
+#endif
+
 STATIC void
 xlog_state_do_callback(
 	struct xlog		*log,
@@ -2786,41 +2833,7 @@ xlog_state_do_callback(
 		}
 	} while (!ioerrors && loopdidcallbacks);
 
-#ifdef DEBUG
-	/*
-	 * Make one last gasp attempt to see if iclogs are being left in limbo.
-	 * If the above loop finds an iclog earlier than the current iclog and
-	 * in one of the syncing states, the current iclog is put into
-	 * DO_CALLBACK and the callbacks are deferred to the completion of the
-	 * earlier iclog. Walk the iclogs in order and make sure that no iclog
-	 * is in DO_CALLBACK unless an earlier iclog is in one of the syncing
-	 * states.
-	 *
-	 * Note that SYNCING|IOABORT is a valid state so we cannot just check
-	 * for ic_state == SYNCING.
-	 */
-	if (funcdidcallbacks) {
-		first_iclog = iclog = log->l_iclog;
-		do {
-			ASSERT(iclog->ic_state != XLOG_STATE_DO_CALLBACK);
-			/*
-			 * Terminate the loop if iclogs are found in states
-			 * which will cause other threads to clean up iclogs.
-			 *
-			 * SYNCING - i/o completion will go through logs
-			 * DONE_SYNC - interrupt thread should be waiting for
-			 *              l_icloglock
-			 * IOERROR - give up hope all ye who enter here
-			 */
-			if (iclog->ic_state == XLOG_STATE_WANT_SYNC ||
-			    iclog->ic_state & XLOG_STATE_SYNCING ||
-			    iclog->ic_state == XLOG_STATE_DONE_SYNC ||
-			    iclog->ic_state == XLOG_STATE_IOERROR )
-				break;
-			iclog = iclog->ic_next;
-		} while (first_iclog != iclog);
-	}
-#endif
+	xlog_state_callback_check_state(log, funcdidcallbacks);
 
 	if (log->l_iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_IOERROR))
 		wake_up_all(&log->l_flush_wait);
-- 
2.23.0.rc1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/7] xfs: factor callbacks out of xlog_state_do_callback()
  2019-09-04  4:24 [PATCH 0/7] xfs: log race fixes and cleanups Dave Chinner
                   ` (2 preceding siblings ...)
  2019-09-04  4:24 ` [PATCH 3/7] xfs: factor debug code out of xlog_state_do_callback() Dave Chinner
@ 2019-09-04  4:24 ` Dave Chinner
  2019-09-04  6:13   ` Christoph Hellwig
  2019-09-04  6:32   ` Christoph Hellwig
  2019-09-04  4:24 ` [PATCH 5/7] xfs: factor iclog state processing " Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 41+ messages in thread
From: Dave Chinner @ 2019-09-04  4:24 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Simplify the code flow by lifting the iclog callback work out of
the main iclog iteration loop. This isolates the log juggling and
callbacks from the iclog state change logic in the loop.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 69 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 45 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index b3d8dbb3b956..c3a43fe023ea 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2612,6 +2612,47 @@ xlog_get_lowest_lsn(
 	return lowest_lsn;
 }
 
+/*
+ * Keep processing entries in the iclog callback list until we come around and
+ * it is empty.  We need to atomically see that the list is empty and change the
+ * state to DIRTY so that we don't miss any more callbacks being added.
+ *
+ * This function is called with the icloglock held and returns with it held. We
+ * drop it while running callbacks, however, as holding it over thousands of
+ * callbacks is unnecessary and causes excessive contention if we do.
+ */
+static bool
+xlog_state_do_iclog_callbacks(
+	struct xlog		*log,
+	struct xlog_in_core	*iclog,
+	bool			aborted)
+{
+	bool			ret = false;
+
+	spin_unlock(&log->l_icloglock);
+
+	spin_lock(&iclog->ic_callback_lock);
+	while (!list_empty(&iclog->ic_callbacks)) {
+		LIST_HEAD(tmp);
+
+		list_splice_init(&iclog->ic_callbacks, &tmp);
+
+		spin_unlock(&iclog->ic_callback_lock);
+		xlog_cil_process_committed(&tmp, aborted);
+		spin_lock(&iclog->ic_callback_lock);
+		ret = true;
+	}
+
+	/*
+	 * Pick up the icloglock while still holding the callback lock so we
+	 * serialise against anyone trying to add more callbacks to this iclog
+	 * now we've finished processing.
+	 */
+	spin_lock(&log->l_icloglock);
+	spin_unlock(&iclog->ic_callback_lock);
+	return ret;
+}
+
 #ifdef DEBUG
 /*
  * Make one last gasp attempt to see if iclogs are being left in limbo.  If the
@@ -2784,31 +2825,9 @@ xlog_state_do_callback(
 			} else
 				ioerrors++;
 
-			spin_unlock(&log->l_icloglock);
+			if (xlog_state_do_iclog_callbacks(log, iclog, aborted))
+				loopdidcallbacks++;
 
-			/*
-			 * Keep processing entries in the callback list until
-			 * we come around and it is empty.  We need to
-			 * atomically see that the list is empty and change the
-			 * state to DIRTY so that we don't miss any more
-			 * callbacks being added.
-			 */
-			spin_lock(&iclog->ic_callback_lock);
-			while (!list_empty(&iclog->ic_callbacks)) {
-				LIST_HEAD(tmp);
-
-				list_splice_init(&iclog->ic_callbacks, &tmp);
-
-				spin_unlock(&iclog->ic_callback_lock);
-				xlog_cil_process_committed(&tmp, aborted);
-				spin_lock(&iclog->ic_callback_lock);
-			}
-
-			loopdidcallbacks++;
-			funcdidcallbacks++;
-
-			spin_lock(&log->l_icloglock);
-			spin_unlock(&iclog->ic_callback_lock);
 			if (!(iclog->ic_state & XLOG_STATE_IOERROR))
 				iclog->ic_state = XLOG_STATE_DIRTY;
 
@@ -2824,6 +2843,8 @@ xlog_state_do_callback(
 			iclog = iclog->ic_next;
 		} while (first_iclog != iclog);
 
+		funcdidcallbacks += loopdidcallbacks;
+
 		if (repeats > 5000) {
 			flushcnt += repeats;
 			repeats = 0;
-- 
2.23.0.rc1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/7] xfs: factor iclog state processing out of xlog_state_do_callback()
  2019-09-04  4:24 [PATCH 0/7] xfs: log race fixes and cleanups Dave Chinner
                   ` (3 preceding siblings ...)
  2019-09-04  4:24 ` [PATCH 4/7] xfs: factor callbacks " Dave Chinner
@ 2019-09-04  4:24 ` Dave Chinner
  2019-09-04  6:42   ` Christoph Hellwig
  2019-09-04  4:24 ` [PATCH 6/7] xfs: push iclog state cleaning into xlog_state_clean_log Dave Chinner
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2019-09-04  4:24 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The iclog IO completion state processing is somewhat complex, and
because it's inside two nested loops it is highly indented and very
hard to read. Factor it out, flatten the logic flow and clean up the
comments so that it much easier to see what the code is doing both
in processing the individual iclogs and in the over
xlog_state_do_callback() operation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 199 ++++++++++++++++++++++++-----------------------
 1 file changed, 103 insertions(+), 96 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index c3a43fe023ea..c3efece72847 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2612,6 +2612,87 @@ xlog_get_lowest_lsn(
 	return lowest_lsn;
 }
 
+/*
+ * Return true if we need to stop processing, false to continue to the next
+ * iclog. The caller will need to run callbacks if the iclog is returned in the
+ * XLOG_STATE_CALLBACK state.
+ */
+static bool
+xlog_state_iodone_process_iclog(
+	struct xlog		*log,
+	struct xlog_in_core	*iclog,
+	struct xlog_in_core	*completed_iclog)
+{
+	xfs_lsn_t		lowest_lsn;
+
+	/* Skip all iclogs in the ACTIVE & DIRTY states */
+	if (iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_DIRTY))
+		return false;
+
+	/*
+	 * Between marking a filesystem SHUTDOWN and stopping the log, we do
+	 * flush all iclogs to disk (if there wasn't a log I/O error). So, we do
+	 * want things to go smoothly in case of just a SHUTDOWN  w/o a
+	 * LOG_IO_ERROR.
+	 */
+	if (iclog->ic_state & XLOG_STATE_IOERROR)
+		return false;
+
+	/*
+	 * Can only perform callbacks in order.  Since this iclog is not in the
+	 * DONE_SYNC/ DO_CALLBACK state, we skip the rest and just try to clean
+	 * up.  If we set our iclog to DO_CALLBACK, we will not process it when
+	 * we retry since a previous iclog is in the CALLBACK and the state
+	 * cannot change since we are holding the l_icloglock.
+	 */
+	if (!(iclog->ic_state &
+			(XLOG_STATE_DONE_SYNC | XLOG_STATE_DO_CALLBACK))) {
+		if (completed_iclog &&
+		    (completed_iclog->ic_state == XLOG_STATE_DONE_SYNC)) {
+			completed_iclog->ic_state = XLOG_STATE_DO_CALLBACK;
+		}
+		return true;
+	}
+
+	/*
+	 * We now have an iclog that is in either the DO_CALLBACK or DONE_SYNC
+	 * states. The other states (WANT_SYNC, SYNCING, or CALLBACK were caught
+	 * by the above if and are going to clean (i.e. we aren't doing their
+	 * callbacks) see the above if.
+	 *
+	 * We will do one more check here to see if we have chased our tail
+	 * around.
+	 */
+	lowest_lsn = xlog_get_lowest_lsn(log);
+	if (lowest_lsn &&
+	    XFS_LSN_CMP(lowest_lsn, be64_to_cpu(iclog->ic_header.h_lsn)) < 0)
+		return false; /* Leave this iclog for another thread */
+
+	iclog->ic_state = XLOG_STATE_CALLBACK;
+
+	/*
+	 * Completion of a iclog IO does not imply that a transaction has
+	 * completed, as transactions can be large enough to span many iclogs.
+	 * We cannot change the tail of the log half way through a transaction
+	 * as this may be the only transaction in the log and moving th etail to
+	 * point to the middle of it will prevent recovery from finding the
+	 * start of the transaction.  Hence we should only update the
+	 * last_sync_lsn if this iclog contains transaction completion callbacks
+	 * on it.
+	 *
+	 * We have to do this before we drop the icloglock to ensure we are the
+	 * only one that can update it.
+	 */
+	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
+			be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
+	if (!list_empty_careful(&iclog->ic_callbacks))
+		atomic64_set(&log->l_last_sync_lsn,
+			be64_to_cpu(iclog->ic_header.h_lsn));
+
+	return false;
+
+}
+
 /*
  * Keep processing entries in the iclog callback list until we come around and
  * it is empty.  We need to atomically see that the list is empty and change the
@@ -2706,22 +2787,16 @@ xlog_state_do_callback(
 	bool			aborted,
 	struct xlog_in_core	*ciclog)
 {
-	xlog_in_core_t	   *iclog;
-	xlog_in_core_t	   *first_iclog;	/* used to know when we've
-						 * processed all iclogs once */
-	int		   flushcnt = 0;
-	xfs_lsn_t	   lowest_lsn;
-	int		   ioerrors;	/* counter: iclogs with errors */
-	int		   loopdidcallbacks; /* flag: inner loop did callbacks*/
-	int		   funcdidcallbacks; /* flag: function did callbacks */
-	int		   repeats;	/* for issuing console warnings if
-					 * looping too many times */
+	struct xlog_in_core	*iclog;
+	struct xlog_in_core	*first_iclog;
+	int			flushcnt = 0;
+	int			ioerrors = 0;
+	int			funcdidcallbacks = 0;
+	int			loopdidcallbacks;
+	int			repeats = 0;
+	int			ret;
 
 	spin_lock(&log->l_icloglock);
-	first_iclog = iclog = log->l_iclog;
-	ioerrors = 0;
-	funcdidcallbacks = 0;
-	repeats = 0;
 
 	do {
 		/*
@@ -2738,92 +2813,24 @@ xlog_state_do_callback(
 		repeats++;
 
 		do {
-
-			/* skip all iclogs in the ACTIVE & DIRTY states */
-			if (iclog->ic_state &
-			    (XLOG_STATE_ACTIVE|XLOG_STATE_DIRTY)) {
-				iclog = iclog->ic_next;
-				continue;
-			}
-
 			/*
-			 * Between marking a filesystem SHUTDOWN and stopping
-			 * the log, we do flush all iclogs to disk (if there
-			 * wasn't a log I/O error). So, we do want things to
-			 * go smoothly in case of just a SHUTDOWN  w/o a
-			 * LOG_IO_ERROR.
+			 * We have to process iclogs marked with errors at
+			 * least once, so we count errors rather than terminate
+			 * the inner loop here.
 			 */
-			if (!(iclog->ic_state & XLOG_STATE_IOERROR)) {
-				/*
-				 * Can only perform callbacks in order.  Since
-				 * this iclog is not in the DONE_SYNC/
-				 * DO_CALLBACK state, we skip the rest and
-				 * just try to clean up.  If we set our iclog
-				 * to DO_CALLBACK, we will not process it when
-				 * we retry since a previous iclog is in the
-				 * CALLBACK and the state cannot change since
-				 * we are holding the l_icloglock.
-				 */
-				if (!(iclog->ic_state &
-					(XLOG_STATE_DONE_SYNC |
-						 XLOG_STATE_DO_CALLBACK))) {
-					if (ciclog && (ciclog->ic_state ==
-							XLOG_STATE_DONE_SYNC)) {
-						ciclog->ic_state = XLOG_STATE_DO_CALLBACK;
-					}
-					break;
-				}
-				/*
-				 * We now have an iclog that is in either the
-				 * DO_CALLBACK or DONE_SYNC states. The other
-				 * states (WANT_SYNC, SYNCING, or CALLBACK were
-				 * caught by the above if and are going to
-				 * clean (i.e. we aren't doing their callbacks)
-				 * see the above if.
-				 */
-
-				/*
-				 * We will do one more check here to see if we
-				 * have chased our tail around.
-				 */
-
-				lowest_lsn = xlog_get_lowest_lsn(log);
-				if (lowest_lsn &&
-				    XFS_LSN_CMP(lowest_lsn,
-						be64_to_cpu(iclog->ic_header.h_lsn)) < 0) {
-					iclog = iclog->ic_next;
-					continue; /* Leave this iclog for
-						   * another thread */
-				}
-
-				iclog->ic_state = XLOG_STATE_CALLBACK;
-
+			if (iclog->ic_state & XLOG_STATE_IOERROR)
+				ioerrors++;
 
-				/*
-				 * Completion of a iclog IO does not imply that
-				 * a transaction has completed, as transactions
-				 * can be large enough to span many iclogs. We
-				 * cannot change the tail of the log half way
-				 * through a transaction as this may be the only
-				 * transaction in the log and moving th etail to
-				 * point to the middle of it will prevent
-				 * recovery from finding the start of the
-				 * transaction. Hence we should only update the
-				 * last_sync_lsn if this iclog contains
-				 * transaction completion callbacks on it.
-				 *
-				 * We have to do this before we drop the
-				 * icloglock to ensure we are the only one that
-				 * can update it.
-				 */
-				ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
-					be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
-				if (!list_empty_careful(&iclog->ic_callbacks))
-					atomic64_set(&log->l_last_sync_lsn,
-						be64_to_cpu(iclog->ic_header.h_lsn));
+			ret = xlog_state_iodone_process_iclog(log, iclog,
+								ciclog);
+			if (ret)
+				break;
 
-			} else
-				ioerrors++;
+			if (!(iclog->ic_state &
+			      (XLOG_STATE_CALLBACK | XLOG_STATE_IOERROR))) {
+				iclog = iclog->ic_next;
+				continue;
+			}
 
 			if (xlog_state_do_iclog_callbacks(log, iclog, aborted))
 				loopdidcallbacks++;
-- 
2.23.0.rc1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/7] xfs: push iclog state cleaning into xlog_state_clean_log
  2019-09-04  4:24 [PATCH 0/7] xfs: log race fixes and cleanups Dave Chinner
                   ` (4 preceding siblings ...)
  2019-09-04  4:24 ` [PATCH 5/7] xfs: factor iclog state processing " Dave Chinner
@ 2019-09-04  4:24 ` Dave Chinner
  2019-09-04  6:44   ` Christoph Hellwig
  2019-09-04  4:24 ` [PATCH 7/7] xfs: push the grant head when the log head moves forward Dave Chinner
  2019-09-04  5:26 ` [PATCH 0/7] xfs: log race fixes and cleanups Christoph Hellwig
  7 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2019-09-04  4:24 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

xlog_state_clean_log() is only called from one place, and it occurs
when an iclog is transitioning back to ACTIVE. Prior to calling
xlog_state_clean_log, the iclog we are processing has a hard coded
state check to DIRTY so that xlog_state_clean_log() processes it
correctly. We also have a hard coded wakeup after
xlog_state_clean_log() to enfore log force waiters on that iclog
are woken correctly.

Both of these things are operations required to finish processing an
iclog and return it to the ACTIVE state again, so they make little
sense to be separated from the rest of the clean state transition
code.

Hence push these things inside xlog_state_clean_log(), document the
behaviour and rename it xlog_state_clean_iclog() to indicate that
it's being driven by an iclog state change and does the iclog state
change work itself.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 57 ++++++++++++++++++++++++++++--------------------
 1 file changed, 33 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index c3efece72847..6a59d71d4c60 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2505,21 +2505,35 @@ xlog_write(
  *****************************************************************************
  */
 
-/* Clean iclogs starting from the head.  This ordering must be
- * maintained, so an iclog doesn't become ACTIVE beyond one that
- * is SYNCING.  This is also required to maintain the notion that we use
- * a ordered wait queue to hold off would be writers to the log when every
- * iclog is trying to sync to disk.
+/*
+ * An iclog has just finished it's completion processing, so we need to update
+ * the iclog state and propagate that up into the overall log state. Hence we
+ * prepare the iclog for cleaning, and then clean all the pending dirty iclogs
+ * starting from the head, and then wake up any threads that are waiting for the
+ * iclog to be marked clean.
+ *
+ * The ordering of marking iclogs ACTIVE must be maintained, so an iclog
+ * doesn't become ACTIVE beyond one that is SYNCING.  This is also required to
+ * maintain the notion that we use a ordered wait queue to hold off would be
+ * writers to the log when every iclog is trying to sync to disk.
+ *
+ * Caller must hold the icloglock before calling us.
  *
- * State Change: DIRTY -> ACTIVE
+ * State Change: !IOERROR -> DIRTY -> ACTIVE
  */
 STATIC void
-xlog_state_clean_log(
-	struct xlog *log)
+xlog_state_clean_iclog(
+	struct xlog		*log,
+	struct xlog_in_core	*dirty_iclog)
 {
-	xlog_in_core_t	*iclog;
-	int changed = 0;
+	struct xlog_in_core	*iclog;
+	int			changed = 0;
+
+	/* Prepare the completed iclog. */
+	if (!(dirty_iclog->ic_state & XLOG_STATE_IOERROR))
+		dirty_iclog->ic_state = XLOG_STATE_DIRTY;
 
+	/* Walk all the iclogs to update the ordered active state. */
 	iclog = log->l_iclog;
 	do {
 		if (iclog->ic_state == XLOG_STATE_DIRTY) {
@@ -2557,7 +2571,13 @@ xlog_state_clean_log(
 		iclog = iclog->ic_next;
 	} while (iclog != log->l_iclog);
 
-	/* log is locked when we are called */
+
+	/*
+	 * Wake up threads waiting in xfs_log_force() for the dirty iclog
+	 * to be cleaned.
+	 */
+	wake_up_all(&dirty_iclog->ic_force_wait);
+
 	/*
 	 * Change state for the dummy log recording.
 	 * We usually go to NEED. But we go to NEED2 if the changed indicates
@@ -2591,7 +2611,7 @@ xlog_state_clean_log(
 			ASSERT(0);
 		}
 	}
-}	/* xlog_state_clean_log */
+}
 
 STATIC xfs_lsn_t
 xlog_get_lowest_lsn(
@@ -2835,18 +2855,7 @@ xlog_state_do_callback(
 			if (xlog_state_do_iclog_callbacks(log, iclog, aborted))
 				loopdidcallbacks++;
 
-			if (!(iclog->ic_state & XLOG_STATE_IOERROR))
-				iclog->ic_state = XLOG_STATE_DIRTY;
-
-			/*
-			 * Transition from DIRTY to ACTIVE if applicable.
-			 * NOP if STATE_IOERROR.
-			 */
-			xlog_state_clean_log(log);
-
-			/* wake up threads waiting in xfs_log_force() */
-			wake_up_all(&iclog->ic_force_wait);
-
+			xlog_state_clean_iclog(log, iclog);
 			iclog = iclog->ic_next;
 		} while (first_iclog != iclog);
 
-- 
2.23.0.rc1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-04  4:24 [PATCH 0/7] xfs: log race fixes and cleanups Dave Chinner
                   ` (5 preceding siblings ...)
  2019-09-04  4:24 ` [PATCH 6/7] xfs: push iclog state cleaning into xlog_state_clean_log Dave Chinner
@ 2019-09-04  4:24 ` Dave Chinner
  2019-09-04  6:45   ` Christoph Hellwig
  2019-09-04 19:34   ` Brian Foster
  2019-09-04  5:26 ` [PATCH 0/7] xfs: log race fixes and cleanups Christoph Hellwig
  7 siblings, 2 replies; 41+ messages in thread
From: Dave Chinner @ 2019-09-04  4:24 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When the log fills up, we can get into the state where the
outstanding items in the CIL being committed and aggregated are
larger than the range that the reservation grant head tail pushing
will attempt to clean. This can result in the tail pushing range
being trimmed back to the the log head (l_last_sync_lsn) and so
may not actually move the push target at all.

When the iclogs associated with the CIL commit finally land, the
log head moves forward, and this removes the restriction on the AIL
push target. However, if we already have transactions sleeping on
the grant head, and there's nothing in the AIL still to flush from
the current push target, then nothing will move the tail of the log
and trigger a log reservation wakeup.

Hence the there is nothing that will trigger xlog_grant_push_ail()
to recalculate the AIL push target and start pushing on the AIL
again to write back the metadata objects that pin the tail of the
log and hence free up space and allow the transaction reservations
to be woken and make progress.

Hence we need to push on the grant head when we move the log head
forward, as this may be the only trigger we have that can move the
AIL push target forwards in this situation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 72 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 47 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 6a59d71d4c60..733693e1ac9f 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2632,6 +2632,46 @@ xlog_get_lowest_lsn(
 	return lowest_lsn;
 }
 
+/*
+ * Completion of a iclog IO does not imply that a transaction has completed, as
+ * transactions can be large enough to span many iclogs. We cannot change the
+ * tail of the log half way through a transaction as this may be the only
+ * transaction in the log and moving the tail to point to the middle of it
+ * will prevent recovery from finding the start of the transaction. Hence we
+ * should only update the last_sync_lsn if this iclog contains transaction
+ * completion callbacks on it.
+ *
+ * We have to do this before we drop the icloglock to ensure we are the only one
+ * that can update it.
+ *
+ * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
+ * the reservation grant head pushing. This is due to the fact that the push
+ * target is bound by the current last_sync_lsn value. Hence if we have a large
+ * amount of log space bound up in this committing transaction then the
+ * last_sync_lsn value may be the limiting factor preventing tail pushing from
+ * freeing space in the log. Hence once we've updated the last_sync_lsn we
+ * should push the AIL to ensure the push target (and hence the grant head) is
+ * no longer bound by the old log head location and can move forwards and make
+ * progress again.
+ */
+static void
+xlog_state_set_callback(
+	struct xlog		*log,
+	struct xlog_in_core	*iclog,
+	xfs_lsn_t		header_lsn)
+{
+	iclog->ic_state = XLOG_STATE_CALLBACK;
+
+	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
+
+	if (list_empty_careful(&iclog->ic_callbacks))
+		return;
+
+	atomic64_set(&log->l_last_sync_lsn, header_lsn);
+	xlog_grant_push_ail(log, 0);
+
+}
+
 /*
  * Return true if we need to stop processing, false to continue to the next
  * iclog. The caller will need to run callbacks if the iclog is returned in the
@@ -2644,6 +2684,7 @@ xlog_state_iodone_process_iclog(
 	struct xlog_in_core	*completed_iclog)
 {
 	xfs_lsn_t		lowest_lsn;
+	xfs_lsn_t		header_lsn;
 
 	/* Skip all iclogs in the ACTIVE & DIRTY states */
 	if (iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_DIRTY))
@@ -2681,34 +2722,15 @@ xlog_state_iodone_process_iclog(
 	 * callbacks) see the above if.
 	 *
 	 * We will do one more check here to see if we have chased our tail
-	 * around.
+	 * around. If this is not the lowest lsn iclog, then we will leave it
+	 * for another completion to process.
 	 */
+	header_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 	lowest_lsn = xlog_get_lowest_lsn(log);
-	if (lowest_lsn &&
-	    XFS_LSN_CMP(lowest_lsn, be64_to_cpu(iclog->ic_header.h_lsn)) < 0)
-		return false; /* Leave this iclog for another thread */
-
-	iclog->ic_state = XLOG_STATE_CALLBACK;
-
-	/*
-	 * Completion of a iclog IO does not imply that a transaction has
-	 * completed, as transactions can be large enough to span many iclogs.
-	 * We cannot change the tail of the log half way through a transaction
-	 * as this may be the only transaction in the log and moving th etail to
-	 * point to the middle of it will prevent recovery from finding the
-	 * start of the transaction.  Hence we should only update the
-	 * last_sync_lsn if this iclog contains transaction completion callbacks
-	 * on it.
-	 *
-	 * We have to do this before we drop the icloglock to ensure we are the
-	 * only one that can update it.
-	 */
-	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
-			be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
-	if (!list_empty_careful(&iclog->ic_callbacks))
-		atomic64_set(&log->l_last_sync_lsn,
-			be64_to_cpu(iclog->ic_header.h_lsn));
+	if (lowest_lsn && XFS_LSN_CMP(lowest_lsn, header_lsn) < 0)
+		return false;
 
+	xlog_state_set_callback(log, iclog, header_lsn);
 	return false;
 
 }
-- 
2.23.0.rc1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] xfs: log race fixes and cleanups
  2019-09-04  4:24 [PATCH 0/7] xfs: log race fixes and cleanups Dave Chinner
                   ` (6 preceding siblings ...)
  2019-09-04  4:24 ` [PATCH 7/7] xfs: push the grant head when the log head moves forward Dave Chinner
@ 2019-09-04  5:26 ` Christoph Hellwig
  2019-09-04  5:56   ` Christoph Hellwig
  7 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2019-09-04  5:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 04, 2019 at 02:24:44PM +1000, Dave Chinner wrote:
> HI folks,
> 
> This series of patches contain fixes for the generic/530 hangs that
> Chandan and Jan reported recently. The first patch in the series is
> the main fix for this, though in theory the last patch in the series
> is necessary to totally close the problem off.

While I haven't been active in the thread I already reported this weeks
ago as well..

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] xfs: log race fixes and cleanups
  2019-09-04  5:26 ` [PATCH 0/7] xfs: log race fixes and cleanups Christoph Hellwig
@ 2019-09-04  5:56   ` Christoph Hellwig
  2019-09-04 22:57     ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2019-09-04  5:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Sep 03, 2019 at 10:26:00PM -0700, Christoph Hellwig wrote:
> On Wed, Sep 04, 2019 at 02:24:44PM +1000, Dave Chinner wrote:
> > HI folks,
> > 
> > This series of patches contain fixes for the generic/530 hangs that
> > Chandan and Jan reported recently. The first patch in the series is
> > the main fix for this, though in theory the last patch in the series
> > is necessary to totally close the problem off.
> 
> While I haven't been active in the thread I already reported this weeks
> ago as well..

And unfortunately generic/530 still hangs for me with this series.
This is an x86-64 VM with 4G of RAM and virtio-blk, default mkfs.xfs
options from current xfsprogs, 20G test and scratch fs.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/7] xfs: push the AIL in xlog_grant_head_wake
  2019-09-04  4:24 ` [PATCH 1/7] xfs: push the AIL in xlog_grant_head_wake Dave Chinner
@ 2019-09-04  6:07   ` Christoph Hellwig
  2019-09-04 21:46     ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2019-09-04  6:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> +		if (*free_bytes < need_bytes) {
> +			if (!woken_task)
> +				xlog_grant_push_ail(log, need_bytes);
>  			return false;
> +		}

It would be nice to have a comment here with the condensed wisdom from
the commit log.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/7] xfs: fix missed wakeup on l_flush_wait
  2019-09-04  4:24 ` [PATCH 2/7] xfs: fix missed wakeup on l_flush_wait Dave Chinner
@ 2019-09-04  6:07   ` Christoph Hellwig
  2019-09-04 21:47     ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2019-09-04  6:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good.  I vaguely remember reviewing this before, though.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/7] xfs: factor debug code out of xlog_state_do_callback()
  2019-09-04  4:24 ` [PATCH 3/7] xfs: factor debug code out of xlog_state_do_callback() Dave Chinner
@ 2019-09-04  6:10   ` Christoph Hellwig
  2019-09-04 21:14     ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2019-09-04  6:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> + * Note that SYNCING|IOABORT is a valid state so we cannot just check for
> + * ic_state == SYNCING.

I've removed the IOABORT flag recently, but it seems I forgot to remove
this comment.  That beeing said the IOERR flag is still a complete mess
as we sometimes use it as a flag and sometimes as a separate state.
I've been wanting to sort that out, but always got preempted.

> + */
> +static void
> +xlog_state_callback_check_state(
> +	struct xlog		*log,
> +	int			did_callbacks)
> +{
> +	struct xlog_in_core	*iclog;
> +	struct xlog_in_core	*first_iclog;
> +
> +	if (!did_callbacks)
> +		return;
> +
> +	first_iclog = iclog = log->l_iclog;

I'd keep the did_callbacks check in the caller.  For the non-debug case
it will be optimized away, but it saves an argument, and allows
initializing the iclog variables on the declaration line.

> +	do {
> +		ASSERT(iclog->ic_state != XLOG_STATE_DO_CALLBACK);
> +		/*
> +		 * Terminate the loop if iclogs are found in states
> +		 * which will cause other threads to clean up iclogs.
> +		 *
> +		 * SYNCING - i/o completion will go through logs
> +		 * DONE_SYNC - interrupt thread should be waiting for
> +		 *              l_icloglock
> +		 * IOERROR - give up hope all ye who enter here
> +		 */
> +		if (iclog->ic_state == XLOG_STATE_WANT_SYNC ||
> +		    iclog->ic_state & XLOG_STATE_SYNCING ||
> +		    iclog->ic_state == XLOG_STATE_DONE_SYNC ||
> +		    iclog->ic_state == XLOG_STATE_IOERROR )
> +			break;
> +		iclog = iclog->ic_next;

No new, but if we list the states we should not miss one..

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7] xfs: factor callbacks out of xlog_state_do_callback()
  2019-09-04  4:24 ` [PATCH 4/7] xfs: factor callbacks " Dave Chinner
@ 2019-09-04  6:13   ` Christoph Hellwig
  2019-09-04  6:32   ` Christoph Hellwig
  1 sibling, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2019-09-04  6:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7] xfs: factor callbacks out of xlog_state_do_callback()
  2019-09-04  4:24 ` [PATCH 4/7] xfs: factor callbacks " Dave Chinner
  2019-09-04  6:13   ` Christoph Hellwig
@ 2019-09-04  6:32   ` Christoph Hellwig
  2019-09-04 21:22     ` Dave Chinner
  1 sibling, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2019-09-04  6:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

After going back to this from the next patch I think I spotted a
behavior difference:  xlog_state_do_iclog_callbacks only returns true,
and xlog_state_do_callback only increments loopdidcallbacks if
ic_callbacks was non-emty.  But we already dropped the block just
to check that it is empty, so I think we need to keep the old
behavior.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/7] xfs: factor iclog state processing out of xlog_state_do_callback()
  2019-09-04  4:24 ` [PATCH 5/7] xfs: factor iclog state processing " Dave Chinner
@ 2019-09-04  6:42   ` Christoph Hellwig
  2019-09-04 21:43     ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2019-09-04  6:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 04, 2019 at 02:24:49PM +1000, Dave Chinner wrote:
> +	/* Skip all iclogs in the ACTIVE & DIRTY states */
> +	if (iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_DIRTY))
> +		return false;

Please use spaces around the "|".

> +			if (iclog->ic_state & XLOG_STATE_IOERROR)
> +				ioerrors++;

This now also counts the ierrror flag for dirty and active iclogs.
Not sure it matters given our state machine, but it does change
behavior.

> +			ret = xlog_state_iodone_process_iclog(log, iclog,
> +								ciclog);
> +			if (ret)
> +				break;

No need for the ret variable.

>  
> -			} else
> -				ioerrors++;
> +			if (!(iclog->ic_state &
> +			      (XLOG_STATE_CALLBACK | XLOG_STATE_IOERROR))) {
> +				iclog = iclog->ic_next;
> +				continue;
> +			}

Btw, one cleanup I had pending is that all our loops ovr the iclog
list can be cleaned up nicely so that continue does that right thing
without all these manual "iclog = iclog->ic_next" next statements.  Just
turn the loop into:

	do {
		..
	} while ((iclog = iclog->ic_next) != first_iclog);

this might be applicable to a few of your patches.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/7] xfs: push iclog state cleaning into xlog_state_clean_log
  2019-09-04  4:24 ` [PATCH 6/7] xfs: push iclog state cleaning into xlog_state_clean_log Dave Chinner
@ 2019-09-04  6:44   ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2019-09-04  6:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 04, 2019 at 02:24:50PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xlog_state_clean_log() is only called from one place, and it occurs
> when an iclog is transitioning back to ACTIVE. Prior to calling
> xlog_state_clean_log, the iclog we are processing has a hard coded
> state check to DIRTY so that xlog_state_clean_log() processes it
> correctly. We also have a hard coded wakeup after
> xlog_state_clean_log() to enfore log force waiters on that iclog
> are woken correctly.
> 
> Both of these things are operations required to finish processing an
> iclog and return it to the ACTIVE state again, so they make little
> sense to be separated from the rest of the clean state transition
> code.
> 
> Hence push these things inside xlog_state_clean_log(), document the
> behaviour and rename it xlog_state_clean_iclog() to indicate that
> it's being driven by an iclog state change and does the iclog state
> change work itself.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-04  4:24 ` [PATCH 7/7] xfs: push the grant head when the log head moves forward Dave Chinner
@ 2019-09-04  6:45   ` Christoph Hellwig
  2019-09-04 21:49     ` Dave Chinner
  2019-09-04 19:34   ` Brian Foster
  1 sibling, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2019-09-04  6:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);

This adds an > 80 char line.

Otherwise this looks sensible to me.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-04  4:24 ` [PATCH 7/7] xfs: push the grant head when the log head moves forward Dave Chinner
  2019-09-04  6:45   ` Christoph Hellwig
@ 2019-09-04 19:34   ` Brian Foster
  2019-09-04 22:50     ` Dave Chinner
  1 sibling, 1 reply; 41+ messages in thread
From: Brian Foster @ 2019-09-04 19:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When the log fills up, we can get into the state where the
> outstanding items in the CIL being committed and aggregated are
> larger than the range that the reservation grant head tail pushing
> will attempt to clean. This can result in the tail pushing range
> being trimmed back to the the log head (l_last_sync_lsn) and so
> may not actually move the push target at all.
> 
> When the iclogs associated with the CIL commit finally land, the
> log head moves forward, and this removes the restriction on the AIL
> push target. However, if we already have transactions sleeping on
> the grant head, and there's nothing in the AIL still to flush from
> the current push target, then nothing will move the tail of the log
> and trigger a log reservation wakeup.
> 
> Hence the there is nothing that will trigger xlog_grant_push_ail()
> to recalculate the AIL push target and start pushing on the AIL
> again to write back the metadata objects that pin the tail of the
> log and hence free up space and allow the transaction reservations
> to be woken and make progress.
> 
> Hence we need to push on the grant head when we move the log head
> forward, as this may be the only trigger we have that can move the
> AIL push target forwards in this situation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c | 72 +++++++++++++++++++++++++++++++-----------------
>  1 file changed, 47 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 6a59d71d4c60..733693e1ac9f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2632,6 +2632,46 @@ xlog_get_lowest_lsn(
>  	return lowest_lsn;
>  }
>  
> +/*
> + * Completion of a iclog IO does not imply that a transaction has completed, as
> + * transactions can be large enough to span many iclogs. We cannot change the
> + * tail of the log half way through a transaction as this may be the only
> + * transaction in the log and moving the tail to point to the middle of it
> + * will prevent recovery from finding the start of the transaction. Hence we
> + * should only update the last_sync_lsn if this iclog contains transaction
> + * completion callbacks on it.
> + *
> + * We have to do this before we drop the icloglock to ensure we are the only one
> + * that can update it.
> + *
> + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> + * the reservation grant head pushing. This is due to the fact that the push
> + * target is bound by the current last_sync_lsn value. Hence if we have a large
> + * amount of log space bound up in this committing transaction then the
> + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> + * should push the AIL to ensure the push target (and hence the grant head) is
> + * no longer bound by the old log head location and can move forwards and make
> + * progress again.
> + */
> +static void
> +xlog_state_set_callback(
> +	struct xlog		*log,
> +	struct xlog_in_core	*iclog,
> +	xfs_lsn_t		header_lsn)
> +{
> +	iclog->ic_state = XLOG_STATE_CALLBACK;
> +
> +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> +
> +	if (list_empty_careful(&iclog->ic_callbacks))
> +		return;
> +
> +	atomic64_set(&log->l_last_sync_lsn, header_lsn);
> +	xlog_grant_push_ail(log, 0);
> +

Nit: extra whitespace line above.

This still seems racy to me, FWIW. What if the AIL is empty (i.e. the
push is skipped)? What if xfsaild completes this push before the
associated log items land in the AIL or we race with xfsaild emptying
the AIL? Why not just reuse/update the existing grant head wake up logic
in the iclog callback itself? E.g., something like the following
(untested):

@@ -740,12 +740,10 @@ xfs_trans_ail_update_bulk(
 	if (mlip_changed) {
 		if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
 			xlog_assign_tail_lsn_locked(ailp->ail_mount);
-		spin_unlock(&ailp->ail_lock);
-
-		xfs_log_space_wake(ailp->ail_mount);
-	} else {
-		spin_unlock(&ailp->ail_lock);
 	}
+
+	spin_unlock(&ailp->ail_lock);
+	xfs_log_space_wake(ailp->ail_mount);
}

... seems to solve the same prospective problem without being racy and
with less and more simple code. Hm?

Brian

> +}
> +
>  /*
>   * Return true if we need to stop processing, false to continue to the next
>   * iclog. The caller will need to run callbacks if the iclog is returned in the
> @@ -2644,6 +2684,7 @@ xlog_state_iodone_process_iclog(
>  	struct xlog_in_core	*completed_iclog)
>  {
>  	xfs_lsn_t		lowest_lsn;
> +	xfs_lsn_t		header_lsn;
>  
>  	/* Skip all iclogs in the ACTIVE & DIRTY states */
>  	if (iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_DIRTY))
> @@ -2681,34 +2722,15 @@ xlog_state_iodone_process_iclog(
>  	 * callbacks) see the above if.
>  	 *
>  	 * We will do one more check here to see if we have chased our tail
> -	 * around.
> +	 * around. If this is not the lowest lsn iclog, then we will leave it
> +	 * for another completion to process.
>  	 */
> +	header_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  	lowest_lsn = xlog_get_lowest_lsn(log);
> -	if (lowest_lsn &&
> -	    XFS_LSN_CMP(lowest_lsn, be64_to_cpu(iclog->ic_header.h_lsn)) < 0)
> -		return false; /* Leave this iclog for another thread */
> -
> -	iclog->ic_state = XLOG_STATE_CALLBACK;
> -
> -	/*
> -	 * Completion of a iclog IO does not imply that a transaction has
> -	 * completed, as transactions can be large enough to span many iclogs.
> -	 * We cannot change the tail of the log half way through a transaction
> -	 * as this may be the only transaction in the log and moving th etail to
> -	 * point to the middle of it will prevent recovery from finding the
> -	 * start of the transaction.  Hence we should only update the
> -	 * last_sync_lsn if this iclog contains transaction completion callbacks
> -	 * on it.
> -	 *
> -	 * We have to do this before we drop the icloglock to ensure we are the
> -	 * only one that can update it.
> -	 */
> -	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
> -			be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
> -	if (!list_empty_careful(&iclog->ic_callbacks))
> -		atomic64_set(&log->l_last_sync_lsn,
> -			be64_to_cpu(iclog->ic_header.h_lsn));
> +	if (lowest_lsn && XFS_LSN_CMP(lowest_lsn, header_lsn) < 0)
> +		return false;
>  
> +	xlog_state_set_callback(log, iclog, header_lsn);
>  	return false;
>  
>  }
> -- 
> 2.23.0.rc1
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/7] xfs: factor debug code out of xlog_state_do_callback()
  2019-09-04  6:10   ` Christoph Hellwig
@ 2019-09-04 21:14     ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2019-09-04 21:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Sep 03, 2019 at 11:10:38PM -0700, Christoph Hellwig wrote:
> > + * Note that SYNCING|IOABORT is a valid state so we cannot just check for
> > + * ic_state == SYNCING.
> 
> I've removed the IOABORT flag recently, but it seems I forgot to remove
> this comment. 

I think the comment is still relevant for SYNCING|IOERROR state,
so I s/ABORT/ERROR/

> That beeing said the IOERR flag is still a complete mess
> as we sometimes use it as a flag and sometimes as a separate state.
> I've been wanting to sort that out, but always got preempted.

Yes, I started cleaning that up (eg. using XLOG_FORCED_SHUTDOWN
instead of checking the IOERROR state of the first iclog for
shutdown) but it's not that simple unfortunately - it's a bit of a
mess - and so I dropped that from the series to get the fixes out.
I have an idea of what is wrong, but it's not ready yet.

> > + */
> > +static void
> > +xlog_state_callback_check_state(
> > +	struct xlog		*log,
> > +	int			did_callbacks)
> > +{
> > +	struct xlog_in_core	*iclog;
> > +	struct xlog_in_core	*first_iclog;
> > +
> > +	if (!did_callbacks)
> > +		return;
> > +
> > +	first_iclog = iclog = log->l_iclog;
> 
> I'd keep the did_callbacks check in the caller.  For the non-debug case
> it will be optimized away, but it saves an argument, and allows
> initializing the iclog variables on the declaration line.

Fixed.

FWIW, I started cleaning that up to with xlog_for_each_iclog()
for all the places where we iterate like this. But that's also
dependent on the IOERROR cleanup I shelved for the moment....

> > +	do {
> > +		ASSERT(iclog->ic_state != XLOG_STATE_DO_CALLBACK);
> > +		/*
> > +		 * Terminate the loop if iclogs are found in states
> > +		 * which will cause other threads to clean up iclogs.
> > +		 *
> > +		 * SYNCING - i/o completion will go through logs
> > +		 * DONE_SYNC - interrupt thread should be waiting for
> > +		 *              l_icloglock
> > +		 * IOERROR - give up hope all ye who enter here
> > +		 */
> > +		if (iclog->ic_state == XLOG_STATE_WANT_SYNC ||
> > +		    iclog->ic_state & XLOG_STATE_SYNCING ||
> > +		    iclog->ic_state == XLOG_STATE_DONE_SYNC ||
> > +		    iclog->ic_state == XLOG_STATE_IOERROR )
> > +			break;
> > +		iclog = iclog->ic_next;
> 
> No new, but if we list the states we should not miss one..

Which one if missing?

Cheers,
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7] xfs: factor callbacks out of xlog_state_do_callback()
  2019-09-04  6:32   ` Christoph Hellwig
@ 2019-09-04 21:22     ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2019-09-04 21:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Sep 03, 2019 at 11:32:40PM -0700, Christoph Hellwig wrote:
> After going back to this from the next patch I think I spotted a
> behavior difference:  xlog_state_do_iclog_callbacks only returns true,
> and xlog_state_do_callback only increments loopdidcallbacks if
> ic_callbacks was non-emty.  But we already dropped the block just
> to check that it is empty, so I think we need to keep the old
> behavior.

IIUC, you are saying that loopdidcallbacks is not a variable that
tracks whether callbacks were run, but whether the icloglock was
released or not. Ok, I'll go fix that up, and name the variable
appropriately.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/7] xfs: factor iclog state processing out of xlog_state_do_callback()
  2019-09-04  6:42   ` Christoph Hellwig
@ 2019-09-04 21:43     ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2019-09-04 21:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Sep 03, 2019 at 11:42:21PM -0700, Christoph Hellwig wrote:
> On Wed, Sep 04, 2019 at 02:24:49PM +1000, Dave Chinner wrote:
> > +	/* Skip all iclogs in the ACTIVE & DIRTY states */
> > +	if (iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_DIRTY))
> > +		return false;
> 
> Please use spaces around the "|".

Fixed.

> 
> > +			if (iclog->ic_state & XLOG_STATE_IOERROR)
> > +				ioerrors++;
> 
> This now also counts the ierrror flag for dirty and active iclogs.
> Not sure it matters given our state machine, but it does change
> behavior.

True. There's an ioerror check in xlog_state_iodone_process_iclog()
so I can pass the ioerror parameter into it and retain the existing
logic.

> > +			ret = xlog_state_iodone_process_iclog(log, iclog,
> > +								ciclog);
> > +			if (ret)
> > +				break;
> 
> No need for the ret variable.

Fixed.

> >  
> > -			} else
> > -				ioerrors++;
> > +			if (!(iclog->ic_state &
> > +			      (XLOG_STATE_CALLBACK | XLOG_STATE_IOERROR))) {
> > +				iclog = iclog->ic_next;
> > +				continue;
> > +			}
> 
> Btw, one cleanup I had pending is that all our loops ovr the iclog
> list can be cleaned up nicely so that continue does that right thing
> without all these manual "iclog = iclog->ic_next" next statements.  Just
> turn the loop into:
> 
> 	do {
> 		..
> 	} while ((iclog = iclog->ic_next) != first_iclog);
> 
> this might be applicable to a few of your patches.

Yup, as I mentioned earlier, that's in progress :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/7] xfs: push the AIL in xlog_grant_head_wake
  2019-09-04  6:07   ` Christoph Hellwig
@ 2019-09-04 21:46     ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2019-09-04 21:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Sep 03, 2019 at 11:07:07PM -0700, Christoph Hellwig wrote:
> > +		if (*free_bytes < need_bytes) {
> > +			if (!woken_task)
> > +				xlog_grant_push_ail(log, need_bytes);
> >  			return false;
> > +		}
> 
> It would be nice to have a comment here with the condensed wisdom from
> the commit log.

Fixed.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/7] xfs: fix missed wakeup on l_flush_wait
  2019-09-04  6:07   ` Christoph Hellwig
@ 2019-09-04 21:47     ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2019-09-04 21:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Sep 03, 2019 at 11:07:51PM -0700, Christoph Hellwig wrote:
> Looks good.  I vaguely remember reviewing this before, though.

Probably, but no one but me commented on it the first time, and I
don't recall anyone reviewing it specifically when I posted it as
part of the nonblocking inode reclaim RFC. SO yeah, maybe seen, not
sure it if was reviewed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-04  6:45   ` Christoph Hellwig
@ 2019-09-04 21:49     ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2019-09-04 21:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Sep 03, 2019 at 11:45:10PM -0700, Christoph Hellwig wrote:
> > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> 
> This adds an > 80 char line.

Fixed.


-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-04 19:34   ` Brian Foster
@ 2019-09-04 22:50     ` Dave Chinner
  2019-09-05 16:25       ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2019-09-04 22:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote:
> On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > +/*
> > + * Completion of a iclog IO does not imply that a transaction has completed, as
> > + * transactions can be large enough to span many iclogs. We cannot change the
> > + * tail of the log half way through a transaction as this may be the only
> > + * transaction in the log and moving the tail to point to the middle of it
> > + * will prevent recovery from finding the start of the transaction. Hence we
> > + * should only update the last_sync_lsn if this iclog contains transaction
> > + * completion callbacks on it.
> > + *
> > + * We have to do this before we drop the icloglock to ensure we are the only one
> > + * that can update it.
> > + *
> > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> > + * the reservation grant head pushing. This is due to the fact that the push
> > + * target is bound by the current last_sync_lsn value. Hence if we have a large
> > + * amount of log space bound up in this committing transaction then the
> > + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> > + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> > + * should push the AIL to ensure the push target (and hence the grant head) is
> > + * no longer bound by the old log head location and can move forwards and make
> > + * progress again.
> > + */
> > +static void
> > +xlog_state_set_callback(
> > +	struct xlog		*log,
> > +	struct xlog_in_core	*iclog,
> > +	xfs_lsn_t		header_lsn)
> > +{
> > +	iclog->ic_state = XLOG_STATE_CALLBACK;
> > +
> > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> > +
> > +	if (list_empty_careful(&iclog->ic_callbacks))
> > +		return;
> > +
> > +	atomic64_set(&log->l_last_sync_lsn, header_lsn);
> > +	xlog_grant_push_ail(log, 0);
> > +
> 
> Nit: extra whitespace line above.

Fixed.

> This still seems racy to me, FWIW. What if the AIL is empty (i.e. the
> push is skipped)?

If the AIL is empty, then it's a no-op because pushing on the AIL is
not going to make more log space become free. Besides, that's not
the problem being solved here - reservation wakeups on first insert
into the AIL are already handled by xfs_trans_ail_update_bulk() and
hence the first patch in the series. This patch is addressing the
situation where the bulk insert that occurs from the callbacks that
are about to run -does not modify the tail of the log-. i.e. the
commit moved the head but not the tail, so we have to update the AIL
push target to take into account the new log head....

i.e. the AIL is for moving the tail of the log - this code moves the
head of the log. But both impact on the AIL push target (it is based on
the distance between the head and tail), so we need
to update the push target just in case this commit does not move
the tail...

> What if xfsaild completes this push before the
> associated log items land in the AIL or we race with xfsaild emptying
> the AIL? Why not just reuse/update the existing grant head wake up logic
> in the iclog callback itself? E.g., something like the following
> (untested):
> 
> @@ -740,12 +740,10 @@ xfs_trans_ail_update_bulk(
>  	if (mlip_changed) {
>  		if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
>  			xlog_assign_tail_lsn_locked(ailp->ail_mount);
> -		spin_unlock(&ailp->ail_lock);
> -
> -		xfs_log_space_wake(ailp->ail_mount);
> -	} else {
> -		spin_unlock(&ailp->ail_lock);
>  	}
> +
> +	spin_unlock(&ailp->ail_lock);
> +	xfs_log_space_wake(ailp->ail_mount);

Two things that I see straight away:

1. if the AIL is empty, the first insert does not set mlip_changed =
true and and so there will be no wakeup in the scenario you are
posing. This would be easy to fix - if (!mlip || changed) - so that
a wakeup is triggered in this case.

2. if we have not moved the tail, then calling xfs_log_space_wake()
will, at best, just burn CPU. At worst, it wll cause hundreds of
thousands of spurious wakeups a seconds because the waiting
transaction reservation will be woken continuously when there isn't
space available and there hasn't been any space made available.

So, from #1 we see that unconditional wakeups are not necessary in
the scenario you pose, and from #2 it's not a viable solution even
if it was required.

However, #1 indicates other problems if a xfs_log_space_wake() call
is necessary in this case. No reservations space and an empty AIL
implies that the CIL pins the entire log - a pending commit that
hasn't finished flushing and the current context that is
aggregating. This implies we've violated a much more important rule
of the on-disk log format: finding the head and tail of the log
requires no individual commit be larger than 50% of the log.

So if we are actually stalling on trasnaction reservations with an
empty AIL and an uncommitted CIL, screwing around with tail pushing
wakeups does not address the bigger problem being seen...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] xfs: log race fixes and cleanups
  2019-09-04  5:56   ` Christoph Hellwig
@ 2019-09-04 22:57     ` Dave Chinner
       [not found]       ` <20190905065133.GA21840@infradead.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2019-09-04 22:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Sep 03, 2019 at 10:56:19PM -0700, Christoph Hellwig wrote:
> On Tue, Sep 03, 2019 at 10:26:00PM -0700, Christoph Hellwig wrote:
> > On Wed, Sep 04, 2019 at 02:24:44PM +1000, Dave Chinner wrote:
> > > HI folks,
> > > 
> > > This series of patches contain fixes for the generic/530 hangs that
> > > Chandan and Jan reported recently. The first patch in the series is
> > > the main fix for this, though in theory the last patch in the series
> > > is necessary to totally close the problem off.
> > 
> > While I haven't been active in the thread I already reported this weeks
> > ago as well..

Ok, I didn't know that, and now that I look there's no information
that I can use to determine what went wrong...

> And unfortunately generic/530 still hangs for me with this series.

Where does it hang?

> This is an x86-64 VM with 4G of RAM and virtio-blk, default mkfs.xfs
> options from current xfsprogs, 20G test and scratch fs.

That's pretty much what one of my test rigs is, except iscsi luns
rather than virtio-blk. I haven't been able to reproduce the issues,
so I'm kinda flying blind w.r.t. to testing them here. Can you
get a trace of what is happening (xfs_trans*, xfs_log*, xfs_ail*
tracepoints) so I can have a deeper look?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] xfs: log race fixes and cleanups
       [not found]       ` <20190905065133.GA21840@infradead.org>
@ 2019-09-05  7:10         ` Dave Chinner
  2019-09-05  7:28           ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2019-09-05  7:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Wed, Sep 04, 2019 at 11:51:33PM -0700, Christoph Hellwig wrote:
> On Thu, Sep 05, 2019 at 08:57:16AM +1000, Dave Chinner wrote:
> > > And unfortunately generic/530 still hangs for me with this series.
> > 
> > Where does it hang?
> > 
> > > This is an x86-64 VM with 4G of RAM and virtio-blk, default mkfs.xfs
> > > options from current xfsprogs, 20G test and scratch fs.
> > 
> > That's pretty much what one of my test rigs is, except iscsi luns
> > rather than virtio-blk. I haven't been able to reproduce the issues,
> > so I'm kinda flying blind w.r.t. to testing them here. Can you
> > get a trace of what is happening (xfs_trans*, xfs_log*, xfs_ail*
> > tracepoints) so I can have a deeper look?
> 
> console output below, traces attached.

Thanks, I'll have a look in a minute. I'm pretty sure I know what it
will show - I got a trace from Chandan earlier this afternoon and
the problem is log recovery doesn't yeild the cpu until it runs out
of transaction reservation space, so the push work doesn't run
because workqueue default behaviour is strict "run work only on the
CPU it is queued on"....

I've got an additional patch in testing right now...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/7] xfs: log race fixes and cleanups
  2019-09-05  7:10         ` Dave Chinner
@ 2019-09-05  7:28           ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2019-09-05  7:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Sep 05, 2019 at 05:10:31PM +1000, Dave Chinner wrote:
> On Wed, Sep 04, 2019 at 11:51:33PM -0700, Christoph Hellwig wrote:
> > On Thu, Sep 05, 2019 at 08:57:16AM +1000, Dave Chinner wrote:
> > > > And unfortunately generic/530 still hangs for me with this series.
> > > 
> > > Where does it hang?
> > > 
> > > > This is an x86-64 VM with 4G of RAM and virtio-blk, default mkfs.xfs
> > > > options from current xfsprogs, 20G test and scratch fs.
> > > 
> > > That's pretty much what one of my test rigs is, except iscsi luns
> > > rather than virtio-blk. I haven't been able to reproduce the issues,
> > > so I'm kinda flying blind w.r.t. to testing them here. Can you
> > > get a trace of what is happening (xfs_trans*, xfs_log*, xfs_ail*
> > > tracepoints) so I can have a deeper look?
> > 
> > console output below, traces attached.
> 
> Thanks, I'll have a look in a minute. I'm pretty sure I know what it
> will show - I got a trace from Chandan earlier this afternoon and
> the problem is log recovery doesn't yeild the cpu until it runs out
> of transaction reservation space, so the push work doesn't run
> because workqueue default behaviour is strict "run work only on the
> CPU it is queued on"....

Yup, exactly the same trace. Right down to the lsns in the log and
the 307 iclog writes just after the log runs out of space. To quote
from #xfs earlier this afternoon:

[5/9/19 14:21] <dchinner> I see what is -likely- to be a cil checkpoint but without the closing commit record
[5/9/19 14:21] <chandan> which line number in the trace log are you noticing that?
[5/9/19 14:22] <dchinner> 307 sequential calls to xfs_log_assign_tail_lsn() from a kworker and then releasing a log reservation
[5/9/19 14:22] <dchinner> Assuming 32kB iclog size (default)
[5/9/19 14:23] <dchinner> thats 307 * 32 / 4 filesystem blocks, which is 2456 blocks
[5/9/19 14:24] <dchinner> that's 96% of the log in a single CIL commit
[5/9/19 14:24] <dchinner> this isn't a "why hasn't there been iclog completion" problem
[5/9/19 14:24] <dchinner> this is a "why didn't the CIL push occur when it passed 12% of the log...
[5/9/19 14:25] <dchinner> ?
[5/9/19 14:25] <dchinner> " problem
[5/9/19 14:26] <dchinner> oooohhhh
[5/9/19 14:27] <dchinner> this isn't a premeptible kernel, is it?
[5/9/19 14:27] <chandan> correct. Linux kernel on ppc64le isn't prememptible
[5/9/19 14:28] <dchinner> so a kernel thread running in a tight loop wil delay a kworker thread scheduled on the same CPU until running kthread yields the CPU
[5/9/19 14:28] <dchinner> but, because we've recovered all the inodes, etc, everything is hot in cache
[5/9/19 14:28] <dchinner> so the unlink workload runs without blocking, and so never yeilds the CPU until it runs out of transaction space.
[5/9/19 14:29] <dchinner> and only then does the background kworker get scheduled to run.

I'll send the updated patch set soon...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-04 22:50     ` Dave Chinner
@ 2019-09-05 16:25       ` Brian Foster
  2019-09-06  0:02         ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2019-09-05 16:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote:
> On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote:
> > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > +/*
> > > + * Completion of a iclog IO does not imply that a transaction has completed, as
> > > + * transactions can be large enough to span many iclogs. We cannot change the
> > > + * tail of the log half way through a transaction as this may be the only
> > > + * transaction in the log and moving the tail to point to the middle of it
> > > + * will prevent recovery from finding the start of the transaction. Hence we
> > > + * should only update the last_sync_lsn if this iclog contains transaction
> > > + * completion callbacks on it.
> > > + *
> > > + * We have to do this before we drop the icloglock to ensure we are the only one
> > > + * that can update it.
> > > + *
> > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> > > + * the reservation grant head pushing. This is due to the fact that the push
> > > + * target is bound by the current last_sync_lsn value. Hence if we have a large
> > > + * amount of log space bound up in this committing transaction then the
> > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> > > + * should push the AIL to ensure the push target (and hence the grant head) is
> > > + * no longer bound by the old log head location and can move forwards and make
> > > + * progress again.
> > > + */
> > > +static void
> > > +xlog_state_set_callback(
> > > +	struct xlog		*log,
> > > +	struct xlog_in_core	*iclog,
> > > +	xfs_lsn_t		header_lsn)
> > > +{
> > > +	iclog->ic_state = XLOG_STATE_CALLBACK;
> > > +
> > > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> > > +
> > > +	if (list_empty_careful(&iclog->ic_callbacks))
> > > +		return;
> > > +
> > > +	atomic64_set(&log->l_last_sync_lsn, header_lsn);
> > > +	xlog_grant_push_ail(log, 0);
> > > +
> > 
> > Nit: extra whitespace line above.
> 
> Fixed.
> 
> > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the
> > push is skipped)?
> 
> If the AIL is empty, then it's a no-op because pushing on the AIL is
> not going to make more log space become free. Besides, that's not
> the problem being solved here - reservation wakeups on first insert
> into the AIL are already handled by xfs_trans_ail_update_bulk() and
> hence the first patch in the series. This patch is addressing the

Nothing currently wakes up reservation waiters on first AIL insertion. I
pointed this out in the original thread along with the fact that the
push is a no-op for an empty AIL. What wasn't clear to me is whether it
matters for the problem this patch is trying to fix. It sounds like not,
but that's a separate question from whether this is a problem itself.

> situation where the bulk insert that occurs from the callbacks that
> are about to run -does not modify the tail of the log-. i.e. the
> commit moved the head but not the tail, so we have to update the AIL
> push target to take into account the new log head....
> 

Ok, I figured based on process of elimination. xfs_ail_push() ignores
the push on an empty AIL and we obviously already have wakeups on tail
updates.

> i.e. the AIL is for moving the tail of the log - this code moves the
> head of the log. But both impact on the AIL push target (it is based on
> the distance between the head and tail), so we need
> to update the push target just in case this commit does not move
> the tail...
> 
> > What if xfsaild completes this push before the
> > associated log items land in the AIL or we race with xfsaild emptying
> > the AIL? Why not just reuse/update the existing grant head wake up logic
> > in the iclog callback itself? E.g., something like the following
> > (untested):
> > 

And the raciness concerns..? AFAICT this still opens a race window where
the AIL can idle on the push target before AIL insertion.

> > @@ -740,12 +740,10 @@ xfs_trans_ail_update_bulk(
> >  	if (mlip_changed) {
> >  		if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
> >  			xlog_assign_tail_lsn_locked(ailp->ail_mount);
> > -		spin_unlock(&ailp->ail_lock);
> > -
> > -		xfs_log_space_wake(ailp->ail_mount);
> > -	} else {
> > -		spin_unlock(&ailp->ail_lock);
> >  	}
> > +
> > +	spin_unlock(&ailp->ail_lock);
> > +	xfs_log_space_wake(ailp->ail_mount);
> 
> Two things that I see straight away:
> 
> 1. if the AIL is empty, the first insert does not set mlip_changed =
> true and and so there will be no wakeup in the scenario you are
> posing. This would be easy to fix - if (!mlip || changed) - so that
> a wakeup is triggered in this case.
> 

This (again) was what I suggested originally in Chandan's thread for the
empty AIL case.

> 2. if we have not moved the tail, then calling xfs_log_space_wake()
> will, at best, just burn CPU. At worst, it wll cause hundreds of
> thousands of spurious wakeups a seconds because the waiting
> transaction reservation will be woken continuously when there isn't
> space available and there hasn't been any space made available.
> 

Yes, I can see how that would be problematic with the diff I posted
above. It's also something that can be easily fixed. Note that I think
there's another potential side effect of that diff in terms of
amplifying pressure on the AIL because we don't know whether the waiters
were blocked because of pent up in-core reservation consumption or
simply because the tail is pinned. That said, I think both patches share
that particular quirk.

Either way, this doesn't address the raciness concern I have with this
patch. If you're wedded to this particular approach, then the simplest
fix is probably to just reorder the xlog_grans_push_ail() call properly
after processing iclog callbacks. A more appropriate fix, IMO, would be
to either export this logic to where the AIL update happens and/or
enhance the existing log space wake up logic to filter wakeups in the
scenarios where it is not necessary (i.e. no tail update &&
xa_push_target == max_lsn), but this is more subjective...

> So, from #1 we see that unconditional wakeups are not necessary in
> the scenario you pose, and from #2 it's not a viable solution even
> if it was required.
> 
> However, #1 indicates other problems if a xfs_log_space_wake() call
> is necessary in this case. No reservations space and an empty AIL
> implies that the CIL pins the entire log - a pending commit that
> hasn't finished flushing and the current context that is
> aggregating. This implies we've violated a much more important rule
> of the on-disk log format: finding the head and tail of the log
> requires no individual commit be larger than 50% of the log.
> 

I described this exact problem days ago in the original thread. There's
no need to rehash it here. FWIW, I can reproduce much worse than 50% log
consumption aggregated outside of the AIL with the current code and it
doesn't depend on a nonpreemptible kernel (though the workqueue fix
looks legit to me).

Brian

> So if we are actually stalling on trasnaction reservations with an
> empty AIL and an uncommitted CIL, screwing around with tail pushing
> wakeups does not address the bigger problem being seen...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-05 16:25       ` Brian Foster
@ 2019-09-06  0:02         ` Dave Chinner
  2019-09-06 13:10           ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2019-09-06  0:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Sep 05, 2019 at 12:25:33PM -0400, Brian Foster wrote:
> On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote:
> > On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote:
> > > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > +/*
> > > > + * Completion of a iclog IO does not imply that a transaction has completed, as
> > > > + * transactions can be large enough to span many iclogs. We cannot change the
> > > > + * tail of the log half way through a transaction as this may be the only
> > > > + * transaction in the log and moving the tail to point to the middle of it
> > > > + * will prevent recovery from finding the start of the transaction. Hence we
> > > > + * should only update the last_sync_lsn if this iclog contains transaction
> > > > + * completion callbacks on it.
> > > > + *
> > > > + * We have to do this before we drop the icloglock to ensure we are the only one
> > > > + * that can update it.
> > > > + *
> > > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> > > > + * the reservation grant head pushing. This is due to the fact that the push
> > > > + * target is bound by the current last_sync_lsn value. Hence if we have a large
> > > > + * amount of log space bound up in this committing transaction then the
> > > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> > > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> > > > + * should push the AIL to ensure the push target (and hence the grant head) is
> > > > + * no longer bound by the old log head location and can move forwards and make
> > > > + * progress again.
> > > > + */
> > > > +static void
> > > > +xlog_state_set_callback(
> > > > +	struct xlog		*log,
> > > > +	struct xlog_in_core	*iclog,
> > > > +	xfs_lsn_t		header_lsn)
> > > > +{
> > > > +	iclog->ic_state = XLOG_STATE_CALLBACK;
> > > > +
> > > > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> > > > +
> > > > +	if (list_empty_careful(&iclog->ic_callbacks))
> > > > +		return;
> > > > +
> > > > +	atomic64_set(&log->l_last_sync_lsn, header_lsn);
> > > > +	xlog_grant_push_ail(log, 0);
> > > > +
> > > 
> > > Nit: extra whitespace line above.
> > 
> > Fixed.
> > 
> > > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the
> > > push is skipped)?
> > 
> > If the AIL is empty, then it's a no-op because pushing on the AIL is
> > not going to make more log space become free. Besides, that's not
> > the problem being solved here - reservation wakeups on first insert
> > into the AIL are already handled by xfs_trans_ail_update_bulk() and
> > hence the first patch in the series. This patch is addressing the
> 
> Nothing currently wakes up reservation waiters on first AIL insertion.

Nor should it be necessary - it's the removal from the AIL that
frees up log space, not insertion. The update operation is a
remove followed by an insert - the remove part of that operation is
what may free up log space, not the insert.

Hence if we need to wake the log reservation waiters on first AIL
insert to fix a bug, we haven't found the underlying problem is
preventing log space from being freed...
>
> > i.e. the AIL is for moving the tail of the log - this code moves the
> > head of the log. But both impact on the AIL push target (it is based on
> > the distance between the head and tail), so we need
> > to update the push target just in case this commit does not move
> > the tail...
> > 
> > > What if xfsaild completes this push before the
> > > associated log items land in the AIL or we race with xfsaild emptying
> > > the AIL? Why not just reuse/update the existing grant head wake up logic
> > > in the iclog callback itself? E.g., something like the following
> > > (untested):
> > > 
> 
> And the raciness concerns..? AFAICT this still opens a race window where
> the AIL can idle on the push target before AIL insertion.

I don't know what race you see - if the AIL completes a push before
we insert new objects at the head from the current commit, then it
does not matter one bit because the items are being inserted at the
log head, not the log tail where the pushing occurs at. If we are
inserting objects into the AIL within the push target window, then
there is something else very wrong going on, because when the log is
out of space the push target should be nowhere near the LSN we are
inserting inew objects into the AIL at. (i.e. they should be 3/4s of
the log apart...)

> > So, from #1 we see that unconditional wakeups are not necessary in
> > the scenario you pose, and from #2 it's not a viable solution even
> > if it was required.
> > 
> > However, #1 indicates other problems if a xfs_log_space_wake() call
> > is necessary in this case. No reservations space and an empty AIL
> > implies that the CIL pins the entire log - a pending commit that
> > hasn't finished flushing and the current context that is
> > aggregating. This implies we've violated a much more important rule
> > of the on-disk log format: finding the head and tail of the log
> > requires no individual commit be larger than 50% of the log.
> > 
> 
> I described this exact problem days ago in the original thread. There's
> no need to rehash it here. FWIW, I can reproduce much worse than 50% log
> consumption aggregated outside of the AIL with the current code and it
> doesn't depend on a nonpreemptible kernel (though the workqueue fix
> looks legit to me).

I'm not rehashing anything intentionally - I'm responding to the
questions you are asking me directly in this thread. Maybe I am
going over something you've already mentioned in a previous thread,
and maybe that hasn't occurred to me because you didn't reference it
and the similarites didn't occur to me because I've spend more time
looking at the code trying to understand how this "impossible
situation" was occurring than reading mailing list discussions.

I've been certain that we were seeing was some fundamental rule was
being violated to cause this "log full, AIL empty", but I couldn't
work out exactly what it was. I was even questioning whether I
understood the basic operation of the log because there was no way I
could see that CIL would not push during log recovery until the log
was full.  I said this to Darrick yesterday morning on #xfs:

[5/9/19 12:56] <dchinner> there's something bothering me about this
log head update thing and I can't put my finger on what it is....

It wasn't until Chandan's trace showed me the CPU hold-off problem
with the CIL workqueue. A couple of hours later, after I'd seen
Chandan's trace:

[5/9/19 14:26] <dchinner> oooohhhh
[5/9/19 14:27] <dchinner> this isn't a premeptible kernel, is it?

And that was the thing that I couldn't put my finger on - I couldn't
work out how a CIL push was being delayed so long on a multi-cpu
system with lots of idle CPU that we had a completely empty AIL when
we ran out of reservation space.  IOWs, I didn't know the right
question to ask until I saw the answer in front of me.

I've never seen a "CIL checkpoint too large" issue manifiest in the
real world, but it's been there since delayed logging was
introduced. I knew about this issue right from the start, but it was
largely a theoretical concern because workqueue scheduling preempts
userspace and so is mostly only ever delayed by the number of
transactions in a single syscall. And for large, ongoing
transactions like a truncate, it will yield the moment we have to
pull in metadata from disk.

What's new in recent kernels is the in-core inode unlinked
processing mechanisms have changed the way both the syscall and log
recovery mechanisms work (merged in 5.1, IIRC), and it looks like it
no longer blocks in log recovery like it used to. Given Christoph
first reported this generic/530 issue in May there's a fair
correlation indicating that the two are linked.

i.e. we changed the unlinked inode processing in a way that
the kernel can now runs tens of thousands of unlink transactions
without yeilding the CPU. That violated the "CIL push work will run
within a few transactions of the background push occurring"
mechanism the workqueue provided us with and that, fundamentally, is
the underlying issue here. It's not a CIL vs empty AIL vs log
reservation exhaustion race condition - that's just an observable
symptom.

To that end, I have been prototyping patches to fix this exact
problem as part of the non-blocking inode reclaim series. I've been
looking at this because the CIL pins so much memory on large logs
and I wanted to put an upper bound on it that wasn't measured in GBs
of RAM. Hence I'm planning to pull these out into a separate series
now as it's clear that non-preemptible kernels and workqueues do not
play well together and that the more we use workqueues for async
processing, the more we introduce a potential real-world vector for
CIL overruns...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-06  0:02         ` Dave Chinner
@ 2019-09-06 13:10           ` Brian Foster
  2019-09-07 15:10             ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2019-09-06 13:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Sep 06, 2019 at 10:02:05AM +1000, Dave Chinner wrote:
> On Thu, Sep 05, 2019 at 12:25:33PM -0400, Brian Foster wrote:
> > On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote:
> > > On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote:
> > > > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > +/*
> > > > > + * Completion of a iclog IO does not imply that a transaction has completed, as
> > > > > + * transactions can be large enough to span many iclogs. We cannot change the
> > > > > + * tail of the log half way through a transaction as this may be the only
> > > > > + * transaction in the log and moving the tail to point to the middle of it
> > > > > + * will prevent recovery from finding the start of the transaction. Hence we
> > > > > + * should only update the last_sync_lsn if this iclog contains transaction
> > > > > + * completion callbacks on it.
> > > > > + *
> > > > > + * We have to do this before we drop the icloglock to ensure we are the only one
> > > > > + * that can update it.
> > > > > + *
> > > > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> > > > > + * the reservation grant head pushing. This is due to the fact that the push
> > > > > + * target is bound by the current last_sync_lsn value. Hence if we have a large
> > > > > + * amount of log space bound up in this committing transaction then the
> > > > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> > > > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> > > > > + * should push the AIL to ensure the push target (and hence the grant head) is
> > > > > + * no longer bound by the old log head location and can move forwards and make
> > > > > + * progress again.
> > > > > + */
> > > > > +static void
> > > > > +xlog_state_set_callback(
> > > > > +	struct xlog		*log,
> > > > > +	struct xlog_in_core	*iclog,
> > > > > +	xfs_lsn_t		header_lsn)
> > > > > +{
> > > > > +	iclog->ic_state = XLOG_STATE_CALLBACK;
> > > > > +
> > > > > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> > > > > +
> > > > > +	if (list_empty_careful(&iclog->ic_callbacks))
> > > > > +		return;
> > > > > +
> > > > > +	atomic64_set(&log->l_last_sync_lsn, header_lsn);
> > > > > +	xlog_grant_push_ail(log, 0);
> > > > > +
> > > > 
> > > > Nit: extra whitespace line above.
> > > 
> > > Fixed.
> > > 
> > > > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the
> > > > push is skipped)?
> > > 
> > > If the AIL is empty, then it's a no-op because pushing on the AIL is
> > > not going to make more log space become free. Besides, that's not
> > > the problem being solved here - reservation wakeups on first insert
> > > into the AIL are already handled by xfs_trans_ail_update_bulk() and
> > > hence the first patch in the series. This patch is addressing the
> > 
> > Nothing currently wakes up reservation waiters on first AIL insertion.
> 
> Nor should it be necessary - it's the removal from the AIL that
> frees up log space, not insertion. The update operation is a
> remove followed by an insert - the remove part of that operation is
> what may free up log space, not the insert.
> 

Just above you wrote: "reservation wakeups on first insert into the AIL
are already handled by xfs_trans_ail_update_bulk()". My reply was just
to point out that there are no wakeups in that case.

> Hence if we need to wake the log reservation waiters on first AIL
> insert to fix a bug, we haven't found the underlying problem is
> preventing log space from being freed...
> >
> > > i.e. the AIL is for moving the tail of the log - this code moves the
> > > head of the log. But both impact on the AIL push target (it is based on
> > > the distance between the head and tail), so we need
> > > to update the push target just in case this commit does not move
> > > the tail...
> > > 
> > > > What if xfsaild completes this push before the
> > > > associated log items land in the AIL or we race with xfsaild emptying
> > > > the AIL? Why not just reuse/update the existing grant head wake up logic
> > > > in the iclog callback itself? E.g., something like the following
> > > > (untested):
> > > > 
> > 
> > And the raciness concerns..? AFAICT this still opens a race window where
> > the AIL can idle on the push target before AIL insertion.
> 
> I don't know what race you see - if the AIL completes a push before
> we insert new objects at the head from the current commit, then it
> does not matter one bit because the items are being inserted at the
> log head, not the log tail where the pushing occurs at. If we are
> inserting objects into the AIL within the push target window, then
> there is something else very wrong going on, because when the log is
> out of space the push target should be nowhere near the LSN we are
> inserting inew objects into the AIL at. (i.e. they should be 3/4s of
> the log apart...)
> 

I'm not following your reasoning. It sounds to me that you're arguing it
doesn't matter that the AIL is not populated from the current commit
because the push target should be much farther behind the head. If
that's the case, why does this patch order the AIL push after a
->l_last_sync_lsn update? That's the LSN of the most recent commit
record to hit the log and hence the new physical log head.

Side note: I think the LSN of the commit record iclog is different and
actually ahead of the LSN associated with AIL insertion. I don't
necessarily think that's a problem given how the log subsystem behaves
today, but it's another subtle/undocumented (and easily avoidable) quirk
that may not always be so benign.

> > > So, from #1 we see that unconditional wakeups are not necessary in
> > > the scenario you pose, and from #2 it's not a viable solution even
> > > if it was required.
> > > 
> > > However, #1 indicates other problems if a xfs_log_space_wake() call
> > > is necessary in this case. No reservations space and an empty AIL
> > > implies that the CIL pins the entire log - a pending commit that
> > > hasn't finished flushing and the current context that is
> > > aggregating. This implies we've violated a much more important rule
> > > of the on-disk log format: finding the head and tail of the log
> > > requires no individual commit be larger than 50% of the log.
> > > 
> > 
> > I described this exact problem days ago in the original thread. There's
> > no need to rehash it here. FWIW, I can reproduce much worse than 50% log
> > consumption aggregated outside of the AIL with the current code and it
> > doesn't depend on a nonpreemptible kernel (though the workqueue fix
> > looks legit to me).
> 
...
> 
> i.e. we changed the unlinked inode processing in a way that
> the kernel can now runs tens of thousands of unlink transactions
> without yeilding the CPU. That violated the "CIL push work will run
> within a few transactions of the background push occurring"
> mechanism the workqueue provided us with and that, fundamentally, is
> the underlying issue here. It's not a CIL vs empty AIL vs log
> reservation exhaustion race condition - that's just an observable
> symptom.
> 

Yes, but the point is that's not the only thing that can delay CIL push
work. Since the AIL is not populated until the commit record iclog is
written out, and background CIL pushing doesn't flush the commit record
for the associated checkpoint before it completes, and CIL pushing
itself is serialized, a stalled commit record iclog I/O is enough to
create "log full, empty AIL" conditions.

> To that end, I have been prototyping patches to fix this exact problem
> as part of the non-blocking inode reclaim series. I've been looking at
> this because the CIL pins so much memory on large logs and I wanted to
> put an upper bound on it that wasn't measured in GBs of RAM. Hence I'm
> planning to pull these out into a separate series now as it's clear
> that non-preemptible kernels and workqueues do not play well together
> and that the more we use workqueues for async processing, the more we
> introduce a potential real-world vector for CIL overruns...
> 

Yes, I think a separate series for CIL management makes sense.

Brian

> Cheers,
> 
> Dave.  -- Dave Chinner david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-06 13:10           ` Brian Foster
@ 2019-09-07 15:10             ` Brian Foster
  2019-09-08 23:26               ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2019-09-07 15:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Sep 06, 2019 at 09:10:14AM -0400, Brian Foster wrote:
> On Fri, Sep 06, 2019 at 10:02:05AM +1000, Dave Chinner wrote:
> > On Thu, Sep 05, 2019 at 12:25:33PM -0400, Brian Foster wrote:
> > > On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote:
> > > > On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote:
> > > > > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote:
> > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > +/*
> > > > > > + * Completion of a iclog IO does not imply that a transaction has completed, as
> > > > > > + * transactions can be large enough to span many iclogs. We cannot change the
> > > > > > + * tail of the log half way through a transaction as this may be the only
> > > > > > + * transaction in the log and moving the tail to point to the middle of it
> > > > > > + * will prevent recovery from finding the start of the transaction. Hence we
> > > > > > + * should only update the last_sync_lsn if this iclog contains transaction
> > > > > > + * completion callbacks on it.
> > > > > > + *
> > > > > > + * We have to do this before we drop the icloglock to ensure we are the only one
> > > > > > + * that can update it.
> > > > > > + *
> > > > > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> > > > > > + * the reservation grant head pushing. This is due to the fact that the push
> > > > > > + * target is bound by the current last_sync_lsn value. Hence if we have a large
> > > > > > + * amount of log space bound up in this committing transaction then the
> > > > > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> > > > > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> > > > > > + * should push the AIL to ensure the push target (and hence the grant head) is
> > > > > > + * no longer bound by the old log head location and can move forwards and make
> > > > > > + * progress again.
> > > > > > + */
> > > > > > +static void
> > > > > > +xlog_state_set_callback(
> > > > > > +	struct xlog		*log,
> > > > > > +	struct xlog_in_core	*iclog,
> > > > > > +	xfs_lsn_t		header_lsn)
> > > > > > +{
> > > > > > +	iclog->ic_state = XLOG_STATE_CALLBACK;
> > > > > > +
> > > > > > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> > > > > > +
> > > > > > +	if (list_empty_careful(&iclog->ic_callbacks))
> > > > > > +		return;
> > > > > > +
> > > > > > +	atomic64_set(&log->l_last_sync_lsn, header_lsn);
> > > > > > +	xlog_grant_push_ail(log, 0);
> > > > > > +
> > > > > 
> > > > > Nit: extra whitespace line above.
> > > > 
> > > > Fixed.
> > > > 
> > > > > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the
> > > > > push is skipped)?
> > > > 
> > > > If the AIL is empty, then it's a no-op because pushing on the AIL is
> > > > not going to make more log space become free. Besides, that's not
> > > > the problem being solved here - reservation wakeups on first insert
> > > > into the AIL are already handled by xfs_trans_ail_update_bulk() and
> > > > hence the first patch in the series. This patch is addressing the
> > > 
> > > Nothing currently wakes up reservation waiters on first AIL insertion.
> > 
> > Nor should it be necessary - it's the removal from the AIL that
> > frees up log space, not insertion. The update operation is a
> > remove followed by an insert - the remove part of that operation is
> > what may free up log space, not the insert.
> > 
> 
> Just above you wrote: "reservation wakeups on first insert into the AIL
> are already handled by xfs_trans_ail_update_bulk()". My reply was just
> to point out that there are no wakeups in that case.
> 
> > Hence if we need to wake the log reservation waiters on first AIL
> > insert to fix a bug, we haven't found the underlying problem is
> > preventing log space from being freed...
> > >
> > > > i.e. the AIL is for moving the tail of the log - this code moves the
> > > > head of the log. But both impact on the AIL push target (it is based on
> > > > the distance between the head and tail), so we need
> > > > to update the push target just in case this commit does not move
> > > > the tail...
> > > > 
> > > > > What if xfsaild completes this push before the
> > > > > associated log items land in the AIL or we race with xfsaild emptying
> > > > > the AIL? Why not just reuse/update the existing grant head wake up logic
> > > > > in the iclog callback itself? E.g., something like the following
> > > > > (untested):
> > > > > 
> > > 
> > > And the raciness concerns..? AFAICT this still opens a race window where
> > > the AIL can idle on the push target before AIL insertion.
> > 
> > I don't know what race you see - if the AIL completes a push before
> > we insert new objects at the head from the current commit, then it
> > does not matter one bit because the items are being inserted at the
> > log head, not the log tail where the pushing occurs at. If we are
> > inserting objects into the AIL within the push target window, then
> > there is something else very wrong going on, because when the log is
> > out of space the push target should be nowhere near the LSN we are
> > inserting inew objects into the AIL at. (i.e. they should be 3/4s of
> > the log apart...)
> > 
> 
> I'm not following your reasoning. It sounds to me that you're arguing it
> doesn't matter that the AIL is not populated from the current commit
> because the push target should be much farther behind the head. If
> that's the case, why does this patch order the AIL push after a
> ->l_last_sync_lsn update? That's the LSN of the most recent commit
> record to hit the log and hence the new physical log head.
> 
> Side note: I think the LSN of the commit record iclog is different and
> actually ahead of the LSN associated with AIL insertion. I don't
> necessarily think that's a problem given how the log subsystem behaves
> today, but it's another subtle/undocumented (and easily avoidable) quirk
> that may not always be so benign.
> 

Just to put a finer point on this (and since this seems to be the only
way I can get you to consider nontrivial feedback to your patches):

kworker/0:1H-220   [000] ...1  3869.403829: xlog_state_do_callback: 2691: l_last_sync_lsn 0x15000021f6
kworker/0:1H-220   [000] ...1  3869.403864: xfs_ail_push: 639: ail_target 0x15000021f6
      <...>-215246 [002] ...1  3875.568561: xfsaild: 403: empty (target 0x15000021f6)
      <...>-215246 [002] ....  3875.568649: xfsaild: 589: idle
kworker/0:1H-220   [000] ...1  3889.843872: xfs_trans_ail_update_bulk: 746: inserted lsn 0x1500001bf6

This is an instance of xfsaild going idle between the time this new AIL
push sets the target based on the iclog about to be committed and AIL
insertion of the associated log items, reproduced via a bit of timing
instrumentation. Don't be distracted by the timestamps or the fact that
the LSNs do not match because the log items in the AIL end up indexed by
the start lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
commit record). The point is simply that xfsaild has completed a push of
a target that hasn't been inserted yet.

A couple additional notes.. I don't see further side effects in the
variant I reproduced, I suspect because we have other wakeups that
squash this transient state created by the race, but I'm not totally
sure of that. I'm also not totally convinced this is the only vector to
this problem, FWIW. It wouldn't surprise me a ton if we had some other
scenario that could result in the same problem with actual side effects,
but this is beside the point.

Brian

> > > > So, from #1 we see that unconditional wakeups are not necessary in
> > > > the scenario you pose, and from #2 it's not a viable solution even
> > > > if it was required.
> > > > 
> > > > However, #1 indicates other problems if a xfs_log_space_wake() call
> > > > is necessary in this case. No reservations space and an empty AIL
> > > > implies that the CIL pins the entire log - a pending commit that
> > > > hasn't finished flushing and the current context that is
> > > > aggregating. This implies we've violated a much more important rule
> > > > of the on-disk log format: finding the head and tail of the log
> > > > requires no individual commit be larger than 50% of the log.
> > > > 
> > > 
> > > I described this exact problem days ago in the original thread. There's
> > > no need to rehash it here. FWIW, I can reproduce much worse than 50% log
> > > consumption aggregated outside of the AIL with the current code and it
> > > doesn't depend on a nonpreemptible kernel (though the workqueue fix
> > > looks legit to me).
> > 
> ...
> > 
> > i.e. we changed the unlinked inode processing in a way that
> > the kernel can now runs tens of thousands of unlink transactions
> > without yeilding the CPU. That violated the "CIL push work will run
> > within a few transactions of the background push occurring"
> > mechanism the workqueue provided us with and that, fundamentally, is
> > the underlying issue here. It's not a CIL vs empty AIL vs log
> > reservation exhaustion race condition - that's just an observable
> > symptom.
> > 
> 
> Yes, but the point is that's not the only thing that can delay CIL push
> work. Since the AIL is not populated until the commit record iclog is
> written out, and background CIL pushing doesn't flush the commit record
> for the associated checkpoint before it completes, and CIL pushing
> itself is serialized, a stalled commit record iclog I/O is enough to
> create "log full, empty AIL" conditions.
> 
> > To that end, I have been prototyping patches to fix this exact problem
> > as part of the non-blocking inode reclaim series. I've been looking at
> > this because the CIL pins so much memory on large logs and I wanted to
> > put an upper bound on it that wasn't measured in GBs of RAM. Hence I'm
> > planning to pull these out into a separate series now as it's clear
> > that non-preemptible kernels and workqueues do not play well together
> > and that the more we use workqueues for async processing, the more we
> > introduce a potential real-world vector for CIL overruns...
> > 
> 
> Yes, I think a separate series for CIL management makes sense.
> 
> Brian
> 
> > Cheers,
> > 
> > Dave.  -- Dave Chinner david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-07 15:10             ` Brian Foster
@ 2019-09-08 23:26               ` Dave Chinner
  2019-09-10  9:56                 ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2019-09-08 23:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> On Fri, Sep 06, 2019 at 09:10:14AM -0400, Brian Foster wrote:
> > On Fri, Sep 06, 2019 at 10:02:05AM +1000, Dave Chinner wrote:
> > > > And the raciness concerns..? AFAICT this still opens a race window where
> > > > the AIL can idle on the push target before AIL insertion.
> > > 
> > > I don't know what race you see - if the AIL completes a push before
> > > we insert new objects at the head from the current commit, then it
> > > does not matter one bit because the items are being inserted at the
> > > log head, not the log tail where the pushing occurs at. If we are
> > > inserting objects into the AIL within the push target window, then
> > > there is something else very wrong going on, because when the log is
> > > out of space the push target should be nowhere near the LSN we are
> > > inserting inew objects into the AIL at. (i.e. they should be 3/4s of
> > > the log apart...)
> > > 
> > 
> > I'm not following your reasoning. It sounds to me that you're arguing it
> > doesn't matter that the AIL is not populated from the current commit
> > because the push target should be much farther behind the head. If
> > that's the case, why does this patch order the AIL push after a
> > ->l_last_sync_lsn update? That's the LSN of the most recent commit
> > record to hit the log and hence the new physical log head.
> > 
> > Side note: I think the LSN of the commit record iclog is different and
> > actually ahead of the LSN associated with AIL insertion. I don't
> > necessarily think that's a problem given how the log subsystem behaves
> > today, but it's another subtle/undocumented (and easily avoidable) quirk
> > that may not always be so benign.
> > 
> 
> Just to put a finer point on this (and since this seems to be the only
> way I can get you to consider nontrivial feedback to your patches):

If I can't make head or tail of the problem you are describing,
exactly how am I supposed to respond? If I'm unable to get my point
across, I'd much prefer to spend my time on patches than on going
around in circles. I'm not interested in winning arguments. I'm not
interested in spending lots of time discussing theoretical problems
with the current set of fixes that don't exist once the root cause
we've already identified is fixed. My time is much better spent
fixing that root cause...

Keep in mind that I also have a lot of different, complex  things
going on at once that all require total focus while I'm looking at
them, so it can take days for me to cycle through everything and get
back to past topics. Delay doesn't mean I haven't read your response
or taken it on board, it just means I don't have time to write a
-meaingful response- straight away.

> kworker/0:1H-220   [000] ...1  3869.403829: xlog_state_do_callback: 2691: l_last_sync_lsn 0x15000021f6
> kworker/0:1H-220   [000] ...1  3869.403864: xfs_ail_push: 639: ail_target 0x15000021f6

Which implies that the log has less than 25% of space free because
we've issued a push, and that the distance we push is bound by the
log head.

>       <...>-215246 [002] ...1  3875.568561: xfsaild: 403: empty (target 0x15000021f6)
>       <...>-215246 [002] ....  3875.568649: xfsaild: 589: idle

has an empty AIL. IOWs, you are creating the situation where the CIL
has not been allowed to run and hence has violated the >50% log size
limit on transactions. This goes away once we prevent the CIL from
doing this.

> kworker/0:1H-220   [000] ...1  3889.843872: xfs_trans_ail_update_bulk: 746: inserted lsn 0x1500001bf6

Ok, so what you see here is somewhat intentional, based on the fact
that the LSN used for items is different to the LSN used for the
commit record (start of commit vs end of commit).  We don't want to
push the currently commiting items instantly to disk as that defeats
the "delayed write" behaviour the AIL uses to allow efficient
relogging to occur.

The next commit will do a similar push during with the new
l_last_sync_lsn, which causes the target to point at the new
last_sync_lsn and so all the items in the AIL from the previous
commit that haven't been relogged (pinned) in the current commit
will get pushed. i.e. commit N will cause commit (N - 1) to get
pushed.

This will continue while we are in a situation where the current log
head location is limiting the push target and we are completely out
of log reservation space. Once we get to the point where the
physical head of the log is more than 25% of the log away from the
tail, the push target will stop being limited by the l_last_sync_lsn
and we'll go back to triggering push target updates via the tail of
the log moving forwards as we currently do.  IOWs, this "log head
pushing" behaviour is likely only necessary for the first 2-3 CIL
commits of a workload, then we fall back into the normal tail
pushing scenario.

> This is an instance of xfsaild going idle between the time this
> new AIL push sets the target based on the iclog about to be
> committed and AIL insertion of the associated log items,
> reproduced via a bit of timing instrumentation.  Don't be
> distracted by the timestamps or the fact that the LSNs do not
> match because the log items in the AIL end up indexed by the start
> lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> commit record). The point is simply that xfsaild has completed a
> push of a target that hasn't been inserted yet.

AFAICT, what you are showing requires delaying of the CIL push to the
point it violates a fundamental assumption about commit sizes, which
is why I largely think it's irrelevant.

> A couple additional notes.. I don't see further side effects in the
> variant I reproduced, I suspect because we have other wakeups that
> squash this transient state created by the race,

Right, if we do run out of log space, the log reservation tail
pushing mechanisms takes over and does the right thing.

> > > i.e. we changed the unlinked inode processing in a way that
> > > the kernel can now runs tens of thousands of unlink transactions
> > > without yeilding the CPU. That violated the "CIL push work will run
> > > within a few transactions of the background push occurring"
> > > mechanism the workqueue provided us with and that, fundamentally, is
> > > the underlying issue here. It's not a CIL vs empty AIL vs log
> > > reservation exhaustion race condition - that's just an observable
> > > symptom.
> > > 
> > 
> > Yes, but the point is that's not the only thing that can delay CIL push
> > work. Since the AIL is not populated until the commit record iclog is
> > written out, and background CIL pushing doesn't flush the commit record
> > for the associated checkpoint before it completes, and CIL pushing
> > itself is serialized, a stalled commit record iclog I/O is enough to
> > create "log full, empty AIL" conditions.

CIL pushing is not actually serialised. Ordered, yes, serialised,
no. ANd stalling an iclog with a commit record should not cause the
log to fill completely - the next CIL push when it overflows should
get it moving long before the log runs out of reservation space.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-08 23:26               ` Dave Chinner
@ 2019-09-10  9:56                 ` Brian Foster
  2019-09-10 23:38                   ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2019-09-10  9:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > On Fri, Sep 06, 2019 at 09:10:14AM -0400, Brian Foster wrote:
> > > On Fri, Sep 06, 2019 at 10:02:05AM +1000, Dave Chinner wrote:
> > > > > And the raciness concerns..? AFAICT this still opens a race window where
> > > > > the AIL can idle on the push target before AIL insertion.
> > > > 
> > > > I don't know what race you see - if the AIL completes a push before
> > > > we insert new objects at the head from the current commit, then it
> > > > does not matter one bit because the items are being inserted at the
> > > > log head, not the log tail where the pushing occurs at. If we are
> > > > inserting objects into the AIL within the push target window, then
> > > > there is something else very wrong going on, because when the log is
> > > > out of space the push target should be nowhere near the LSN we are
> > > > inserting inew objects into the AIL at. (i.e. they should be 3/4s of
> > > > the log apart...)
> > > > 
> > > 
> > > I'm not following your reasoning. It sounds to me that you're arguing it
> > > doesn't matter that the AIL is not populated from the current commit
> > > because the push target should be much farther behind the head. If
> > > that's the case, why does this patch order the AIL push after a
> > > ->l_last_sync_lsn update? That's the LSN of the most recent commit
> > > record to hit the log and hence the new physical log head.
> > > 
> > > Side note: I think the LSN of the commit record iclog is different and
> > > actually ahead of the LSN associated with AIL insertion. I don't
> > > necessarily think that's a problem given how the log subsystem behaves
> > > today, but it's another subtle/undocumented (and easily avoidable) quirk
> > > that may not always be so benign.
> > > 
> > 
> > Just to put a finer point on this (and since this seems to be the only
> > way I can get you to consider nontrivial feedback to your patches):
> 
> If I can't make head or tail of the problem you are describing,
> exactly how am I supposed to respond? If I'm unable to get my point
> across, I'd much prefer to spend my time on patches than on going
> around in circles. I'm not interested in winning arguments. I'm not
> interested in spending lots of time discussing theoretical problems
> with the current set of fixes that don't exist once the root cause
> we've already identified is fixed. My time is much better spent
> fixing that root cause...
> 
> Keep in mind that I also have a lot of different, complex  things
> going on at once that all require total focus while I'm looking at
> them, so it can take days for me to cycle through everything and get
> back to past topics. Delay doesn't mean I haven't read your response
> or taken it on board, it just means I don't have time to write a
> -meaingful response- straight away.
> 
> > kworker/0:1H-220   [000] ...1  3869.403829: xlog_state_do_callback: 2691: l_last_sync_lsn 0x15000021f6
> > kworker/0:1H-220   [000] ...1  3869.403864: xfs_ail_push: 639: ail_target 0x15000021f6
> 
> Which implies that the log has less than 25% of space free because
> we've issued a push, and that the distance we push is bound by the
> log head.
> 
> >       <...>-215246 [002] ...1  3875.568561: xfsaild: 403: empty (target 0x15000021f6)
> >       <...>-215246 [002] ....  3875.568649: xfsaild: 589: idle
> 
> has an empty AIL. IOWs, you are creating the situation where the CIL
> has not been allowed to run and hence has violated the >50% log size
> limit on transactions. This goes away once we prevent the CIL from
> doing this.
> 
> > kworker/0:1H-220   [000] ...1  3889.843872: xfs_trans_ail_update_bulk: 746: inserted lsn 0x1500001bf6
> 
> Ok, so what you see here is somewhat intentional, based on the fact
> that the LSN used for items is different to the LSN used for the
> commit record (start of commit vs end of commit).  We don't want to
> push the currently commiting items instantly to disk as that defeats
> the "delayed write" behaviour the AIL uses to allow efficient
> relogging to occur.
> 
> The next commit will do a similar push during with the new
> l_last_sync_lsn, which causes the target to point at the new
> last_sync_lsn and so all the items in the AIL from the previous
> commit that haven't been relogged (pinned) in the current commit
> will get pushed. i.e. commit N will cause commit (N - 1) to get
> pushed.
> 
> This will continue while we are in a situation where the current log
> head location is limiting the push target and we are completely out
> of log reservation space. Once we get to the point where the
> physical head of the log is more than 25% of the log away from the
> tail, the push target will stop being limited by the l_last_sync_lsn
> and we'll go back to triggering push target updates via the tail of
> the log moving forwards as we currently do.  IOWs, this "log head
> pushing" behaviour is likely only necessary for the first 2-3 CIL
> commits of a workload, then we fall back into the normal tail
> pushing scenario.
> 
> > This is an instance of xfsaild going idle between the time this
> > new AIL push sets the target based on the iclog about to be
> > committed and AIL insertion of the associated log items,
> > reproduced via a bit of timing instrumentation.  Don't be
> > distracted by the timestamps or the fact that the LSNs do not
> > match because the log items in the AIL end up indexed by the start
> > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > commit record). The point is simply that xfsaild has completed a
> > push of a target that hasn't been inserted yet.
> 
> AFAICT, what you are showing requires delaying of the CIL push to the
> point it violates a fundamental assumption about commit sizes, which
> is why I largely think it's irrelevant.
> 

The CIL checkpoint size is an unrelated side effect of the test I
happened to use, not a fundamental cause of the problem it demonstrates.
Fixing CIL checkpoint size issues won't change anything. Here's a
different variant of the same problem with a small enough number of log
items such that background CIL pushing is not a factor:

       <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
	...
       <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
       <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
	....

This sequence starts with one log item in the AIL and some number of
items in the CIL such that a checkpoint executes from the background log
worker. The worker forces the CIL and log I/O completion issues an AIL
push that is truncated by the recently updated ->l_last_sync_lsn due to
outstanding transaction reservation and small AIL size. This push races
with completion of a previous push that empties the AIL and iclog
callbacks insert log items for the current checkpoint at the LSN target
xfsaild just idled at.

Brian

> > A couple additional notes.. I don't see further side effects in the
> > variant I reproduced, I suspect because we have other wakeups that
> > squash this transient state created by the race,
> 
> Right, if we do run out of log space, the log reservation tail
> pushing mechanisms takes over and does the right thing.
> 
> > > > i.e. we changed the unlinked inode processing in a way that
> > > > the kernel can now runs tens of thousands of unlink transactions
> > > > without yeilding the CPU. That violated the "CIL push work will run
> > > > within a few transactions of the background push occurring"
> > > > mechanism the workqueue provided us with and that, fundamentally, is
> > > > the underlying issue here. It's not a CIL vs empty AIL vs log
> > > > reservation exhaustion race condition - that's just an observable
> > > > symptom.
> > > > 
> > > 
> > > Yes, but the point is that's not the only thing that can delay CIL push
> > > work. Since the AIL is not populated until the commit record iclog is
> > > written out, and background CIL pushing doesn't flush the commit record
> > > for the associated checkpoint before it completes, and CIL pushing
> > > itself is serialized, a stalled commit record iclog I/O is enough to
> > > create "log full, empty AIL" conditions.
> 
> CIL pushing is not actually serialised. Ordered, yes, serialised,
> no. ANd stalling an iclog with a commit record should not cause the
> log to fill completely - the next CIL push when it overflows should
> get it moving long before the log runs out of reservation space.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-10  9:56                 ` Brian Foster
@ 2019-09-10 23:38                   ` Dave Chinner
  2019-09-12 13:46                     ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2019-09-10 23:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > This is an instance of xfsaild going idle between the time this
> > > new AIL push sets the target based on the iclog about to be
> > > committed and AIL insertion of the associated log items,
> > > reproduced via a bit of timing instrumentation.  Don't be
> > > distracted by the timestamps or the fact that the LSNs do not
> > > match because the log items in the AIL end up indexed by the start
> > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > commit record). The point is simply that xfsaild has completed a
> > > push of a target that hasn't been inserted yet.
> > 
> > AFAICT, what you are showing requires delaying of the CIL push to the
> > point it violates a fundamental assumption about commit sizes, which
> > is why I largely think it's irrelevant.
> > 
> 
> The CIL checkpoint size is an unrelated side effect of the test I
> happened to use, not a fundamental cause of the problem it demonstrates.
> Fixing CIL checkpoint size issues won't change anything. Here's a
> different variant of the same problem with a small enough number of log
> items such that background CIL pushing is not a factor:
> 
>        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> 	...
>        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
>        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> 	....
>
> This sequence starts with one log item in the AIL and some number of
> items in the CIL such that a checkpoint executes from the background log
> worker. The worker forces the CIL and log I/O completion issues an AIL
> push that is truncated by the recently updated ->l_last_sync_lsn due to
> outstanding transaction reservation and small AIL size. This push races
> with completion of a previous push that empties the AIL and iclog
> callbacks insert log items for the current checkpoint at the LSN target
> xfsaild just idled at.

I'm just not seeing what the problem here is. The behaviour you are
describing has been around since day zero and doesn't require the
addition of an ail push from iclog completion to trigger.  Prior to
this series, it would be:

process 1	reservation	log completion	xfsaild
<completes metadata IO>
  xfs_ail_delete()
    mlip_changed
    xlog_assign_tail_lsn_locked()
      ail empty, sets l_last_sync = 0x1000032e2
    xfs_log_space_wake()
				xlog_state_do_callback
				  sets CALLBACK
				  sets last_sync_lsn to iclog head
				    -> 0x1000032e4
				  <drops icloglock, gets preempted>
		<wakes>
		xlog_grant_head_wait
		  free_bytes < need_bytes
		    xlog_grant_push_ail()
		      xlog_push_ail()
		        ->ail_target 0x1000032e4
		<sleeps>
						<wakes>
						sets prev target to 0x1000032e4
						sees empty AIL
						<sleeps>
				    <runs again>
				    runs callbacks
				      xfs_ail_insert(lsn = 0x1000032e4)

and now we have the AIL push thread asleep with items in it at the
push threshold.  IOWs, what you describe has always been possible,
and before the CIL was introduced this sort of thing happened quite
a bit because iclog completions freed up much less space in the log
than a CIL commit completion.

It's not a problem, however, because if we are out of transaction
reservation space we must have transactions in progress, and as long
as they make progress then the commit of each transaction will end
up calling xlog_ungrant_log_space() to return the unused portion of
the transaction reservation. That calls xfs_log_space_wake() to
allow reservation waiters to try to make progress.

If there's still not enough space reservation after the transaction
in progress has released it's reservation, then it goes back to
sleep. As long as we have active transactions in progress while
there are transaction reservations waiting on reservation space,
there will be a wakeup vector for the reservation independent of
the CIL, iclogs and AIL behaviour.

[ Yes, there was a bug here, in the case xfs_log_space_wake() did
not issue a wakeup because of not enough space being availble and
the push target was limited by the old log head location. i.e.
nothing ever updated the push target to reflect the new log head and
so the tail might never get moved now. That particular bug was fixed
by a an earlier patch in the series, so we can ignore it here. ]

IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
the log space, and we have transactions waiting on log reservation
space, then we must have enough transactions in progress to cover at
least 75% of the log space. Completion of those transactions will
wake waiters and, if necessary, push the AIL again to keep the log
tail moving appropriately. This handles the AIL empty and "insert
before target" situations you are concerned about just fine, as long
as we have a guarantee of forwards progress. Bounding the CIL size
provides that forwards progress guarantee for the CIL...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-10 23:38                   ` Dave Chinner
@ 2019-09-12 13:46                     ` Brian Foster
  2019-09-17  4:31                       ` Darrick J. Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2019-09-12 13:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote:
> On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > > This is an instance of xfsaild going idle between the time this
> > > > new AIL push sets the target based on the iclog about to be
> > > > committed and AIL insertion of the associated log items,
> > > > reproduced via a bit of timing instrumentation.  Don't be
> > > > distracted by the timestamps or the fact that the LSNs do not
> > > > match because the log items in the AIL end up indexed by the start
> > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > > commit record). The point is simply that xfsaild has completed a
> > > > push of a target that hasn't been inserted yet.
> > > 
> > > AFAICT, what you are showing requires delaying of the CIL push to the
> > > point it violates a fundamental assumption about commit sizes, which
> > > is why I largely think it's irrelevant.
> > > 
> > 
> > The CIL checkpoint size is an unrelated side effect of the test I
> > happened to use, not a fundamental cause of the problem it demonstrates.
> > Fixing CIL checkpoint size issues won't change anything. Here's a
> > different variant of the same problem with a small enough number of log
> > items such that background CIL pushing is not a factor:
> > 
> >        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> > 	...
> >        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
> >        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> > kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > 	....
> >
> > This sequence starts with one log item in the AIL and some number of
> > items in the CIL such that a checkpoint executes from the background log
> > worker. The worker forces the CIL and log I/O completion issues an AIL
> > push that is truncated by the recently updated ->l_last_sync_lsn due to
> > outstanding transaction reservation and small AIL size. This push races
> > with completion of a previous push that empties the AIL and iclog
> > callbacks insert log items for the current checkpoint at the LSN target
> > xfsaild just idled at.
> 
> I'm just not seeing what the problem here is. The behaviour you are
> describing has been around since day zero and doesn't require the
> addition of an ail push from iclog completion to trigger.  Prior to
> this series, it would be:
> 

A few days ago you said that if we're inserting log items before the
push target, "something is very wrong." Since this was what I was
concerned about, I attempted to manufacture the issue to demonstrate.
You suggested the first reproducer I came up with was a separate problem
(related to CIL size issues), so I came up with the one above to avoid
that distraction. Now you're telling me this has always happened and is
fine..

While I don't think this is quite accurate (more below), I do find this
reasoning somewhat amusing in that it essentially implies that this
patch itself is dubious. If this new AIL push is required to fix a real
issue, and this race is essentially manifest as implied, then this patch
can't possibly reliably fix the original problem. Anyways, that is
neither here nor there..

All of the details of this particular issue aside, I do think there's a
development process problem here. It shouldn't require an extended game
of whack-a-mole with this kind of inconsistent reasoning just to request
a trivial change to a patch (you also implied in a previous response it
was me wasting your time on this topic) that closes an obvious race and
otherwise has no negative effect. Someone is being unreasonable here and
I don't think it's me. More importantly, discussion of open issues
shouldn't be a race against the associated patch being merged. :/

> process 1	reservation	log completion	xfsaild
> <completes metadata IO>
>   xfs_ail_delete()
>     mlip_changed
>     xlog_assign_tail_lsn_locked()
>       ail empty, sets l_last_sync = 0x1000032e2
>     xfs_log_space_wake()
> 				xlog_state_do_callback
> 				  sets CALLBACK
> 				  sets last_sync_lsn to iclog head
> 				    -> 0x1000032e4
> 				  <drops icloglock, gets preempted>
> 		<wakes>
> 		xlog_grant_head_wait
> 		  free_bytes < need_bytes
> 		    xlog_grant_push_ail()
> 		      xlog_push_ail()
> 		        ->ail_target 0x1000032e4
> 		<sleeps>
> 						<wakes>
> 						sets prev target to 0x1000032e4
> 						sees empty AIL
> 						<sleeps>
> 				    <runs again>
> 				    runs callbacks
> 				      xfs_ail_insert(lsn = 0x1000032e4)
> 
> and now we have the AIL push thread asleep with items in it at the
> push threshold.  IOWs, what you describe has always been possible,
> and before the CIL was introduced this sort of thing happened quite
> a bit because iclog completions freed up much less space in the log
> than a CIL commit completion.
> 

I was suspicious that this could occur prior to this change but I hadn't
confirmed. The scenario documented above cannot occur because a push on
an empty AIL has no effect. The target doesn't move and the task isn't
woken. That said, I still suspect the race can occur with the current
code via between a grant head waiter, AIL emptying and iclog completion.

This just speaks to the frequency of the problem, though. I'm not
convinced it's something that happens "quite a bit" given the nature of
the 3-way race. I also don't agree that existence of a historical
problem somehow excuses introduction a new variant of the same problem.
Instead, if this patch exposes a historical problem that simply had no
noticeable impact to this point, we should probably look into whether it
needs fixing too.

> It's not a problem, however, because if we are out of transaction
> reservation space we must have transactions in progress, and as long
> as they make progress then the commit of each transaction will end
> up calling xlog_ungrant_log_space() to return the unused portion of
> the transaction reservation. That calls xfs_log_space_wake() to
> allow reservation waiters to try to make progress.
> 

Yes, this is why I don't see immediate side effects in the tests I've
run so far. The assumptions you're basing this off are not always true,
however. Particularly on smaller (<= 1GB) filesystems, it's relatively
easy to produce conditions where the entire reservation space is
consumed by open transactions that don't ultimately commit anything to
the log subsystem and thus generate no forward progress.

> If there's still not enough space reservation after the transaction
> in progress has released it's reservation, then it goes back to
> sleep. As long as we have active transactions in progress while
> there are transaction reservations waiting on reservation space,
> there will be a wakeup vector for the reservation independent of
> the CIL, iclogs and AIL behaviour.
> 

We do have clean transaction cancel and error scenarios, existing log
deadlock vectors, increasing reliance on long running transactions via
deferred ops, scrub, etc. Also consider the fact that open transactions
consume considerably more reservation than committed transactions on
average.

I'm not saying it's likely for a real world workload to consume the
entirety of log reservation space via open transactions and then release
it without filesystem modification (and then race with log I/O and AIL
emptying), but from the perspective of proving the existence of a bug
it's really not that difficult to produce. I've not seen a real world
workload that reproduces the problems fixed by any of these patches
either, but we still fix them.

> [ Yes, there was a bug here, in the case xfs_log_space_wake() did
> not issue a wakeup because of not enough space being availble and
> the push target was limited by the old log head location. i.e.
> nothing ever updated the push target to reflect the new log head and
> so the tail might never get moved now. That particular bug was fixed
> by a an earlier patch in the series, so we can ignore it here. ]
> 
> IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
> the log space, and we have transactions waiting on log reservation
> space, then we must have enough transactions in progress to cover at
> least 75% of the log space. Completion of those transactions will
> wake waiters and, if necessary, push the AIL again to keep the log
> tail moving appropriately. This handles the AIL empty and "insert
> before target" situations you are concerned about just fine, as long
> as we have a guarantee of forwards progress. Bounding the CIL size
> provides that forwards progress guarantee for the CIL...
> 

I think you have some tunnel vision or something going on here with
regard to the higher level architectural view of how things are supposed
to operate in a normal running/steady state vs simply what can and
cannot happen in the code. I can't really tell why/how, but the only
suggestion I can make is to perhaps separate from this high level view
of things and take a closer look at the code. This is a simple code bug,
not some grand architectural flaw. The context here is way out of whack.
The repeated unrelated and overblown architectural assertions come off
as indication of lack of any real argument to allow this race to live.
There is simply no such guarantee of forward progress in all scenarios
that produce the conditions that can cause this race.

Yet another example:

           <...>-369   [002] ...2   220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
           <...>-27    [003] ...1   224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
           <...>-404   [003] ...1   224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa
     kworker/3:1-39    [003] ...2   224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
    xfsaild/dm-4-1034  [000] ....   224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa
    kworker/3:1H-404   [003] ...2   225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL
     kworker/3:1-39    [003] ...1   254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
	...
     kworker/3:2-1920  [003] ...1  3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]


# cat /sys/fs/xfs/dm-4/log/log_*lsn
1:252
1:250

This instance of the race uses the same serialization instrumentation to
control execution timing and whatnot as before (i.e. no functional
changes). First, an item is inserted into the AIL. Immediately after AIL
insertion, another transaction commits to the CIL (not shown in the
trace). The background log worker comes around a few seconds later and
forces the log/CIL. The checkpoint for this log force races with an AIL
delete and idle (same as before). AIL insertion occurs at the push
target xfsaild just idled at, but this time reservation pressure
relieves and the filesystem goes idle.

At this point, nothing occurs on the fs except for continuous background
log worker jobs. Note the timestamp difference between the first
post-race log force and the last in the trace. The log worker runs at
the default 30s interval and has run repeatedly for almost an hour while
failing to push the AIL and subsequently cover the log. To confirm the
AIL is populated, see the log head/tail LSNs reported via sysfs. This
state persists indefinitely so long as the fs is idle. This is a bug.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-12 13:46                     ` Brian Foster
@ 2019-09-17  4:31                       ` Darrick J. Wong
  2019-09-17 12:48                         ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2019-09-17  4:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote:
> On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote:
> > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > > > This is an instance of xfsaild going idle between the time this
> > > > > new AIL push sets the target based on the iclog about to be
> > > > > committed and AIL insertion of the associated log items,
> > > > > reproduced via a bit of timing instrumentation.  Don't be
> > > > > distracted by the timestamps or the fact that the LSNs do not
> > > > > match because the log items in the AIL end up indexed by the start
> > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > > > commit record). The point is simply that xfsaild has completed a
> > > > > push of a target that hasn't been inserted yet.
> > > > 
> > > > AFAICT, what you are showing requires delaying of the CIL push to the
> > > > point it violates a fundamental assumption about commit sizes, which
> > > > is why I largely think it's irrelevant.
> > > > 
> > > 
> > > The CIL checkpoint size is an unrelated side effect of the test I
> > > happened to use, not a fundamental cause of the problem it demonstrates.
> > > Fixing CIL checkpoint size issues won't change anything. Here's a
> > > different variant of the same problem with a small enough number of log
> > > items such that background CIL pushing is not a factor:
> > > 
> > >        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> > > 	...
> > >        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
> > >        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> > > kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > 	....
> > >
> > > This sequence starts with one log item in the AIL and some number of
> > > items in the CIL such that a checkpoint executes from the background log
> > > worker. The worker forces the CIL and log I/O completion issues an AIL
> > > push that is truncated by the recently updated ->l_last_sync_lsn due to
> > > outstanding transaction reservation and small AIL size. This push races
> > > with completion of a previous push that empties the AIL and iclog
> > > callbacks insert log items for the current checkpoint at the LSN target
> > > xfsaild just idled at.
> > 
> > I'm just not seeing what the problem here is. The behaviour you are
> > describing has been around since day zero and doesn't require the
> > addition of an ail push from iclog completion to trigger.  Prior to
> > this series, it would be:
> > 
> 
> A few days ago you said that if we're inserting log items before the
> push target, "something is very wrong." Since this was what I was
> concerned about, I attempted to manufacture the issue to demonstrate.
> You suggested the first reproducer I came up with was a separate problem
> (related to CIL size issues), so I came up with the one above to avoid
> that distraction. Now you're telling me this has always happened and is
> fine..
> 
> While I don't think this is quite accurate (more below), I do find this
> reasoning somewhat amusing in that it essentially implies that this
> patch itself is dubious. If this new AIL push is required to fix a real
> issue, and this race is essentially manifest as implied, then this patch
> can't possibly reliably fix the original problem. Anyways, that is
> neither here nor there..
> 
> All of the details of this particular issue aside, I do think there's a
> development process problem here. It shouldn't require an extended game
> of whack-a-mole with this kind of inconsistent reasoning just to request
> a trivial change to a patch (you also implied in a previous response it
> was me wasting your time on this topic) that closes an obvious race and
> otherwise has no negative effect. Someone is being unreasonable here and
> I don't think it's me. More importantly, discussion of open issues
> shouldn't be a race against the associated patch being merged. :/
> 
> > process 1	reservation	log completion	xfsaild
> > <completes metadata IO>
> >   xfs_ail_delete()
> >     mlip_changed
> >     xlog_assign_tail_lsn_locked()
> >       ail empty, sets l_last_sync = 0x1000032e2
> >     xfs_log_space_wake()
> > 				xlog_state_do_callback
> > 				  sets CALLBACK
> > 				  sets last_sync_lsn to iclog head
> > 				    -> 0x1000032e4
> > 				  <drops icloglock, gets preempted>
> > 		<wakes>
> > 		xlog_grant_head_wait
> > 		  free_bytes < need_bytes
> > 		    xlog_grant_push_ail()
> > 		      xlog_push_ail()
> > 		        ->ail_target 0x1000032e4
> > 		<sleeps>
> > 						<wakes>
> > 						sets prev target to 0x1000032e4
> > 						sees empty AIL
> > 						<sleeps>
> > 				    <runs again>
> > 				    runs callbacks
> > 				      xfs_ail_insert(lsn = 0x1000032e4)
> > 
> > and now we have the AIL push thread asleep with items in it at the
> > push threshold.  IOWs, what you describe has always been possible,
> > and before the CIL was introduced this sort of thing happened quite
> > a bit because iclog completions freed up much less space in the log
> > than a CIL commit completion.
> > 
> 
> I was suspicious that this could occur prior to this change but I hadn't
> confirmed. The scenario documented above cannot occur because a push on
> an empty AIL has no effect. The target doesn't move and the task isn't
> woken. That said, I still suspect the race can occur with the current
> code via between a grant head waiter, AIL emptying and iclog completion.
> 
> This just speaks to the frequency of the problem, though. I'm not
> convinced it's something that happens "quite a bit" given the nature of
> the 3-way race. I also don't agree that existence of a historical
> problem somehow excuses introduction a new variant of the same problem.
> Instead, if this patch exposes a historical problem that simply had no
> noticeable impact to this point, we should probably look into whether it
> needs fixing too.
> 
> > It's not a problem, however, because if we are out of transaction
> > reservation space we must have transactions in progress, and as long
> > as they make progress then the commit of each transaction will end
> > up calling xlog_ungrant_log_space() to return the unused portion of
> > the transaction reservation. That calls xfs_log_space_wake() to
> > allow reservation waiters to try to make progress.
> > 
> 
> Yes, this is why I don't see immediate side effects in the tests I've
> run so far. The assumptions you're basing this off are not always true,
> however. Particularly on smaller (<= 1GB) filesystems, it's relatively
> easy to produce conditions where the entire reservation space is
> consumed by open transactions that don't ultimately commit anything to
> the log subsystem and thus generate no forward progress.
> 
> > If there's still not enough space reservation after the transaction
> > in progress has released it's reservation, then it goes back to
> > sleep. As long as we have active transactions in progress while
> > there are transaction reservations waiting on reservation space,
> > there will be a wakeup vector for the reservation independent of
> > the CIL, iclogs and AIL behaviour.
> > 
> 
> We do have clean transaction cancel and error scenarios, existing log
> deadlock vectors, increasing reliance on long running transactions via
> deferred ops, scrub, etc. Also consider the fact that open transactions
> consume considerably more reservation than committed transactions on
> average.
> 
> I'm not saying it's likely for a real world workload to consume the
> entirety of log reservation space via open transactions and then release
> it without filesystem modification (and then race with log I/O and AIL
> emptying), but from the perspective of proving the existence of a bug
> it's really not that difficult to produce. I've not seen a real world
> workload that reproduces the problems fixed by any of these patches
> either, but we still fix them.
> 
> > [ Yes, there was a bug here, in the case xfs_log_space_wake() did
> > not issue a wakeup because of not enough space being availble and
> > the push target was limited by the old log head location. i.e.
> > nothing ever updated the push target to reflect the new log head and
> > so the tail might never get moved now. That particular bug was fixed
> > by a an earlier patch in the series, so we can ignore it here. ]
> > 
> > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
> > the log space, and we have transactions waiting on log reservation
> > space, then we must have enough transactions in progress to cover at
> > least 75% of the log space. Completion of those transactions will
> > wake waiters and, if necessary, push the AIL again to keep the log
> > tail moving appropriately. This handles the AIL empty and "insert
> > before target" situations you are concerned about just fine, as long
> > as we have a guarantee of forwards progress. Bounding the CIL size
> > provides that forwards progress guarantee for the CIL...
> > 
> 
> I think you have some tunnel vision or something going on here with
> regard to the higher level architectural view of how things are supposed
> to operate in a normal running/steady state vs simply what can and
> cannot happen in the code. I can't really tell why/how, but the only
> suggestion I can make is to perhaps separate from this high level view
> of things and take a closer look at the code. This is a simple code bug,
> not some grand architectural flaw. The context here is way out of whack.
> The repeated unrelated and overblown architectural assertions come off
> as indication of lack of any real argument to allow this race to live.
> There is simply no such guarantee of forward progress in all scenarios
> that produce the conditions that can cause this race.
> 
> Yet another example:
> 
>            <...>-369   [002] ...2   220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
>            <...>-27    [003] ...1   224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
>            <...>-404   [003] ...1   224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa
>      kworker/3:1-39    [003] ...2   224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
>     xfsaild/dm-4-1034  [000] ....   224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa
>     kworker/3:1H-404   [003] ...2   225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL
>      kworker/3:1-39    [003] ...1   254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> 	...
>      kworker/3:2-1920  [003] ...1  3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> 
> 
> # cat /sys/fs/xfs/dm-4/log/log_*lsn
> 1:252
> 1:250
> 
> This instance of the race uses the same serialization instrumentation to
> control execution timing and whatnot as before (i.e. no functional
> changes). First, an item is inserted into the AIL. Immediately after AIL
> insertion, another transaction commits to the CIL (not shown in the
> trace). The background log worker comes around a few seconds later and
> forces the log/CIL. The checkpoint for this log force races with an AIL
> delete and idle (same as before). AIL insertion occurs at the push
> target xfsaild just idled at, but this time reservation pressure
> relieves and the filesystem goes idle.
> 
> At this point, nothing occurs on the fs except for continuous background
> log worker jobs. Note the timestamp difference between the first
> post-race log force and the last in the trace. The log worker runs at
> the default 30s interval and has run repeatedly for almost an hour while
> failing to push the AIL and subsequently cover the log. To confirm the
> AIL is populated, see the log head/tail LSNs reported via sysfs. This
> state persists indefinitely so long as the fs is idle. This is a bug.

/me stumbles back in after ~2wks, and has a few questions:

1) Are these concerns a reason to hold up this series, or are they a
separate bug lurking in the code being touched by the series?  AFAICT I
think it's the second, but <shrug> my brain is still mush.

2) Er... how do you get the log stuck like this?  I see things earlier
in the thread like "open transactions that don't ultimately commit
anything to the log subsystem" and think "OH, you mean xfs_scrub!"

--D

> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-17  4:31                       ` Darrick J. Wong
@ 2019-09-17 12:48                         ` Brian Foster
  2019-09-24 17:16                           ` Darrick J. Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2019-09-17 12:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Mon, Sep 16, 2019 at 09:31:56PM -0700, Darrick J. Wong wrote:
> On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote:
> > On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote:
> > > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> > > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > > > > This is an instance of xfsaild going idle between the time this
> > > > > > new AIL push sets the target based on the iclog about to be
> > > > > > committed and AIL insertion of the associated log items,
> > > > > > reproduced via a bit of timing instrumentation.  Don't be
> > > > > > distracted by the timestamps or the fact that the LSNs do not
> > > > > > match because the log items in the AIL end up indexed by the start
> > > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > > > > commit record). The point is simply that xfsaild has completed a
> > > > > > push of a target that hasn't been inserted yet.
> > > > > 
> > > > > AFAICT, what you are showing requires delaying of the CIL push to the
> > > > > point it violates a fundamental assumption about commit sizes, which
> > > > > is why I largely think it's irrelevant.
> > > > > 
> > > > 
> > > > The CIL checkpoint size is an unrelated side effect of the test I
> > > > happened to use, not a fundamental cause of the problem it demonstrates.
> > > > Fixing CIL checkpoint size issues won't change anything. Here's a
> > > > different variant of the same problem with a small enough number of log
> > > > items such that background CIL pushing is not a factor:
> > > > 
> > > >        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > > kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> > > > 	...
> > > >        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
> > > >        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> > > > kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > > kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > > 	....
> > > >
> > > > This sequence starts with one log item in the AIL and some number of
> > > > items in the CIL such that a checkpoint executes from the background log
> > > > worker. The worker forces the CIL and log I/O completion issues an AIL
> > > > push that is truncated by the recently updated ->l_last_sync_lsn due to
> > > > outstanding transaction reservation and small AIL size. This push races
> > > > with completion of a previous push that empties the AIL and iclog
> > > > callbacks insert log items for the current checkpoint at the LSN target
> > > > xfsaild just idled at.
> > > 
> > > I'm just not seeing what the problem here is. The behaviour you are
> > > describing has been around since day zero and doesn't require the
> > > addition of an ail push from iclog completion to trigger.  Prior to
> > > this series, it would be:
> > > 
> > 
> > A few days ago you said that if we're inserting log items before the
> > push target, "something is very wrong." Since this was what I was
> > concerned about, I attempted to manufacture the issue to demonstrate.
> > You suggested the first reproducer I came up with was a separate problem
> > (related to CIL size issues), so I came up with the one above to avoid
> > that distraction. Now you're telling me this has always happened and is
> > fine..
> > 
> > While I don't think this is quite accurate (more below), I do find this
> > reasoning somewhat amusing in that it essentially implies that this
> > patch itself is dubious. If this new AIL push is required to fix a real
> > issue, and this race is essentially manifest as implied, then this patch
> > can't possibly reliably fix the original problem. Anyways, that is
> > neither here nor there..
> > 
> > All of the details of this particular issue aside, I do think there's a
> > development process problem here. It shouldn't require an extended game
> > of whack-a-mole with this kind of inconsistent reasoning just to request
> > a trivial change to a patch (you also implied in a previous response it
> > was me wasting your time on this topic) that closes an obvious race and
> > otherwise has no negative effect. Someone is being unreasonable here and
> > I don't think it's me. More importantly, discussion of open issues
> > shouldn't be a race against the associated patch being merged. :/
> > 
> > > process 1	reservation	log completion	xfsaild
> > > <completes metadata IO>
> > >   xfs_ail_delete()
> > >     mlip_changed
> > >     xlog_assign_tail_lsn_locked()
> > >       ail empty, sets l_last_sync = 0x1000032e2
> > >     xfs_log_space_wake()
> > > 				xlog_state_do_callback
> > > 				  sets CALLBACK
> > > 				  sets last_sync_lsn to iclog head
> > > 				    -> 0x1000032e4
> > > 				  <drops icloglock, gets preempted>
> > > 		<wakes>
> > > 		xlog_grant_head_wait
> > > 		  free_bytes < need_bytes
> > > 		    xlog_grant_push_ail()
> > > 		      xlog_push_ail()
> > > 		        ->ail_target 0x1000032e4
> > > 		<sleeps>
> > > 						<wakes>
> > > 						sets prev target to 0x1000032e4
> > > 						sees empty AIL
> > > 						<sleeps>
> > > 				    <runs again>
> > > 				    runs callbacks
> > > 				      xfs_ail_insert(lsn = 0x1000032e4)
> > > 
> > > and now we have the AIL push thread asleep with items in it at the
> > > push threshold.  IOWs, what you describe has always been possible,
> > > and before the CIL was introduced this sort of thing happened quite
> > > a bit because iclog completions freed up much less space in the log
> > > than a CIL commit completion.
> > > 
> > 
> > I was suspicious that this could occur prior to this change but I hadn't
> > confirmed. The scenario documented above cannot occur because a push on
> > an empty AIL has no effect. The target doesn't move and the task isn't
> > woken. That said, I still suspect the race can occur with the current
> > code via between a grant head waiter, AIL emptying and iclog completion.
> > 
> > This just speaks to the frequency of the problem, though. I'm not
> > convinced it's something that happens "quite a bit" given the nature of
> > the 3-way race. I also don't agree that existence of a historical
> > problem somehow excuses introduction a new variant of the same problem.
> > Instead, if this patch exposes a historical problem that simply had no
> > noticeable impact to this point, we should probably look into whether it
> > needs fixing too.
> > 
> > > It's not a problem, however, because if we are out of transaction
> > > reservation space we must have transactions in progress, and as long
> > > as they make progress then the commit of each transaction will end
> > > up calling xlog_ungrant_log_space() to return the unused portion of
> > > the transaction reservation. That calls xfs_log_space_wake() to
> > > allow reservation waiters to try to make progress.
> > > 
> > 
> > Yes, this is why I don't see immediate side effects in the tests I've
> > run so far. The assumptions you're basing this off are not always true,
> > however. Particularly on smaller (<= 1GB) filesystems, it's relatively
> > easy to produce conditions where the entire reservation space is
> > consumed by open transactions that don't ultimately commit anything to
> > the log subsystem and thus generate no forward progress.
> > 
> > > If there's still not enough space reservation after the transaction
> > > in progress has released it's reservation, then it goes back to
> > > sleep. As long as we have active transactions in progress while
> > > there are transaction reservations waiting on reservation space,
> > > there will be a wakeup vector for the reservation independent of
> > > the CIL, iclogs and AIL behaviour.
> > > 
> > 
> > We do have clean transaction cancel and error scenarios, existing log
> > deadlock vectors, increasing reliance on long running transactions via
> > deferred ops, scrub, etc. Also consider the fact that open transactions
> > consume considerably more reservation than committed transactions on
> > average.
> > 
> > I'm not saying it's likely for a real world workload to consume the
> > entirety of log reservation space via open transactions and then release
> > it without filesystem modification (and then race with log I/O and AIL
> > emptying), but from the perspective of proving the existence of a bug
> > it's really not that difficult to produce. I've not seen a real world
> > workload that reproduces the problems fixed by any of these patches
> > either, but we still fix them.
> > 
> > > [ Yes, there was a bug here, in the case xfs_log_space_wake() did
> > > not issue a wakeup because of not enough space being availble and
> > > the push target was limited by the old log head location. i.e.
> > > nothing ever updated the push target to reflect the new log head and
> > > so the tail might never get moved now. That particular bug was fixed
> > > by a an earlier patch in the series, so we can ignore it here. ]
> > > 
> > > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
> > > the log space, and we have transactions waiting on log reservation
> > > space, then we must have enough transactions in progress to cover at
> > > least 75% of the log space. Completion of those transactions will
> > > wake waiters and, if necessary, push the AIL again to keep the log
> > > tail moving appropriately. This handles the AIL empty and "insert
> > > before target" situations you are concerned about just fine, as long
> > > as we have a guarantee of forwards progress. Bounding the CIL size
> > > provides that forwards progress guarantee for the CIL...
> > > 
> > 
> > I think you have some tunnel vision or something going on here with
> > regard to the higher level architectural view of how things are supposed
> > to operate in a normal running/steady state vs simply what can and
> > cannot happen in the code. I can't really tell why/how, but the only
> > suggestion I can make is to perhaps separate from this high level view
> > of things and take a closer look at the code. This is a simple code bug,
> > not some grand architectural flaw. The context here is way out of whack.
> > The repeated unrelated and overblown architectural assertions come off
> > as indication of lack of any real argument to allow this race to live.
> > There is simply no such guarantee of forward progress in all scenarios
> > that produce the conditions that can cause this race.
> > 
> > Yet another example:
> > 
> >            <...>-369   [002] ...2   220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
> >            <...>-27    [003] ...1   224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> >            <...>-404   [003] ...1   224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa
> >      kworker/3:1-39    [003] ...2   224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
> >     xfsaild/dm-4-1034  [000] ....   224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa
> >     kworker/3:1H-404   [003] ...2   225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL
> >      kworker/3:1-39    [003] ...1   254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > 	...
> >      kworker/3:2-1920  [003] ...1  3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > 
> > 
> > # cat /sys/fs/xfs/dm-4/log/log_*lsn
> > 1:252
> > 1:250
> > 
> > This instance of the race uses the same serialization instrumentation to
> > control execution timing and whatnot as before (i.e. no functional
> > changes). First, an item is inserted into the AIL. Immediately after AIL
> > insertion, another transaction commits to the CIL (not shown in the
> > trace). The background log worker comes around a few seconds later and
> > forces the log/CIL. The checkpoint for this log force races with an AIL
> > delete and idle (same as before). AIL insertion occurs at the push
> > target xfsaild just idled at, but this time reservation pressure
> > relieves and the filesystem goes idle.
> > 
> > At this point, nothing occurs on the fs except for continuous background
> > log worker jobs. Note the timestamp difference between the first
> > post-race log force and the last in the trace. The log worker runs at
> > the default 30s interval and has run repeatedly for almost an hour while
> > failing to push the AIL and subsequently cover the log. To confirm the
> > AIL is populated, see the log head/tail LSNs reported via sysfs. This
> > state persists indefinitely so long as the fs is idle. This is a bug.
> 
> /me stumbles back in after ~2wks, and has a few questions:
> 

Heh, welcome back.. ;)

> 1) Are these concerns a reason to hold up this series, or are they a
> separate bug lurking in the code being touched by the series?  AFAICT I
> think it's the second, but <shrug> my brain is still mush.
> 

A little of both I guess. To Dave's earlier point, I think this
technically can happen in the existing code as a 3-way race between the
aforementioned tasks (just not the way it was described). OTOH, I'm not
sure what this has to do with the fact that the new code being added is
racy on its own (or since when discovery of some old bug justifies
adding new ones..?). The examples shown above are fundamentally races
between log I/O completion and xfsaild. The last one shows the log
remain uncovered indefinitely on an idle fs (which is not a corruption
or anything, but certainly a bug) simply because that's the easiest side
effect to reproduce. I'm fairly confident at this point that one could
be manufactured into a similar log deadlock if we really wanted to try,
but that would be much more difficult and TBH I'm tired of burning
myself out on these kind of objections to obvious and easily addressed
landmines. How likely is it that somebody would hit these problems?
Probably highly unlikely. How likely is it somebody would hit this
problem before whatever problem this patch fixes? *shrug*

I don't think it's a reason to hold up the series, but at the same time
this patch is unrelated to the original problem. IIRC, it fell out of
some other issue reproduced with a different experimental hack/fix (that
was eventually replaced) to the original problem. FWIW, I'm annoyed with
the lazy approach to review here more than anything. In hindsight, if I
knew the feedback was going to be dismissed and the patchset rolled
forward and merged, perhaps I should have just nacked the subsequent
reposts to make the objection clear.

I dunno, not my call on what to do with it now. Feel free to add my
Nacked-by: to the upstream commit I guess so I at least remember this
when/if considering whether to backport it anywhere. :/

> 2) Er... how do you get the log stuck like this?  I see things earlier
> in the thread like "open transactions that don't ultimately commit
> anything to the log subsystem" and think "OH, you mean xfs_scrub!"
> 

That's one thing I was thinking about but I didn't end up looking into
it (does scrub actually acquire log reservation?). For a more simple
example, consider a bunch of threads running into quota block allocation
failures where a system is also under memory pressure. On filesystems
with smaller logs, it only takes a handful of such threads to bash the
reservation grant head against the log tail even though the log is empty
(and doing so without ever committing anything to the log).

Note that this by itself isn't what gets the log "stuck" in the most
recent example (note: not deadlocked), but rather if we're in a state
where the grant head is close enough to the log head (such that we AIL
push the items associated with the current checkpoint before it inserts)
when log I/O completion happens to race with AIL emptying as described.

Brian

> --D
> 
> > Brian
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-17 12:48                         ` Brian Foster
@ 2019-09-24 17:16                           ` Darrick J. Wong
  2019-09-26 13:19                             ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2019-09-24 17:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Tue, Sep 17, 2019 at 08:48:27AM -0400, Brian Foster wrote:
> On Mon, Sep 16, 2019 at 09:31:56PM -0700, Darrick J. Wong wrote:
> > On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote:
> > > On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote:
> > > > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> > > > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > > > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > > > > > This is an instance of xfsaild going idle between the time this
> > > > > > > new AIL push sets the target based on the iclog about to be
> > > > > > > committed and AIL insertion of the associated log items,
> > > > > > > reproduced via a bit of timing instrumentation.  Don't be
> > > > > > > distracted by the timestamps or the fact that the LSNs do not
> > > > > > > match because the log items in the AIL end up indexed by the start
> > > > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > > > > > commit record). The point is simply that xfsaild has completed a
> > > > > > > push of a target that hasn't been inserted yet.
> > > > > > 
> > > > > > AFAICT, what you are showing requires delaying of the CIL push to the
> > > > > > point it violates a fundamental assumption about commit sizes, which
> > > > > > is why I largely think it's irrelevant.
> > > > > > 
> > > > > 
> > > > > The CIL checkpoint size is an unrelated side effect of the test I
> > > > > happened to use, not a fundamental cause of the problem it demonstrates.
> > > > > Fixing CIL checkpoint size issues won't change anything. Here's a
> > > > > different variant of the same problem with a small enough number of log
> > > > > items such that background CIL pushing is not a factor:
> > > > > 
> > > > >        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > > > kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> > > > > 	...
> > > > >        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
> > > > >        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> > > > > kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > > > kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > > > 	....
> > > > >
> > > > > This sequence starts with one log item in the AIL and some number of
> > > > > items in the CIL such that a checkpoint executes from the background log
> > > > > worker. The worker forces the CIL and log I/O completion issues an AIL
> > > > > push that is truncated by the recently updated ->l_last_sync_lsn due to
> > > > > outstanding transaction reservation and small AIL size. This push races
> > > > > with completion of a previous push that empties the AIL and iclog
> > > > > callbacks insert log items for the current checkpoint at the LSN target
> > > > > xfsaild just idled at.
> > > > 
> > > > I'm just not seeing what the problem here is. The behaviour you are
> > > > describing has been around since day zero and doesn't require the
> > > > addition of an ail push from iclog completion to trigger.  Prior to
> > > > this series, it would be:
> > > > 
> > > 
> > > A few days ago you said that if we're inserting log items before the
> > > push target, "something is very wrong." Since this was what I was
> > > concerned about, I attempted to manufacture the issue to demonstrate.
> > > You suggested the first reproducer I came up with was a separate problem
> > > (related to CIL size issues), so I came up with the one above to avoid
> > > that distraction. Now you're telling me this has always happened and is
> > > fine..
> > > 
> > > While I don't think this is quite accurate (more below), I do find this
> > > reasoning somewhat amusing in that it essentially implies that this
> > > patch itself is dubious. If this new AIL push is required to fix a real
> > > issue, and this race is essentially manifest as implied, then this patch
> > > can't possibly reliably fix the original problem. Anyways, that is
> > > neither here nor there..
> > > 
> > > All of the details of this particular issue aside, I do think there's a
> > > development process problem here. It shouldn't require an extended game
> > > of whack-a-mole with this kind of inconsistent reasoning just to request
> > > a trivial change to a patch (you also implied in a previous response it
> > > was me wasting your time on this topic) that closes an obvious race and
> > > otherwise has no negative effect. Someone is being unreasonable here and
> > > I don't think it's me. More importantly, discussion of open issues
> > > shouldn't be a race against the associated patch being merged. :/
> > > 
> > > > process 1	reservation	log completion	xfsaild
> > > > <completes metadata IO>
> > > >   xfs_ail_delete()
> > > >     mlip_changed
> > > >     xlog_assign_tail_lsn_locked()
> > > >       ail empty, sets l_last_sync = 0x1000032e2
> > > >     xfs_log_space_wake()
> > > > 				xlog_state_do_callback
> > > > 				  sets CALLBACK
> > > > 				  sets last_sync_lsn to iclog head
> > > > 				    -> 0x1000032e4
> > > > 				  <drops icloglock, gets preempted>
> > > > 		<wakes>
> > > > 		xlog_grant_head_wait
> > > > 		  free_bytes < need_bytes
> > > > 		    xlog_grant_push_ail()
> > > > 		      xlog_push_ail()
> > > > 		        ->ail_target 0x1000032e4
> > > > 		<sleeps>
> > > > 						<wakes>
> > > > 						sets prev target to 0x1000032e4
> > > > 						sees empty AIL
> > > > 						<sleeps>
> > > > 				    <runs again>
> > > > 				    runs callbacks
> > > > 				      xfs_ail_insert(lsn = 0x1000032e4)
> > > > 
> > > > and now we have the AIL push thread asleep with items in it at the
> > > > push threshold.  IOWs, what you describe has always been possible,
> > > > and before the CIL was introduced this sort of thing happened quite
> > > > a bit because iclog completions freed up much less space in the log
> > > > than a CIL commit completion.
> > > > 
> > > 
> > > I was suspicious that this could occur prior to this change but I hadn't
> > > confirmed. The scenario documented above cannot occur because a push on
> > > an empty AIL has no effect. The target doesn't move and the task isn't
> > > woken. That said, I still suspect the race can occur with the current
> > > code via between a grant head waiter, AIL emptying and iclog completion.
> > > 
> > > This just speaks to the frequency of the problem, though. I'm not
> > > convinced it's something that happens "quite a bit" given the nature of
> > > the 3-way race. I also don't agree that existence of a historical
> > > problem somehow excuses introduction a new variant of the same problem.
> > > Instead, if this patch exposes a historical problem that simply had no
> > > noticeable impact to this point, we should probably look into whether it
> > > needs fixing too.
> > > 
> > > > It's not a problem, however, because if we are out of transaction
> > > > reservation space we must have transactions in progress, and as long
> > > > as they make progress then the commit of each transaction will end
> > > > up calling xlog_ungrant_log_space() to return the unused portion of
> > > > the transaction reservation. That calls xfs_log_space_wake() to
> > > > allow reservation waiters to try to make progress.
> > > > 
> > > 
> > > Yes, this is why I don't see immediate side effects in the tests I've
> > > run so far. The assumptions you're basing this off are not always true,
> > > however. Particularly on smaller (<= 1GB) filesystems, it's relatively
> > > easy to produce conditions where the entire reservation space is
> > > consumed by open transactions that don't ultimately commit anything to
> > > the log subsystem and thus generate no forward progress.
> > > 
> > > > If there's still not enough space reservation after the transaction
> > > > in progress has released it's reservation, then it goes back to
> > > > sleep. As long as we have active transactions in progress while
> > > > there are transaction reservations waiting on reservation space,
> > > > there will be a wakeup vector for the reservation independent of
> > > > the CIL, iclogs and AIL behaviour.
> > > > 
> > > 
> > > We do have clean transaction cancel and error scenarios, existing log
> > > deadlock vectors, increasing reliance on long running transactions via
> > > deferred ops, scrub, etc. Also consider the fact that open transactions
> > > consume considerably more reservation than committed transactions on
> > > average.
> > > 
> > > I'm not saying it's likely for a real world workload to consume the
> > > entirety of log reservation space via open transactions and then release
> > > it without filesystem modification (and then race with log I/O and AIL
> > > emptying), but from the perspective of proving the existence of a bug
> > > it's really not that difficult to produce. I've not seen a real world
> > > workload that reproduces the problems fixed by any of these patches
> > > either, but we still fix them.
> > > 
> > > > [ Yes, there was a bug here, in the case xfs_log_space_wake() did
> > > > not issue a wakeup because of not enough space being availble and
> > > > the push target was limited by the old log head location. i.e.
> > > > nothing ever updated the push target to reflect the new log head and
> > > > so the tail might never get moved now. That particular bug was fixed
> > > > by a an earlier patch in the series, so we can ignore it here. ]
> > > > 
> > > > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
> > > > the log space, and we have transactions waiting on log reservation
> > > > space, then we must have enough transactions in progress to cover at
> > > > least 75% of the log space. Completion of those transactions will
> > > > wake waiters and, if necessary, push the AIL again to keep the log
> > > > tail moving appropriately. This handles the AIL empty and "insert
> > > > before target" situations you are concerned about just fine, as long
> > > > as we have a guarantee of forwards progress. Bounding the CIL size
> > > > provides that forwards progress guarantee for the CIL...
> > > > 
> > > 
> > > I think you have some tunnel vision or something going on here with
> > > regard to the higher level architectural view of how things are supposed
> > > to operate in a normal running/steady state vs simply what can and
> > > cannot happen in the code. I can't really tell why/how, but the only
> > > suggestion I can make is to perhaps separate from this high level view
> > > of things and take a closer look at the code. This is a simple code bug,
> > > not some grand architectural flaw. The context here is way out of whack.
> > > The repeated unrelated and overblown architectural assertions come off
> > > as indication of lack of any real argument to allow this race to live.
> > > There is simply no such guarantee of forward progress in all scenarios
> > > that produce the conditions that can cause this race.
> > > 
> > > Yet another example:
> > > 
> > >            <...>-369   [002] ...2   220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
> > >            <...>-27    [003] ...1   224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > >            <...>-404   [003] ...1   224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa
> > >      kworker/3:1-39    [003] ...2   224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
> > >     xfsaild/dm-4-1034  [000] ....   224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa
> > >     kworker/3:1H-404   [003] ...2   225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL
> > >      kworker/3:1-39    [003] ...1   254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > 	...
> > >      kworker/3:2-1920  [003] ...1  3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > 
> > > 
> > > # cat /sys/fs/xfs/dm-4/log/log_*lsn
> > > 1:252
> > > 1:250
> > > 
> > > This instance of the race uses the same serialization instrumentation to
> > > control execution timing and whatnot as before (i.e. no functional
> > > changes). First, an item is inserted into the AIL. Immediately after AIL
> > > insertion, another transaction commits to the CIL (not shown in the
> > > trace). The background log worker comes around a few seconds later and
> > > forces the log/CIL. The checkpoint for this log force races with an AIL
> > > delete and idle (same as before). AIL insertion occurs at the push
> > > target xfsaild just idled at, but this time reservation pressure
> > > relieves and the filesystem goes idle.
> > > 
> > > At this point, nothing occurs on the fs except for continuous background
> > > log worker jobs. Note the timestamp difference between the first
> > > post-race log force and the last in the trace. The log worker runs at
> > > the default 30s interval and has run repeatedly for almost an hour while
> > > failing to push the AIL and subsequently cover the log. To confirm the
> > > AIL is populated, see the log head/tail LSNs reported via sysfs. This
> > > state persists indefinitely so long as the fs is idle. This is a bug.
> > 
> > /me stumbles back in after ~2wks, and has a few questions:
> > 
> 
> Heh, welcome back.. ;)
> 
> > 1) Are these concerns a reason to hold up this series, or are they a
> > separate bug lurking in the code being touched by the series?  AFAICT I
> > think it's the second, but <shrug> my brain is still mush.
> > 
> 
> A little of both I guess. To Dave's earlier point, I think this
> technically can happen in the existing code as a 3-way race between the
> aforementioned tasks (just not the way it was described). OTOH, I'm not
> sure what this has to do with the fact that the new code being added is
> racy on its own (or since when discovery of some old bug justifies
> adding new ones..?). The examples shown above are fundamentally races
> between log I/O completion and xfsaild. The last one shows the log
> remain uncovered indefinitely on an idle fs (which is not a corruption
> or anything, but certainly a bug) simply because that's the easiest side
> effect to reproduce. I'm fairly confident at this point that one could
> be manufactured into a similar log deadlock if we really wanted to try,
> but that would be much more difficult and TBH I'm tired of burning
> myself out on these kind of objections to obvious and easily addressed
> landmines. How likely is it that somebody would hit these problems?
> Probably highly unlikely. How likely is it somebody would hit this
> problem before whatever problem this patch fixes? *shrug*
> 
> I don't think it's a reason to hold up the series, but at the same time
> this patch is unrelated to the original problem. IIRC, it fell out of
> some other issue reproduced with a different experimental hack/fix (that
> was eventually replaced) to the original problem. FWIW, I'm annoyed with
> the lazy approach to review here more than anything. In hindsight, if I
> knew the feedback was going to be dismissed and the patchset rolled
> forward and merged, perhaps I should have just nacked the subsequent
> reposts to make the objection clear.

:(

I'm sorry you feel that way.  I myself don't feel that my own handling
of this merge window has been good, between feeling pressured to get the
branches ready to go before vacation and for-next becoming intermittent
right around the same time.  Those both decrease my certainty about
what's going in the next merge and increases my own anxieties, and it
becomes a competition in my head between "I can add it now and revert it
later as a regression fix" vs. "if I don't add it I'll wonder if it was
necessary".

Anyway, I /think/ the end result is that if the AIL gets stuck /and/ the
system goes down before it becomes unstuck, then there'll be more work
for log recovery to do, because we failed to checkpoint everything that
we possibly could have before the crash?

So AFAICT it's not a critical disaster bug but I would like to study
this situation some more, particularly now that we have ~2mos for
stabilizing things.

> I dunno, not my call on what to do with it now. Feel free to add my
> Nacked-by: to the upstream commit I guess so I at least remember this
> when/if considering whether to backport it anywhere. :/

(/me continues to wish there was an easy way to add tagging to a commit,
particularly when it comes well after the fact.)

> > 2) Er... how do you get the log stuck like this?  I see things earlier
> > in the thread like "open transactions that don't ultimately commit
> > anything to the log subsystem" and think "OH, you mean xfs_scrub!"
> > 
> 
> That's one thing I was thinking about but I didn't end up looking into
> it (does scrub actually acquire log reservation?).

If you invoke the scrub ioctl with IFLAG_REPAIR set, it allocates a
non-empty transaction (itruncate, iirc) to do the check and rebuild the
data structure.  If the item is ok then it'll cancel the transaction.

> For a more simple
> example, consider a bunch of threads running into quota block allocation
> failures where a system is also under memory pressure. On filesystems
> with smaller logs, it only takes a handful of such threads to bash the
> reservation grant head against the log tail even though the log is empty
> (and doing so without ever committing anything to the log).
> 
> Note that this by itself isn't what gets the log "stuck" in the most
> recent example (note: not deadlocked), but rather if we're in a state
> where the grant head is close enough to the log head (such that we AIL
> push the items associated with the current checkpoint before it inserts)
> when log I/O completion happens to race with AIL emptying as described.

Hmm... I wonder if we could reproduce this by formatting a filesystem
with a small log; running a slow moving thread that touches a file once
per second (or something to generate a moderate amount of workload) and
monitors the log to watch its progress; and then kicking off dozens of
threads to invoke IFLAG_REPAIR scrubbers on some other non-corrupt part
of the filesystem?

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward
  2019-09-24 17:16                           ` Darrick J. Wong
@ 2019-09-26 13:19                             ` Brian Foster
  0 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2019-09-26 13:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Tue, Sep 24, 2019 at 10:16:09AM -0700, Darrick J. Wong wrote:
> On Tue, Sep 17, 2019 at 08:48:27AM -0400, Brian Foster wrote:
> > On Mon, Sep 16, 2019 at 09:31:56PM -0700, Darrick J. Wong wrote:
> > > On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote:
> > > > On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote:
> > > > > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> > > > > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > > > > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > > > > > > This is an instance of xfsaild going idle between the time this
> > > > > > > > new AIL push sets the target based on the iclog about to be
> > > > > > > > committed and AIL insertion of the associated log items,
> > > > > > > > reproduced via a bit of timing instrumentation.  Don't be
> > > > > > > > distracted by the timestamps or the fact that the LSNs do not
> > > > > > > > match because the log items in the AIL end up indexed by the start
> > > > > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > > > > > > commit record). The point is simply that xfsaild has completed a
> > > > > > > > push of a target that hasn't been inserted yet.
> > > > > > > 
> > > > > > > AFAICT, what you are showing requires delaying of the CIL push to the
> > > > > > > point it violates a fundamental assumption about commit sizes, which
> > > > > > > is why I largely think it's irrelevant.
> > > > > > > 
> > > > > > 
> > > > > > The CIL checkpoint size is an unrelated side effect of the test I
> > > > > > happened to use, not a fundamental cause of the problem it demonstrates.
> > > > > > Fixing CIL checkpoint size issues won't change anything. Here's a
> > > > > > different variant of the same problem with a small enough number of log
> > > > > > items such that background CIL pushing is not a factor:
> > > > > > 
> > > > > >        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > > > > kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> > > > > > 	...
> > > > > >        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
> > > > > >        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> > > > > > kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > > > > kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > > > > 	....
> > > > > >
> > > > > > This sequence starts with one log item in the AIL and some number of
> > > > > > items in the CIL such that a checkpoint executes from the background log
> > > > > > worker. The worker forces the CIL and log I/O completion issues an AIL
> > > > > > push that is truncated by the recently updated ->l_last_sync_lsn due to
> > > > > > outstanding transaction reservation and small AIL size. This push races
> > > > > > with completion of a previous push that empties the AIL and iclog
> > > > > > callbacks insert log items for the current checkpoint at the LSN target
> > > > > > xfsaild just idled at.
> > > > > 
> > > > > I'm just not seeing what the problem here is. The behaviour you are
> > > > > describing has been around since day zero and doesn't require the
> > > > > addition of an ail push from iclog completion to trigger.  Prior to
> > > > > this series, it would be:
> > > > > 
> > > > 
> > > > A few days ago you said that if we're inserting log items before the
> > > > push target, "something is very wrong." Since this was what I was
> > > > concerned about, I attempted to manufacture the issue to demonstrate.
> > > > You suggested the first reproducer I came up with was a separate problem
> > > > (related to CIL size issues), so I came up with the one above to avoid
> > > > that distraction. Now you're telling me this has always happened and is
> > > > fine..
> > > > 
> > > > While I don't think this is quite accurate (more below), I do find this
> > > > reasoning somewhat amusing in that it essentially implies that this
> > > > patch itself is dubious. If this new AIL push is required to fix a real
> > > > issue, and this race is essentially manifest as implied, then this patch
> > > > can't possibly reliably fix the original problem. Anyways, that is
> > > > neither here nor there..
> > > > 
> > > > All of the details of this particular issue aside, I do think there's a
> > > > development process problem here. It shouldn't require an extended game
> > > > of whack-a-mole with this kind of inconsistent reasoning just to request
> > > > a trivial change to a patch (you also implied in a previous response it
> > > > was me wasting your time on this topic) that closes an obvious race and
> > > > otherwise has no negative effect. Someone is being unreasonable here and
> > > > I don't think it's me. More importantly, discussion of open issues
> > > > shouldn't be a race against the associated patch being merged. :/
> > > > 
> > > > > process 1	reservation	log completion	xfsaild
> > > > > <completes metadata IO>
> > > > >   xfs_ail_delete()
> > > > >     mlip_changed
> > > > >     xlog_assign_tail_lsn_locked()
> > > > >       ail empty, sets l_last_sync = 0x1000032e2
> > > > >     xfs_log_space_wake()
> > > > > 				xlog_state_do_callback
> > > > > 				  sets CALLBACK
> > > > > 				  sets last_sync_lsn to iclog head
> > > > > 				    -> 0x1000032e4
> > > > > 				  <drops icloglock, gets preempted>
> > > > > 		<wakes>
> > > > > 		xlog_grant_head_wait
> > > > > 		  free_bytes < need_bytes
> > > > > 		    xlog_grant_push_ail()
> > > > > 		      xlog_push_ail()
> > > > > 		        ->ail_target 0x1000032e4
> > > > > 		<sleeps>
> > > > > 						<wakes>
> > > > > 						sets prev target to 0x1000032e4
> > > > > 						sees empty AIL
> > > > > 						<sleeps>
> > > > > 				    <runs again>
> > > > > 				    runs callbacks
> > > > > 				      xfs_ail_insert(lsn = 0x1000032e4)
> > > > > 
> > > > > and now we have the AIL push thread asleep with items in it at the
> > > > > push threshold.  IOWs, what you describe has always been possible,
> > > > > and before the CIL was introduced this sort of thing happened quite
> > > > > a bit because iclog completions freed up much less space in the log
> > > > > than a CIL commit completion.
> > > > > 
> > > > 
> > > > I was suspicious that this could occur prior to this change but I hadn't
> > > > confirmed. The scenario documented above cannot occur because a push on
> > > > an empty AIL has no effect. The target doesn't move and the task isn't
> > > > woken. That said, I still suspect the race can occur with the current
> > > > code via between a grant head waiter, AIL emptying and iclog completion.
> > > > 
> > > > This just speaks to the frequency of the problem, though. I'm not
> > > > convinced it's something that happens "quite a bit" given the nature of
> > > > the 3-way race. I also don't agree that existence of a historical
> > > > problem somehow excuses introduction a new variant of the same problem.
> > > > Instead, if this patch exposes a historical problem that simply had no
> > > > noticeable impact to this point, we should probably look into whether it
> > > > needs fixing too.
> > > > 
> > > > > It's not a problem, however, because if we are out of transaction
> > > > > reservation space we must have transactions in progress, and as long
> > > > > as they make progress then the commit of each transaction will end
> > > > > up calling xlog_ungrant_log_space() to return the unused portion of
> > > > > the transaction reservation. That calls xfs_log_space_wake() to
> > > > > allow reservation waiters to try to make progress.
> > > > > 
> > > > 
> > > > Yes, this is why I don't see immediate side effects in the tests I've
> > > > run so far. The assumptions you're basing this off are not always true,
> > > > however. Particularly on smaller (<= 1GB) filesystems, it's relatively
> > > > easy to produce conditions where the entire reservation space is
> > > > consumed by open transactions that don't ultimately commit anything to
> > > > the log subsystem and thus generate no forward progress.
> > > > 
> > > > > If there's still not enough space reservation after the transaction
> > > > > in progress has released it's reservation, then it goes back to
> > > > > sleep. As long as we have active transactions in progress while
> > > > > there are transaction reservations waiting on reservation space,
> > > > > there will be a wakeup vector for the reservation independent of
> > > > > the CIL, iclogs and AIL behaviour.
> > > > > 
> > > > 
> > > > We do have clean transaction cancel and error scenarios, existing log
> > > > deadlock vectors, increasing reliance on long running transactions via
> > > > deferred ops, scrub, etc. Also consider the fact that open transactions
> > > > consume considerably more reservation than committed transactions on
> > > > average.
> > > > 
> > > > I'm not saying it's likely for a real world workload to consume the
> > > > entirety of log reservation space via open transactions and then release
> > > > it without filesystem modification (and then race with log I/O and AIL
> > > > emptying), but from the perspective of proving the existence of a bug
> > > > it's really not that difficult to produce. I've not seen a real world
> > > > workload that reproduces the problems fixed by any of these patches
> > > > either, but we still fix them.
> > > > 
> > > > > [ Yes, there was a bug here, in the case xfs_log_space_wake() did
> > > > > not issue a wakeup because of not enough space being availble and
> > > > > the push target was limited by the old log head location. i.e.
> > > > > nothing ever updated the push target to reflect the new log head and
> > > > > so the tail might never get moved now. That particular bug was fixed
> > > > > by a an earlier patch in the series, so we can ignore it here. ]
> > > > > 
> > > > > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
> > > > > the log space, and we have transactions waiting on log reservation
> > > > > space, then we must have enough transactions in progress to cover at
> > > > > least 75% of the log space. Completion of those transactions will
> > > > > wake waiters and, if necessary, push the AIL again to keep the log
> > > > > tail moving appropriately. This handles the AIL empty and "insert
> > > > > before target" situations you are concerned about just fine, as long
> > > > > as we have a guarantee of forwards progress. Bounding the CIL size
> > > > > provides that forwards progress guarantee for the CIL...
> > > > > 
> > > > 
> > > > I think you have some tunnel vision or something going on here with
> > > > regard to the higher level architectural view of how things are supposed
> > > > to operate in a normal running/steady state vs simply what can and
> > > > cannot happen in the code. I can't really tell why/how, but the only
> > > > suggestion I can make is to perhaps separate from this high level view
> > > > of things and take a closer look at the code. This is a simple code bug,
> > > > not some grand architectural flaw. The context here is way out of whack.
> > > > The repeated unrelated and overblown architectural assertions come off
> > > > as indication of lack of any real argument to allow this race to live.
> > > > There is simply no such guarantee of forward progress in all scenarios
> > > > that produce the conditions that can cause this race.
> > > > 
> > > > Yet another example:
> > > > 
> > > >            <...>-369   [002] ...2   220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
> > > >            <...>-27    [003] ...1   224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > >            <...>-404   [003] ...1   224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa
> > > >      kworker/3:1-39    [003] ...2   224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
> > > >     xfsaild/dm-4-1034  [000] ....   224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa
> > > >     kworker/3:1H-404   [003] ...2   225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL
> > > >      kworker/3:1-39    [003] ...1   254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > > 	...
> > > >      kworker/3:2-1920  [003] ...1  3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > > 
> > > > 
> > > > # cat /sys/fs/xfs/dm-4/log/log_*lsn
> > > > 1:252
> > > > 1:250
> > > > 
> > > > This instance of the race uses the same serialization instrumentation to
> > > > control execution timing and whatnot as before (i.e. no functional
> > > > changes). First, an item is inserted into the AIL. Immediately after AIL
> > > > insertion, another transaction commits to the CIL (not shown in the
> > > > trace). The background log worker comes around a few seconds later and
> > > > forces the log/CIL. The checkpoint for this log force races with an AIL
> > > > delete and idle (same as before). AIL insertion occurs at the push
> > > > target xfsaild just idled at, but this time reservation pressure
> > > > relieves and the filesystem goes idle.
> > > > 
> > > > At this point, nothing occurs on the fs except for continuous background
> > > > log worker jobs. Note the timestamp difference between the first
> > > > post-race log force and the last in the trace. The log worker runs at
> > > > the default 30s interval and has run repeatedly for almost an hour while
> > > > failing to push the AIL and subsequently cover the log. To confirm the
> > > > AIL is populated, see the log head/tail LSNs reported via sysfs. This
> > > > state persists indefinitely so long as the fs is idle. This is a bug.
> > > 
> > > /me stumbles back in after ~2wks, and has a few questions:
> > > 
> > 
> > Heh, welcome back.. ;)
> > 
> > > 1) Are these concerns a reason to hold up this series, or are they a
> > > separate bug lurking in the code being touched by the series?  AFAICT I
> > > think it's the second, but <shrug> my brain is still mush.
> > > 
> > 
> > A little of both I guess. To Dave's earlier point, I think this
> > technically can happen in the existing code as a 3-way race between the
> > aforementioned tasks (just not the way it was described). OTOH, I'm not
> > sure what this has to do with the fact that the new code being added is
> > racy on its own (or since when discovery of some old bug justifies
> > adding new ones..?). The examples shown above are fundamentally races
> > between log I/O completion and xfsaild. The last one shows the log
> > remain uncovered indefinitely on an idle fs (which is not a corruption
> > or anything, but certainly a bug) simply because that's the easiest side
> > effect to reproduce. I'm fairly confident at this point that one could
> > be manufactured into a similar log deadlock if we really wanted to try,
> > but that would be much more difficult and TBH I'm tired of burning
> > myself out on these kind of objections to obvious and easily addressed
> > landmines. How likely is it that somebody would hit these problems?
> > Probably highly unlikely. How likely is it somebody would hit this
> > problem before whatever problem this patch fixes? *shrug*
> > 
> > I don't think it's a reason to hold up the series, but at the same time
> > this patch is unrelated to the original problem. IIRC, it fell out of
> > some other issue reproduced with a different experimental hack/fix (that
> > was eventually replaced) to the original problem. FWIW, I'm annoyed with
> > the lazy approach to review here more than anything. In hindsight, if I
> > knew the feedback was going to be dismissed and the patchset rolled
> > forward and merged, perhaps I should have just nacked the subsequent
> > reposts to make the objection clear.
> 
> :(
> 
> I'm sorry you feel that way.  I myself don't feel that my own handling
> of this merge window has been good, between feeling pressured to get the
> branches ready to go before vacation and for-next becoming intermittent
> right around the same time.  Those both decrease my certainty about
> what's going in the next merge and increases my own anxieties, and it
> becomes a competition in my head between "I can add it now and revert it
> later as a regression fix" vs. "if I don't add it I'll wonder if it was
> necessary".
> 

Eh, it is what it is. I don't expect to always agree on everything. I
could/should have noted the objection more clearly on subsequent posts
and will try to do so in the future. Conversely, a quick "there appears
to be a disagreement, the maintainer is making a decision" note on the
list would be appreciated should that play out in the future.

Just my .02 (and not saying this is some kind of pattern or anything),
but I do think the whole "merge it since we can revert it from for-next
later if it causes problems" thing kind of sets a low bar and a bad
precedent. What's the point of reviewing patches with that approach?
Beyond discouraging review, it also diminishes sense of responsibility
for the quality of affected areas of code IMO.

> Anyway, I /think/ the end result is that if the AIL gets stuck /and/ the
> system goes down before it becomes unstuck, then there'll be more work
> for log recovery to do, because we failed to checkpoint everything that
> we possibly could have before the crash?
> 

Yep, in this particular variant at least.

> So AFAICT it's not a critical disaster bug but I would like to study
> this situation some more, particularly now that we have ~2mos for
> stabilizing things.
> 

Right, this is definitely not a critical bug. AFAICT neither is the bug
fixed by the patch.

Note that just on principle I'm not going to spend a whole lot of time
testing for things post-merge that I consider and/or call out as
problems during review and end up ignored. The point of addressing such
things during review is to avoid those problems in the first place. In
this particular case, why spend time on that when it is so relatively
simple to relocate the xlog_grant_push_ail() call to after callback
processing? I still don't understand that tbh.

> > I dunno, not my call on what to do with it now. Feel free to add my
> > Nacked-by: to the upstream commit I guess so I at least remember this
> > when/if considering whether to backport it anywhere. :/
> 
> (/me continues to wish there was an easy way to add tagging to a commit,
> particularly when it comes well after the fact.)
> 

No big deal, this was all in hindsight.

> > > 2) Er... how do you get the log stuck like this?  I see things earlier
> > > in the thread like "open transactions that don't ultimately commit
> > > anything to the log subsystem" and think "OH, you mean xfs_scrub!"
> > > 
> > 
> > That's one thing I was thinking about but I didn't end up looking into
> > it (does scrub actually acquire log reservation?).
> 
> If you invoke the scrub ioctl with IFLAG_REPAIR set, it allocates a
> non-empty transaction (itruncate, iirc) to do the check and rebuild the
> data structure.  If the item is ok then it'll cancel the transaction.
> 

Ah, tr_itruncate is actually larger than tr_write too (which I assume is
why it's used here)...

> > For a more simple
> > example, consider a bunch of threads running into quota block allocation
> > failures where a system is also under memory pressure. On filesystems
> > with smaller logs, it only takes a handful of such threads to bash the
> > reservation grant head against the log tail even though the log is empty
> > (and doing so without ever committing anything to the log).
> > 
> > Note that this by itself isn't what gets the log "stuck" in the most
> > recent example (note: not deadlocked), but rather if we're in a state
> > where the grant head is close enough to the log head (such that we AIL
> > push the items associated with the current checkpoint before it inserts)
> > when log I/O completion happens to race with AIL emptying as described.
> 
> Hmm... I wonder if we could reproduce this by formatting a filesystem
> with a small log; running a slow moving thread that touches a file once
> per second (or something to generate a moderate amount of workload) and
> monitors the log to watch its progress; and then kicking off dozens of
> threads to invoke IFLAG_REPAIR scrubbers on some other non-corrupt part
> of the filesystem?
> 

Yeah, given the above and if we're able to kick off enough concurrent
scrubbers such that they do nontrivial work and don't block eachother
before transaction allocation, that looks like it could result in
similar behavior wrt to reservation. Concurrent allocbt scrubs perhaps?
If so, on an otherwise default sized 100g fs, concurrent repair scrubs
to anything more than 18 or so AGs (based on tr_itruncate size) would be
enough to cause AIL pushing from the log subsystem (with a minimum size
log, I think something like 3 AGs would be enough) without any guarantee
that any of those transactions commit. The rest is just a simple race
between AIL pushing a fabricated target and xfsaild.

BTW, another thing I noticed with regard to timing is that on 32-bit
systems the push target is updated under ->ail_lock, which I think means
the xfs_ail_push() call from log I/O completion can serialize behind an
xfsaild push in progress (after the former has checked for a !empty
AIL). I'm not sure that's enough to make the race easy to reproduce
without explicit delay injection, but it's a step in that direction..

Brian

> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > Cheers,
> > > > > 
> > > > > Dave.
> > > > > -- 
> > > > > Dave Chinner
> > > > > david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2019-09-26 13:19 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-04  4:24 [PATCH 0/7] xfs: log race fixes and cleanups Dave Chinner
2019-09-04  4:24 ` [PATCH 1/7] xfs: push the AIL in xlog_grant_head_wake Dave Chinner
2019-09-04  6:07   ` Christoph Hellwig
2019-09-04 21:46     ` Dave Chinner
2019-09-04  4:24 ` [PATCH 2/7] xfs: fix missed wakeup on l_flush_wait Dave Chinner
2019-09-04  6:07   ` Christoph Hellwig
2019-09-04 21:47     ` Dave Chinner
2019-09-04  4:24 ` [PATCH 3/7] xfs: factor debug code out of xlog_state_do_callback() Dave Chinner
2019-09-04  6:10   ` Christoph Hellwig
2019-09-04 21:14     ` Dave Chinner
2019-09-04  4:24 ` [PATCH 4/7] xfs: factor callbacks " Dave Chinner
2019-09-04  6:13   ` Christoph Hellwig
2019-09-04  6:32   ` Christoph Hellwig
2019-09-04 21:22     ` Dave Chinner
2019-09-04  4:24 ` [PATCH 5/7] xfs: factor iclog state processing " Dave Chinner
2019-09-04  6:42   ` Christoph Hellwig
2019-09-04 21:43     ` Dave Chinner
2019-09-04  4:24 ` [PATCH 6/7] xfs: push iclog state cleaning into xlog_state_clean_log Dave Chinner
2019-09-04  6:44   ` Christoph Hellwig
2019-09-04  4:24 ` [PATCH 7/7] xfs: push the grant head when the log head moves forward Dave Chinner
2019-09-04  6:45   ` Christoph Hellwig
2019-09-04 21:49     ` Dave Chinner
2019-09-04 19:34   ` Brian Foster
2019-09-04 22:50     ` Dave Chinner
2019-09-05 16:25       ` Brian Foster
2019-09-06  0:02         ` Dave Chinner
2019-09-06 13:10           ` Brian Foster
2019-09-07 15:10             ` Brian Foster
2019-09-08 23:26               ` Dave Chinner
2019-09-10  9:56                 ` Brian Foster
2019-09-10 23:38                   ` Dave Chinner
2019-09-12 13:46                     ` Brian Foster
2019-09-17  4:31                       ` Darrick J. Wong
2019-09-17 12:48                         ` Brian Foster
2019-09-24 17:16                           ` Darrick J. Wong
2019-09-26 13:19                             ` Brian Foster
2019-09-04  5:26 ` [PATCH 0/7] xfs: log race fixes and cleanups Christoph Hellwig
2019-09-04  5:56   ` Christoph Hellwig
2019-09-04 22:57     ` Dave Chinner
     [not found]       ` <20190905065133.GA21840@infradead.org>
2019-09-05  7:10         ` Dave Chinner
2019-09-05  7:28           ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).