All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: CIL improvements
@ 2021-02-23  5:32 Dave Chinner
  2021-02-23  5:32 ` [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Dave Chinner @ 2021-02-23  5:32 UTC (permalink / raw)
  To: linux-xfs

HI folks,

The following patches improve the behaviour of the CIL and increase
the processing capacity available for pushing changes into the
journal.

The first patch is an interface change, fixing something I did when
first introducing delayed logging to ease the understanding of the
change the CIL placed on the rest of the code. That was to "pretend"
that the log items still recorded the log sequence numbers that they
were comitted to, and that these LSNs could be passed to
xfs_log_force_lsn() to force the log to a certain LSN.

Truth is that the CIL hands out CIL sequence numbers to log items on
commits, not journal log sequence numbers. The LSNs are now
essentially and internal log and AIL implementation detail,
everything else that runs transactions and interfaces with log items
to force the log store around CIL sequences.

Hence the first patch converts xfs_log_force_lsn() to
xfs_log_force_seq() and all the high level stuff to refer to
sequence numbers rather than LSNs.

The second patch gets rid of severe latency issues that the AIL can
see due to having to force the log to unpin items. While it is
running a log force, it does nothing and hence we can run out of log
space before the log force returns and the AIL starts pushing again.

The AIL only needs to push on the CIL to get items unpinned, and it
doesn't need to wait for it to complete, either, before it continues
onwards trying to push out items to disk. The AIL will back off when
it reaches target, so it doesn't need to wait on log forces to back
off when there are pinned items in the AIL.

Hence this patch adds a mechanism to do an async push on the CIL
that does not block and changes the AIL to use it. This results in
the AIL backing off on it's own short timeouts and trying to make
progress repeatedly instead of stalling for seconds waiting for log
forces to complete.

The last patch is a fix to the pipelining of the CIL pushes. The
pipelining isn't working as intended, it's actually serialising and
resulting in the CIL push work being CPU bound and limiting the rate
at which items can be pushed to the journal. It is also creating
excessive push latency where the CIL fills and hits the hard
throttle while waiting for the push work to finish the current push
and then start on the new push and swap in a new CIL context that
can be committed to.

Essentially, the problem is an implementation problem, not a design
flaw. The implementation has a single work attached to the CIL,
meaning we can only have a single outstanding push work in progress
at any time. The workqueue can handle more, but we only have a
single work. So the fix is to move the work to the CIL context so we
can queue and process multiple works at the same time, thereby
actually allowing the CIL push work to pipeline in the intended
manner.

Cheers,

Dave.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN
  2021-02-23  5:32 [PATCH 0/3] xfs: CIL improvements Dave Chinner
@ 2021-02-23  5:32 ` Dave Chinner
  2021-02-24 21:42   ` Darrick J. Wong
  2021-03-02 18:12   ` Brian Foster
  2021-02-23  5:32 ` [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing Dave Chinner
  2021-02-23  5:32 ` [PATCH 3/3] xfs: CIL work is serialised, not pipelined Dave Chinner
  2 siblings, 2 replies; 33+ messages in thread
From: Dave Chinner @ 2021-02-23  5:32 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

In doing an investigation into AIL push stalls, I was looking at the
log force code to see if an async CIL push could be done instead.
This lead me to xfs_log_force_lsn() and looking at how it works.

xfs_log_force_lsn() is only called from inode synchronisation
contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
value as the LSN to sync the log to. This gets passed to
xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
journal, and then used by xfs_log_force_lsn() to flush the iclogs to
the journal.

The problem with is that ip->i_itemp->ili_last_lsn does not store a
log sequence number. What it stores is passed to it from the
->iop_committing method, which is called by xfs_log_commit_cil().
The value this passes to the iop_committing method is the CIL
context sequence number that the item was committed to.

As it turns out, xlog_cil_force_lsn() converts the sequence to an
actual commit LSN for the related context and returns that to
xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
variable that contained a sequence with an actual LSN and then uses
that to sync the iclogs.

This caused me some confusion for a while, even though I originally
wrote all this code a decade ago. ->iop_committing is only used by
a couple of log item types, and only inode items use the sequence
number it is passed.

Let's clean up the API, CIL structures and inode log item to call it
a sequence number, and make it clear that the high level code is
using CIL sequence numbers and not on-disk LSNs for integrity
synchronisation purposes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c   |  2 +-
 fs/xfs/xfs_dquot_item.c |  2 +-
 fs/xfs/xfs_file.c       | 14 +++++++-------
 fs/xfs/xfs_inode.c      | 10 +++++-----
 fs/xfs/xfs_inode_item.c |  4 ++--
 fs/xfs/xfs_inode_item.h |  2 +-
 fs/xfs/xfs_log.c        | 27 ++++++++++++++-------------
 fs/xfs/xfs_log.h        |  4 +---
 fs/xfs/xfs_log_cil.c    | 22 +++++++++-------------
 fs/xfs/xfs_log_priv.h   | 15 +++++++--------
 fs/xfs/xfs_trans.c      |  6 +++---
 fs/xfs/xfs_trans.h      |  4 ++--
 12 files changed, 53 insertions(+), 59 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 14d1fefcbf4c..7affe1aa16da 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -713,7 +713,7 @@ xfs_buf_item_release(
 STATIC void
 xfs_buf_item_committing(
 	struct xfs_log_item	*lip,
-	xfs_lsn_t		commit_lsn)
+	uint64_t		seq)
 {
 	return xfs_buf_item_release(lip);
 }
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 8c1fdf37ee8f..74ca8fbbf1f3 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -188,7 +188,7 @@ xfs_qm_dquot_logitem_release(
 STATIC void
 xfs_qm_dquot_logitem_committing(
 	struct xfs_log_item	*lip,
-	xfs_lsn_t		commit_lsn)
+	uint64_t		seq)
 {
 	return xfs_qm_dquot_logitem_release(lip);
 }
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index dd33ef2d0e20..8b07408928a9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -118,8 +118,8 @@ xfs_dir_fsync(
 	return xfs_log_force_inode(ip);
 }
 
-static xfs_lsn_t
-xfs_fsync_lsn(
+static uint64_t
+xfs_fsync_seq(
 	struct xfs_inode	*ip,
 	bool			datasync)
 {
@@ -127,7 +127,7 @@ xfs_fsync_lsn(
 		return 0;
 	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
 		return 0;
-	return ip->i_itemp->ili_last_lsn;
+	return ip->i_itemp->ili_commit_seq;
 }
 
 /*
@@ -150,12 +150,12 @@ xfs_fsync_flush_log(
 	int			*log_flushed)
 {
 	int			error = 0;
-	xfs_lsn_t		lsn;
+	uint64_t		seq;
 
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	lsn = xfs_fsync_lsn(ip, datasync);
-	if (lsn) {
-		error = xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC,
+	seq = xfs_fsync_seq(ip, datasync);
+	if (seq) {
+		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
 					  log_flushed);
 
 		spin_lock(&ip->i_itemp->ili_lock);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 95e3a5e6e5e2..bc3e27f9c7f2 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2636,7 +2636,7 @@ xfs_iunpin(
 	trace_xfs_inode_unpin_nowait(ip, _RET_IP_);
 
 	/* Give the log a push to start the unpinning I/O */
-	xfs_log_force_lsn(ip->i_mount, ip->i_itemp->ili_last_lsn, 0, NULL);
+	xfs_log_force_seq(ip->i_mount, ip->i_itemp->ili_commit_seq, 0, NULL);
 
 }
 
@@ -3641,16 +3641,16 @@ int
 xfs_log_force_inode(
 	struct xfs_inode	*ip)
 {
-	xfs_lsn_t		lsn = 0;
+	uint64_t		seq = 0;
 
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	if (xfs_ipincount(ip))
-		lsn = ip->i_itemp->ili_last_lsn;
+		seq = ip->i_itemp->ili_commit_seq;
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
-	if (!lsn)
+	if (!seq)
 		return 0;
-	return xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, NULL);
+	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, NULL);
 }
 
 /*
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6ff91e5bf3cd..7270a80f21dc 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -617,9 +617,9 @@ xfs_inode_item_committed(
 STATIC void
 xfs_inode_item_committing(
 	struct xfs_log_item	*lip,
-	xfs_lsn_t		commit_lsn)
+	uint64_t		seq)
 {
-	INODE_ITEM(lip)->ili_last_lsn = commit_lsn;
+	INODE_ITEM(lip)->ili_commit_seq = seq;
 	return xfs_inode_item_release(lip);
 }
 
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 4b926e32831c..e60125b88292 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -33,7 +33,7 @@ struct xfs_inode_log_item {
 	unsigned int		ili_fields;	   /* fields to be logged */
 	unsigned int		ili_fsync_fields;  /* logged since last fsync */
 	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
-	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
+	uint64_t		ili_commit_seq;	   /* last transaction commit */
 };
 
 static inline int xfs_inode_clean(struct xfs_inode *ip)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 08d68a6161ae..84cd9b6c6d1f 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3336,14 +3336,13 @@ xfs_log_force(
 }
 
 static int
-__xfs_log_force_lsn(
-	struct xfs_mount	*mp,
+xlog_force_lsn(
+	struct xlog		*log,
 	xfs_lsn_t		lsn,
 	uint			flags,
 	int			*log_flushed,
 	bool			already_slept)
 {
-	struct xlog		*log = mp->m_log;
 	struct xlog_in_core	*iclog;
 
 	spin_lock(&log->l_icloglock);
@@ -3376,8 +3375,6 @@ __xfs_log_force_lsn(
 		if (!already_slept &&
 		    (iclog->ic_prev->ic_state == XLOG_STATE_WANT_SYNC ||
 		     iclog->ic_prev->ic_state == XLOG_STATE_SYNCING)) {
-			XFS_STATS_INC(mp, xs_log_force_sleep);
-
 			xlog_wait(&iclog->ic_prev->ic_write_wait,
 					&log->l_icloglock);
 			return -EAGAIN;
@@ -3415,25 +3412,29 @@ __xfs_log_force_lsn(
  * to disk, that thread will wake up all threads waiting on the queue.
  */
 int
-xfs_log_force_lsn(
+xfs_log_force_seq(
 	struct xfs_mount	*mp,
-	xfs_lsn_t		lsn,
+	uint64_t		seq,
 	uint			flags,
 	int			*log_flushed)
 {
+	struct xlog		*log = mp->m_log;
+	xfs_lsn_t		lsn;
 	int			ret;
-	ASSERT(lsn != 0);
+	ASSERT(seq != 0);
 
 	XFS_STATS_INC(mp, xs_log_force);
-	trace_xfs_log_force(mp, lsn, _RET_IP_);
+	trace_xfs_log_force(mp, seq, _RET_IP_);
 
-	lsn = xlog_cil_force_lsn(mp->m_log, lsn);
+	lsn = xlog_cil_force_seq(log, seq);
 	if (lsn == NULLCOMMITLSN)
 		return 0;
 
-	ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, false);
-	if (ret == -EAGAIN)
-		ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, true);
+	ret = xlog_force_lsn(log, lsn, flags, log_flushed, false);
+	if (ret == -EAGAIN) {
+		XFS_STATS_INC(mp, xs_log_force_sleep);
+		ret = xlog_force_lsn(log, lsn, flags, log_flushed, true);
+	}
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 044e02cb8921..33ae53401060 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -106,7 +106,7 @@ struct xfs_item_ops;
 struct xfs_trans;
 
 int	  xfs_log_force(struct xfs_mount *mp, uint flags);
-int	  xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags,
+int	  xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags,
 		int *log_forced);
 int	  xfs_log_mount(struct xfs_mount	*mp,
 			struct xfs_buftarg	*log_target,
@@ -132,8 +132,6 @@ bool	xfs_log_writable(struct xfs_mount *mp);
 struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
 void	  xfs_log_ticket_put(struct xlog_ticket *ticket);
 
-void	xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp,
-				xfs_lsn_t *commit_lsn, bool regrant);
 void	xlog_cil_process_committed(struct list_head *list);
 bool	xfs_log_item_in_current_chkpt(struct xfs_log_item *lip);
 
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 0a00c3c9610c..2fda8c4513b2 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -792,7 +792,7 @@ xlog_cil_push_work(
 	 * that higher sequences will wait for us to write out a commit record
 	 * before they do.
 	 *
-	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
+	 * xfs_log_force_seq requires us to mirror the new sequence into the cil
 	 * structure atomically with the addition of this sequence to the
 	 * committing list. This also ensures that we can do unlocked checks
 	 * against the current sequence in log forces without risking
@@ -1054,16 +1054,14 @@ xlog_cil_empty(
  * allowed again.
  */
 void
-xfs_log_commit_cil(
-	struct xfs_mount	*mp,
+xlog_cil_commit(
+	struct xlog		*log,
 	struct xfs_trans	*tp,
-	xfs_lsn_t		*commit_lsn,
+	uint64_t		*commit_seq,
 	bool			regrant)
 {
-	struct xlog		*log = mp->m_log;
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_log_item	*lip, *next;
-	xfs_lsn_t		xc_commit_lsn;
 
 	/*
 	 * Do all necessary memory allocation before we lock the CIL.
@@ -1077,10 +1075,6 @@ xfs_log_commit_cil(
 
 	xlog_cil_insert_items(log, tp);
 
-	xc_commit_lsn = cil->xc_ctx->sequence;
-	if (commit_lsn)
-		*commit_lsn = xc_commit_lsn;
-
 	if (regrant && !XLOG_FORCED_SHUTDOWN(log))
 		xfs_log_ticket_regrant(log, tp->t_ticket);
 	else
@@ -1103,8 +1097,10 @@ xfs_log_commit_cil(
 	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
 		xfs_trans_del_item(lip);
 		if (lip->li_ops->iop_committing)
-			lip->li_ops->iop_committing(lip, xc_commit_lsn);
+			lip->li_ops->iop_committing(lip, cil->xc_ctx->sequence);
 	}
+	if (commit_seq)
+		*commit_seq = cil->xc_ctx->sequence;
 
 	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
 	xlog_cil_push_background(log);
@@ -1121,9 +1117,9 @@ xfs_log_commit_cil(
  * iclog flush is necessary following this call.
  */
 xfs_lsn_t
-xlog_cil_force_lsn(
+xlog_cil_force_seq(
 	struct xlog	*log,
-	xfs_lsn_t	sequence)
+	uint64_t	sequence)
 {
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_cil_ctx	*ctx;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 24acdc54e44e..59778cd5ecdd 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -234,7 +234,7 @@ struct xfs_cil;
 
 struct xfs_cil_ctx {
 	struct xfs_cil		*cil;
-	xfs_lsn_t		sequence;	/* chkpt sequence # */
+	uint64_t		sequence;	/* chkpt sequence # */
 	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
 	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
 	struct xlog_ticket	*ticket;	/* chkpt ticket */
@@ -272,10 +272,10 @@ struct xfs_cil {
 	struct xfs_cil_ctx	*xc_ctx;
 
 	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
-	xfs_lsn_t		xc_push_seq;
+	uint64_t		xc_push_seq;
 	struct list_head	xc_committing;
 	wait_queue_head_t	xc_commit_wait;
-	xfs_lsn_t		xc_current_sequence;
+	uint64_t		xc_current_sequence;
 	struct work_struct	xc_push_work;
 	wait_queue_head_t	xc_push_wait;	/* background push throttle */
 } ____cacheline_aligned_in_smp;
@@ -552,19 +552,18 @@ int	xlog_cil_init(struct xlog *log);
 void	xlog_cil_init_post_recovery(struct xlog *log);
 void	xlog_cil_destroy(struct xlog *log);
 bool	xlog_cil_empty(struct xlog *log);
+void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
+			uint64_t *commit_seq, bool regrant);
 
 /*
  * CIL force routines
  */
-xfs_lsn_t
-xlog_cil_force_lsn(
-	struct xlog *log,
-	xfs_lsn_t sequence);
+xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence);
 
 static inline void
 xlog_cil_force(struct xlog *log)
 {
-	xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence);
+	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
 }
 
 /*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 44f72c09c203..697703f3be48 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -849,7 +849,7 @@ __xfs_trans_commit(
 	bool			regrant)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	xfs_lsn_t		commit_lsn = -1;
+	uint64_t		commit_seq = 0;
 	int			error = 0;
 	int			sync = tp->t_flags & XFS_TRANS_SYNC;
 
@@ -891,7 +891,7 @@ __xfs_trans_commit(
 		xfs_trans_apply_sb_deltas(tp);
 	xfs_trans_apply_dquot_deltas(tp);
 
-	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
+	xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant);
 
 	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
@@ -901,7 +901,7 @@ __xfs_trans_commit(
 	 * log out now and wait for it.
 	 */
 	if (sync) {
-		error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL);
+		error = xfs_log_force_seq(mp, commit_seq, XFS_LOG_SYNC, NULL);
 		XFS_STATS_INC(mp, xs_trans_sync);
 	} else {
 		XFS_STATS_INC(mp, xs_trans_async);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 8b03fbfe9a1b..d223d4f4e429 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -43,7 +43,7 @@ struct xfs_log_item {
 	struct list_head		li_cil;		/* CIL pointers */
 	struct xfs_log_vec		*li_lv;		/* active log vector */
 	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
-	xfs_lsn_t			li_seq;		/* CIL commit seq */
+	uint64_t			li_seq;		/* CIL commit seq */
 };
 
 /*
@@ -69,7 +69,7 @@ struct xfs_item_ops {
 	void (*iop_pin)(struct xfs_log_item *);
 	void (*iop_unpin)(struct xfs_log_item *, int remove);
 	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
-	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
+	void (*iop_committing)(struct xfs_log_item *lip, uint64_t seq);
 	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
 	int (*iop_recover)(struct xfs_log_item *lip,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing
  2021-02-23  5:32 [PATCH 0/3] xfs: CIL improvements Dave Chinner
  2021-02-23  5:32 ` [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
@ 2021-02-23  5:32 ` Dave Chinner
  2021-02-24 21:10   ` Dave Chinner
  2021-02-23  5:32 ` [PATCH 3/3] xfs: CIL work is serialised, not pipelined Dave Chinner
  2 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2021-02-23  5:32 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The AIL pushing is stalling on log forces when it comes across
pinned items. This is happening on removal workloads where the AIL
is dominated by stale items that are removed from AIL when the
checkpoint that marks the items stale is committed to the journal.
This results is relatively few items in the AIL, but those that are
are often pinned as directories items are being removed from are
still being logged.

As a result, many push cycles through the CIL will first issue a
blocking log force to unpin the items. This can take some time to
complete, with tracing regularly showing push delays of half a
second and sometimes up into the range of several seconds. Sequences
like this aren't uncommon:

....
 399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 270ms delay>
 400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
 400.099623:  xfsaild: first lsn 0x11002f3600
 400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
<wanted 50ms, got 500ms delay>
 400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
 400.589349:  xfsaild: first lsn 0x1100305000
 400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
<wanted 50ms, got 460ms delay>
 400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
 400.950343:  xfsaild: first lsn 0x1100317c00
 400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
<wanted 20ms, got 200ms delay>
 401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
 401.142334:  xfsaild: first lsn 0x110032e600
 401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
<wanted 10ms, got 10ms delay>
 401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
 401.154328:  xfsaild: first lsn 0x1100353000
 401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 300ms delay>
 401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
 401.451526:  xfsaild: first lsn 0x1100353000
 401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
<wanted 50ms, got 500ms delay>
 401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
....

In each of these cases, every AIL pass saw 101 log items stuck on
the AIL (pinned) with very few other items being found. Each pass, a
log force was issued, and delay between last/first is the sleep time
+ the sync log force time.

Some of these 101 items pinned the tail of the log. The tail of the
log does slowly creep forward (first lsn), but the problem is that
the log is actually out of reservation space because it's been
running so many transactions that stale items that never reach the
AIL but consume log space. Hence we have a largely empty AIL, with
long term pins on items that pin the tail of the log that don't get
pushed frequently enough to keep log space available.

The problem is the hundreds of milliseconds that we block in the log
force pushing the CIL out to disk. The AIL should not be stalled
like this - it needs to run and flush items that are at the tail of
the log with minimal latency. What we really need to do is trigger a
log flush, but then not wait for it at all - we've already done our
waiting for stuff to complete when we backed off prior to the log
force being issued.

Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
still do a blocking flush of the CIL and that is what is causing the
issue. Hence we need a new interface for the CIL to trigger an
immediate background push of the CIL to get it moving faster but not
to wait on that to occur. While the CIL is pushing, the AIL can also
be pushing.

We already have an internal interface to do this -
xlog_cil_push_now() - but we need a wrapper for it to be used
externally. xlog_cil_force_seq() can easily be extended to do what
we need as it already implements the synchronous CIL push via
xlog_cil_push_now(). Add the necessary flags and "push current
sequence" semantics to xlog_cil_force_seq() and convert the AIL
pushing to use it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c       | 33 ++++++++++++++++++++++-----------
 fs/xfs/xfs_log.h       |  1 +
 fs/xfs/xfs_log_cil.c   | 29 +++++++++++++++++++++--------
 fs/xfs/xfs_log_priv.h  |  5 +++--
 fs/xfs/xfs_sysfs.c     |  1 +
 fs/xfs/xfs_trace.c     |  1 +
 fs/xfs/xfs_trans.c     |  2 +-
 fs/xfs/xfs_trans_ail.c | 11 ++++++++---
 8 files changed, 58 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 84cd9b6c6d1f..a0ecaf4f9b77 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3335,6 +3335,20 @@ xfs_log_force(
 	return -EIO;
 }
 
+/*
+ * Force the log to a specific LSN.
+ *
+ * If an iclog with that lsn can be found:
+ *	If it is in the DIRTY state, just return.
+ *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
+ *		state and go to sleep or return.
+ *	If it is in any other state, go to sleep or return.
+ *
+ * Synchronous forces are implemented with a wait queue.  All callers trying
+ * to force a given lsn to disk must wait on the queue attached to the
+ * specific in-core log.  When given in-core log finally completes its write
+ * to disk, that thread will wake up all threads waiting on the queue.
+ */
 static int
 xlog_force_lsn(
 	struct xlog		*log,
@@ -3398,18 +3412,15 @@ xlog_force_lsn(
 }
 
 /*
- * Force the in-core log to disk for a specific LSN.
+ * Force the log to a specific checkpoint sequence.
  *
- * Find in-core log with lsn.
- *	If it is in the DIRTY state, just return.
- *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
- *		state and go to sleep or return.
- *	If it is in any other state, go to sleep or return.
+ * First force the CIL so that all the required changes have been flushed to the
+ * iclogs. If the CIL force completed it will return a commit LSN that indicates
+ * the iclog that needs to be flushed to stable storage.
  *
- * Synchronous forces are implemented with a wait queue.  All callers trying
- * to force a given lsn to disk must wait on the queue attached to the
- * specific in-core log.  When given in-core log finally completes its write
- * to disk, that thread will wake up all threads waiting on the queue.
+ * If the XFS_LOG_SYNC flag is not set, we only trigger a background CIL force
+ * and do not wait for it to complete, nor do we attempt to check/flush iclogs
+ * as the CIL will not have committed when xlog_cil_force_seq() returns.
  */
 int
 xfs_log_force_seq(
@@ -3426,7 +3437,7 @@ xfs_log_force_seq(
 	XFS_STATS_INC(mp, xs_log_force);
 	trace_xfs_log_force(mp, seq, _RET_IP_);
 
-	lsn = xlog_cil_force_seq(log, seq);
+	lsn = xlog_cil_force_seq(log, flags, seq);
 	if (lsn == NULLCOMMITLSN)
 		return 0;
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 33ae53401060..77d3fca28f55 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -104,6 +104,7 @@ struct xlog_ticket;
 struct xfs_log_item;
 struct xfs_item_ops;
 struct xfs_trans;
+struct xlog;
 
 int	  xfs_log_force(struct xfs_mount *mp, uint flags);
 int	  xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 2fda8c4513b2..be6c91d7218f 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -999,7 +999,8 @@ xlog_cil_push_background(
 static void
 xlog_cil_push_now(
 	struct xlog	*log,
-	xfs_lsn_t	push_seq)
+	xfs_lsn_t	push_seq,
+	bool		flush)
 {
 	struct xfs_cil	*cil = log->l_cilp;
 
@@ -1009,7 +1010,8 @@ xlog_cil_push_now(
 	ASSERT(push_seq && push_seq <= cil->xc_current_sequence);
 
 	/* start on any pending background push to minimise wait time on it */
-	flush_work(&cil->xc_push_work);
+	if (flush)
+		flush_work(&cil->xc_push_work);
 
 	/*
 	 * If the CIL is empty or we've already pushed the sequence then
@@ -1109,16 +1111,22 @@ xlog_cil_commit(
 /*
  * Conditionally push the CIL based on the sequence passed in.
  *
- * We only need to push if we haven't already pushed the sequence
- * number given. Hence the only time we will trigger a push here is
- * if the push sequence is the same as the current context.
+ * We only need to push if we haven't already pushed the sequence number given.
+ * Hence the only time we will trigger a push here is if the push sequence is
+ * the same as the current context.
  *
- * We return the current commit lsn to allow the callers to determine if a
- * iclog flush is necessary following this call.
+ * If the sequence is zero, push the current sequence. If XFS_LOG_SYNC is set in
+ * the flags wait for it to complete, otherwise jsut return NULLCOMMITLSN to
+ * indicate we didn't wait for a commit lsn.
+ *
+ * If we waited for the push to complete, then we return the current commit lsn
+ * to allow the callers to determine if a iclog flush is necessary following
+ * this call.
  */
 xfs_lsn_t
 xlog_cil_force_seq(
 	struct xlog	*log,
+	uint32_t	flags,
 	uint64_t	sequence)
 {
 	struct xfs_cil		*cil = log->l_cilp;
@@ -1127,13 +1135,18 @@ xlog_cil_force_seq(
 
 	ASSERT(sequence <= cil->xc_current_sequence);
 
+	if (!sequence)
+		sequence = cil->xc_current_sequence;
+
 	/*
 	 * check to see if we need to force out the current context.
 	 * xlog_cil_push() handles racing pushes for the same sequence,
 	 * so no need to deal with it here.
 	 */
 restart:
-	xlog_cil_push_now(log, sequence);
+	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
+	if (!(flags & XFS_LOG_SYNC))
+		return commit_lsn;
 
 	/*
 	 * See if we can find a previous sequence still committing.
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 59778cd5ecdd..d9929d9146e4 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -558,12 +558,13 @@ void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
 /*
  * CIL force routines
  */
-xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence);
+xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags,
+				uint64_t sequence);
 
 static inline void
 xlog_cil_force(struct xlog *log)
 {
-	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
+	xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence);
 }
 
 /*
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index f1bc88f4367c..18dc5eca6c04 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -10,6 +10,7 @@
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_sysfs.h"
+#include "xfs_log.h"
 #include "xfs_log_priv.h"
 #include "xfs_mount.h"
 
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 9b8d703dc9fd..d111a994b7b6 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -20,6 +20,7 @@
 #include "xfs_bmap.h"
 #include "xfs_attr.h"
 #include "xfs_trans.h"
+#include "xfs_log.h"
 #include "xfs_log_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_quota.h"
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 697703f3be48..7d05694681e3 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -9,7 +9,6 @@
 #include "xfs_shared.h"
 #include "xfs_format.h"
 #include "xfs_log_format.h"
-#include "xfs_log_priv.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_extent_busy.h"
@@ -17,6 +16,7 @@
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_defer.h"
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index dbb69b4bf3ed..dfc0206c0d36 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -17,6 +17,7 @@
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 
 #ifdef DEBUG
 /*
@@ -429,8 +430,12 @@ xfsaild_push(
 
 	/*
 	 * If we encountered pinned items or did not finish writing out all
-	 * buffers the last time we ran, force the log first and wait for it
-	 * before pushing again.
+	 * buffers the last time we ran, force a background CIL push to get the
+	 * items unpinned in the near future. We do not wait on the CIL push as
+	 * that could stall us for seconds if there is enough background IO
+	 * load. Stalling for that long when the tail of the log is pinned and
+	 * needs flushing will hard stop the transaction subsystem when log
+	 * space runs out.
 	 */
 	if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
 	    (!list_empty_careful(&ailp->ail_buf_list) ||
@@ -438,7 +443,7 @@ xfsaild_push(
 		ailp->ail_log_flush = 0;
 
 		XFS_STATS_INC(mp, xs_push_ail_flush);
-		xfs_log_force(mp, XFS_LOG_SYNC);
+		xlog_cil_force_seq(mp->m_log, 0, 0);
 	}
 
 	spin_lock(&ailp->ail_lock);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/3] xfs: CIL work is serialised, not pipelined
  2021-02-23  5:32 [PATCH 0/3] xfs: CIL improvements Dave Chinner
  2021-02-23  5:32 ` [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
  2021-02-23  5:32 ` [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing Dave Chinner
@ 2021-02-23  5:32 ` Dave Chinner
  2 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2021-02-23  5:32 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Because we use a single work structure attached to the CIL rather
than the CIL context, we can only queue a single work item at a
time. This results in the CIL being single threaded and limits
performance when it becomes CPU bound.

The design of the CIL is that it is pipelined and multiple commits
can be running concurrently, but the way the work is currently
implemented means that it is not pipelining as it was intended. The
critical work to switch the CIL context can take a few milliseconds
to run, but the rest of the CIL context flush can take hundreds of
milliseconds to complete. The context switching is the serialisation
point of the CIL, once the context has been switched the rest of the
context push can run asynchrnously with all other context pushes.

Hence we can move the work to the CIL context so that we can run
multiple CIL pushes at the same time and spread the majority of
the work out over multiple CPUs. We can keep the per-cpu CIL commit
state on the CIL rather than the context, because the context is
pinned to the CIL until the switch is done and we aggregate and
drain the per-cpu state held on the CIL during the context switch.

However, because we no longer serialise the CIL work, we can have
effectively unlimited CIL pushes in progress. We don't want to do
this - not only does it create contention on the iclogs and the
state machine locks, we can run the log right out of space with
outstanding pushes. Instead, limit the work concurrency to 4
concurrent works being processed at a time. THis is enough
concurrency to remove the CIL from being a CPU bound bottleneck but
not enough to create new contention points or unbound concurrency
issues.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 80 +++++++++++++++++++++++--------------------
 fs/xfs/xfs_log_priv.h |  2 +-
 fs/xfs/xfs_super.c    |  2 +-
 3 files changed, 44 insertions(+), 40 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index be6c91d7218f..54d3fabd9a5b 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -47,6 +47,34 @@ xlog_cil_ticket_alloc(
 	return tic;
 }
 
+/*
+ * Unavoidable forward declaration - xlog_cil_push_work() calls
+ * xlog_cil_ctx_alloc() itself.
+ */
+static void xlog_cil_push_work(struct work_struct *work);
+
+static struct xfs_cil_ctx *
+xlog_cil_ctx_alloc(void)
+{
+	struct xfs_cil_ctx	*ctx;
+
+	ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS);
+	INIT_LIST_HEAD(&ctx->committing);
+	INIT_LIST_HEAD(&ctx->busy_extents);
+	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
+	return ctx;
+}
+
+static void
+xlog_cil_ctx_switch(
+	struct xfs_cil		*cil,
+	struct xfs_cil_ctx	*ctx)
+{
+	ctx->sequence = ++cil->xc_current_sequence;
+	ctx->cil = cil;
+	cil->xc_ctx = ctx;
+}
+
 /*
  * After the first stage of log recovery is done, we know where the head and
  * tail of the log are. We need this log initialisation done before we can
@@ -641,11 +669,11 @@ static void
 xlog_cil_push_work(
 	struct work_struct	*work)
 {
-	struct xfs_cil		*cil =
-		container_of(work, struct xfs_cil, xc_push_work);
+	struct xfs_cil_ctx	*ctx =
+		container_of(work, struct xfs_cil_ctx, push_work);
+	struct xfs_cil		*cil = ctx->cil;
 	struct xlog		*log = cil->xc_log;
 	struct xfs_log_vec	*lv;
-	struct xfs_cil_ctx	*ctx;
 	struct xfs_cil_ctx	*new_ctx;
 	struct xlog_in_core	*commit_iclog;
 	struct xlog_ticket	*tic;
@@ -658,11 +686,10 @@ xlog_cil_push_work(
 	xfs_lsn_t		push_seq;
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 
-	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
+	new_ctx = xlog_cil_ctx_alloc();
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
 
 	down_write(&cil->xc_ctx_lock);
-	ctx = cil->xc_ctx;
 
 	spin_lock(&cil->xc_push_lock);
 	push_seq = cil->xc_push_seq;
@@ -692,7 +719,7 @@ xlog_cil_push_work(
 
 
 	/* check for a previously pushed sequence */
-	if (push_seq < cil->xc_ctx->sequence) {
+	if (push_seq < ctx->sequence) {
 		spin_unlock(&cil->xc_push_lock);
 		goto out_skip;
 	}
@@ -762,19 +789,7 @@ xlog_cil_push_work(
 	}
 
 	/*
-	 * initialise the new context and attach it to the CIL. Then attach
-	 * the current context to the CIL committing list so it can be found
-	 * during log forces to extract the commit lsn of the sequence that
-	 * needs to be forced.
-	 */
-	INIT_LIST_HEAD(&new_ctx->committing);
-	INIT_LIST_HEAD(&new_ctx->busy_extents);
-	new_ctx->sequence = ctx->sequence + 1;
-	new_ctx->cil = cil;
-	cil->xc_ctx = new_ctx;
-
-	/*
-	 * The switch is now done, so we can drop the context lock and move out
+	 * Switch the contexts so we can drop the context lock and move out
 	 * of a shared context. We can't just go straight to the commit record,
 	 * though - we need to synchronise with previous and future commits so
 	 * that the commit records are correctly ordered in the log to ensure
@@ -799,7 +814,7 @@ xlog_cil_push_work(
 	 * deferencing a freed context pointer.
 	 */
 	spin_lock(&cil->xc_push_lock);
-	cil->xc_current_sequence = new_ctx->sequence;
+	xlog_cil_ctx_switch(cil, new_ctx);
 	spin_unlock(&cil->xc_push_lock);
 	up_write(&cil->xc_ctx_lock);
 
@@ -957,7 +972,7 @@ xlog_cil_push_background(
 	spin_lock(&cil->xc_push_lock);
 	if (cil->xc_push_seq < cil->xc_current_sequence) {
 		cil->xc_push_seq = cil->xc_current_sequence;
-		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
+		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work);
 	}
 
 	/*
@@ -1011,7 +1026,7 @@ xlog_cil_push_now(
 
 	/* start on any pending background push to minimise wait time on it */
 	if (flush)
-		flush_work(&cil->xc_push_work);
+		flush_workqueue(log->l_mp->m_cil_workqueue);
 
 	/*
 	 * If the CIL is empty or we've already pushed the sequence then
@@ -1024,7 +1039,7 @@ xlog_cil_push_now(
 	}
 
 	cil->xc_push_seq = push_seq;
-	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
+	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work);
 	spin_unlock(&cil->xc_push_lock);
 }
 
@@ -1259,13 +1274,6 @@ xlog_cil_init(
 	if (!cil)
 		return -ENOMEM;
 
-	ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL);
-	if (!ctx) {
-		kmem_free(cil);
-		return -ENOMEM;
-	}
-
-	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
 	INIT_LIST_HEAD(&cil->xc_cil);
 	INIT_LIST_HEAD(&cil->xc_committing);
 	spin_lock_init(&cil->xc_cil_lock);
@@ -1273,16 +1281,12 @@ xlog_cil_init(
 	init_waitqueue_head(&cil->xc_push_wait);
 	init_rwsem(&cil->xc_ctx_lock);
 	init_waitqueue_head(&cil->xc_commit_wait);
-
-	INIT_LIST_HEAD(&ctx->committing);
-	INIT_LIST_HEAD(&ctx->busy_extents);
-	ctx->sequence = 1;
-	ctx->cil = cil;
-	cil->xc_ctx = ctx;
-	cil->xc_current_sequence = ctx->sequence;
-
 	cil->xc_log = log;
 	log->l_cilp = cil;
+
+	ctx = xlog_cil_ctx_alloc();
+	xlog_cil_ctx_switch(cil, ctx);
+
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index d9929d9146e4..c0058c1f0c81 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -245,6 +245,7 @@ struct xfs_cil_ctx {
 	struct list_head	iclog_entry;
 	struct list_head	committing;	/* ctx committing list */
 	struct work_struct	discard_endio_work;
+	struct work_struct	push_work;
 };
 
 /*
@@ -276,7 +277,6 @@ struct xfs_cil {
 	struct list_head	xc_committing;
 	wait_queue_head_t	xc_commit_wait;
 	uint64_t		xc_current_sequence;
-	struct work_struct	xc_push_work;
 	wait_queue_head_t	xc_push_wait;	/* background push throttle */
 } ____cacheline_aligned_in_smp;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 85dd9593b40b..343499fdcbf3 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -502,7 +502,7 @@ xfs_init_mount_workqueues(
 
 	mp->m_cil_workqueue = alloc_workqueue("xfs-cil/%s",
 			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM | WQ_UNBOUND),
-			0, mp->m_super->s_id);
+			4, mp->m_super->s_id);
 	if (!mp->m_cil_workqueue)
 		goto out_destroy_unwritten;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing
  2021-02-23  5:32 ` [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing Dave Chinner
@ 2021-02-24 21:10   ` Dave Chinner
  2021-02-24 23:26     ` [PATCH 2/3 v2] " Dave Chinner
  2021-02-25 13:12     ` [PATCH 2/3] " Brian Foster
  0 siblings, 2 replies; 33+ messages in thread
From: Dave Chinner @ 2021-02-24 21:10 UTC (permalink / raw)
  To: linux-xfs

On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The AIL pushing is stalling on log forces when it comes across
> pinned items. This is happening on removal workloads where the AIL
> is dominated by stale items that are removed from AIL when the
> checkpoint that marks the items stale is committed to the journal.
> This results is relatively few items in the AIL, but those that are
> are often pinned as directories items are being removed from are
> still being logged.
> 
> As a result, many push cycles through the CIL will first issue a
> blocking log force to unpin the items. This can take some time to
> complete, with tracing regularly showing push delays of half a
> second and sometimes up into the range of several seconds. Sequences
> like this aren't uncommon:
> 
> ....
>  399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 270ms delay>
>  400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
>  400.099623:  xfsaild: first lsn 0x11002f3600
>  400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
> <wanted 50ms, got 500ms delay>
>  400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
>  400.589349:  xfsaild: first lsn 0x1100305000
>  400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
> <wanted 50ms, got 460ms delay>
>  400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
>  400.950343:  xfsaild: first lsn 0x1100317c00
>  400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 200ms delay>
>  401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
>  401.142334:  xfsaild: first lsn 0x110032e600
>  401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
> <wanted 10ms, got 10ms delay>
>  401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
>  401.154328:  xfsaild: first lsn 0x1100353000
>  401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 300ms delay>
>  401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
>  401.451526:  xfsaild: first lsn 0x1100353000
>  401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
> <wanted 50ms, got 500ms delay>
>  401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
> ....
> 
> In each of these cases, every AIL pass saw 101 log items stuck on
> the AIL (pinned) with very few other items being found. Each pass, a
> log force was issued, and delay between last/first is the sleep time
> + the sync log force time.
> 
> Some of these 101 items pinned the tail of the log. The tail of the
> log does slowly creep forward (first lsn), but the problem is that
> the log is actually out of reservation space because it's been
> running so many transactions that stale items that never reach the
> AIL but consume log space. Hence we have a largely empty AIL, with
> long term pins on items that pin the tail of the log that don't get
> pushed frequently enough to keep log space available.
> 
> The problem is the hundreds of milliseconds that we block in the log
> force pushing the CIL out to disk. The AIL should not be stalled
> like this - it needs to run and flush items that are at the tail of
> the log with minimal latency. What we really need to do is trigger a
> log flush, but then not wait for it at all - we've already done our
> waiting for stuff to complete when we backed off prior to the log
> force being issued.
> 
> Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> still do a blocking flush of the CIL and that is what is causing the
> issue. Hence we need a new interface for the CIL to trigger an
> immediate background push of the CIL to get it moving faster but not
> to wait on that to occur. While the CIL is pushing, the AIL can also
> be pushing.
> 
> We already have an internal interface to do this -
> xlog_cil_push_now() - but we need a wrapper for it to be used
> externally. xlog_cil_force_seq() can easily be extended to do what
> we need as it already implements the synchronous CIL push via
> xlog_cil_push_now(). Add the necessary flags and "push current
> sequence" semantics to xlog_cil_force_seq() and convert the AIL
> pushing to use it.

Overnight testing indicates generic/530 hangs on small logs with
this patch. It essentially runs out of log space because there are
inode cluster buffers permanently pinned by this workload.

That is, as inodes are unlinked, they repeatedly relog the inode
cluster buffer, and so while the CIL is flushing to unpin the
buffer, a new unlink relogs it and adds a new pin to the buffer.
Hence the log force that the AIL does to unpin pinned items doesn't
actually unpin buffers that are being relogged.

These buffers only get unpinned when the log actually runs out of
space because they pin the tail of the log. Then the modifications
that are relogging the buffer stop because they fail to get a log
reservation, and the tail of the log does not move forward until the
AIL issues a log force that finally unpins the buffer and writes it
back.

Essentially, relogged buffers cannot be flushed by the AIL until the
log completely stalls.

The problem being tripped over here is that we no longer force the
final iclogs in a CIL push to disk - we leave the iclog with the
commit record in it in ACTIVE state, and by not waiting and forcing
all the iclogs to disk, the buffer never gets unpinned because there
isn't any more log pressure to force it out because everything is
blocked on reservation space.

The solution to this is to have the CIL push change the state of the
commit iclog to WANT_SYNC before it is released. This means the CIL
push will always flush the iclog to disk and the checkpoint will
complete and unpin the buffers.

Right now, we really only want to do this state switch for these
async pushes - for small sync transactions and fsync we really want
the iclog aggregation that we have now to optimise iclogbuf usage,
so I'll have to pass a new flag through the push code and back into
xlog_write(). That will make this patch behave the same as we
currently do.

In the longer term, we need to change how the AIL deals with pinned
buffers, as the current method is definitely sub-optimal. It also
explains the "everything stops for a second" performance anomalies
I've been seeing for a while in testing. But fixing that is outside
the scope of this patchset, so in the mean time I'll fix this one up
and repost it in a little while.

FWIW, this is also the likely cause of the "CIL workqueue too deep"
hangs I was seeing with the next patch in the series, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN
  2021-02-23  5:32 ` [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
@ 2021-02-24 21:42   ` Darrick J. Wong
  2021-02-24 22:19     ` Dave Chinner
  2021-02-25 19:01     ` Christoph Hellwig
  2021-03-02 18:12   ` Brian Foster
  1 sibling, 2 replies; 33+ messages in thread
From: Darrick J. Wong @ 2021-02-24 21:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 04:32:10PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In doing an investigation into AIL push stalls, I was looking at the
> log force code to see if an async CIL push could be done instead.
> This lead me to xfs_log_force_lsn() and looking at how it works.
> 
> xfs_log_force_lsn() is only called from inode synchronisation
> contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
> value as the LSN to sync the log to. This gets passed to
> xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
> journal, and then used by xfs_log_force_lsn() to flush the iclogs to
> the journal.
> 
> The problem with is that ip->i_itemp->ili_last_lsn does not store a
> log sequence number. What it stores is passed to it from the
> ->iop_committing method, which is called by xfs_log_commit_cil().
> The value this passes to the iop_committing method is the CIL
> context sequence number that the item was committed to.
> 
> As it turns out, xlog_cil_force_lsn() converts the sequence to an
> actual commit LSN for the related context and returns that to
> xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
> variable that contained a sequence with an actual LSN and then uses
> that to sync the iclogs.
> 
> This caused me some confusion for a while, even though I originally
> wrote all this code a decade ago. ->iop_committing is only used by
> a couple of log item types, and only inode items use the sequence
> number it is passed.
> 
> Let's clean up the API, CIL structures and inode log item to call it
> a sequence number, and make it clear that the high level code is
> using CIL sequence numbers and not on-disk LSNs for integrity
> synchronisation purposes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf_item.c   |  2 +-
>  fs/xfs/xfs_dquot_item.c |  2 +-
>  fs/xfs/xfs_file.c       | 14 +++++++-------
>  fs/xfs/xfs_inode.c      | 10 +++++-----
>  fs/xfs/xfs_inode_item.c |  4 ++--
>  fs/xfs/xfs_inode_item.h |  2 +-
>  fs/xfs/xfs_log.c        | 27 ++++++++++++++-------------
>  fs/xfs/xfs_log.h        |  4 +---
>  fs/xfs/xfs_log_cil.c    | 22 +++++++++-------------
>  fs/xfs/xfs_log_priv.h   | 15 +++++++--------
>  fs/xfs/xfs_trans.c      |  6 +++---
>  fs/xfs/xfs_trans.h      |  4 ++--
>  12 files changed, 53 insertions(+), 59 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 14d1fefcbf4c..7affe1aa16da 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -713,7 +713,7 @@ xfs_buf_item_release(
>  STATIC void
>  xfs_buf_item_committing(
>  	struct xfs_log_item	*lip,
> -	xfs_lsn_t		commit_lsn)
> +	uint64_t		seq)

FWIW I rather wish you'd defined a new type for cil sequence numbers,
since uint64_t is rather generic.  Even if checkpatch whines about new
typedefs.

I was kind of hoping that we'd be able to mark xfs_lsn_t and xfs_csn_t
with __bitwise and so static checkers could catch us if we accidentally
feed a CIL sequence number into a function that wants an LSN.

--D

>  {
>  	return xfs_buf_item_release(lip);
>  }
> diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> index 8c1fdf37ee8f..74ca8fbbf1f3 100644
> --- a/fs/xfs/xfs_dquot_item.c
> +++ b/fs/xfs/xfs_dquot_item.c
> @@ -188,7 +188,7 @@ xfs_qm_dquot_logitem_release(
>  STATIC void
>  xfs_qm_dquot_logitem_committing(
>  	struct xfs_log_item	*lip,
> -	xfs_lsn_t		commit_lsn)
> +	uint64_t		seq)
>  {
>  	return xfs_qm_dquot_logitem_release(lip);
>  }
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index dd33ef2d0e20..8b07408928a9 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -118,8 +118,8 @@ xfs_dir_fsync(
>  	return xfs_log_force_inode(ip);
>  }
>  
> -static xfs_lsn_t
> -xfs_fsync_lsn(
> +static uint64_t
> +xfs_fsync_seq(
>  	struct xfs_inode	*ip,
>  	bool			datasync)
>  {
> @@ -127,7 +127,7 @@ xfs_fsync_lsn(
>  		return 0;
>  	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
>  		return 0;
> -	return ip->i_itemp->ili_last_lsn;
> +	return ip->i_itemp->ili_commit_seq;
>  }
>  
>  /*
> @@ -150,12 +150,12 @@ xfs_fsync_flush_log(
>  	int			*log_flushed)
>  {
>  	int			error = 0;
> -	xfs_lsn_t		lsn;
> +	uint64_t		seq;
>  
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
> -	lsn = xfs_fsync_lsn(ip, datasync);
> -	if (lsn) {
> -		error = xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC,
> +	seq = xfs_fsync_seq(ip, datasync);
> +	if (seq) {
> +		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
>  					  log_flushed);
>  
>  		spin_lock(&ip->i_itemp->ili_lock);
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 95e3a5e6e5e2..bc3e27f9c7f2 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2636,7 +2636,7 @@ xfs_iunpin(
>  	trace_xfs_inode_unpin_nowait(ip, _RET_IP_);
>  
>  	/* Give the log a push to start the unpinning I/O */
> -	xfs_log_force_lsn(ip->i_mount, ip->i_itemp->ili_last_lsn, 0, NULL);
> +	xfs_log_force_seq(ip->i_mount, ip->i_itemp->ili_commit_seq, 0, NULL);
>  
>  }
>  
> @@ -3641,16 +3641,16 @@ int
>  xfs_log_force_inode(
>  	struct xfs_inode	*ip)
>  {
> -	xfs_lsn_t		lsn = 0;
> +	uint64_t		seq = 0;
>  
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
>  	if (xfs_ipincount(ip))
> -		lsn = ip->i_itemp->ili_last_lsn;
> +		seq = ip->i_itemp->ili_commit_seq;
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
> -	if (!lsn)
> +	if (!seq)
>  		return 0;
> -	return xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, NULL);
> +	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, NULL);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 6ff91e5bf3cd..7270a80f21dc 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -617,9 +617,9 @@ xfs_inode_item_committed(
>  STATIC void
>  xfs_inode_item_committing(
>  	struct xfs_log_item	*lip,
> -	xfs_lsn_t		commit_lsn)
> +	uint64_t		seq)
>  {
> -	INODE_ITEM(lip)->ili_last_lsn = commit_lsn;
> +	INODE_ITEM(lip)->ili_commit_seq = seq;
>  	return xfs_inode_item_release(lip);
>  }
>  
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 4b926e32831c..e60125b88292 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -33,7 +33,7 @@ struct xfs_inode_log_item {
>  	unsigned int		ili_fields;	   /* fields to be logged */
>  	unsigned int		ili_fsync_fields;  /* logged since last fsync */
>  	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
> -	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
> +	uint64_t		ili_commit_seq;	   /* last transaction commit */
>  };
>  
>  static inline int xfs_inode_clean(struct xfs_inode *ip)
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 08d68a6161ae..84cd9b6c6d1f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -3336,14 +3336,13 @@ xfs_log_force(
>  }
>  
>  static int
> -__xfs_log_force_lsn(
> -	struct xfs_mount	*mp,
> +xlog_force_lsn(
> +	struct xlog		*log,
>  	xfs_lsn_t		lsn,
>  	uint			flags,
>  	int			*log_flushed,
>  	bool			already_slept)
>  {
> -	struct xlog		*log = mp->m_log;
>  	struct xlog_in_core	*iclog;
>  
>  	spin_lock(&log->l_icloglock);
> @@ -3376,8 +3375,6 @@ __xfs_log_force_lsn(
>  		if (!already_slept &&
>  		    (iclog->ic_prev->ic_state == XLOG_STATE_WANT_SYNC ||
>  		     iclog->ic_prev->ic_state == XLOG_STATE_SYNCING)) {
> -			XFS_STATS_INC(mp, xs_log_force_sleep);
> -
>  			xlog_wait(&iclog->ic_prev->ic_write_wait,
>  					&log->l_icloglock);
>  			return -EAGAIN;
> @@ -3415,25 +3412,29 @@ __xfs_log_force_lsn(
>   * to disk, that thread will wake up all threads waiting on the queue.
>   */
>  int
> -xfs_log_force_lsn(
> +xfs_log_force_seq(
>  	struct xfs_mount	*mp,
> -	xfs_lsn_t		lsn,
> +	uint64_t		seq,
>  	uint			flags,
>  	int			*log_flushed)
>  {
> +	struct xlog		*log = mp->m_log;
> +	xfs_lsn_t		lsn;
>  	int			ret;
> -	ASSERT(lsn != 0);
> +	ASSERT(seq != 0);
>  
>  	XFS_STATS_INC(mp, xs_log_force);
> -	trace_xfs_log_force(mp, lsn, _RET_IP_);
> +	trace_xfs_log_force(mp, seq, _RET_IP_);
>  
> -	lsn = xlog_cil_force_lsn(mp->m_log, lsn);
> +	lsn = xlog_cil_force_seq(log, seq);
>  	if (lsn == NULLCOMMITLSN)
>  		return 0;
>  
> -	ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, false);
> -	if (ret == -EAGAIN)
> -		ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, true);
> +	ret = xlog_force_lsn(log, lsn, flags, log_flushed, false);
> +	if (ret == -EAGAIN) {
> +		XFS_STATS_INC(mp, xs_log_force_sleep);
> +		ret = xlog_force_lsn(log, lsn, flags, log_flushed, true);
> +	}
>  	return ret;
>  }
>  
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 044e02cb8921..33ae53401060 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -106,7 +106,7 @@ struct xfs_item_ops;
>  struct xfs_trans;
>  
>  int	  xfs_log_force(struct xfs_mount *mp, uint flags);
> -int	  xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags,
> +int	  xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags,
>  		int *log_forced);
>  int	  xfs_log_mount(struct xfs_mount	*mp,
>  			struct xfs_buftarg	*log_target,
> @@ -132,8 +132,6 @@ bool	xfs_log_writable(struct xfs_mount *mp);
>  struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
>  void	  xfs_log_ticket_put(struct xlog_ticket *ticket);
>  
> -void	xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp,
> -				xfs_lsn_t *commit_lsn, bool regrant);
>  void	xlog_cil_process_committed(struct list_head *list);
>  bool	xfs_log_item_in_current_chkpt(struct xfs_log_item *lip);
>  
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 0a00c3c9610c..2fda8c4513b2 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -792,7 +792,7 @@ xlog_cil_push_work(
>  	 * that higher sequences will wait for us to write out a commit record
>  	 * before they do.
>  	 *
> -	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
> +	 * xfs_log_force_seq requires us to mirror the new sequence into the cil
>  	 * structure atomically with the addition of this sequence to the
>  	 * committing list. This also ensures that we can do unlocked checks
>  	 * against the current sequence in log forces without risking
> @@ -1054,16 +1054,14 @@ xlog_cil_empty(
>   * allowed again.
>   */
>  void
> -xfs_log_commit_cil(
> -	struct xfs_mount	*mp,
> +xlog_cil_commit(
> +	struct xlog		*log,
>  	struct xfs_trans	*tp,
> -	xfs_lsn_t		*commit_lsn,
> +	uint64_t		*commit_seq,
>  	bool			regrant)
>  {
> -	struct xlog		*log = mp->m_log;
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_log_item	*lip, *next;
> -	xfs_lsn_t		xc_commit_lsn;
>  
>  	/*
>  	 * Do all necessary memory allocation before we lock the CIL.
> @@ -1077,10 +1075,6 @@ xfs_log_commit_cil(
>  
>  	xlog_cil_insert_items(log, tp);
>  
> -	xc_commit_lsn = cil->xc_ctx->sequence;
> -	if (commit_lsn)
> -		*commit_lsn = xc_commit_lsn;
> -
>  	if (regrant && !XLOG_FORCED_SHUTDOWN(log))
>  		xfs_log_ticket_regrant(log, tp->t_ticket);
>  	else
> @@ -1103,8 +1097,10 @@ xfs_log_commit_cil(
>  	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
>  		xfs_trans_del_item(lip);
>  		if (lip->li_ops->iop_committing)
> -			lip->li_ops->iop_committing(lip, xc_commit_lsn);
> +			lip->li_ops->iop_committing(lip, cil->xc_ctx->sequence);
>  	}
> +	if (commit_seq)
> +		*commit_seq = cil->xc_ctx->sequence;
>  
>  	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
>  	xlog_cil_push_background(log);
> @@ -1121,9 +1117,9 @@ xfs_log_commit_cil(
>   * iclog flush is necessary following this call.
>   */
>  xfs_lsn_t
> -xlog_cil_force_lsn(
> +xlog_cil_force_seq(
>  	struct xlog	*log,
> -	xfs_lsn_t	sequence)
> +	uint64_t	sequence)
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_cil_ctx	*ctx;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 24acdc54e44e..59778cd5ecdd 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -234,7 +234,7 @@ struct xfs_cil;
>  
>  struct xfs_cil_ctx {
>  	struct xfs_cil		*cil;
> -	xfs_lsn_t		sequence;	/* chkpt sequence # */
> +	uint64_t		sequence;	/* chkpt sequence # */
>  	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
>  	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
>  	struct xlog_ticket	*ticket;	/* chkpt ticket */
> @@ -272,10 +272,10 @@ struct xfs_cil {
>  	struct xfs_cil_ctx	*xc_ctx;
>  
>  	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
> -	xfs_lsn_t		xc_push_seq;
> +	uint64_t		xc_push_seq;
>  	struct list_head	xc_committing;
>  	wait_queue_head_t	xc_commit_wait;
> -	xfs_lsn_t		xc_current_sequence;
> +	uint64_t		xc_current_sequence;
>  	struct work_struct	xc_push_work;
>  	wait_queue_head_t	xc_push_wait;	/* background push throttle */
>  } ____cacheline_aligned_in_smp;
> @@ -552,19 +552,18 @@ int	xlog_cil_init(struct xlog *log);
>  void	xlog_cil_init_post_recovery(struct xlog *log);
>  void	xlog_cil_destroy(struct xlog *log);
>  bool	xlog_cil_empty(struct xlog *log);
> +void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
> +			uint64_t *commit_seq, bool regrant);
>  
>  /*
>   * CIL force routines
>   */
> -xfs_lsn_t
> -xlog_cil_force_lsn(
> -	struct xlog *log,
> -	xfs_lsn_t sequence);
> +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence);
>  
>  static inline void
>  xlog_cil_force(struct xlog *log)
>  {
> -	xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence);
> +	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 44f72c09c203..697703f3be48 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -849,7 +849,7 @@ __xfs_trans_commit(
>  	bool			regrant)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	xfs_lsn_t		commit_lsn = -1;
> +	uint64_t		commit_seq = 0;
>  	int			error = 0;
>  	int			sync = tp->t_flags & XFS_TRANS_SYNC;
>  
> @@ -891,7 +891,7 @@ __xfs_trans_commit(
>  		xfs_trans_apply_sb_deltas(tp);
>  	xfs_trans_apply_dquot_deltas(tp);
>  
> -	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
> +	xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant);
>  
>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free(tp);
> @@ -901,7 +901,7 @@ __xfs_trans_commit(
>  	 * log out now and wait for it.
>  	 */
>  	if (sync) {
> -		error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL);
> +		error = xfs_log_force_seq(mp, commit_seq, XFS_LOG_SYNC, NULL);
>  		XFS_STATS_INC(mp, xs_trans_sync);
>  	} else {
>  		XFS_STATS_INC(mp, xs_trans_async);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 8b03fbfe9a1b..d223d4f4e429 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -43,7 +43,7 @@ struct xfs_log_item {
>  	struct list_head		li_cil;		/* CIL pointers */
>  	struct xfs_log_vec		*li_lv;		/* active log vector */
>  	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
> -	xfs_lsn_t			li_seq;		/* CIL commit seq */
> +	uint64_t			li_seq;		/* CIL commit seq */
>  };
>  
>  /*
> @@ -69,7 +69,7 @@ struct xfs_item_ops {
>  	void (*iop_pin)(struct xfs_log_item *);
>  	void (*iop_unpin)(struct xfs_log_item *, int remove);
>  	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> -	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
> +	void (*iop_committing)(struct xfs_log_item *lip, uint64_t seq);
>  	void (*iop_release)(struct xfs_log_item *);
>  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
>  	int (*iop_recover)(struct xfs_log_item *lip,
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN
  2021-02-24 21:42   ` Darrick J. Wong
@ 2021-02-24 22:19     ` Dave Chinner
  2021-02-25 19:01     ` Christoph Hellwig
  1 sibling, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2021-02-24 22:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Feb 24, 2021 at 01:42:35PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 23, 2021 at 04:32:10PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > In doing an investigation into AIL push stalls, I was looking at the
> > log force code to see if an async CIL push could be done instead.
> > This lead me to xfs_log_force_lsn() and looking at how it works.
> > 
> > xfs_log_force_lsn() is only called from inode synchronisation
> > contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
> > value as the LSN to sync the log to. This gets passed to
> > xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
> > journal, and then used by xfs_log_force_lsn() to flush the iclogs to
> > the journal.
> > 
> > The problem with is that ip->i_itemp->ili_last_lsn does not store a
> > log sequence number. What it stores is passed to it from the
> > ->iop_committing method, which is called by xfs_log_commit_cil().
> > The value this passes to the iop_committing method is the CIL
> > context sequence number that the item was committed to.
> > 
> > As it turns out, xlog_cil_force_lsn() converts the sequence to an
> > actual commit LSN for the related context and returns that to
> > xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
> > variable that contained a sequence with an actual LSN and then uses
> > that to sync the iclogs.
> > 
> > This caused me some confusion for a while, even though I originally
> > wrote all this code a decade ago. ->iop_committing is only used by
> > a couple of log item types, and only inode items use the sequence
> > number it is passed.
> > 
> > Let's clean up the API, CIL structures and inode log item to call it
> > a sequence number, and make it clear that the high level code is
> > using CIL sequence numbers and not on-disk LSNs for integrity
> > synchronisation purposes.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_buf_item.c   |  2 +-
> >  fs/xfs/xfs_dquot_item.c |  2 +-
> >  fs/xfs/xfs_file.c       | 14 +++++++-------
> >  fs/xfs/xfs_inode.c      | 10 +++++-----
> >  fs/xfs/xfs_inode_item.c |  4 ++--
> >  fs/xfs/xfs_inode_item.h |  2 +-
> >  fs/xfs/xfs_log.c        | 27 ++++++++++++++-------------
> >  fs/xfs/xfs_log.h        |  4 +---
> >  fs/xfs/xfs_log_cil.c    | 22 +++++++++-------------
> >  fs/xfs/xfs_log_priv.h   | 15 +++++++--------
> >  fs/xfs/xfs_trans.c      |  6 +++---
> >  fs/xfs/xfs_trans.h      |  4 ++--
> >  12 files changed, 53 insertions(+), 59 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > index 14d1fefcbf4c..7affe1aa16da 100644
> > --- a/fs/xfs/xfs_buf_item.c
> > +++ b/fs/xfs/xfs_buf_item.c
> > @@ -713,7 +713,7 @@ xfs_buf_item_release(
> >  STATIC void
> >  xfs_buf_item_committing(
> >  	struct xfs_log_item	*lip,
> > -	xfs_lsn_t		commit_lsn)
> > +	uint64_t		seq)
> 
> FWIW I rather wish you'd defined a new type for cil sequence numbers,
> since uint64_t is rather generic.  Even if checkpatch whines about new
> typedefs.

I don't use checkpatch so I don't care about all the idiotic
"crusade of the month" stuff it is always whining about.

> I was kind of hoping that we'd be able to mark xfs_lsn_t and xfs_csn_t
> with __bitwise and so static checkers could catch us if we accidentally
> feed a CIL sequence number into a function that wants an LSN.

Seems reasonable to hide that behind a typedef. No idea how to set
it up and test it, though, and I don't really have time right now.
I'll change it to a typedef to make this easier to do in future,
though.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing
  2021-02-24 21:10   ` Dave Chinner
@ 2021-02-24 23:26     ` Dave Chinner
  2021-02-25  2:15       ` Dave Chinner
  2021-03-02 21:44       ` Brian Foster
  2021-02-25 13:12     ` [PATCH 2/3] " Brian Foster
  1 sibling, 2 replies; 33+ messages in thread
From: Dave Chinner @ 2021-02-24 23:26 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The AIL pushing is stalling on log forces when it comes across
pinned items. This is happening on removal workloads where the AIL
is dominated by stale items that are removed from AIL when the
checkpoint that marks the items stale is committed to the journal.
This results is relatively few items in the AIL, but those that are
are often pinned as directories items are being removed from are
still being logged.

As a result, many push cycles through the CIL will first issue a
blocking log force to unpin the items. This can take some time to
complete, with tracing regularly showing push delays of half a
second and sometimes up into the range of several seconds. Sequences
like this aren't uncommon:

....
 399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 270ms delay>
 400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
 400.099623:  xfsaild: first lsn 0x11002f3600
 400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
<wanted 50ms, got 500ms delay>
 400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
 400.589349:  xfsaild: first lsn 0x1100305000
 400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
<wanted 50ms, got 460ms delay>
 400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
 400.950343:  xfsaild: first lsn 0x1100317c00
 400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
<wanted 20ms, got 200ms delay>
 401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
 401.142334:  xfsaild: first lsn 0x110032e600
 401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
<wanted 10ms, got 10ms delay>
 401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
 401.154328:  xfsaild: first lsn 0x1100353000
 401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 300ms delay>
 401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
 401.451526:  xfsaild: first lsn 0x1100353000
 401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
<wanted 50ms, got 500ms delay>
 401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
....

In each of these cases, every AIL pass saw 101 log items stuck on
the AIL (pinned) with very few other items being found. Each pass, a
log force was issued, and delay between last/first is the sleep time
+ the sync log force time.

Some of these 101 items pinned the tail of the log. The tail of the
log does slowly creep forward (first lsn), but the problem is that
the log is actually out of reservation space because it's been
running so many transactions that stale items that never reach the
AIL but consume log space. Hence we have a largely empty AIL, with
long term pins on items that pin the tail of the log that don't get
pushed frequently enough to keep log space available.

The problem is the hundreds of milliseconds that we block in the log
force pushing the CIL out to disk. The AIL should not be stalled
like this - it needs to run and flush items that are at the tail of
the log with minimal latency. What we really need to do is trigger a
log flush, but then not wait for it at all - we've already done our
waiting for stuff to complete when we backed off prior to the log
force being issued.

Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
still do a blocking flush of the CIL and that is what is causing the
issue. Hence we need a new interface for the CIL to trigger an
immediate background push of the CIL to get it moving faster but not
to wait on that to occur. While the CIL is pushing, the AIL can also
be pushing.

We already have an internal interface to do this -
xlog_cil_push_now() - but we need a wrapper for it to be used
externally. xlog_cil_force_seq() can easily be extended to do what
we need as it already implements the synchronous CIL push via
xlog_cil_push_now(). Add the necessary flags and "push current
sequence" semantics to xlog_cil_force_seq() and convert the AIL
pushing to use it.

One of the complexities here is that the CIL push does not guarantee
that the commit record for the CIL checkpoint is written to disk.
The current log force ensures this by submitting the current ACTIVE
iclog that the commit record was written to. We need the CIL to
actually write this commit record to disk for an async push to
ensure that the checkpoint actually makes it to disk and unpins the
pinned items in the checkpoint on completion. Hence we need to pass
down to the CIL push that we are doing an async flush so that it can
switch out the commit_iclog if necessary to get written to disk when
the commit iclog is finally released.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
Version 2:
- ensure the CIL checkpoint issues the commit record to disk for an
  async push. Fixes generic/530 hang on small logs.
- increment log force stats when the CIL is forced and also when it
  sleeps to give insight into the amount of blocking being done when
  the CIL is forced.

 fs/xfs/xfs_log.c       | 59 +++++++++++++++++++---------------------------
 fs/xfs/xfs_log.h       |  2 +-
 fs/xfs/xfs_log_cil.c   | 64 +++++++++++++++++++++++++++++++++++++++++---------
 fs/xfs/xfs_log_priv.h  | 10 ++++++--
 fs/xfs/xfs_sysfs.c     |  1 +
 fs/xfs/xfs_trace.c     |  1 +
 fs/xfs/xfs_trans.c     |  2 +-
 fs/xfs/xfs_trans_ail.c | 11 ++++++---
 8 files changed, 97 insertions(+), 53 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 84cd9b6c6d1f..51acc0075699 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -50,11 +50,6 @@ xlog_state_get_iclog_space(
 	int			*continued_write,
 	int			*logoffsetp);
 STATIC void
-xlog_state_switch_iclogs(
-	struct xlog		*log,
-	struct xlog_in_core	*iclog,
-	int			eventual_size);
-STATIC void
 xlog_grant_push_ail(
 	struct xlog		*log,
 	int			need_bytes);
@@ -511,7 +506,7 @@ __xlog_state_release_iclog(
  * Flush iclog to disk if this is the last reference to the given iclog and the
  * it is in the WANT_SYNC state.
  */
-static int
+int
 xlog_state_release_iclog(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog)
@@ -531,23 +526,6 @@ xlog_state_release_iclog(
 	return 0;
 }
 
-void
-xfs_log_release_iclog(
-	struct xlog_in_core	*iclog)
-{
-	struct xlog		*log = iclog->ic_log;
-	bool			sync = false;
-
-	if (atomic_dec_and_lock(&iclog->ic_refcnt, &log->l_icloglock)) {
-		if (iclog->ic_state != XLOG_STATE_IOERROR)
-			sync = __xlog_state_release_iclog(log, iclog);
-		spin_unlock(&log->l_icloglock);
-	}
-
-	if (sync)
-		xlog_sync(log, iclog);
-}
-
 /*
  * Mount a log filesystem
  *
@@ -3188,7 +3166,7 @@ xfs_log_ticket_ungrant(
  * This routine will mark the current iclog in the ring as WANT_SYNC and move
  * the current iclog pointer to the next iclog in the ring.
  */
-STATIC void
+void
 xlog_state_switch_iclogs(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog,
@@ -3335,6 +3313,20 @@ xfs_log_force(
 	return -EIO;
 }
 
+/*
+ * Force the log to a specific LSN.
+ *
+ * If an iclog with that lsn can be found:
+ *	If it is in the DIRTY state, just return.
+ *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
+ *		state and go to sleep or return.
+ *	If it is in any other state, go to sleep or return.
+ *
+ * Synchronous forces are implemented with a wait queue.  All callers trying
+ * to force a given lsn to disk must wait on the queue attached to the
+ * specific in-core log.  When given in-core log finally completes its write
+ * to disk, that thread will wake up all threads waiting on the queue.
+ */
 static int
 xlog_force_lsn(
 	struct xlog		*log,
@@ -3398,18 +3390,15 @@ xlog_force_lsn(
 }
 
 /*
- * Force the in-core log to disk for a specific LSN.
+ * Force the log to a specific checkpoint sequence.
  *
- * Find in-core log with lsn.
- *	If it is in the DIRTY state, just return.
- *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
- *		state and go to sleep or return.
- *	If it is in any other state, go to sleep or return.
+ * First force the CIL so that all the required changes have been flushed to the
+ * iclogs. If the CIL force completed it will return a commit LSN that indicates
+ * the iclog that needs to be flushed to stable storage.
  *
- * Synchronous forces are implemented with a wait queue.  All callers trying
- * to force a given lsn to disk must wait on the queue attached to the
- * specific in-core log.  When given in-core log finally completes its write
- * to disk, that thread will wake up all threads waiting on the queue.
+ * If the XFS_LOG_SYNC flag is not set, we only trigger a background CIL force
+ * and do not wait for it to complete, nor do we attempt to check/flush iclogs
+ * as the CIL will not have committed when xlog_cil_force_seq() returns.
  */
 int
 xfs_log_force_seq(
@@ -3426,7 +3415,7 @@ xfs_log_force_seq(
 	XFS_STATS_INC(mp, xs_log_force);
 	trace_xfs_log_force(mp, seq, _RET_IP_);
 
-	lsn = xlog_cil_force_seq(log, seq);
+	lsn = xlog_cil_force_seq(log, flags, seq);
 	if (lsn == NULLCOMMITLSN)
 		return 0;
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 33ae53401060..ff480d372dcf 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -104,6 +104,7 @@ struct xlog_ticket;
 struct xfs_log_item;
 struct xfs_item_ops;
 struct xfs_trans;
+struct xlog;
 
 int	  xfs_log_force(struct xfs_mount *mp, uint flags);
 int	  xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags,
@@ -117,7 +118,6 @@ void	xfs_log_mount_cancel(struct xfs_mount *);
 xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
 xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
 void	  xfs_log_space_wake(struct xfs_mount *mp);
-void	  xfs_log_release_iclog(struct xlog_in_core *iclog);
 int	  xfs_log_reserve(struct xfs_mount *mp,
 			  int		   length,
 			  int		   count,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 2fda8c4513b2..1b46559ef64a 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -657,6 +657,7 @@ xlog_cil_push_work(
 	xfs_lsn_t		commit_lsn;
 	xfs_lsn_t		push_seq;
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
+	bool			commit_iclog_sync = false;
 
 	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -667,7 +668,12 @@ xlog_cil_push_work(
 	spin_lock(&cil->xc_push_lock);
 	push_seq = cil->xc_push_seq;
 	ASSERT(push_seq <= ctx->sequence);
+	commit_iclog_sync = cil->xc_push_async;
+	cil->xc_push_async = false;
 
+trace_printk("curseq %llu, ctxseq %llu pushseq %llu empty %d",
+		cil->xc_current_sequence, cil->xc_ctx->sequence,
+		cil->xc_push_seq, list_empty(&cil->xc_cil));
 	/*
 	 * As we are about to switch to a new, empty CIL context, we no longer
 	 * need to throttle tasks on CIL space overruns. Wake any waiters that
@@ -910,7 +916,11 @@ xlog_cil_push_work(
 		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
 
 	/* release the hounds! */
-	xfs_log_release_iclog(commit_iclog);
+	spin_lock(&log->l_icloglock);
+	if (commit_iclog_sync && commit_iclog->ic_state == XLOG_STATE_ACTIVE)
+		xlog_state_switch_iclogs(log, commit_iclog, 0);
+	xlog_state_release_iclog(log, commit_iclog);
+	spin_unlock(&log->l_icloglock);
 	return;
 
 out_skip:
@@ -993,13 +1003,26 @@ xlog_cil_push_background(
 /*
  * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
  * number that is passed. When it returns, the work will be queued for
- * @push_seq, but it won't be completed. The caller is expected to do any
- * waiting for push_seq to complete if it is required.
+ * @push_seq, but it won't be completed.
+ *
+ * If the caller is performing a synchronous force, we will flush the workqueue
+ * to get previously queued work moving to minimise the wait time they will
+ * undergo waiting for all outstanding pushes to complete. The caller is
+ * expected to do the required waiting for push_seq to complete.
+ *
+ * If the caller is performing an async push, we need to ensure that the
+ * checkpoint is fully flushed out of the iclogs when we finish the push. If we
+ * don't do this, then the commit record may remain sitting in memory in an
+ * ACTIVE iclog. This then requires another full log force to push to disk,
+ * which defeats the purpose of having an async, non-blocking CIL force
+ * mechanism. Hence in this case we need to pass a flag to the push work to
+ * indicate it needs to flush the commit record itself.
  */
 static void
 xlog_cil_push_now(
 	struct xlog	*log,
-	xfs_lsn_t	push_seq)
+	xfs_lsn_t	push_seq,
+	bool		sync)
 {
 	struct xfs_cil	*cil = log->l_cilp;
 
@@ -1009,7 +1032,8 @@ xlog_cil_push_now(
 	ASSERT(push_seq && push_seq <= cil->xc_current_sequence);
 
 	/* start on any pending background push to minimise wait time on it */
-	flush_work(&cil->xc_push_work);
+	if (sync)
+		flush_work(&cil->xc_push_work);
 
 	/*
 	 * If the CIL is empty or we've already pushed the sequence then
@@ -1022,6 +1046,8 @@ xlog_cil_push_now(
 	}
 
 	cil->xc_push_seq = push_seq;
+	if (!sync)
+		cil->xc_push_async = true;
 	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
 	spin_unlock(&cil->xc_push_lock);
 }
@@ -1109,16 +1135,22 @@ xlog_cil_commit(
 /*
  * Conditionally push the CIL based on the sequence passed in.
  *
- * We only need to push if we haven't already pushed the sequence
- * number given. Hence the only time we will trigger a push here is
- * if the push sequence is the same as the current context.
+ * We only need to push if we haven't already pushed the sequence number given.
+ * Hence the only time we will trigger a push here is if the push sequence is
+ * the same as the current context.
  *
- * We return the current commit lsn to allow the callers to determine if a
- * iclog flush is necessary following this call.
+ * If the sequence is zero, push the current sequence. If XFS_LOG_SYNC is set in
+ * the flags wait for it to complete, otherwise jsut return NULLCOMMITLSN to
+ * indicate we didn't wait for a commit lsn.
+ *
+ * If we waited for the push to complete, then we return the current commit lsn
+ * to allow the callers to determine if a iclog flush is necessary following
+ * this call.
  */
 xfs_lsn_t
 xlog_cil_force_seq(
 	struct xlog	*log,
+	uint32_t	flags,
 	uint64_t	sequence)
 {
 	struct xfs_cil		*cil = log->l_cilp;
@@ -1127,13 +1159,22 @@ xlog_cil_force_seq(
 
 	ASSERT(sequence <= cil->xc_current_sequence);
 
+	if (!sequence)
+		sequence = cil->xc_current_sequence;
+trace_printk("seq %llu, curseq %llu, ctxseq %llu pushseq %llu flags 0x%x",
+		sequence, cil->xc_current_sequence, cil->xc_ctx->sequence,
+		cil->xc_push_seq, flags);
+
+	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
 	/*
 	 * check to see if we need to force out the current context.
 	 * xlog_cil_push() handles racing pushes for the same sequence,
 	 * so no need to deal with it here.
 	 */
 restart:
-	xlog_cil_push_now(log, sequence);
+	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
+	if (!(flags & XFS_LOG_SYNC))
+		return commit_lsn;
 
 	/*
 	 * See if we can find a previous sequence still committing.
@@ -1157,6 +1198,7 @@ xlog_cil_force_seq(
 			 * It is still being pushed! Wait for the push to
 			 * complete, then start again from the beginning.
 			 */
+			XFS_STATS_INC(log->l_mp, xs_log_force_sleep);
 			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
 			goto restart;
 		}
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 7e8ec77bc6e6..6b1a251dd013 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -273,6 +273,7 @@ struct xfs_cil {
 
 	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
 	uint64_t		xc_push_seq;
+	bool			xc_push_async;
 	struct list_head	xc_committing;
 	wait_queue_head_t	xc_commit_wait;
 	uint64_t		xc_current_sequence;
@@ -487,6 +488,10 @@ int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
 		struct xlog_in_core **commit_iclog, uint optype);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
+void	xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
+		int eventual_size);
+int	xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog);
+
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
 
@@ -558,12 +563,13 @@ void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
 /*
  * CIL force routines
  */
-xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence);
+xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags,
+				uint64_t sequence);
 
 static inline void
 xlog_cil_force(struct xlog *log)
 {
-	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
+	xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence);
 }
 
 /*
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index f1bc88f4367c..18dc5eca6c04 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -10,6 +10,7 @@
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_sysfs.h"
+#include "xfs_log.h"
 #include "xfs_log_priv.h"
 #include "xfs_mount.h"
 
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 9b8d703dc9fd..d111a994b7b6 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -20,6 +20,7 @@
 #include "xfs_bmap.h"
 #include "xfs_attr.h"
 #include "xfs_trans.h"
+#include "xfs_log.h"
 #include "xfs_log_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_quota.h"
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 697703f3be48..7d05694681e3 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -9,7 +9,6 @@
 #include "xfs_shared.h"
 #include "xfs_format.h"
 #include "xfs_log_format.h"
-#include "xfs_log_priv.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_extent_busy.h"
@@ -17,6 +16,7 @@
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_defer.h"
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index dbb69b4bf3ed..dfc0206c0d36 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -17,6 +17,7 @@
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 
 #ifdef DEBUG
 /*
@@ -429,8 +430,12 @@ xfsaild_push(
 
 	/*
 	 * If we encountered pinned items or did not finish writing out all
-	 * buffers the last time we ran, force the log first and wait for it
-	 * before pushing again.
+	 * buffers the last time we ran, force a background CIL push to get the
+	 * items unpinned in the near future. We do not wait on the CIL push as
+	 * that could stall us for seconds if there is enough background IO
+	 * load. Stalling for that long when the tail of the log is pinned and
+	 * needs flushing will hard stop the transaction subsystem when log
+	 * space runs out.
 	 */
 	if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
 	    (!list_empty_careful(&ailp->ail_buf_list) ||
@@ -438,7 +443,7 @@ xfsaild_push(
 		ailp->ail_log_flush = 0;
 
 		XFS_STATS_INC(mp, xs_push_ail_flush);
-		xfs_log_force(mp, XFS_LOG_SYNC);
+		xlog_cil_force_seq(mp->m_log, 0, 0);
 	}
 
 	spin_lock(&ailp->ail_lock);

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing
  2021-02-24 23:26     ` [PATCH 2/3 v2] " Dave Chinner
@ 2021-02-25  2:15       ` Dave Chinner
  2021-03-02 21:44       ` Brian Foster
  1 sibling, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2021-02-25  2:15 UTC (permalink / raw)
  To: linux-xfs

On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The AIL pushing is stalling on log forces when it comes across
> pinned items. This is happening on removal workloads where the AIL
> is dominated by stale items that are removed from AIL when the
> checkpoint that marks the items stale is committed to the journal.
> This results is relatively few items in the AIL, but those that are
> are often pinned as directories items are being removed from are
> still being logged.
.....
> One of the complexities here is that the CIL push does not guarantee
> that the commit record for the CIL checkpoint is written to disk.
> The current log force ensures this by submitting the current ACTIVE
> iclog that the commit record was written to. We need the CIL to
> actually write this commit record to disk for an async push to
> ensure that the checkpoint actually makes it to disk and unpins the
> pinned items in the checkpoint on completion. Hence we need to pass
> down to the CIL push that we are doing an async flush so that it can
> switch out the commit_iclog if necessary to get written to disk when
> the commit iclog is finally released.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> Version 2:
> - ensure the CIL checkpoint issues the commit record to disk for an
>   async push. Fixes generic/530 hang on small logs.
> - increment log force stats when the CIL is forced and also when it
>   sleeps to give insight into the amount of blocking being done when
>   the CIL is forced.

Oops, looks like I forgot to strip debug trace_printk()s out of the
patch before sending it. They are gone now, so I'll wait for review
comments before resending again...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing
  2021-02-24 21:10   ` Dave Chinner
  2021-02-24 23:26     ` [PATCH 2/3 v2] " Dave Chinner
@ 2021-02-25 13:12     ` Brian Foster
  2021-02-25 22:03       ` Dave Chinner
  1 sibling, 1 reply; 33+ messages in thread
From: Brian Foster @ 2021-02-25 13:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote:
> On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The AIL pushing is stalling on log forces when it comes across
> > pinned items. This is happening on removal workloads where the AIL
> > is dominated by stale items that are removed from AIL when the
> > checkpoint that marks the items stale is committed to the journal.
> > This results is relatively few items in the AIL, but those that are
> > are often pinned as directories items are being removed from are
> > still being logged.
> > 
> > As a result, many push cycles through the CIL will first issue a
> > blocking log force to unpin the items. This can take some time to
> > complete, with tracing regularly showing push delays of half a
> > second and sometimes up into the range of several seconds. Sequences
> > like this aren't uncommon:
> > 
> > ....
> >  399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
> > <wanted 20ms, got 270ms delay>
> >  400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
> >  400.099623:  xfsaild: first lsn 0x11002f3600
> >  400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
> > <wanted 50ms, got 500ms delay>
> >  400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
> >  400.589349:  xfsaild: first lsn 0x1100305000
> >  400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
> > <wanted 50ms, got 460ms delay>
> >  400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
> >  400.950343:  xfsaild: first lsn 0x1100317c00
> >  400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
> > <wanted 20ms, got 200ms delay>
> >  401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
> >  401.142334:  xfsaild: first lsn 0x110032e600
> >  401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
> > <wanted 10ms, got 10ms delay>
> >  401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
> >  401.154328:  xfsaild: first lsn 0x1100353000
> >  401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
> > <wanted 20ms, got 300ms delay>
> >  401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
> >  401.451526:  xfsaild: first lsn 0x1100353000
> >  401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
> > <wanted 50ms, got 500ms delay>
> >  401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
> > ....
> > 
> > In each of these cases, every AIL pass saw 101 log items stuck on
> > the AIL (pinned) with very few other items being found. Each pass, a
> > log force was issued, and delay between last/first is the sleep time
> > + the sync log force time.
> > 
> > Some of these 101 items pinned the tail of the log. The tail of the
> > log does slowly creep forward (first lsn), but the problem is that
> > the log is actually out of reservation space because it's been
> > running so many transactions that stale items that never reach the
> > AIL but consume log space. Hence we have a largely empty AIL, with
> > long term pins on items that pin the tail of the log that don't get
> > pushed frequently enough to keep log space available.
> > 
> > The problem is the hundreds of milliseconds that we block in the log
> > force pushing the CIL out to disk. The AIL should not be stalled
> > like this - it needs to run and flush items that are at the tail of
> > the log with minimal latency. What we really need to do is trigger a
> > log flush, but then not wait for it at all - we've already done our
> > waiting for stuff to complete when we backed off prior to the log
> > force being issued.
> > 
> > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> > still do a blocking flush of the CIL and that is what is causing the
> > issue. Hence we need a new interface for the CIL to trigger an
> > immediate background push of the CIL to get it moving faster but not
> > to wait on that to occur. While the CIL is pushing, the AIL can also
> > be pushing.
> > 
> > We already have an internal interface to do this -
> > xlog_cil_push_now() - but we need a wrapper for it to be used
> > externally. xlog_cil_force_seq() can easily be extended to do what
> > we need as it already implements the synchronous CIL push via
> > xlog_cil_push_now(). Add the necessary flags and "push current
> > sequence" semantics to xlog_cil_force_seq() and convert the AIL
> > pushing to use it.
> 
> Overnight testing indicates generic/530 hangs on small logs with
> this patch. It essentially runs out of log space because there are
> inode cluster buffers permanently pinned by this workload.
> 
> That is, as inodes are unlinked, they repeatedly relog the inode
> cluster buffer, and so while the CIL is flushing to unpin the
> buffer, a new unlink relogs it and adds a new pin to the buffer.
> Hence the log force that the AIL does to unpin pinned items doesn't
> actually unpin buffers that are being relogged.
> 
> These buffers only get unpinned when the log actually runs out of
> space because they pin the tail of the log. Then the modifications
> that are relogging the buffer stop because they fail to get a log
> reservation, and the tail of the log does not move forward until the
> AIL issues a log force that finally unpins the buffer and writes it
> back.
> 
> Essentially, relogged buffers cannot be flushed by the AIL until the
> log completely stalls.
> 
> The problem being tripped over here is that we no longer force the
> final iclogs in a CIL push to disk - we leave the iclog with the
> commit record in it in ACTIVE state, and by not waiting and forcing
> all the iclogs to disk, the buffer never gets unpinned because there
> isn't any more log pressure to force it out because everything is
> blocked on reservation space.
> 
> The solution to this is to have the CIL push change the state of the
> commit iclog to WANT_SYNC before it is released. This means the CIL
> push will always flush the iclog to disk and the checkpoint will
> complete and unpin the buffers.
> 
> Right now, we really only want to do this state switch for these
> async pushes - for small sync transactions and fsync we really want
> the iclog aggregation that we have now to optimise iclogbuf usage,
> so I'll have to pass a new flag through the push code and back into
> xlog_write(). That will make this patch behave the same as we
> currently do.
> 

Unfortunately I've not yet caught up to reviewing your most recently
posted set of log patches so I can easily be missing some context, but
when passing through some of the feedback/updates so far this has me
rather confused. We discussed this pre-existing CIL behavior in the
previous version, I suggested some similar potential behavior change
where we would opportunistically send off checkpoint iclogs for I/O a
bit earlier than normal and you argued [1] against it. Now it sounds
like not only are we implementing that, but it's actually necessary to
fix a log hang problem..? What am I missing?

The updated iclog behavior does sound more friendly to me than what we
currently do (obviously, based on my previous comments), but I am
slightly skeptical of how such a change fixes the root cause of a hang.
Is this a stall/perf issue or an actual log deadlock? If the latter,
what prevents this deadlock on current upstream?

Brian

[1] https://lore.kernel.org/linux-xfs/20210129222559.GT4662@dread.disaster.area/

> In the longer term, we need to change how the AIL deals with pinned
> buffers, as the current method is definitely sub-optimal. It also
> explains the "everything stops for a second" performance anomalies
> I've been seeing for a while in testing. But fixing that is outside
> the scope of this patchset, so in the mean time I'll fix this one up
> and repost it in a little while.
> 
> FWIW, this is also the likely cause of the "CIL workqueue too deep"
> hangs I was seeing with the next patch in the series, too.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN
  2021-02-24 21:42   ` Darrick J. Wong
  2021-02-24 22:19     ` Dave Chinner
@ 2021-02-25 19:01     ` Christoph Hellwig
  1 sibling, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2021-02-25 19:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Wed, Feb 24, 2021 at 01:42:35PM -0800, Darrick J. Wong wrote:
> FWIW I rather wish you'd defined a new type for cil sequence numbers,
> since uint64_t is rather generic.  Even if checkpatch whines about new
> typedefs.
> 
> I was kind of hoping that we'd be able to mark xfs_lsn_t and xfs_csn_t
> with __bitwise and so static checkers could catch us if we accidentally
> feed a CIL sequence number into a function that wants an LSN.

__bitwise is a rather bad fit for xfs_lsn_t.  It is used as scalar and not
flags for which __bitwise is used, and it is written to disk in many
places.  Any in case you're asking:  I tried to quickly convert it and
noticed that the hard way.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing
  2021-02-25 13:12     ` [PATCH 2/3] " Brian Foster
@ 2021-02-25 22:03       ` Dave Chinner
  2021-02-27 16:25         ` Brian Foster
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2021-02-25 22:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote:
> On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote:
> > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> > The problem being tripped over here is that we no longer force the
> > final iclogs in a CIL push to disk - we leave the iclog with the
> > commit record in it in ACTIVE state, and by not waiting and forcing
> > all the iclogs to disk, the buffer never gets unpinned because there
> > isn't any more log pressure to force it out because everything is
> > blocked on reservation space.
> > 
> > The solution to this is to have the CIL push change the state of the
> > commit iclog to WANT_SYNC before it is released. This means the CIL
> > push will always flush the iclog to disk and the checkpoint will
> > complete and unpin the buffers.
> > 
> > Right now, we really only want to do this state switch for these
> > async pushes - for small sync transactions and fsync we really want
> > the iclog aggregation that we have now to optimise iclogbuf usage,
> > so I'll have to pass a new flag through the push code and back into
> > xlog_write(). That will make this patch behave the same as we
> > currently do.
> > 
> 
> Unfortunately I've not yet caught up to reviewing your most recently
> posted set of log patches so I can easily be missing some context, but
> when passing through some of the feedback/updates so far this has me
> rather confused. We discussed this pre-existing CIL behavior in the
> previous version, I suggested some similar potential behavior change
> where we would opportunistically send off checkpoint iclogs for I/O a
> bit earlier than normal and you argued [1] against it. Now it sounds
> like not only are we implementing that, but it's actually necessary to
> fix a log hang problem..? What am I missing?

You're talking about this comment?

| > That means there's an increased chance that
| > the next iclog in the ring may be active. Perhaps we could introduce
| > some logic to switch out the commit record iclog before the CIL push
| > returns in those particular cases.  For example, switch out if the
| > current checkpoint hit multiple iclogs..? Or hit multiple iclogs and the
| > next iclog in the ring is clean..?
|
| We could just call xlog_state_switch_iclogs() to mark it WANT_SYNC,
| but then we can't aggregate more changes into it and fill it up. If
| someone starts waiting on that iclog (i.e. a log force), then it
| immediately gets marked WANT_SYNC and submitted to disk when it is
| released. But if there is no-one waiting on it, then we largely
| don't care if an asynchronous checkpoint is committed immediately,
| at the start of the next checkpoint, or at worst, within 30s when
| the log worker next kicks a log force....

I'm arguing against the general case here of submitting iclog
buffers with checkpoints in them early. THere is no reason to do that
because the xfs_log_force() code will kick the iclog to disk if
necessary. That still stands - I have not changed behaviour
unconditionally.

This patch only kicks the commit_iclog to disk if the cause of the
CIL push was an async call to xlog_cil_force_seq(). That can occur
from xfs_iunpin() and, with this patch, the AIL.

In both cases, all they care about is that the CIL checkpoint fully
completes in the near future. For xfs_iunpin(), the log force
doesn't guarantee the inode is actually unpinned, hence it waits on
the pin count and not the log force. So waiting for the CIL push is
unnecessary as long as the CIL push gets the entire checkpoint to
disk.

Similarly, the AIL only needs the log force to guarantee the CIL
commits to disk in the near future, but it does not need to block
waiting for that to occur.

So in these cases, we don't want to block while the CIL is pushing
just so the log force code can iterate the last iclog the CIL wrote
to flush it to disk. Instead, we pass a flag to the CIL to tell it
"you need to push the commit record to disk" so we don't have to
wait for it to then check if it pushed the commit record or not.
This is a special case, not the general case where we are running
lots of fsync or sync transactions.

In the general case, we are trying to aggregate as many sync
transactions/fsync log forces as possible into the current iclog to
batch them up effectively. We'll continue to batch them while there
is space in the current iclog and the previous iclog is in the
WANT_SYNC/SYNCING state indicating it is still being written to
disk.

This automated batching behaviour of the iclog state machine is one
of the reasons XFS has always had excellent performance under
synci transaction/fsync heavy workloads such as NFS/CIFS servers. It
is functionality we need to preserve, and this patch does not
actually change that behaviour.

> The updated iclog behavior does sound more friendly to me than what we
> currently do (obviously, based on my previous comments),

As per above, the conditional behaviour I've added here is actually
less friendly :)

> but I am
> slightly skeptical of how such a change fixes the root cause of a hang.
> Is this a stall/perf issue or an actual log deadlock? If the latter,
> what prevents this deadlock on current upstream?

The problem that I talk about is introduced by this patch and also
fixed by this patch. The problem does not exist until we have code
that depends on an async CIL push guaranteeing that items in the CIL
are unpinned. A stock xfs_log_force_seq(0) call guarantees that the
log reaches disk by waiting on the CIL push then pushing the iclogs.
All it does is elide waiting for the final iclogs to hit the disk.

By eliding the "wait for CIL push" in an async xfs_log_force_seq()
call, we also elide the "push iclogs" part of the log force and
that's what causes the AIL problem. We don't push the iclogs, so the
items committed to the last iclog don't get dispatched and the items
never get unpinned because the checkpoint never completes.

So this "AIL can hang because tail items aren't unpinned" condition
is exposed by async CIL forces. It is also fixed by this patch
passing the async push state to the CIL push and having it ensure
that the commit iclog if written out if the CIL push was triggered
by an async push.

Like all log forces, async pushes are rare, but we have to ensure
tehy behave correctly. There is no bug in the code before this
patch, and there is no bug in the code (that I know of) after this
patch. I've described a condition that we have to handle in this
patch that, if we don't, we end up with broken AIL pushing.

As to the "root cause of the hang" I now know what that is too, but
addressing it is outside the scope of this patchset. It's taken me a
long time to identify this, but I've been seeing symptoms of it for
quite a long time (i.e. years) but until doing this async CIL push
work I've never been able to identify the cause.

That is, the root cause is that the AIL cannot force a pinned buffer
to disk, and a log force (of any kind) does not guarantee a pinned
item is unpinned. That's because the CIL is pipelined, meaning that
an item can be pinned in one CIL context, and while that is being
pushed, the front end can relog that same item and pin it in the
current CIL context. Hence it has two pins, not one. Now when the
original log force completes, the pin from the first CIL context has
been removed, but the pin from the current CIL context is still
present.

Hence when we are relogging buffers over time (e.g. repeatedly
modifying a directory), the AIL always sees the buffer pinned and
cannot flush it. Even if the buffer has been moved to the delwri
list (i.e. had a successful push) we still can't write it out
because the delwri code skips pinned buffers.

This iterative pin continues preventing the AIL from flushing the
buffer until it pins the tail of the log. When the log runs out of
space and blocks the modifying/relogging task(s), the log force the
AIL issues allows the CIL to be pushed without the front end
relogging the item and re-pinning the item. At the completion of the
log force, the AIL can now push the unpinned buffer and hence move
the tail of the log forward which frees up log reservation space.
This then wakes the front end, which goes back to it's relogging
game and the cycle repeats over.

This is really nasty behaviour, and it's only recently that I've got
a handle on it. I found it because my original "async CIL push" code
resulted in long stalls every time the log is filled and the tail is
pinned by a buffer that is being relogged in this manner....

I'm not sure how to fix this yet - the AIL needs to block the front
end relogging to allow the buffer to be unpinned. Essentially, we
need to hold the pinned items locked across a CIL push to guarantee
they are unpinned, but that's the complete opposite of what the AIL
currently does to prevent the front end from seeing long tail lock
latencies when modifying stuff....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing
  2021-02-25 22:03       ` Dave Chinner
@ 2021-02-27 16:25         ` Brian Foster
  2021-03-01  4:54           ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Brian Foster @ 2021-02-27 16:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote:
> On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote:
> > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote:
> > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> > > The problem being tripped over here is that we no longer force the
> > > final iclogs in a CIL push to disk - we leave the iclog with the
> > > commit record in it in ACTIVE state, and by not waiting and forcing
> > > all the iclogs to disk, the buffer never gets unpinned because there
> > > isn't any more log pressure to force it out because everything is
> > > blocked on reservation space.
> > > 
> > > The solution to this is to have the CIL push change the state of the
> > > commit iclog to WANT_SYNC before it is released. This means the CIL
> > > push will always flush the iclog to disk and the checkpoint will
> > > complete and unpin the buffers.
> > > 
> > > Right now, we really only want to do this state switch for these
> > > async pushes - for small sync transactions and fsync we really want
> > > the iclog aggregation that we have now to optimise iclogbuf usage,
> > > so I'll have to pass a new flag through the push code and back into
> > > xlog_write(). That will make this patch behave the same as we
> > > currently do.
> > > 
> > 
> > Unfortunately I've not yet caught up to reviewing your most recently
> > posted set of log patches so I can easily be missing some context, but
> > when passing through some of the feedback/updates so far this has me
> > rather confused. We discussed this pre-existing CIL behavior in the
> > previous version, I suggested some similar potential behavior change
> > where we would opportunistically send off checkpoint iclogs for I/O a
> > bit earlier than normal and you argued [1] against it. Now it sounds
> > like not only are we implementing that, but it's actually necessary to
> > fix a log hang problem..? What am I missing?
> 
> You're talking about this comment?
> 

Yes.

> | > That means there's an increased chance that
> | > the next iclog in the ring may be active. Perhaps we could introduce
> | > some logic to switch out the commit record iclog before the CIL push
> | > returns in those particular cases.  For example, switch out if the
> | > current checkpoint hit multiple iclogs..? Or hit multiple iclogs and the
> | > next iclog in the ring is clean..?
> |
> | We could just call xlog_state_switch_iclogs() to mark it WANT_SYNC,
> | but then we can't aggregate more changes into it and fill it up. If
> | someone starts waiting on that iclog (i.e. a log force), then it
> | immediately gets marked WANT_SYNC and submitted to disk when it is
> | released. But if there is no-one waiting on it, then we largely
> | don't care if an asynchronous checkpoint is committed immediately,
> | at the start of the next checkpoint, or at worst, within 30s when
> | the log worker next kicks a log force....
> 
> I'm arguing against the general case here of submitting iclog
> buffers with checkpoints in them early. THere is no reason to do that
> because the xfs_log_force() code will kick the iclog to disk if
> necessary. That still stands - I have not changed behaviour
> unconditionally.
> 

But the suggestion wasn't to change behavior unconditionally. That seems
fairly clear to me even from just the quoted snippet above..?

> This patch only kicks the commit_iclog to disk if the cause of the
> CIL push was an async call to xlog_cil_force_seq(). That can occur
> from xfs_iunpin() and, with this patch, the AIL.
> 
> In both cases, all they care about is that the CIL checkpoint fully
> completes in the near future. For xfs_iunpin(), the log force
> doesn't guarantee the inode is actually unpinned, hence it waits on
> the pin count and not the log force. So waiting for the CIL push is
> unnecessary as long as the CIL push gets the entire checkpoint to
> disk.
> 
> Similarly, the AIL only needs the log force to guarantee the CIL
> commits to disk in the near future, but it does not need to block
> waiting for that to occur.
> 
> So in these cases, we don't want to block while the CIL is pushing
> just so the log force code can iterate the last iclog the CIL wrote
> to flush it to disk. Instead, we pass a flag to the CIL to tell it
> "you need to push the commit record to disk" so we don't have to
> wait for it to then check if it pushed the commit record or not.
> This is a special case, not the general case where we are running
> lots of fsync or sync transactions.
> 
> In the general case, we are trying to aggregate as many sync
> transactions/fsync log forces as possible into the current iclog to
> batch them up effectively. We'll continue to batch them while there
> is space in the current iclog and the previous iclog is in the
> WANT_SYNC/SYNCING state indicating it is still being written to
> disk.
> 
> This automated batching behaviour of the iclog state machine is one
> of the reasons XFS has always had excellent performance under
> synci transaction/fsync heavy workloads such as NFS/CIFS servers. It
> is functionality we need to preserve, and this patch does not
> actually change that behaviour.
> 

Sure, this all makes sense.

> > The updated iclog behavior does sound more friendly to me than what we
> > currently do (obviously, based on my previous comments),
> 
> As per above, the conditional behaviour I've added here is actually
> less friendly :)
> 

Heh. :P

> > but I am
> > slightly skeptical of how such a change fixes the root cause of a hang.
> > Is this a stall/perf issue or an actual log deadlock? If the latter,
> > what prevents this deadlock on current upstream?
> 
> The problem that I talk about is introduced by this patch and also
> fixed by this patch. The problem does not exist until we have code
> that depends on an async CIL push guaranteeing that items in the CIL
> are unpinned. A stock xfs_log_force_seq(0) call guarantees that the
> log reaches disk by waiting on the CIL push then pushing the iclogs.
> All it does is elide waiting for the final iclogs to hit the disk.
> 
> By eliding the "wait for CIL push" in an async xfs_log_force_seq()
> call, we also elide the "push iclogs" part of the log force and
> that's what causes the AIL problem. We don't push the iclogs, so the
> items committed to the last iclog don't get dispatched and the items
> never get unpinned because the checkpoint never completes.
> 

So in general it sounds like this patch broke async log force by
skipping the current iclog switch somehow or another and then this
change is incorporated to fix it (hence there is no outstanding bug
upstream). That makes a bit more sense. I'll get the detailed context
from the patches.

> So this "AIL can hang because tail items aren't unpinned" condition
> is exposed by async CIL forces. It is also fixed by this patch
> passing the async push state to the CIL push and having it ensure
> that the commit iclog if written out if the CIL push was triggered
> by an async push.
> 
> Like all log forces, async pushes are rare, but we have to ensure
> tehy behave correctly. There is no bug in the code before this
> patch, and there is no bug in the code (that I know of) after this
> patch. I've described a condition that we have to handle in this
> patch that, if we don't, we end up with broken AIL pushing.
> 

Ok.

> As to the "root cause of the hang" I now know what that is too, but
> addressing it is outside the scope of this patchset. It's taken me a
> long time to identify this, but I've been seeing symptoms of it for
> quite a long time (i.e. years) but until doing this async CIL push
> work I've never been able to identify the cause.
> 
> That is, the root cause is that the AIL cannot force a pinned buffer
> to disk, and a log force (of any kind) does not guarantee a pinned
> item is unpinned. That's because the CIL is pipelined, meaning that
> an item can be pinned in one CIL context, and while that is being
> pushed, the front end can relog that same item and pin it in the
> current CIL context. Hence it has two pins, not one. Now when the
> original log force completes, the pin from the first CIL context has
> been removed, but the pin from the current CIL context is still
> present.
> 

Somewhat related to the above point, ISTM our current behavior can lead
to scenarios where the potential window for elevated pin counts is
somewhat artificially widened. IOW, the log vector for a pincount == 1
item might have been written out to disk, but the log item itself not
unpinned for some time later because the commit record is still sitting
in an active iclog. The unpin processing of the item doesn't occur until
that commit record iclog is synced out and I/O completes, which leaves
plenty of time for another pin to come along in the meantime (and the
scenario potentially repeats, starving out pincount == 0 state). So ISTM
the "damage" is done to some degree long before the AIL has any
opportunity to fix it, particularly if the only recourse it has is a log
force.

(And of course there's a huge tradeoff between current behavior and
something that unconditionally synced out commit record iclogs,
particularly depending on CIL checkpoint sizes, so that's a broad
simplification and not a valid solution.)

> Hence when we are relogging buffers over time (e.g. repeatedly
> modifying a directory), the AIL always sees the buffer pinned and
> cannot flush it. Even if the buffer has been moved to the delwri
> list (i.e. had a successful push) we still can't write it out
> because the delwri code skips pinned buffers.
> 
> This iterative pin continues preventing the AIL from flushing the
> buffer until it pins the tail of the log. When the log runs out of
> space and blocks the modifying/relogging task(s), the log force the
> AIL issues allows the CIL to be pushed without the front end
> relogging the item and re-pinning the item. At the completion of the
> log force, the AIL can now push the unpinned buffer and hence move
> the tail of the log forward which frees up log reservation space.
> This then wakes the front end, which goes back to it's relogging
> game and the cycle repeats over.
> 

The description of the nested pin behavior makes sense. I've observed
the same thing recently when looking at some unrelated shutdown
problems.

In any event, if I'm following correctly the deadlock issue was a
transient problem with this patch and not an upstream problem. However
there is still a pipeline stall/performance issue that manifests as
described above, and that preexists the proposed changes. Yes?

> This is really nasty behaviour, and it's only recently that I've got
> a handle on it. I found it because my original "async CIL push" code
> resulted in long stalls every time the log is filled and the tail is
> pinned by a buffer that is being relogged in this manner....
> 
> I'm not sure how to fix this yet - the AIL needs to block the front
> end relogging to allow the buffer to be unpinned. Essentially, we
> need to hold the pinned items locked across a CIL push to guarantee
> they are unpinned, but that's the complete opposite of what the AIL
> currently does to prevent the front end from seeing long tail lock
> latencies when modifying stuff....
> 

When this stall problem manifests, I'm assuming it's exacerbated by
delayed logging and the commit record behavior I described above. If
that's the case, could the AIL communicate writeback pressure through
affected log items such that checkpoints in which they are resident are
flushed out completely/immediately when the checkpoints occur? I suppose
that would require a log item flag or some such, which does raise a
concern of unnecessarily tagging many items (it's not clear to me how
likely that really is), but I'm curious if that would be an effective
POC at least..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing
  2021-02-27 16:25         ` Brian Foster
@ 2021-03-01  4:54           ` Dave Chinner
  2021-03-01 13:32             ` Brian Foster
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2021-03-01  4:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote:
> On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote:
> > On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote:
> > > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote:
> > > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> > The problem that I talk about is introduced by this patch and also
> > fixed by this patch. The problem does not exist until we have code
> > that depends on an async CIL push guaranteeing that items in the CIL
> > are unpinned. A stock xfs_log_force_seq(0) call guarantees that the
> > log reaches disk by waiting on the CIL push then pushing the iclogs.
> > All it does is elide waiting for the final iclogs to hit the disk.
> > 
> > By eliding the "wait for CIL push" in an async xfs_log_force_seq()
> > call, we also elide the "push iclogs" part of the log force and
> > that's what causes the AIL problem. We don't push the iclogs, so the
> > items committed to the last iclog don't get dispatched and the items
> > never get unpinned because the checkpoint never completes.
> > 
> 
> So in general it sounds like this patch broke async log force by
> skipping the current iclog switch somehow or another and then this
> change is incorporated to fix it (hence there is no outstanding bug
> upstream). That makes a bit more sense. I'll get the detailed context
> from the patches.

Yes.

> > As to the "root cause of the hang" I now know what that is too, but
> > addressing it is outside the scope of this patchset. It's taken me a
> > long time to identify this, but I've been seeing symptoms of it for
> > quite a long time (i.e. years) but until doing this async CIL push
> > work I've never been able to identify the cause.
> > 
> > That is, the root cause is that the AIL cannot force a pinned buffer
> > to disk, and a log force (of any kind) does not guarantee a pinned
> > item is unpinned. That's because the CIL is pipelined, meaning that
> > an item can be pinned in one CIL context, and while that is being
> > pushed, the front end can relog that same item and pin it in the
> > current CIL context. Hence it has two pins, not one. Now when the
> > original log force completes, the pin from the first CIL context has
> > been removed, but the pin from the current CIL context is still
> > present.
> > 
> 
> Somewhat related to the above point, ISTM our current behavior can lead
> to scenarios where the potential window for elevated pin counts is
> somewhat artificially widened. IOW, the log vector for a pincount == 1
> item might have been written out to disk, but the log item itself not
> unpinned for some time later because the commit record is still sitting
> in an active iclog. The unpin processing of the item doesn't occur until
> that commit record iclog is synced out and I/O completes, which leaves
> plenty of time for another pin to come along in the meantime (and the
> scenario potentially repeats, starving out pincount == 0 state). So ISTM
> the "damage" is done to some degree long before the AIL has any
> opportunity to fix it, particularly if the only recourse it has is a log
> force.
> 
> (And of course there's a huge tradeoff between current behavior and
> something that unconditionally synced out commit record iclogs,
> particularly depending on CIL checkpoint sizes, so that's a broad
> simplification and not a valid solution.)

Yes, this can be an issue because when the single CIL work is CPU
bound, it is the first iclog write of the checkpoint that pushes the
previous commit record to disk. So, like you say, there's a natural
pin overlap in this situation.

> > Hence when we are relogging buffers over time (e.g. repeatedly
> > modifying a directory), the AIL always sees the buffer pinned and
> > cannot flush it. Even if the buffer has been moved to the delwri
> > list (i.e. had a successful push) we still can't write it out
> > because the delwri code skips pinned buffers.
> > 
> > This iterative pin continues preventing the AIL from flushing the
> > buffer until it pins the tail of the log. When the log runs out of
> > space and blocks the modifying/relogging task(s), the log force the
> > AIL issues allows the CIL to be pushed without the front end
> > relogging the item and re-pinning the item. At the completion of the
> > log force, the AIL can now push the unpinned buffer and hence move
> > the tail of the log forward which frees up log reservation space.
> > This then wakes the front end, which goes back to it's relogging
> > game and the cycle repeats over.
> > 
> 
> The description of the nested pin behavior makes sense. I've observed
> the same thing recently when looking at some unrelated shutdown
> problems.
> 
> In any event, if I'm following correctly the deadlock issue was a
> transient problem with this patch and not an upstream problem. However
> there is still a pipeline stall/performance issue that manifests as
> described above, and that preexists the proposed changes. Yes?

Yes, AFAICT, this pin problem dates back to day zero. Before delayed
logging it required the inode or buffer to be relogged in almost
every iclog that was written (think a long running truncate removing
hundreds of thousands of extents). But before delayed logging, this
was very slow and limited by iclog IO so having the AIL stall would
have largely been unnoticed.

> > This is really nasty behaviour, and it's only recently that I've got
> > a handle on it. I found it because my original "async CIL push" code
> > resulted in long stalls every time the log is filled and the tail is
> > pinned by a buffer that is being relogged in this manner....
> > 
> > I'm not sure how to fix this yet - the AIL needs to block the front
> > end relogging to allow the buffer to be unpinned. Essentially, we
> > need to hold the pinned items locked across a CIL push to guarantee
> > they are unpinned, but that's the complete opposite of what the AIL
> > currently does to prevent the front end from seeing long tail lock
> > latencies when modifying stuff....
> 
> When this stall problem manifests, I'm assuming it's exacerbated by
> delayed logging and the commit record behavior I described above. If
> that's the case, could the AIL communicate writeback pressure through
> affected log items such that checkpoints in which they are resident are
> flushed out completely/immediately when the checkpoints occur? I suppose
> that would require a log item flag or some such, which does raise a
> concern of unnecessarily tagging many items (it's not clear to me how
> likely that really is), but I'm curious if that would be an effective
> POC at least..

I don't think we need to do anything like that. All we need to do to
ensure that the AIL can flush a pinned buffer is to lock it, kick
the log and wait for the pin count to go to zero. Then we can write
it just fine, blocking only the front end transactions that need
that buffer lock.  Same goes for inodes, though xfs_iunpin_wait()
already does this....

The only issue is determining if this causes unaaceptible long tail
latencies on buffers like the AG headers....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing
  2021-03-01  4:54           ` Dave Chinner
@ 2021-03-01 13:32             ` Brian Foster
  2021-03-03  1:23               ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Brian Foster @ 2021-03-01 13:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote:
> On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote:
> > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote:
> > > On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote:
> > > > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote:
> > > > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> > > The problem that I talk about is introduced by this patch and also
> > > fixed by this patch. The problem does not exist until we have code
> > > that depends on an async CIL push guaranteeing that items in the CIL
> > > are unpinned. A stock xfs_log_force_seq(0) call guarantees that the
> > > log reaches disk by waiting on the CIL push then pushing the iclogs.
> > > All it does is elide waiting for the final iclogs to hit the disk.
> > > 
> > > By eliding the "wait for CIL push" in an async xfs_log_force_seq()
> > > call, we also elide the "push iclogs" part of the log force and
> > > that's what causes the AIL problem. We don't push the iclogs, so the
> > > items committed to the last iclog don't get dispatched and the items
> > > never get unpinned because the checkpoint never completes.
> > > 
> > 
> > So in general it sounds like this patch broke async log force by
> > skipping the current iclog switch somehow or another and then this
> > change is incorporated to fix it (hence there is no outstanding bug
> > upstream). That makes a bit more sense. I'll get the detailed context
> > from the patches.
> 
> Yes.
> 
> > > As to the "root cause of the hang" I now know what that is too, but
> > > addressing it is outside the scope of this patchset. It's taken me a
> > > long time to identify this, but I've been seeing symptoms of it for
> > > quite a long time (i.e. years) but until doing this async CIL push
> > > work I've never been able to identify the cause.
> > > 
> > > That is, the root cause is that the AIL cannot force a pinned buffer
> > > to disk, and a log force (of any kind) does not guarantee a pinned
> > > item is unpinned. That's because the CIL is pipelined, meaning that
> > > an item can be pinned in one CIL context, and while that is being
> > > pushed, the front end can relog that same item and pin it in the
> > > current CIL context. Hence it has two pins, not one. Now when the
> > > original log force completes, the pin from the first CIL context has
> > > been removed, but the pin from the current CIL context is still
> > > present.
> > > 
> > 
> > Somewhat related to the above point, ISTM our current behavior can lead
> > to scenarios where the potential window for elevated pin counts is
> > somewhat artificially widened. IOW, the log vector for a pincount == 1
> > item might have been written out to disk, but the log item itself not
> > unpinned for some time later because the commit record is still sitting
> > in an active iclog. The unpin processing of the item doesn't occur until
> > that commit record iclog is synced out and I/O completes, which leaves
> > plenty of time for another pin to come along in the meantime (and the
> > scenario potentially repeats, starving out pincount == 0 state). So ISTM
> > the "damage" is done to some degree long before the AIL has any
> > opportunity to fix it, particularly if the only recourse it has is a log
> > force.
> > 
> > (And of course there's a huge tradeoff between current behavior and
> > something that unconditionally synced out commit record iclogs,
> > particularly depending on CIL checkpoint sizes, so that's a broad
> > simplification and not a valid solution.)
> 
> Yes, this can be an issue because when the single CIL work is CPU
> bound, it is the first iclog write of the checkpoint that pushes the
> previous commit record to disk. So, like you say, there's a natural
> pin overlap in this situation.
> 
> > > Hence when we are relogging buffers over time (e.g. repeatedly
> > > modifying a directory), the AIL always sees the buffer pinned and
> > > cannot flush it. Even if the buffer has been moved to the delwri
> > > list (i.e. had a successful push) we still can't write it out
> > > because the delwri code skips pinned buffers.
> > > 
> > > This iterative pin continues preventing the AIL from flushing the
> > > buffer until it pins the tail of the log. When the log runs out of
> > > space and blocks the modifying/relogging task(s), the log force the
> > > AIL issues allows the CIL to be pushed without the front end
> > > relogging the item and re-pinning the item. At the completion of the
> > > log force, the AIL can now push the unpinned buffer and hence move
> > > the tail of the log forward which frees up log reservation space.
> > > This then wakes the front end, which goes back to it's relogging
> > > game and the cycle repeats over.
> > > 
> > 
> > The description of the nested pin behavior makes sense. I've observed
> > the same thing recently when looking at some unrelated shutdown
> > problems.
> > 
> > In any event, if I'm following correctly the deadlock issue was a
> > transient problem with this patch and not an upstream problem. However
> > there is still a pipeline stall/performance issue that manifests as
> > described above, and that preexists the proposed changes. Yes?
> 
> Yes, AFAICT, this pin problem dates back to day zero. Before delayed
> logging it required the inode or buffer to be relogged in almost
> every iclog that was written (think a long running truncate removing
> hundreds of thousands of extents). But before delayed logging, this
> was very slow and limited by iclog IO so having the AIL stall would
> have largely been unnoticed.
> 
> > > This is really nasty behaviour, and it's only recently that I've got
> > > a handle on it. I found it because my original "async CIL push" code
> > > resulted in long stalls every time the log is filled and the tail is
> > > pinned by a buffer that is being relogged in this manner....
> > > 
> > > I'm not sure how to fix this yet - the AIL needs to block the front
> > > end relogging to allow the buffer to be unpinned. Essentially, we
> > > need to hold the pinned items locked across a CIL push to guarantee
> > > they are unpinned, but that's the complete opposite of what the AIL
> > > currently does to prevent the front end from seeing long tail lock
> > > latencies when modifying stuff....
> > 
> > When this stall problem manifests, I'm assuming it's exacerbated by
> > delayed logging and the commit record behavior I described above. If
> > that's the case, could the AIL communicate writeback pressure through
> > affected log items such that checkpoints in which they are resident are
> > flushed out completely/immediately when the checkpoints occur? I suppose
> > that would require a log item flag or some such, which does raise a
> > concern of unnecessarily tagging many items (it's not clear to me how
> > likely that really is), but I'm curious if that would be an effective
> > POC at least..
> 
> I don't think we need to do anything like that. All we need to do to
> ensure that the AIL can flush a pinned buffer is to lock it, kick
> the log and wait for the pin count to go to zero. Then we can write
> it just fine, blocking only the front end transactions that need
> that buffer lock.  Same goes for inodes, though xfs_iunpin_wait()
> already does this....
> 

Yeah, but why would we want to block xfsaild on a single item like that?
Wouldn't holding the item locked like that just create a new stall point
within xfsaild? Maybe I'm missing something, but what you describe here
basically just sounds like a "lock the item and run a sync log force"
pattern.

I agree that the implementation I handwave about above seems overkill,
but I'm really curious what the disadvantage of some form of logic in
the logging code to do something like "if this checkpoint contains items
that are AIL resident && under some measure of writeback pressure && the
next iclog is active, want_sync the commit record iclog on background
push to reduce item pin count times" might be.

Brian

> The only issue is determining if this causes unaaceptible long tail
> latencies on buffers like the AG headers....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN
  2021-02-23  5:32 ` [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
  2021-02-24 21:42   ` Darrick J. Wong
@ 2021-03-02 18:12   ` Brian Foster
  1 sibling, 0 replies; 33+ messages in thread
From: Brian Foster @ 2021-03-02 18:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 04:32:10PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In doing an investigation into AIL push stalls, I was looking at the
> log force code to see if an async CIL push could be done instead.
> This lead me to xfs_log_force_lsn() and looking at how it works.
> 
> xfs_log_force_lsn() is only called from inode synchronisation
> contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
> value as the LSN to sync the log to. This gets passed to
> xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
> journal, and then used by xfs_log_force_lsn() to flush the iclogs to
> the journal.
> 

As you point out just below, this ->ili_last_lsn value is not used by
xfs_log_force_lsn(). It's passed to the CIL force and that potentially
returns a LSN for the log force call.

> The problem with is that ip->i_itemp->ili_last_lsn does not store a
> log sequence number. What it stores is passed to it from the
> ->iop_committing method, which is called by xfs_log_commit_cil().
> The value this passes to the iop_committing method is the CIL
> context sequence number that the item was committed to.
> 

Ok, but what is the "problem?" It sounds like this patch is cleaning
things up vs. fixing a problem...

> As it turns out, xlog_cil_force_lsn() converts the sequence to an
> actual commit LSN for the related context and returns that to
> xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
> variable that contained a sequence with an actual LSN and then uses
> that to sync the iclogs.
> 
> This caused me some confusion for a while, even though I originally
> wrote all this code a decade ago. ->iop_committing is only used by
> a couple of log item types, and only inode items use the sequence
> number it is passed.
> 
> Let's clean up the API, CIL structures and inode log item to call it
> a sequence number, and make it clear that the high level code is
> using CIL sequence numbers and not on-disk LSNs for integrity
> synchronisation purposes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf_item.c   |  2 +-
>  fs/xfs/xfs_dquot_item.c |  2 +-
>  fs/xfs/xfs_file.c       | 14 +++++++-------
>  fs/xfs/xfs_inode.c      | 10 +++++-----
>  fs/xfs/xfs_inode_item.c |  4 ++--
>  fs/xfs/xfs_inode_item.h |  2 +-
>  fs/xfs/xfs_log.c        | 27 ++++++++++++++-------------
>  fs/xfs/xfs_log.h        |  4 +---
>  fs/xfs/xfs_log_cil.c    | 22 +++++++++-------------
>  fs/xfs/xfs_log_priv.h   | 15 +++++++--------
>  fs/xfs/xfs_trans.c      |  6 +++---
>  fs/xfs/xfs_trans.h      |  4 ++--
>  12 files changed, 53 insertions(+), 59 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 08d68a6161ae..84cd9b6c6d1f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
...
> @@ -3415,25 +3412,29 @@ __xfs_log_force_lsn(
>   * to disk, that thread will wake up all threads waiting on the queue.
>   */
>  int
> -xfs_log_force_lsn(
> +xfs_log_force_seq(
>  	struct xfs_mount	*mp,
> -	xfs_lsn_t		lsn,
> +	uint64_t		seq,
>  	uint			flags,
>  	int			*log_flushed)
>  {
> +	struct xlog		*log = mp->m_log;
> +	xfs_lsn_t		lsn;
>  	int			ret;
> -	ASSERT(lsn != 0);
> +	ASSERT(seq != 0);
>  
>  	XFS_STATS_INC(mp, xs_log_force);
> -	trace_xfs_log_force(mp, lsn, _RET_IP_);
> +	trace_xfs_log_force(mp, seq, _RET_IP_);
>  
> -	lsn = xlog_cil_force_lsn(mp->m_log, lsn);
> +	lsn = xlog_cil_force_seq(log, seq);
>  	if (lsn == NULLCOMMITLSN)
>  		return 0;
>  
> -	ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, false);
> -	if (ret == -EAGAIN)
> -		ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, true);
> +	ret = xlog_force_lsn(log, lsn, flags, log_flushed, false);
> +	if (ret == -EAGAIN) {
> +		XFS_STATS_INC(mp, xs_log_force_sleep);

This looks misplaced to me here. I'd just leave it where we do the sleep
and pass log->l_mp. That nit and the slightly misleading commit log
aside, the rest LGTM and seems like a nice cleanup:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +		ret = xlog_force_lsn(log, lsn, flags, log_flushed, true);
> +	}
>  	return ret;
>  }
>  
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 044e02cb8921..33ae53401060 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -106,7 +106,7 @@ struct xfs_item_ops;
>  struct xfs_trans;
>  
>  int	  xfs_log_force(struct xfs_mount *mp, uint flags);
> -int	  xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags,
> +int	  xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags,
>  		int *log_forced);
>  int	  xfs_log_mount(struct xfs_mount	*mp,
>  			struct xfs_buftarg	*log_target,
> @@ -132,8 +132,6 @@ bool	xfs_log_writable(struct xfs_mount *mp);
>  struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
>  void	  xfs_log_ticket_put(struct xlog_ticket *ticket);
>  
> -void	xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp,
> -				xfs_lsn_t *commit_lsn, bool regrant);
>  void	xlog_cil_process_committed(struct list_head *list);
>  bool	xfs_log_item_in_current_chkpt(struct xfs_log_item *lip);
>  
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 0a00c3c9610c..2fda8c4513b2 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -792,7 +792,7 @@ xlog_cil_push_work(
>  	 * that higher sequences will wait for us to write out a commit record
>  	 * before they do.
>  	 *
> -	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
> +	 * xfs_log_force_seq requires us to mirror the new sequence into the cil
>  	 * structure atomically with the addition of this sequence to the
>  	 * committing list. This also ensures that we can do unlocked checks
>  	 * against the current sequence in log forces without risking
> @@ -1054,16 +1054,14 @@ xlog_cil_empty(
>   * allowed again.
>   */
>  void
> -xfs_log_commit_cil(
> -	struct xfs_mount	*mp,
> +xlog_cil_commit(
> +	struct xlog		*log,
>  	struct xfs_trans	*tp,
> -	xfs_lsn_t		*commit_lsn,
> +	uint64_t		*commit_seq,
>  	bool			regrant)
>  {
> -	struct xlog		*log = mp->m_log;
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_log_item	*lip, *next;
> -	xfs_lsn_t		xc_commit_lsn;
>  
>  	/*
>  	 * Do all necessary memory allocation before we lock the CIL.
> @@ -1077,10 +1075,6 @@ xfs_log_commit_cil(
>  
>  	xlog_cil_insert_items(log, tp);
>  
> -	xc_commit_lsn = cil->xc_ctx->sequence;
> -	if (commit_lsn)
> -		*commit_lsn = xc_commit_lsn;
> -
>  	if (regrant && !XLOG_FORCED_SHUTDOWN(log))
>  		xfs_log_ticket_regrant(log, tp->t_ticket);
>  	else
> @@ -1103,8 +1097,10 @@ xfs_log_commit_cil(
>  	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
>  		xfs_trans_del_item(lip);
>  		if (lip->li_ops->iop_committing)
> -			lip->li_ops->iop_committing(lip, xc_commit_lsn);
> +			lip->li_ops->iop_committing(lip, cil->xc_ctx->sequence);
>  	}
> +	if (commit_seq)
> +		*commit_seq = cil->xc_ctx->sequence;
>  
>  	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
>  	xlog_cil_push_background(log);
> @@ -1121,9 +1117,9 @@ xfs_log_commit_cil(
>   * iclog flush is necessary following this call.
>   */
>  xfs_lsn_t
> -xlog_cil_force_lsn(
> +xlog_cil_force_seq(
>  	struct xlog	*log,
> -	xfs_lsn_t	sequence)
> +	uint64_t	sequence)
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_cil_ctx	*ctx;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 24acdc54e44e..59778cd5ecdd 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -234,7 +234,7 @@ struct xfs_cil;
>  
>  struct xfs_cil_ctx {
>  	struct xfs_cil		*cil;
> -	xfs_lsn_t		sequence;	/* chkpt sequence # */
> +	uint64_t		sequence;	/* chkpt sequence # */
>  	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
>  	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
>  	struct xlog_ticket	*ticket;	/* chkpt ticket */
> @@ -272,10 +272,10 @@ struct xfs_cil {
>  	struct xfs_cil_ctx	*xc_ctx;
>  
>  	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
> -	xfs_lsn_t		xc_push_seq;
> +	uint64_t		xc_push_seq;
>  	struct list_head	xc_committing;
>  	wait_queue_head_t	xc_commit_wait;
> -	xfs_lsn_t		xc_current_sequence;
> +	uint64_t		xc_current_sequence;
>  	struct work_struct	xc_push_work;
>  	wait_queue_head_t	xc_push_wait;	/* background push throttle */
>  } ____cacheline_aligned_in_smp;
> @@ -552,19 +552,18 @@ int	xlog_cil_init(struct xlog *log);
>  void	xlog_cil_init_post_recovery(struct xlog *log);
>  void	xlog_cil_destroy(struct xlog *log);
>  bool	xlog_cil_empty(struct xlog *log);
> +void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
> +			uint64_t *commit_seq, bool regrant);
>  
>  /*
>   * CIL force routines
>   */
> -xfs_lsn_t
> -xlog_cil_force_lsn(
> -	struct xlog *log,
> -	xfs_lsn_t sequence);
> +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence);
>  
>  static inline void
>  xlog_cil_force(struct xlog *log)
>  {
> -	xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence);
> +	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 44f72c09c203..697703f3be48 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -849,7 +849,7 @@ __xfs_trans_commit(
>  	bool			regrant)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	xfs_lsn_t		commit_lsn = -1;
> +	uint64_t		commit_seq = 0;
>  	int			error = 0;
>  	int			sync = tp->t_flags & XFS_TRANS_SYNC;
>  
> @@ -891,7 +891,7 @@ __xfs_trans_commit(
>  		xfs_trans_apply_sb_deltas(tp);
>  	xfs_trans_apply_dquot_deltas(tp);
>  
> -	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
> +	xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant);
>  
>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free(tp);
> @@ -901,7 +901,7 @@ __xfs_trans_commit(
>  	 * log out now and wait for it.
>  	 */
>  	if (sync) {
> -		error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL);
> +		error = xfs_log_force_seq(mp, commit_seq, XFS_LOG_SYNC, NULL);
>  		XFS_STATS_INC(mp, xs_trans_sync);
>  	} else {
>  		XFS_STATS_INC(mp, xs_trans_async);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 8b03fbfe9a1b..d223d4f4e429 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -43,7 +43,7 @@ struct xfs_log_item {
>  	struct list_head		li_cil;		/* CIL pointers */
>  	struct xfs_log_vec		*li_lv;		/* active log vector */
>  	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
> -	xfs_lsn_t			li_seq;		/* CIL commit seq */
> +	uint64_t			li_seq;		/* CIL commit seq */
>  };
>  
>  /*
> @@ -69,7 +69,7 @@ struct xfs_item_ops {
>  	void (*iop_pin)(struct xfs_log_item *);
>  	void (*iop_unpin)(struct xfs_log_item *, int remove);
>  	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> -	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
> +	void (*iop_committing)(struct xfs_log_item *lip, uint64_t seq);
>  	void (*iop_release)(struct xfs_log_item *);
>  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
>  	int (*iop_recover)(struct xfs_log_item *lip,
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing
  2021-02-24 23:26     ` [PATCH 2/3 v2] " Dave Chinner
  2021-02-25  2:15       ` Dave Chinner
@ 2021-03-02 21:44       ` Brian Foster
  2021-03-03  0:57         ` Dave Chinner
  1 sibling, 1 reply; 33+ messages in thread
From: Brian Foster @ 2021-03-02 21:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The AIL pushing is stalling on log forces when it comes across
> pinned items. This is happening on removal workloads where the AIL
> is dominated by stale items that are removed from AIL when the
> checkpoint that marks the items stale is committed to the journal.
> This results is relatively few items in the AIL, but those that are
> are often pinned as directories items are being removed from are
> still being logged.
> 
> As a result, many push cycles through the CIL will first issue a
> blocking log force to unpin the items. This can take some time to
> complete, with tracing regularly showing push delays of half a
> second and sometimes up into the range of several seconds. Sequences
> like this aren't uncommon:
> 
...
> 
> In each of these cases, every AIL pass saw 101 log items stuck on
> the AIL (pinned) with very few other items being found. Each pass, a
> log force was issued, and delay between last/first is the sleep time
> + the sync log force time.
> 
> Some of these 101 items pinned the tail of the log. The tail of the
> log does slowly creep forward (first lsn), but the problem is that
> the log is actually out of reservation space because it's been
> running so many transactions that stale items that never reach the
> AIL but consume log space. Hence we have a largely empty AIL, with
> long term pins on items that pin the tail of the log that don't get
> pushed frequently enough to keep log space available.
> 
> The problem is the hundreds of milliseconds that we block in the log
> force pushing the CIL out to disk. The AIL should not be stalled
> like this - it needs to run and flush items that are at the tail of
> the log with minimal latency. What we really need to do is trigger a
> log flush, but then not wait for it at all - we've already done our
> waiting for stuff to complete when we backed off prior to the log
> force being issued.
> 
> Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> still do a blocking flush of the CIL and that is what is causing the
> issue. Hence we need a new interface for the CIL to trigger an
> immediate background push of the CIL to get it moving faster but not
> to wait on that to occur. While the CIL is pushing, the AIL can also
> be pushing.
> 

So I think I follow what you're describing wrt to the xfsaild delay, but
what I'm not clear on is what changing this behavior actually fixes.
IOW, if the xfsaild is cycling through and finding mostly stuck items,
then we see a delay because of the time spent in a log force (because of
those stuck items), what's the advantage to issue an async force over a
sync force? Won't xfsaild effectively continue to spin on these stuck
items until the log force completes, regardless?

Is the issue not so much the throughput of xfsaild, but just the fact
that the task stalls are large enough to cause other problems? I.e., the
comment you add down in the xfsaild code refers to tail pinning causing
reservation stalls. IOW, the purpose of this patch is then to keep
xfsaild sleeps to a predictable latency to ensure we come around again
quickly enough to keep the tail moving forward (even though the AIL
itself might remain in a state where the majority of items are stuck).
Yes?

> We already have an internal interface to do this -
> xlog_cil_push_now() - but we need a wrapper for it to be used
> externally. xlog_cil_force_seq() can easily be extended to do what
> we need as it already implements the synchronous CIL push via
> xlog_cil_push_now(). Add the necessary flags and "push current
> sequence" semantics to xlog_cil_force_seq() and convert the AIL
> pushing to use it.
> 
> One of the complexities here is that the CIL push does not guarantee
> that the commit record for the CIL checkpoint is written to disk.
> The current log force ensures this by submitting the current ACTIVE
> iclog that the commit record was written to. We need the CIL to
> actually write this commit record to disk for an async push to
> ensure that the checkpoint actually makes it to disk and unpins the
> pinned items in the checkpoint on completion. Hence we need to pass
> down to the CIL push that we are doing an async flush so that it can
> switch out the commit_iclog if necessary to get written to disk when
> the commit iclog is finally released.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> Version 2:
> - ensure the CIL checkpoint issues the commit record to disk for an
>   async push. Fixes generic/530 hang on small logs.
> - increment log force stats when the CIL is forced and also when it
>   sleeps to give insight into the amount of blocking being done when
>   the CIL is forced.
> 
>  fs/xfs/xfs_log.c       | 59 +++++++++++++++++++---------------------------
>  fs/xfs/xfs_log.h       |  2 +-
>  fs/xfs/xfs_log_cil.c   | 64 +++++++++++++++++++++++++++++++++++++++++---------
>  fs/xfs/xfs_log_priv.h  | 10 ++++++--
>  fs/xfs/xfs_sysfs.c     |  1 +
>  fs/xfs/xfs_trace.c     |  1 +
>  fs/xfs/xfs_trans.c     |  2 +-
>  fs/xfs/xfs_trans_ail.c | 11 ++++++---
>  8 files changed, 97 insertions(+), 53 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 2fda8c4513b2..1b46559ef64a 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
...
> @@ -1127,13 +1159,22 @@ xlog_cil_force_seq(
>  
>  	ASSERT(sequence <= cil->xc_current_sequence);
>  
> +	if (!sequence)
> +		sequence = cil->xc_current_sequence;
> +trace_printk("seq %llu, curseq %llu, ctxseq %llu pushseq %llu flags 0x%x",
> +		sequence, cil->xc_current_sequence, cil->xc_ctx->sequence,
> +		cil->xc_push_seq, flags);
> +
> +	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
>  	/*
>  	 * check to see if we need to force out the current context.
>  	 * xlog_cil_push() handles racing pushes for the same sequence,
>  	 * so no need to deal with it here.
>  	 */
>  restart:
> -	xlog_cil_push_now(log, sequence);
> +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> +	if (!(flags & XFS_LOG_SYNC))
> +		return commit_lsn;

Hm, so now we have sync and async log force and sync and async CIL push.
A log force always requires a sync CIL push and commit record submit;
the difference is simply whether or not we wait on commit record I/O
completion. The sync CIL push drains the CIL of log items but does not
switch out the commit record iclog, while the async CIL push executes in
the workqueue context so the drain is async, but it does switch out the
commit record iclog before it completes. IOW, the async CIL push is
basically a "more async" async log force.

I can see the need for the behavior of the async CIL push here, but that
leaves a mess of interfaces and behavior matrix. Is there any reason we
couldn't just make async log forces unconditionally behave equivalent to
the async CIL push as defined by this patch? There's only a handful of
existing users and I don't see any obvious reason why they might care
whether the underlying CIL push is synchronous or not...

Brian

>  
>  	/*
>  	 * See if we can find a previous sequence still committing.
> @@ -1157,6 +1198,7 @@ xlog_cil_force_seq(
>  			 * It is still being pushed! Wait for the push to
>  			 * complete, then start again from the beginning.
>  			 */
> +			XFS_STATS_INC(log->l_mp, xs_log_force_sleep);
>  			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
>  			goto restart;
>  		}
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 7e8ec77bc6e6..6b1a251dd013 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -273,6 +273,7 @@ struct xfs_cil {
>  
>  	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
>  	uint64_t		xc_push_seq;
> +	bool			xc_push_async;
>  	struct list_head	xc_committing;
>  	wait_queue_head_t	xc_commit_wait;
>  	uint64_t		xc_current_sequence;
> @@ -487,6 +488,10 @@ int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
>  		struct xlog_in_core **commit_iclog, uint optype);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
> +void	xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
> +		int eventual_size);
> +int	xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog);
> +
>  void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
>  void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
>  
> @@ -558,12 +563,13 @@ void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
>  /*
>   * CIL force routines
>   */
> -xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence);
> +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags,
> +				uint64_t sequence);
>  
>  static inline void
>  xlog_cil_force(struct xlog *log)
>  {
> -	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
> +	xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> index f1bc88f4367c..18dc5eca6c04 100644
> --- a/fs/xfs/xfs_sysfs.c
> +++ b/fs/xfs/xfs_sysfs.c
> @@ -10,6 +10,7 @@
>  #include "xfs_log_format.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_sysfs.h"
> +#include "xfs_log.h"
>  #include "xfs_log_priv.h"
>  #include "xfs_mount.h"
>  
> diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
> index 9b8d703dc9fd..d111a994b7b6 100644
> --- a/fs/xfs/xfs_trace.c
> +++ b/fs/xfs/xfs_trace.c
> @@ -20,6 +20,7 @@
>  #include "xfs_bmap.h"
>  #include "xfs_attr.h"
>  #include "xfs_trans.h"
> +#include "xfs_log.h"
>  #include "xfs_log_priv.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_quota.h"
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 697703f3be48..7d05694681e3 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -9,7 +9,6 @@
>  #include "xfs_shared.h"
>  #include "xfs_format.h"
>  #include "xfs_log_format.h"
> -#include "xfs_log_priv.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
>  #include "xfs_extent_busy.h"
> @@ -17,6 +16,7 @@
>  #include "xfs_trans.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  #include "xfs_trace.h"
>  #include "xfs_error.h"
>  #include "xfs_defer.h"
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index dbb69b4bf3ed..dfc0206c0d36 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -17,6 +17,7 @@
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  
>  #ifdef DEBUG
>  /*
> @@ -429,8 +430,12 @@ xfsaild_push(
>  
>  	/*
>  	 * If we encountered pinned items or did not finish writing out all
> -	 * buffers the last time we ran, force the log first and wait for it
> -	 * before pushing again.
> +	 * buffers the last time we ran, force a background CIL push to get the
> +	 * items unpinned in the near future. We do not wait on the CIL push as
> +	 * that could stall us for seconds if there is enough background IO
> +	 * load. Stalling for that long when the tail of the log is pinned and
> +	 * needs flushing will hard stop the transaction subsystem when log
> +	 * space runs out.
>  	 */
>  	if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
>  	    (!list_empty_careful(&ailp->ail_buf_list) ||
> @@ -438,7 +443,7 @@ xfsaild_push(
>  		ailp->ail_log_flush = 0;
>  
>  		XFS_STATS_INC(mp, xs_push_ail_flush);
> -		xfs_log_force(mp, XFS_LOG_SYNC);
> +		xlog_cil_force_seq(mp->m_log, 0, 0);
>  	}
>  
>  	spin_lock(&ailp->ail_lock);
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing
  2021-03-02 21:44       ` Brian Foster
@ 2021-03-03  0:57         ` Dave Chinner
  2021-03-03 17:32           ` Brian Foster
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2021-03-03  0:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote:
> On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The AIL pushing is stalling on log forces when it comes across
> > pinned items. This is happening on removal workloads where the AIL
> > is dominated by stale items that are removed from AIL when the
> > checkpoint that marks the items stale is committed to the journal.
> > This results is relatively few items in the AIL, but those that are
> > are often pinned as directories items are being removed from are
> > still being logged.
> > 
> > As a result, many push cycles through the CIL will first issue a
> > blocking log force to unpin the items. This can take some time to
> > complete, with tracing regularly showing push delays of half a
> > second and sometimes up into the range of several seconds. Sequences
> > like this aren't uncommon:
> > 
> ...
> > 
> > In each of these cases, every AIL pass saw 101 log items stuck on
> > the AIL (pinned) with very few other items being found. Each pass, a
> > log force was issued, and delay between last/first is the sleep time
> > + the sync log force time.
> > 
> > Some of these 101 items pinned the tail of the log. The tail of the
> > log does slowly creep forward (first lsn), but the problem is that
> > the log is actually out of reservation space because it's been
> > running so many transactions that stale items that never reach the
> > AIL but consume log space. Hence we have a largely empty AIL, with
> > long term pins on items that pin the tail of the log that don't get
> > pushed frequently enough to keep log space available.
> > 
> > The problem is the hundreds of milliseconds that we block in the log
> > force pushing the CIL out to disk. The AIL should not be stalled
> > like this - it needs to run and flush items that are at the tail of
> > the log with minimal latency. What we really need to do is trigger a
> > log flush, but then not wait for it at all - we've already done our
> > waiting for stuff to complete when we backed off prior to the log
> > force being issued.
> > 
> > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> > still do a blocking flush of the CIL and that is what is causing the
> > issue. Hence we need a new interface for the CIL to trigger an
> > immediate background push of the CIL to get it moving faster but not
> > to wait on that to occur. While the CIL is pushing, the AIL can also
> > be pushing.
> > 
> 
> So I think I follow what you're describing wrt to the xfsaild delay, but
> what I'm not clear on is what changing this behavior actually fixes.
> IOW, if the xfsaild is cycling through and finding mostly stuck items,
> then we see a delay because of the time spent in a log force (because of
> those stuck items), what's the advantage to issue an async force over a
> sync force? Won't xfsaild effectively continue to spin on these stuck
> items until the log force completes, regardless?

When buffers get flushed and then pinned and can't be written back,
we issue sync log forces from the AIL because the delwri list is not
empty and the force count is non-zero.

So it's not just pinned items in the AIL that trigger this - it's
stuff that gets pinned after it has been flushed but before it gets
written that causes problems, too.

And if it's pinned buffers in the delwri list that are the issue,
then we stop flushing other items from the list while we wait for
the log force. hence if the tail pinning items are not actually
pinned, but other buffers are, then we block on the log force when
we could have been flushing and writing the items that pin the tail
of the log.

> Is the issue not so much the throughput of xfsaild, but just the fact
> that the task stalls are large enough to cause other problems? I.e., the
> comment you add down in the xfsaild code refers to tail pinning causing
> reservation stalls.  IOW, the purpose of this patch is then to keep
> xfsaild sleeps to a predictable latency to ensure we come around again
> quickly enough to keep the tail moving forward (even though the AIL
> itself might remain in a state where the majority of items are stuck).
> Yes?

Yup, largely because there are multiple triggers for log forces. 
There's a difference between buffers that can't be flushed because
they are pinned and flushed buffers that can't be written because
they are pinned. The prior prevents the AIL from making push progress,
the latter doesn't.

> > +trace_printk("seq %llu, curseq %llu, ctxseq %llu pushseq %llu flags 0x%x",
> > +		sequence, cil->xc_current_sequence, cil->xc_ctx->sequence,
> > +		cil->xc_push_seq, flags);
> > +
> > +	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
> >  	/*
> >  	 * check to see if we need to force out the current context.
> >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> >  	 * so no need to deal with it here.
> >  	 */
> >  restart:
> > -	xlog_cil_push_now(log, sequence);
> > +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> > +	if (!(flags & XFS_LOG_SYNC))
> > +		return commit_lsn;
> 
> Hm, so now we have sync and async log force and sync and async CIL push.
> A log force always requires a sync CIL push and commit record submit;
> the difference is simply whether or not we wait on commit record I/O
> completion. The sync CIL push drains the CIL of log items but does not
> switch out the commit record iclog, while the async CIL push executes in
> the workqueue context so the drain is async, but it does switch out the
> commit record iclog before it completes. IOW, the async CIL push is
> basically a "more async" async log force.

Yes.

> I can see the need for the behavior of the async CIL push here, but that
> leaves a mess of interfaces and behavior matrix. Is there any reason we
> couldn't just make async log forces unconditionally behave equivalent to
> the async CIL push as defined by this patch? There's only a handful of
> existing users and I don't see any obvious reason why they might care
> whether the underlying CIL push is synchronous or not...

I'm not touching the rest of the log force code in this series - it
is out of scope of this bug fix and the rest of the series that it
is part of.

Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
changing things like log force behaviour and implementation is for
a future series.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing
  2021-03-01 13:32             ` Brian Foster
@ 2021-03-03  1:23               ` Dave Chinner
  2021-03-03 17:20                 ` Brian Foster
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2021-03-03  1:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Mar 01, 2021 at 08:32:18AM -0500, Brian Foster wrote:
> On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote:
> > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote:
> > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote:
> > > > This is really nasty behaviour, and it's only recently that I've got
> > > > a handle on it. I found it because my original "async CIL push" code
> > > > resulted in long stalls every time the log is filled and the tail is
> > > > pinned by a buffer that is being relogged in this manner....
> > > > 
> > > > I'm not sure how to fix this yet - the AIL needs to block the front
> > > > end relogging to allow the buffer to be unpinned. Essentially, we
> > > > need to hold the pinned items locked across a CIL push to guarantee
> > > > they are unpinned, but that's the complete opposite of what the AIL
> > > > currently does to prevent the front end from seeing long tail lock
> > > > latencies when modifying stuff....
> > > 
> > > When this stall problem manifests, I'm assuming it's exacerbated by
> > > delayed logging and the commit record behavior I described above. If
> > > that's the case, could the AIL communicate writeback pressure through
> > > affected log items such that checkpoints in which they are resident are
> > > flushed out completely/immediately when the checkpoints occur? I suppose
> > > that would require a log item flag or some such, which does raise a
> > > concern of unnecessarily tagging many items (it's not clear to me how
> > > likely that really is), but I'm curious if that would be an effective
> > > POC at least..
> > 
> > I don't think we need to do anything like that. All we need to do to
> > ensure that the AIL can flush a pinned buffer is to lock it, kick
> > the log and wait for the pin count to go to zero. Then we can write
> > it just fine, blocking only the front end transactions that need
> > that buffer lock.  Same goes for inodes, though xfs_iunpin_wait()
> > already does this....
> > 
> 
> Yeah, but why would we want to block xfsaild on a single item like that?

Who said anything about blocking the AIL on a single item? :)

> Wouldn't holding the item locked like that just create a new stall point
> within xfsaild? Maybe I'm missing something, but what you describe here
> basically just sounds like a "lock the item and run a sync log force"
> pattern.

What I was thinking is that if the item is pinned and at the
tail of the log, then we leave it locked when we flush it rather
than unlocking it and relocking it when the delwri submit code gets
to it. If it gets unpinned before the delwri code gets to it, all
good. If not, the delwri code being unable to flush it will feed
back up into the AIL to trigger a log force, which will unpin it
in the near future and it will be written on the next AIL delwri
submit cycle.

The thing we need to be careful of this is minimising the lock hold
time by the AIL while we unpin the item. That's the "long tail
latencies" problem I mention above. Essentially, we need to try to
avoid holding the item locked for a long period of time before
issuing the log force and/or resubmitting it for IO once it is
unlocked. I have a few ideas on this, but no patches yet.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing
  2021-03-03  1:23               ` Dave Chinner
@ 2021-03-03 17:20                 ` Brian Foster
  2021-03-04  2:01                   ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Brian Foster @ 2021-03-03 17:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 03, 2021 at 12:23:04PM +1100, Dave Chinner wrote:
> On Mon, Mar 01, 2021 at 08:32:18AM -0500, Brian Foster wrote:
> > On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote:
> > > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote:
> > > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote:
> > > > > This is really nasty behaviour, and it's only recently that I've got
> > > > > a handle on it. I found it because my original "async CIL push" code
> > > > > resulted in long stalls every time the log is filled and the tail is
> > > > > pinned by a buffer that is being relogged in this manner....
> > > > > 
> > > > > I'm not sure how to fix this yet - the AIL needs to block the front
> > > > > end relogging to allow the buffer to be unpinned. Essentially, we
> > > > > need to hold the pinned items locked across a CIL push to guarantee
> > > > > they are unpinned, but that's the complete opposite of what the AIL
> > > > > currently does to prevent the front end from seeing long tail lock
> > > > > latencies when modifying stuff....
> > > > 
> > > > When this stall problem manifests, I'm assuming it's exacerbated by
> > > > delayed logging and the commit record behavior I described above. If
> > > > that's the case, could the AIL communicate writeback pressure through
> > > > affected log items such that checkpoints in which they are resident are
> > > > flushed out completely/immediately when the checkpoints occur? I suppose
> > > > that would require a log item flag or some such, which does raise a
> > > > concern of unnecessarily tagging many items (it's not clear to me how
> > > > likely that really is), but I'm curious if that would be an effective
> > > > POC at least..
> > > 
> > > I don't think we need to do anything like that. All we need to do to
> > > ensure that the AIL can flush a pinned buffer is to lock it, kick
> > > the log and wait for the pin count to go to zero. Then we can write
> > > it just fine, blocking only the front end transactions that need
> > > that buffer lock.  Same goes for inodes, though xfs_iunpin_wait()
> > > already does this....
> > > 
> > 
> > Yeah, but why would we want to block xfsaild on a single item like that?
> 
> Who said anything about blocking the AIL on a single item? :)
> 
> > Wouldn't holding the item locked like that just create a new stall point
> > within xfsaild? Maybe I'm missing something, but what you describe here
> > basically just sounds like a "lock the item and run a sync log force"
> > pattern.
> 
> What I was thinking is that if the item is pinned and at the
> tail of the log, then we leave it locked when we flush it rather
> than unlocking it and relocking it when the delwri submit code gets
> to it. If it gets unpinned before the delwri code gets to it, all
> good. If not, the delwri code being unable to flush it will feed
> back up into the AIL to trigger a log force, which will unpin it
> in the near future and it will be written on the next AIL delwri
> submit cycle.
> 

I'm not sure what you mean by leaving the item locked when we flush it
if it is pinned, since we don't flush pinned items. Perhaps implied is
that this would trylock the buffer first, then only worry about if it's
pinned if we acquire the lock. If so (and the pin is at the tail), hold
the lock and kick the log as a means to ensure that xfsaild is
guaranteed next access to the buffer once unpinned. Hm?

IIUC, that seems interesting. Though as noted in the flush optimization
series, I am a little leery about issuing (even async) log forces with
locks held, at least with the current implementation..

Brian

> The thing we need to be careful of this is minimising the lock hold
> time by the AIL while we unpin the item. That's the "long tail
> latencies" problem I mention above. Essentially, we need to try to
> avoid holding the item locked for a long period of time before
> issuing the log force and/or resubmitting it for IO once it is
> unlocked. I have a few ideas on this, but no patches yet.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing
  2021-03-03  0:57         ` Dave Chinner
@ 2021-03-03 17:32           ` Brian Foster
  2021-03-04  1:59             ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Brian Foster @ 2021-03-03 17:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote:
> On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote:
> > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The AIL pushing is stalling on log forces when it comes across
> > > pinned items. This is happening on removal workloads where the AIL
> > > is dominated by stale items that are removed from AIL when the
> > > checkpoint that marks the items stale is committed to the journal.
> > > This results is relatively few items in the AIL, but those that are
> > > are often pinned as directories items are being removed from are
> > > still being logged.
> > > 
> > > As a result, many push cycles through the CIL will first issue a
> > > blocking log force to unpin the items. This can take some time to
> > > complete, with tracing regularly showing push delays of half a
> > > second and sometimes up into the range of several seconds. Sequences
> > > like this aren't uncommon:
> > > 
> > ...
> > > 
> > > In each of these cases, every AIL pass saw 101 log items stuck on
> > > the AIL (pinned) with very few other items being found. Each pass, a
> > > log force was issued, and delay between last/first is the sleep time
> > > + the sync log force time.
> > > 
> > > Some of these 101 items pinned the tail of the log. The tail of the
> > > log does slowly creep forward (first lsn), but the problem is that
> > > the log is actually out of reservation space because it's been
> > > running so many transactions that stale items that never reach the
> > > AIL but consume log space. Hence we have a largely empty AIL, with
> > > long term pins on items that pin the tail of the log that don't get
> > > pushed frequently enough to keep log space available.
> > > 
> > > The problem is the hundreds of milliseconds that we block in the log
> > > force pushing the CIL out to disk. The AIL should not be stalled
> > > like this - it needs to run and flush items that are at the tail of
> > > the log with minimal latency. What we really need to do is trigger a
> > > log flush, but then not wait for it at all - we've already done our
> > > waiting for stuff to complete when we backed off prior to the log
> > > force being issued.
> > > 
> > > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> > > still do a blocking flush of the CIL and that is what is causing the
> > > issue. Hence we need a new interface for the CIL to trigger an
> > > immediate background push of the CIL to get it moving faster but not
> > > to wait on that to occur. While the CIL is pushing, the AIL can also
> > > be pushing.
> > > 
> > 
> > So I think I follow what you're describing wrt to the xfsaild delay, but
> > what I'm not clear on is what changing this behavior actually fixes.
> > IOW, if the xfsaild is cycling through and finding mostly stuck items,
> > then we see a delay because of the time spent in a log force (because of
> > those stuck items), what's the advantage to issue an async force over a
> > sync force? Won't xfsaild effectively continue to spin on these stuck
> > items until the log force completes, regardless?
> 
> When buffers get flushed and then pinned and can't be written back,
> we issue sync log forces from the AIL because the delwri list is not
> empty and the force count is non-zero.
> 
> So it's not just pinned items in the AIL that trigger this - it's
> stuff that gets pinned after it has been flushed but before it gets
> written that causes problems, too.
> 
> And if it's pinned buffers in the delwri list that are the issue,
> then we stop flushing other items from the list while we wait for
> the log force. hence if the tail pinning items are not actually
> pinned, but other buffers are, then we block on the log force when
> we could have been flushing and writing the items that pin the tail
> of the log.
> 
> > Is the issue not so much the throughput of xfsaild, but just the fact
> > that the task stalls are large enough to cause other problems? I.e., the
> > comment you add down in the xfsaild code refers to tail pinning causing
> > reservation stalls.  IOW, the purpose of this patch is then to keep
> > xfsaild sleeps to a predictable latency to ensure we come around again
> > quickly enough to keep the tail moving forward (even though the AIL
> > itself might remain in a state where the majority of items are stuck).
> > Yes?
> 
> Yup, largely because there are multiple triggers for log forces. 
> There's a difference between buffers that can't be flushed because
> they are pinned and flushed buffers that can't be written because
> they are pinned. The prior prevents the AIL from making push progress,
> the latter doesn't.
> 
> > > +trace_printk("seq %llu, curseq %llu, ctxseq %llu pushseq %llu flags 0x%x",
> > > +		sequence, cil->xc_current_sequence, cil->xc_ctx->sequence,
> > > +		cil->xc_push_seq, flags);
> > > +
> > > +	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
> > >  	/*
> > >  	 * check to see if we need to force out the current context.
> > >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> > >  	 * so no need to deal with it here.
> > >  	 */
> > >  restart:
> > > -	xlog_cil_push_now(log, sequence);
> > > +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> > > +	if (!(flags & XFS_LOG_SYNC))
> > > +		return commit_lsn;
> > 
> > Hm, so now we have sync and async log force and sync and async CIL push.
> > A log force always requires a sync CIL push and commit record submit;
> > the difference is simply whether or not we wait on commit record I/O
> > completion. The sync CIL push drains the CIL of log items but does not
> > switch out the commit record iclog, while the async CIL push executes in
> > the workqueue context so the drain is async, but it does switch out the
> > commit record iclog before it completes. IOW, the async CIL push is
> > basically a "more async" async log force.
> 
> Yes.
> 
> > I can see the need for the behavior of the async CIL push here, but that
> > leaves a mess of interfaces and behavior matrix. Is there any reason we
> > couldn't just make async log forces unconditionally behave equivalent to
> > the async CIL push as defined by this patch? There's only a handful of
> > existing users and I don't see any obvious reason why they might care
> > whether the underlying CIL push is synchronous or not...
> 
> I'm not touching the rest of the log force code in this series - it
> is out of scope of this bug fix and the rest of the series that it
> is part of.
> 

But you already have altered the log force code by changing
xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq()
behaves. So not only does this patch expose subsystem internals to
external layers and create more log forcing interfaces/behaviors to
maintain, it invokes one or the other from the preexisting async log
force variants purely based on whether the caller had a target sequence
number or not. This doesn't make a whole lot of sense to me.

> Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> changing things like log force behaviour and implementation is for
> a future series.
> 

TBH I think this patch is kind of a mess on its own. I think the
mechanism it wants to provide is sane, but I've not even got to the
point of reviewing _that_ yet because of the seeming dismissal of higher
level feedback. I'd rather not go around in circles on this so I'll just
offer my summarized feedback to this patch...

I think this patch should be broken down into two, possibly three
sub-patches that generally do the following (not necessarily in this
order):

1. Implement the new underlying async CIL push mechanism (i.e. the
   ->xc_push_async bits and new commit iclog switching behavior).
2. Change the behavior of existing async log forces to use the new
   mechanism consistently. Document the change in behavior for historical
   reference.
3. Switch xfsaild to use an async log force instead of the current sync
   log force. Document why this might depend on the updated async
   behavior.

Further log subsystem cleanup or consolidation that this might
facilitate between the historical async log force and the newer CIL
based async log force can occur separately in future patches..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing
  2021-03-03 17:32           ` Brian Foster
@ 2021-03-04  1:59             ` Dave Chinner
  2021-03-04 13:13               ` Brian Foster
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2021-03-04  1:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Mar 03, 2021 at 12:32:39PM -0500, Brian Foster wrote:
> On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote:
> > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote:
> > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> > > >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> > > >  	 * so no need to deal with it here.
> > > >  	 */
> > > >  restart:
> > > > -	xlog_cil_push_now(log, sequence);
> > > > +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> > > > +	if (!(flags & XFS_LOG_SYNC))
> > > > +		return commit_lsn;
> > > 
> > > Hm, so now we have sync and async log force and sync and async CIL push.
> > > A log force always requires a sync CIL push and commit record submit;
> > > the difference is simply whether or not we wait on commit record I/O
> > > completion. The sync CIL push drains the CIL of log items but does not
> > > switch out the commit record iclog, while the async CIL push executes in
> > > the workqueue context so the drain is async, but it does switch out the
> > > commit record iclog before it completes. IOW, the async CIL push is
> > > basically a "more async" async log force.
> > 
> > Yes.
> > 
> > > I can see the need for the behavior of the async CIL push here, but that
> > > leaves a mess of interfaces and behavior matrix. Is there any reason we
> > > couldn't just make async log forces unconditionally behave equivalent to
> > > the async CIL push as defined by this patch? There's only a handful of
> > > existing users and I don't see any obvious reason why they might care
> > > whether the underlying CIL push is synchronous or not...
> > 
> > I'm not touching the rest of the log force code in this series - it
> > is out of scope of this bug fix and the rest of the series that it
> > is part of.
> > 
> 
> But you already have altered the log force code by changing
> xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq()
> behaves.

The behaviour of the function when called from xfs_log_force*()
should be unchanged. Can you be specific about exactly what
behaviour did I change for non-synchronous xfs_log_force*() callers
so I can fix it? I have not intended to change it at all, so
whatever you are refering is an issue I need to fix...

> So not only does this patch expose subsystem internals to
> external layers and create more log forcing interfaces/behaviors to

Sorry, I don't follow. What "subsystem internals" are being exposed
and what external layer are they being exposed to?

> > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > changing things like log force behaviour and implementation is for
> > a future series.
> > 
> 
> TBH I think this patch is kind of a mess on its own. I think the
> mechanism it wants to provide is sane, but I've not even got to the
> point of reviewing _that_ yet because of the seeming dismissal of higher
> level feedback. I'd rather not go around in circles on this so I'll just
> offer my summarized feedback to this patch...

I'm not dismissing review nor am I saying the API cannot or should
not be improved. I'm simply telling you that I think the additional
changes you are proposing are outside the scope of the problem I am
addressing here. I already plan to rework the log force API (and
others) but doing so it not something that this patchset needs to
address, or indeed should address.

There are already enough subtle changes being made to core code and
algorithms that mixing them with unrelated high level external
behavioural changes that it greatly increases the risk of unexpected
regressions in the patchset. The log force are paths are used in
data integrity paths, so I want to limit the scope of behavioural
change to just the AIL where the log force has no data integrity
requirement associcated with it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing
  2021-03-03 17:20                 ` Brian Foster
@ 2021-03-04  2:01                   ` Dave Chinner
  0 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2021-03-04  2:01 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Mar 03, 2021 at 12:20:19PM -0500, Brian Foster wrote:
> On Wed, Mar 03, 2021 at 12:23:04PM +1100, Dave Chinner wrote:
> > On Mon, Mar 01, 2021 at 08:32:18AM -0500, Brian Foster wrote:
> > > On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote:
> > > > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote:
> > > > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote:
> > > > > > This is really nasty behaviour, and it's only recently that I've got
> > > > > > a handle on it. I found it because my original "async CIL push" code
> > > > > > resulted in long stalls every time the log is filled and the tail is
> > > > > > pinned by a buffer that is being relogged in this manner....
> > > > > > 
> > > > > > I'm not sure how to fix this yet - the AIL needs to block the front
> > > > > > end relogging to allow the buffer to be unpinned. Essentially, we
> > > > > > need to hold the pinned items locked across a CIL push to guarantee
> > > > > > they are unpinned, but that's the complete opposite of what the AIL
> > > > > > currently does to prevent the front end from seeing long tail lock
> > > > > > latencies when modifying stuff....
> > > > > 
> > > > > When this stall problem manifests, I'm assuming it's exacerbated by
> > > > > delayed logging and the commit record behavior I described above. If
> > > > > that's the case, could the AIL communicate writeback pressure through
> > > > > affected log items such that checkpoints in which they are resident are
> > > > > flushed out completely/immediately when the checkpoints occur? I suppose
> > > > > that would require a log item flag or some such, which does raise a
> > > > > concern of unnecessarily tagging many items (it's not clear to me how
> > > > > likely that really is), but I'm curious if that would be an effective
> > > > > POC at least..
> > > > 
> > > > I don't think we need to do anything like that. All we need to do to
> > > > ensure that the AIL can flush a pinned buffer is to lock it, kick
> > > > the log and wait for the pin count to go to zero. Then we can write
> > > > it just fine, blocking only the front end transactions that need
> > > > that buffer lock.  Same goes for inodes, though xfs_iunpin_wait()
> > > > already does this....
> > > > 
> > > 
> > > Yeah, but why would we want to block xfsaild on a single item like that?
> > 
> > Who said anything about blocking the AIL on a single item? :)
> > 
> > > Wouldn't holding the item locked like that just create a new stall point
> > > within xfsaild? Maybe I'm missing something, but what you describe here
> > > basically just sounds like a "lock the item and run a sync log force"
> > > pattern.
> > 
> > What I was thinking is that if the item is pinned and at the
> > tail of the log, then we leave it locked when we flush it rather
> > than unlocking it and relocking it when the delwri submit code gets
> > to it. If it gets unpinned before the delwri code gets to it, all
> > good. If not, the delwri code being unable to flush it will feed
> > back up into the AIL to trigger a log force, which will unpin it
> > in the near future and it will be written on the next AIL delwri
> > submit cycle.
> > 
> 
> I'm not sure what you mean by leaving the item locked when we flush it
> if it is pinned, since we don't flush pinned items.

That's exactly what I'm talking about changing.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing
  2021-03-04  1:59             ` Dave Chinner
@ 2021-03-04 13:13               ` Brian Foster
  2021-03-04 22:48                 ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Brian Foster @ 2021-03-04 13:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Mar 04, 2021 at 12:59:33PM +1100, Dave Chinner wrote:
> On Wed, Mar 03, 2021 at 12:32:39PM -0500, Brian Foster wrote:
> > On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote:
> > > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote:
> > > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> > > > >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> > > > >  	 * so no need to deal with it here.
> > > > >  	 */
> > > > >  restart:
> > > > > -	xlog_cil_push_now(log, sequence);
> > > > > +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> > > > > +	if (!(flags & XFS_LOG_SYNC))
> > > > > +		return commit_lsn;
> > > > 
> > > > Hm, so now we have sync and async log force and sync and async CIL push.
> > > > A log force always requires a sync CIL push and commit record submit;
> > > > the difference is simply whether or not we wait on commit record I/O
> > > > completion. The sync CIL push drains the CIL of log items but does not
> > > > switch out the commit record iclog, while the async CIL push executes in
> > > > the workqueue context so the drain is async, but it does switch out the
> > > > commit record iclog before it completes. IOW, the async CIL push is
> > > > basically a "more async" async log force.
> > > 
> > > Yes.
> > > 
> > > > I can see the need for the behavior of the async CIL push here, but that
> > > > leaves a mess of interfaces and behavior matrix. Is there any reason we
> > > > couldn't just make async log forces unconditionally behave equivalent to
> > > > the async CIL push as defined by this patch? There's only a handful of
> > > > existing users and I don't see any obvious reason why they might care
> > > > whether the underlying CIL push is synchronous or not...
> > > 
> > > I'm not touching the rest of the log force code in this series - it
> > > is out of scope of this bug fix and the rest of the series that it
> > > is part of.
> > > 
> > 
> > But you already have altered the log force code by changing
> > xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq()
> > behaves.
> 
> The behaviour of the function when called from xfs_log_force*()
> should be unchanged. Can you be specific about exactly what
> behaviour did I change for non-synchronous xfs_log_force*() callers
> so I can fix it? I have not intended to change it at all, so
> whatever you are refering is an issue I need to fix...
> 

xfs_log_force_seq() passes flags from the caller to xlog_cil_force_seq()
(whereas this patch presumably wants it to pass XFS_LOG_SYNC
unconditionally). IOW, xfs_log_force(mp, 0) behavior is different from
xfs_log_force_seq(mp, seq, 0, ...).

> > So not only does this patch expose subsystem internals to
> > external layers and create more log forcing interfaces/behaviors to
> 
> Sorry, I don't follow. What "subsystem internals" are being exposed
> and what external layer are they being exposed to?
> 
> > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > changing things like log force behaviour and implementation is for
> > > a future series.
> > > 
> > 
> > TBH I think this patch is kind of a mess on its own. I think the
> > mechanism it wants to provide is sane, but I've not even got to the
> > point of reviewing _that_ yet because of the seeming dismissal of higher
> > level feedback. I'd rather not go around in circles on this so I'll just
> > offer my summarized feedback to this patch...
> 
> I'm not dismissing review nor am I saying the API cannot or should
> not be improved. I'm simply telling you that I think the additional
> changes you are proposing are outside the scope of the problem I am
> addressing here. I already plan to rework the log force API (and
> others) but doing so it not something that this patchset needs to
> address, or indeed should address.
> 

I'm not proposing additional changes nor to rework the log force API.
I'm pointing out that I find this implementation to be extremely and
unnecessarily confusing. To improve it, I'm suggesting to either coopt
the existing async log force API...

> There are already enough subtle changes being made to core code and
> algorithms that mixing them with unrelated high level external
> behavioural changes that it greatly increases the risk of unexpected
> regressions in the patchset. The log force are paths are used in
> data integrity paths, so I want to limit the scope of behavioural
> change to just the AIL where the log force has no data integrity
> requirement associcated with it.
> 

... or if we really want a special async log force just for xfsaild (why
is still not clear to me), then tie it to an XFS_LOG_REALLY_ASYNC flag
or some such, pass that to the existing log force call, and document the
purpose/behavior of the new mode in detail. That at least won't require
a developer to wonder what !(flags & XFS_LOG_SYNC) happens to mean
depending on the particular log force function.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing
  2021-03-04 13:13               ` Brian Foster
@ 2021-03-04 22:48                 ` Dave Chinner
  2021-03-05 14:58                   ` Brian Foster
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2021-03-04 22:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Mar 04, 2021 at 08:13:07AM -0500, Brian Foster wrote:
> On Thu, Mar 04, 2021 at 12:59:33PM +1100, Dave Chinner wrote:
> > On Wed, Mar 03, 2021 at 12:32:39PM -0500, Brian Foster wrote:
> > > On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote:
> > > > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote:
> > > > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> > > > > >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> > > > > >  	 * so no need to deal with it here.
> > > > > >  	 */
> > > > > >  restart:
> > > > > > -	xlog_cil_push_now(log, sequence);
> > > > > > +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> > > > > > +	if (!(flags & XFS_LOG_SYNC))
> > > > > > +		return commit_lsn;
> > > > > 
> > > > > Hm, so now we have sync and async log force and sync and async CIL push.
> > > > > A log force always requires a sync CIL push and commit record submit;
> > > > > the difference is simply whether or not we wait on commit record I/O
> > > > > completion. The sync CIL push drains the CIL of log items but does not
> > > > > switch out the commit record iclog, while the async CIL push executes in
> > > > > the workqueue context so the drain is async, but it does switch out the
> > > > > commit record iclog before it completes. IOW, the async CIL push is
> > > > > basically a "more async" async log force.
> > > > 
> > > > Yes.
> > > > 
> > > > > I can see the need for the behavior of the async CIL push here, but that
> > > > > leaves a mess of interfaces and behavior matrix. Is there any reason we
> > > > > couldn't just make async log forces unconditionally behave equivalent to
> > > > > the async CIL push as defined by this patch? There's only a handful of
> > > > > existing users and I don't see any obvious reason why they might care
> > > > > whether the underlying CIL push is synchronous or not...
> > > > 
> > > > I'm not touching the rest of the log force code in this series - it
> > > > is out of scope of this bug fix and the rest of the series that it
> > > > is part of.
> > > > 
> > > 
> > > But you already have altered the log force code by changing
> > > xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq()
> > > behaves.
> > 
> > The behaviour of the function when called from xfs_log_force*()
> > should be unchanged. Can you be specific about exactly what
> > behaviour did I change for non-synchronous xfs_log_force*() callers
> > so I can fix it? I have not intended to change it at all, so
> > whatever you are refering is an issue I need to fix...
> > 
> 
> xfs_log_force_seq() passes flags from the caller to xlog_cil_force_seq()
> (whereas this patch presumably wants it to pass XFS_LOG_SYNC
> unconditionally). IOW, xfs_log_force(mp, 0) behavior is different from
> xfs_log_force_seq(mp, seq, 0, ...).

Fixed.

> > > So not only does this patch expose subsystem internals to
> > > external layers and create more log forcing interfaces/behaviors to
> > 
> > Sorry, I don't follow. What "subsystem internals" are being exposed
> > and what external layer are they being exposed to?
> > 
> > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > > changing things like log force behaviour and implementation is for
> > > > a future series.
> > > > 
> > > 
> > > TBH I think this patch is kind of a mess on its own. I think the
> > > mechanism it wants to provide is sane, but I've not even got to the
> > > point of reviewing _that_ yet because of the seeming dismissal of higher
> > > level feedback. I'd rather not go around in circles on this so I'll just
> > > offer my summarized feedback to this patch...
> > 
> > I'm not dismissing review nor am I saying the API cannot or should
> > not be improved. I'm simply telling you that I think the additional
> > changes you are proposing are outside the scope of the problem I am
> > addressing here. I already plan to rework the log force API (and
> > others) but doing so it not something that this patchset needs to
> > address, or indeed should address.
> > 
> 
> I'm not proposing additional changes nor to rework the log force API.
> I'm pointing out that I find this implementation to be extremely and
> unnecessarily confusing.

And that's just fine - not everyone has to understand all of the
code all of the time. That's why we have a team....

> To improve it, I'm suggesting to either coopt
> the existing async log force API...

And I'm telling you that this is *future work*. As the person
actually doing the work, that's my decision and you need to accept
that. I understand your concerns and have a plan to address them -
you just have to accept that the plan to address them isn't going to
be exactly to your liking.

> > There are already enough subtle changes being made to core code and
> > algorithms that mixing them with unrelated high level external
> > behavioural changes that it greatly increases the risk of unexpected
> > regressions in the patchset. The log force are paths are used in
> > data integrity paths, so I want to limit the scope of behavioural
> > change to just the AIL where the log force has no data integrity
> > requirement associcated with it.
> > 
> 
> ... or if we really want a special async log force just for xfsaild (why
> is still not clear to me), then tie it to an XFS_LOG_REALLY_ASYNC flag
> or some such, pass that to the existing log force call, and document the
> purpose/behavior of the new mode in detail. That at least won't require
> a developer to wonder what !(flags & XFS_LOG_SYNC) happens to mean
> depending on the particular log force function.

No. That's just obfuscation. And it is extremely likely that the
first thing Christoph will ask me to do with one-shot flag that just
calls out to a single function is to get rid of the flag and call
the function directly.

This will get cleaned up. Just not in this patchset.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing
  2021-03-04 22:48                 ` Dave Chinner
@ 2021-03-05 14:58                   ` Brian Foster
  2021-03-09  0:44                     ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Brian Foster @ 2021-03-05 14:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote:
> On Thu, Mar 04, 2021 at 08:13:07AM -0500, Brian Foster wrote:
> > On Thu, Mar 04, 2021 at 12:59:33PM +1100, Dave Chinner wrote:
> > > On Wed, Mar 03, 2021 at 12:32:39PM -0500, Brian Foster wrote:
> > > > On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote:
> > > > > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote:
> > > > > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> > > > > > >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> > > > > > >  	 * so no need to deal with it here.
> > > > > > >  	 */
> > > > > > >  restart:
> > > > > > > -	xlog_cil_push_now(log, sequence);
> > > > > > > +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> > > > > > > +	if (!(flags & XFS_LOG_SYNC))
> > > > > > > +		return commit_lsn;
> > > > > > 
> > > > > > Hm, so now we have sync and async log force and sync and async CIL push.
> > > > > > A log force always requires a sync CIL push and commit record submit;
> > > > > > the difference is simply whether or not we wait on commit record I/O
> > > > > > completion. The sync CIL push drains the CIL of log items but does not
> > > > > > switch out the commit record iclog, while the async CIL push executes in
> > > > > > the workqueue context so the drain is async, but it does switch out the
> > > > > > commit record iclog before it completes. IOW, the async CIL push is
> > > > > > basically a "more async" async log force.
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > I can see the need for the behavior of the async CIL push here, but that
> > > > > > leaves a mess of interfaces and behavior matrix. Is there any reason we
> > > > > > couldn't just make async log forces unconditionally behave equivalent to
> > > > > > the async CIL push as defined by this patch? There's only a handful of
> > > > > > existing users and I don't see any obvious reason why they might care
> > > > > > whether the underlying CIL push is synchronous or not...
> > > > > 
> > > > > I'm not touching the rest of the log force code in this series - it
> > > > > is out of scope of this bug fix and the rest of the series that it
> > > > > is part of.
> > > > > 
> > > > 
> > > > But you already have altered the log force code by changing
> > > > xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq()
> > > > behaves.
> > > 
> > > The behaviour of the function when called from xfs_log_force*()
> > > should be unchanged. Can you be specific about exactly what
> > > behaviour did I change for non-synchronous xfs_log_force*() callers
> > > so I can fix it? I have not intended to change it at all, so
> > > whatever you are refering is an issue I need to fix...
> > > 
> > 
> > xfs_log_force_seq() passes flags from the caller to xlog_cil_force_seq()
> > (whereas this patch presumably wants it to pass XFS_LOG_SYNC
> > unconditionally). IOW, xfs_log_force(mp, 0) behavior is different from
> > xfs_log_force_seq(mp, seq, 0, ...).
> 
> Fixed.
> 
> > > > So not only does this patch expose subsystem internals to
> > > > external layers and create more log forcing interfaces/behaviors to
> > > 
> > > Sorry, I don't follow. What "subsystem internals" are being exposed
> > > and what external layer are they being exposed to?
> > > 
> > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > > > changing things like log force behaviour and implementation is for
> > > > > a future series.
> > > > > 
> > > > 
> > > > TBH I think this patch is kind of a mess on its own. I think the
> > > > mechanism it wants to provide is sane, but I've not even got to the
> > > > point of reviewing _that_ yet because of the seeming dismissal of higher
> > > > level feedback. I'd rather not go around in circles on this so I'll just
> > > > offer my summarized feedback to this patch...
> > > 
> > > I'm not dismissing review nor am I saying the API cannot or should
> > > not be improved. I'm simply telling you that I think the additional
> > > changes you are proposing are outside the scope of the problem I am
> > > addressing here. I already plan to rework the log force API (and
> > > others) but doing so it not something that this patchset needs to
> > > address, or indeed should address.
> > > 
> > 
> > I'm not proposing additional changes nor to rework the log force API.
> > I'm pointing out that I find this implementation to be extremely and
> > unnecessarily confusing.
> 
> And that's just fine - not everyone has to understand all of the
> code all of the time. That's why we have a team....
> 
> > To improve it, I'm suggesting to either coopt
> > the existing async log force API...
> 
> And I'm telling you that this is *future work*. As the person
> actually doing the work, that's my decision and you need to accept
> that. I understand your concerns and have a plan to address them -
> you just have to accept that the plan to address them isn't going to
> be exactly to your liking.
> 

As a reviewer, I am pointing out that I object to how this patch is
implemented and offered several fairly simple suggestions on how to
address my concerns. Those suggestions have been rejected pretty much
out of hand on the basis of the existence of some future plans.

Future rework plans do not justify or change my view of this patch and
the mess it adds (for other developers, for historical context and
downstreams, etc.). Therefore, if the only option you're willing to
consider is "make this mess now and clean it up later in some broader
rework," then I'd rather we just defer this patch until after that
rework is available and avoid the mess that way.

Brian

> > > There are already enough subtle changes being made to core code and
> > > algorithms that mixing them with unrelated high level external
> > > behavioural changes that it greatly increases the risk of unexpected
> > > regressions in the patchset. The log force are paths are used in
> > > data integrity paths, so I want to limit the scope of behavioural
> > > change to just the AIL where the log force has no data integrity
> > > requirement associcated with it.
> > > 
> > 
> > ... or if we really want a special async log force just for xfsaild (why
> > is still not clear to me), then tie it to an XFS_LOG_REALLY_ASYNC flag
> > or some such, pass that to the existing log force call, and document the
> > purpose/behavior of the new mode in detail. That at least won't require
> > a developer to wonder what !(flags & XFS_LOG_SYNC) happens to mean
> > depending on the particular log force function.
> 
> No. That's just obfuscation. And it is extremely likely that the
> first thing Christoph will ask me to do with one-shot flag that just
> calls out to a single function is to get rid of the flag and call
> the function directly.
> 
> This will get cleaned up. Just not in this patchset.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing
  2021-03-05 14:58                   ` Brian Foster
@ 2021-03-09  0:44                     ` Dave Chinner
  2021-03-09  4:35                       ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Darrick J. Wong
  2021-03-10 14:49                       ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing Brian Foster
  0 siblings, 2 replies; 33+ messages in thread
From: Dave Chinner @ 2021-03-09  0:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote:
> On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote:
> > > > > So not only does this patch expose subsystem internals to
> > > > > external layers and create more log forcing interfaces/behaviors to
> > > > 
> > > > Sorry, I don't follow. What "subsystem internals" are being exposed
> > > > and what external layer are they being exposed to?
> > > > 
> > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > > > > changing things like log force behaviour and implementation is for
> > > > > > a future series.
> > > > > > 
> > > > > 
> > > > > TBH I think this patch is kind of a mess on its own. I think the
> > > > > mechanism it wants to provide is sane, but I've not even got to the
> > > > > point of reviewing _that_ yet because of the seeming dismissal of higher
> > > > > level feedback. I'd rather not go around in circles on this so I'll just
> > > > > offer my summarized feedback to this patch...
> > > > 
> > > > I'm not dismissing review nor am I saying the API cannot or should
> > > > not be improved. I'm simply telling you that I think the additional
> > > > changes you are proposing are outside the scope of the problem I am
> > > > addressing here. I already plan to rework the log force API (and
> > > > others) but doing so it not something that this patchset needs to
> > > > address, or indeed should address.
> > > > 
> > > 
> > > I'm not proposing additional changes nor to rework the log force API.
> > > I'm pointing out that I find this implementation to be extremely and
> > > unnecessarily confusing.
> > 
> > And that's just fine - not everyone has to understand all of the
> > code all of the time. That's why we have a team....
> > 
> > > To improve it, I'm suggesting to either coopt
> > > the existing async log force API...
> > 
> > And I'm telling you that this is *future work*. As the person
> > actually doing the work, that's my decision and you need to accept
> > that. I understand your concerns and have a plan to address them -
> > you just have to accept that the plan to address them isn't going to
> > be exactly to your liking.
> > 
> 
> As a reviewer, I am pointing out that I object to how this patch is
> implemented and offered several fairly simple suggestions on how to
> address my concerns. Those suggestions have been rejected pretty much
> out of hand on the basis of the existence of some future plans.

We're allowed to disagree on the best approach. But to find a way
forward means that both the summitter and the reviewer need to
compromise. I've fixed up the obvious warts and bugs you've pointed
out, and agreed that it needs cleaning up and have committed to
cleaning it up in the near future.

> Future rework plans do not justify or change my view of this patch
> and the mess it adds (for other developers, for historical context
> and downstreams, etc.).  Therefore, if the only option you're
> willing to consider is "make this mess now and clean it up later
> in some broader rework," then I'd rather we just defer this patch
> until after that rework is available and avoid the mess that way.

So you won't review it until I have 100 outstanding patches in this
series and it's completely and utterly unreviewable? Then you'll ask
me to split it up and re-order it into digestable chunks, and so
we'll go back to having to merge this because any of the API rework
that depends on the mechanism that this patch introduces.

I'm not going to play that "now jump through this hoop" game.  We
add flags for on-off behaviours in internal functions -all the
time-. If this makes the interface so complex and confusing that you
don't understand it, then the interface was already too complex and
confusing. And fixing that is *not in the scope of this patchset*.

Demanding that code be made perfect before it can be merged is
really not very helpful. Especially when there are already plans to
rework the API but that rework is dependent on a bunch of other
changes than need to be done first.

iclogs are something that need to be moved behind the CIL, not sit
in front of CIL. The CIL abstracts the journal and writing to the
journal completely away from the transaction subsystem, yet the log
force code punches straight through that abstraction and walks
iclogs directly. The whole log force implementation needs to change,
and I plan for the API that wraps the log forces to get reworked at
that time.

For example, if we want to direct map storage for log writes, then
iclog-based log force synchronisation needs to go away because we
don't need iclogs for buffering journal writes. Hence the log foce
code should interface only with the CIL, and only the CIL should
manage whatever mechanism it is using to write to stable storage.
The code is currently the way it is because the CIL, when first
implemented, had to co-exist with the old way of writing to the log.
We haven't used that old way for a decade now and we have very
different storage performance characteristics these days, so it's
about time we got rid of the mechanism designed to be optimal for
spinning disks and actually integrated the CIL and the log
efficiently.

There are a *lot* of steps to do this, and reworking the log force
implementation and API is part of that. But reworking that API is
premature because we haven't done all the necessary pre-work in
place to make such a change yet. This patch is actually part of that
pre-work to get the mechanisms that the log force rework will rely
on.

I have very good reasons for pushing back against your suggestions,
Brian. Your suggestions have merit but this patch is not the time to
be making the changes you suggest. Code does not need to be perfect
to be merged, nor does the entire larger change they will be part of
need to be complete and 100% tested and reviewed before preparation
and infrastructure patches can be merged. This is how we've done big
changes in the past - they've been staged across multiple kernel
cycles - and not everything needs to be done in the first
patchset of a larger body of work....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\
  2021-03-09  0:44                     ` Dave Chinner
@ 2021-03-09  4:35                       ` Darrick J. Wong
  2021-03-10  2:10                         ` Brian Foster
                                           ` (2 more replies)
  2021-03-10 14:49                       ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing Brian Foster
  1 sibling, 3 replies; 33+ messages in thread
From: Darrick J. Wong @ 2021-03-09  4:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote:
> On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote:
> > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote:
> > > > > > So not only does this patch expose subsystem internals to
> > > > > > external layers and create more log forcing interfaces/behaviors to
> > > > > 
> > > > > Sorry, I don't follow. What "subsystem internals" are being exposed
> > > > > and what external layer are they being exposed to?
> > > > > 
> > > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > > > > > changing things like log force behaviour and implementation is for
> > > > > > > a future series.
> > > > > > > 
> > > > > > 
> > > > > > TBH I think this patch is kind of a mess on its own. I think the
> > > > > > mechanism it wants to provide is sane, but I've not even got to the
> > > > > > point of reviewing _that_ yet because of the seeming dismissal of higher
> > > > > > level feedback. I'd rather not go around in circles on this so I'll just
> > > > > > offer my summarized feedback to this patch...
> > > > > 
> > > > > I'm not dismissing review nor am I saying the API cannot or should
> > > > > not be improved. I'm simply telling you that I think the additional
> > > > > changes you are proposing are outside the scope of the problem I am
> > > > > addressing here. I already plan to rework the log force API (and
> > > > > others) but doing so it not something that this patchset needs to
> > > > > address, or indeed should address.
> > > > > 
> > > > 
> > > > I'm not proposing additional changes nor to rework the log force API.
> > > > I'm pointing out that I find this implementation to be extremely and
> > > > unnecessarily confusing.
> > > 
> > > And that's just fine - not everyone has to understand all of the
> > > code all of the time. That's why we have a team....
> > > 
> > > > To improve it, I'm suggesting to either coopt
> > > > the existing async log force API...
> > > 
> > > And I'm telling you that this is *future work*. As the person
> > > actually doing the work, that's my decision and you need to accept
> > > that. I understand your concerns and have a plan to address them -
> > > you just have to accept that the plan to address them isn't going to
> > > be exactly to your liking.
> > > 
> > 
> > As a reviewer, I am pointing out that I object to how this patch is
> > implemented and offered several fairly simple suggestions on how to
> > address my concerns. Those suggestions have been rejected pretty much
> > out of hand on the basis of the existence of some future plans.
> 
> We're allowed to disagree on the best approach. But to find a way
> forward means that both the summitter and the reviewer need to
> compromise. I've fixed up the obvious warts and bugs you've pointed
> out, and agreed that it needs cleaning up and have committed to
> cleaning it up in the near future.
> 
> > Future rework plans do not justify or change my view of this patch
> > and the mess it adds (for other developers, for historical context
> > and downstreams, etc.).  Therefore, if the only option you're
> > willing to consider is "make this mess now and clean it up later
> > in some broader rework," then I'd rather we just defer this patch
> > until after that rework is available and avoid the mess that way.
> 
> So you won't review it until I have 100 outstanding patches in this
> series and it's completely and utterly unreviewable?

Already unreviewable at 45, and I've only gotten through 2/3 of it.

> Then you'll ask me to split it up and re-order it into digestable
> chunks, and so we'll go back to having to merge this because any of
> the API rework that depends on the mechanism that this patch
> introduces.

Here's something I haven't previously shared with all of you: Last cycle
when we were going around and around on the ENOSPC/EDQUOT retry loop
patches (which exploded from 13 to 41 patches) it was /very/ stressful
to have to rework this and that part every day and a half for almost
three weeks.

When I get a patchset ready for submission, I export the patches from my
development branch into a topic branch based off the latest -rc.  Every
time anyone makes a suggestion, I update the topic branch and send that
to the fstests "cloud" to see if it broke anything.  If it succeeds, I
push it to k.org and resubmit.

The part that's less obvious is the fact that rebasing is /not/ that
simple.  Next I integrate the changes into my development tree and
rebase everything that comes after /that/ to make sure it doesn't
interfere with my longer term development goals.  Sometimes this is
easy, sometimes this takes an entire day to work out the warts.
Then I send that to the fstests "cloud".  This rebase is particularly
painful because every change that everyone makes to inode
initialization collides with a rework of inode initialization that I've
been working on in preparation for metadata directory trees.

The part that's even /less/ obvious than that is that once the
development tree tests ok, then I do the same to my xfsprogs dev tree to
make sure that nothing broke.  Frequently there are ABI or cognitive
mismatches between kernel and userspace such that the build breaks, and
then I have to patch the two kernel trees and re-run everything.

So, that's really stressful for me because a minor tweak to an interface
can result in an enormous amount of work.  And I reject the argument
that I could just rebase less frequently -- Dave does that, which means
that any time he hears that one of his topic branches is being asked
about, he has to spend weeks rebasing the whole mess to the latest
upstream.  Maybe that works for him, but for me, I would hate that even
more than doing a daily rebase.

Add to that the fact that vger has been a total delivery sh*tshow all
year.  Now in addition to feeling disconnected from my family and
friends, I also feel disconnected from work people too.  This really did
nearly push me over the edge three weeks ago.

Please remember, one of the big advantages of our open development
processes is that we /do/ accept code with warty (but functional)
interfaces now, and we can clean them up later.  This is (IMHO) a good
stress-reduction tactic, because each of us (ideally) should concentrate
on getting the core algorithms right, and not focusing on rebasing code
and smoothing over the same d*** merge conflicts over and over.

Yes, it's true that people think that a maintainer's only real power is
to say 'no' in the hopes of forcing developers to fix everything now
because they can't trust that a dev will ever come back with the
promised updates, but I reject that 110%.  I'm not going anywhere, and I
/do/ trust that when the rest of you say that you'll be back with wart
remover, you will.

> I'm not going to play that "now jump through this hoop" game.  We
> add flags for on-off behaviours in internal functions -all the
> time-. If this makes the interface so complex and confusing that you
> don't understand it, then the interface was already too complex and
> confusing. And fixing that is *not in the scope of this patchset*.
> 
> Demanding that code be made perfect before it can be merged is
> really not very helpful. Especially when there are already plans to
> rework the API but that rework is dependent on a bunch of other
> changes than need to be done first.
> 
> iclogs are something that need to be moved behind the CIL, not sit
> in front of CIL. The CIL abstracts the journal and writing to the
> journal completely away from the transaction subsystem, yet the log
> force code punches straight through that abstraction and walks
> iclogs directly. The whole log force implementation needs to change,
> and I plan for the API that wraps the log forces to get reworked at
> that time.

So here's what I want to know: Do Dave's changes to the log force APIs
introduce broken behavior?  If the interfaces are so confusing that
/none/ of us understand it, can we introduce the appropriate wrappers
and documentation so that the rest of us plugging away at the rest of
the system can only call it the supported ways to achieve any of the
supported outcomes?

I'm willing to accept a bunch of documentation and "trivial" wrappers
for the rest of us as a shoofly to enable the rest of the xfs developers
to keep moving around a messy slow-moving log restructuring without
falling into a work pit.

However, it's been difficult for me to check that without being able to
reference a branch to see that at least the end result looks sane.  That
was immensely helpful for reviewing Allison's deferred xattrs series.

(TLDR: git branch plz)

The other way to ease my fears, of course, would be to submit a ton of
fstests to examine the log behavior for correctness, but that's
difficult to pull off when the control surface is the userspace ABI.

> For example, if we want to direct map storage for log writes, then
> iclog-based log force synchronisation needs to go away because we
> don't need iclogs for buffering journal writes. Hence the log foce
> code should interface only with the CIL, and only the CIL should
> manage whatever mechanism it is using to write to stable storage.
> The code is currently the way it is because the CIL, when first
> implemented, had to co-exist with the old way of writing to the log.
> We haven't used that old way for a decade now and we have very
> different storage performance characteristics these days, so it's
> about time we got rid of the mechanism designed to be optimal for
> spinning disks and actually integrated the CIL and the log
> efficiently.
> 
> There are a *lot* of steps to do this, and reworking the log force
> implementation and API is part of that. But reworking that API is
> premature because we haven't done all the necessary pre-work in
> place to make such a change yet. This patch is actually part of that
> pre-work to get the mechanisms that the log force rework will rely
> on.
> 
> I have very good reasons for pushing back against your suggestions,
> Brian. Your suggestions have merit but this patch is not the time to
> be making the changes you suggest. Code does not need to be perfect
> to be merged, nor does the entire larger change they will be part of
> need to be complete and 100% tested and reviewed before preparation
> and infrastructure patches can be merged. This is how we've done big
> changes in the past - they've been staged across multiple kernel
> cycles - and not everything needs to be done in the first
> patchset of a larger body of work....

You mean there's even more beyond the 45 already on the list? /groan/

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\
  2021-03-09  4:35                       ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Darrick J. Wong
@ 2021-03-10  2:10                         ` Brian Foster
  2021-03-10 22:00                           ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing Dave Chinner
  2021-03-10 15:13                         ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Brian Foster
  2021-03-11 12:41                         ` Christoph Hellwig
  2 siblings, 1 reply; 33+ messages in thread
From: Brian Foster @ 2021-03-10  2:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Mon, Mar 08, 2021 at 08:35:59PM -0800, Darrick J. Wong wrote:
> On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote:
> > On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote:
> > > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote:
> > > > > > > So not only does this patch expose subsystem internals to
> > > > > > > external layers and create more log forcing interfaces/behaviors to
> > > > > > 
> > > > > > Sorry, I don't follow. What "subsystem internals" are being exposed
> > > > > > and what external layer are they being exposed to?
> > > > > > 
> > > > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > > > > > > changing things like log force behaviour and implementation is for
> > > > > > > > a future series.
> > > > > > > > 
> > > > > > > 
> > > > > > > TBH I think this patch is kind of a mess on its own. I think the
> > > > > > > mechanism it wants to provide is sane, but I've not even got to the
> > > > > > > point of reviewing _that_ yet because of the seeming dismissal of higher
> > > > > > > level feedback. I'd rather not go around in circles on this so I'll just
> > > > > > > offer my summarized feedback to this patch...
> > > > > > 
> > > > > > I'm not dismissing review nor am I saying the API cannot or should
> > > > > > not be improved. I'm simply telling you that I think the additional
> > > > > > changes you are proposing are outside the scope of the problem I am
> > > > > > addressing here. I already plan to rework the log force API (and
> > > > > > others) but doing so it not something that this patchset needs to
> > > > > > address, or indeed should address.
> > > > > > 
> > > > > 
> > > > > I'm not proposing additional changes nor to rework the log force API.
> > > > > I'm pointing out that I find this implementation to be extremely and
> > > > > unnecessarily confusing.
> > > > 
> > > > And that's just fine - not everyone has to understand all of the
> > > > code all of the time. That's why we have a team....
> > > > 
> > > > > To improve it, I'm suggesting to either coopt
> > > > > the existing async log force API...
> > > > 
> > > > And I'm telling you that this is *future work*. As the person
> > > > actually doing the work, that's my decision and you need to accept
> > > > that. I understand your concerns and have a plan to address them -
> > > > you just have to accept that the plan to address them isn't going to
> > > > be exactly to your liking.
> > > > 
> > > 
> > > As a reviewer, I am pointing out that I object to how this patch is
> > > implemented and offered several fairly simple suggestions on how to
> > > address my concerns. Those suggestions have been rejected pretty much
> > > out of hand on the basis of the existence of some future plans.
> > 
> > We're allowed to disagree on the best approach. But to find a way
> > forward means that both the summitter and the reviewer need to
> > compromise. I've fixed up the obvious warts and bugs you've pointed
> > out, and agreed that it needs cleaning up and have committed to
> > cleaning it up in the near future.
> > 
> > > Future rework plans do not justify or change my view of this patch
> > > and the mess it adds (for other developers, for historical context
> > > and downstreams, etc.).  Therefore, if the only option you're
> > > willing to consider is "make this mess now and clean it up later
> > > in some broader rework," then I'd rather we just defer this patch
> > > until after that rework is available and avoid the mess that way.
> > 
> > So you won't review it until I have 100 outstanding patches in this
> > series and it's completely and utterly unreviewable?
> 
> Already unreviewable at 45, and I've only gotten through 2/3 of it.
> 
> > Then you'll ask me to split it up and re-order it into digestable
> > chunks, and so we'll go back to having to merge this because any of
> > the API rework that depends on the mechanism that this patch
> > introduces.
> 
> Here's something I haven't previously shared with all of you: Last cycle
> when we were going around and around on the ENOSPC/EDQUOT retry loop
> patches (which exploded from 13 to 41 patches) it was /very/ stressful
> to have to rework this and that part every day and a half for almost
> three weeks.
> 
> When I get a patchset ready for submission, I export the patches from my
> development branch into a topic branch based off the latest -rc.  Every
> time anyone makes a suggestion, I update the topic branch and send that
> to the fstests "cloud" to see if it broke anything.  If it succeeds, I
> push it to k.org and resubmit.
> 
> The part that's less obvious is the fact that rebasing is /not/ that
> simple.  Next I integrate the changes into my development tree and
> rebase everything that comes after /that/ to make sure it doesn't
> interfere with my longer term development goals.  Sometimes this is
> easy, sometimes this takes an entire day to work out the warts.
> Then I send that to the fstests "cloud".  This rebase is particularly
> painful because every change that everyone makes to inode
> initialization collides with a rework of inode initialization that I've
> been working on in preparation for metadata directory trees.
> 
> The part that's even /less/ obvious than that is that once the
> development tree tests ok, then I do the same to my xfsprogs dev tree to
> make sure that nothing broke.  Frequently there are ABI or cognitive
> mismatches between kernel and userspace such that the build breaks, and
> then I have to patch the two kernel trees and re-run everything.
> 
> So, that's really stressful for me because a minor tweak to an interface
> can result in an enormous amount of work.  And I reject the argument
> that I could just rebase less frequently -- Dave does that, which means
> that any time he hears that one of his topic branches is being asked
> about, he has to spend weeks rebasing the whole mess to the latest
> upstream.  Maybe that works for him, but for me, I would hate that even
> more than doing a daily rebase.
> 
> Add to that the fact that vger has been a total delivery sh*tshow all
> year.  Now in addition to feeling disconnected from my family and
> friends, I also feel disconnected from work people too.  This really did
> nearly push me over the edge three weeks ago.
> 
> Please remember, one of the big advantages of our open development
> processes is that we /do/ accept code with warty (but functional)
> interfaces now, and we can clean them up later.  This is (IMHO) a good
> stress-reduction tactic, because each of us (ideally) should concentrate
> on getting the core algorithms right, and not focusing on rebasing code
> and smoothing over the same d*** merge conflicts over and over.
> 
> Yes, it's true that people think that a maintainer's only real power is
> to say 'no' in the hopes of forcing developers to fix everything now
> because they can't trust that a dev will ever come back with the
> promised updates, but I reject that 110%.  I'm not going anywhere, and I
> /do/ trust that when the rest of you say that you'll be back with wart
> remover, you will.
> 
> > I'm not going to play that "now jump through this hoop" game.  We
> > add flags for on-off behaviours in internal functions -all the
> > time-. If this makes the interface so complex and confusing that you
> > don't understand it, then the interface was already too complex and
> > confusing. And fixing that is *not in the scope of this patchset*.
> > 
> > Demanding that code be made perfect before it can be merged is
> > really not very helpful. Especially when there are already plans to
> > rework the API but that rework is dependent on a bunch of other
> > changes than need to be done first.
> > 
> > iclogs are something that need to be moved behind the CIL, not sit
> > in front of CIL. The CIL abstracts the journal and writing to the
> > journal completely away from the transaction subsystem, yet the log
> > force code punches straight through that abstraction and walks
> > iclogs directly. The whole log force implementation needs to change,
> > and I plan for the API that wraps the log forces to get reworked at
> > that time.
> 
> So here's what I want to know: Do Dave's changes to the log force APIs
> introduce broken behavior?  If the interfaces are so confusing that
> /none/ of us understand it, can we introduce the appropriate wrappers
> and documentation so that the rest of us plugging away at the rest of
> the system can only call it the supported ways to achieve any of the
> supported outcomes?
> 

It looks to me that the changes to xlog_cil_force_seq() could
essentially be replaced with something like:

/*
 * Behold the magical async CIL force
 * ... <explain what this does> ...
 */
static inline void
xfs_cil_force_async(
	struct xlog	*log)
{
	struct xfs_cil	*cil = log->l_cilp;

	/* true for magic async CIL force */
	xlog_cil_push_now(log, cil->xc_current_sequence, true);
}

If so, we wouldn't have to hack up xlog_cil_force_seq() with a flags
parameter that provides inconsistent async semantics (and thus show
mercy on the log force call stack above it) and otherwise only serves to
bypass 95% of that function anyways. I also think that inverting the cil
push bool param (and calling it async_push or some such) is a bit more
clear because it separates it from XFS_LOG_SYNC (or !XFS_LOG_SYNC)
semantics and simultaneously maps directly to the underlying
->xc_push_async control.

Brian

> I'm willing to accept a bunch of documentation and "trivial" wrappers
> for the rest of us as a shoofly to enable the rest of the xfs developers
> to keep moving around a messy slow-moving log restructuring without
> falling into a work pit.
> 
> However, it's been difficult for me to check that without being able to
> reference a branch to see that at least the end result looks sane.  That
> was immensely helpful for reviewing Allison's deferred xattrs series.
> 
> (TLDR: git branch plz)
> 
> The other way to ease my fears, of course, would be to submit a ton of
> fstests to examine the log behavior for correctness, but that's
> difficult to pull off when the control surface is the userspace ABI.
> 
> > For example, if we want to direct map storage for log writes, then
> > iclog-based log force synchronisation needs to go away because we
> > don't need iclogs for buffering journal writes. Hence the log foce
> > code should interface only with the CIL, and only the CIL should
> > manage whatever mechanism it is using to write to stable storage.
> > The code is currently the way it is because the CIL, when first
> > implemented, had to co-exist with the old way of writing to the log.
> > We haven't used that old way for a decade now and we have very
> > different storage performance characteristics these days, so it's
> > about time we got rid of the mechanism designed to be optimal for
> > spinning disks and actually integrated the CIL and the log
> > efficiently.
> > 
> > There are a *lot* of steps to do this, and reworking the log force
> > implementation and API is part of that. But reworking that API is
> > premature because we haven't done all the necessary pre-work in
> > place to make such a change yet. This patch is actually part of that
> > pre-work to get the mechanisms that the log force rework will rely
> > on.
> > 
> > I have very good reasons for pushing back against your suggestions,
> > Brian. Your suggestions have merit but this patch is not the time to
> > be making the changes you suggest. Code does not need to be perfect
> > to be merged, nor does the entire larger change they will be part of
> > need to be complete and 100% tested and reviewed before preparation
> > and infrastructure patches can be merged. This is how we've done big
> > changes in the past - they've been staged across multiple kernel
> > cycles - and not everything needs to be done in the first
> > patchset of a larger body of work....
> 
> You mean there's even more beyond the 45 already on the list? /groan/
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing
  2021-03-09  0:44                     ` Dave Chinner
  2021-03-09  4:35                       ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Darrick J. Wong
@ 2021-03-10 14:49                       ` Brian Foster
  1 sibling, 0 replies; 33+ messages in thread
From: Brian Foster @ 2021-03-10 14:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote:
> On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote:
> > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote:
> > > > > > So not only does this patch expose subsystem internals to
> > > > > > external layers and create more log forcing interfaces/behaviors to
> > > > > 
> > > > > Sorry, I don't follow. What "subsystem internals" are being exposed
> > > > > and what external layer are they being exposed to?
> > > > > 
> > > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > > > > > changing things like log force behaviour and implementation is for
> > > > > > > a future series.
> > > > > > > 
> > > > > > 
> > > > > > TBH I think this patch is kind of a mess on its own. I think the
> > > > > > mechanism it wants to provide is sane, but I've not even got to the
> > > > > > point of reviewing _that_ yet because of the seeming dismissal of higher
> > > > > > level feedback. I'd rather not go around in circles on this so I'll just
> > > > > > offer my summarized feedback to this patch...
> > > > > 
> > > > > I'm not dismissing review nor am I saying the API cannot or should
> > > > > not be improved. I'm simply telling you that I think the additional
> > > > > changes you are proposing are outside the scope of the problem I am
> > > > > addressing here. I already plan to rework the log force API (and
> > > > > others) but doing so it not something that this patchset needs to
> > > > > address, or indeed should address.
> > > > > 
> > > > 
> > > > I'm not proposing additional changes nor to rework the log force API.
> > > > I'm pointing out that I find this implementation to be extremely and
> > > > unnecessarily confusing.
> > > 
> > > And that's just fine - not everyone has to understand all of the
> > > code all of the time. That's why we have a team....
> > > 
> > > > To improve it, I'm suggesting to either coopt
> > > > the existing async log force API...
> > > 
> > > And I'm telling you that this is *future work*. As the person
> > > actually doing the work, that's my decision and you need to accept
> > > that. I understand your concerns and have a plan to address them -
> > > you just have to accept that the plan to address them isn't going to
> > > be exactly to your liking.
> > > 
> > 
> > As a reviewer, I am pointing out that I object to how this patch is
> > implemented and offered several fairly simple suggestions on how to
> > address my concerns. Those suggestions have been rejected pretty much
> > out of hand on the basis of the existence of some future plans.
> 
> We're allowed to disagree on the best approach. But to find a way
> forward means that both the summitter and the reviewer need to
> compromise. I've fixed up the obvious warts and bugs you've pointed
> out, and agreed that it needs cleaning up and have committed to
> cleaning it up in the near future.
> 
> > Future rework plans do not justify or change my view of this patch
> > and the mess it adds (for other developers, for historical context
> > and downstreams, etc.).  Therefore, if the only option you're
> > willing to consider is "make this mess now and clean it up later
> > in some broader rework," then I'd rather we just defer this patch
> > until after that rework is available and avoid the mess that way.
> 
> So you won't review it until I have 100 outstanding patches in this
> series and it's completely and utterly unreviewable? Then you'll ask
> me to split it up and re-order it into digestable chunks, and so
> we'll go back to having to merge this because any of the API rework
> that depends on the mechanism that this patch introduces.
> 

(Well, I'm not reviewing it now because I'm on PTO this week...)

> I'm not going to play that "now jump through this hoop" game.  We
> add flags for on-off behaviours in internal functions -all the
> time-. If this makes the interface so complex and confusing that you
> don't understand it, then the interface was already too complex and
> confusing. And fixing that is *not in the scope of this patchset*.
> 

I'm pretty sure creating a flag for the one-off behavior here was one of
my earlier suggestions. Based on the above, what's the problem with
that? I think Darrick's suggestion (re: my previous reply) around a
wrapper approach is quite reasonable as well.

> Demanding that code be made perfect before it can be merged is
> really not very helpful. Especially when there are already plans to
> rework the API but that rework is dependent on a bunch of other
> changes than need to be done first.
> 

"Jumping through hoops" on 100 patch series? Demanding perfection?
That's a little extreme for first pass feedback on v1 of a 3 patch
series, don't you think?

My position on this patch in its current form is that regardless of any
pending rework, it is unnecessarily gross. As stated numerous times, I
think there are various options for small tweaks that make it passable
for the time being. Let me know if you become willing to entertain the
notion that this patch might stand to improve independent from some
broader rework and I'm happy to revisit any of those options in further
detail.

In the meantime, I'm growing a little tired of the hyperbolic replies
around playing games, jumping through hoops, demanding perfection, or
whatever other nonsense. I've invested the better part of a week (and
now into PTO) in review of this patch alone because I feel strongly
about the result and am willing to contribute toward a solution, not
because I want to waste time or play games. I don't feel in kind about
the replies.

Brian

P.S.,

FWIW, I certainly don't expect all feedback to be incorporated during
review. On balance, I'd say a reasonable breakdown is that probably half
makes sense, another half might be bogus, and then some percentage of
that comes down to preference between reviewer and author where I'll try
to defer to the patch author when I feel there's at least been some good
faith discussion or justification for the current approach. "No, that's
obfuscation" or "no, I'll do it later" are not productive arguments if
you're looking for compromise. Those leave no room for discussion.
There's no counter suggestion to try and address the immediate concern
and work towards something mutually agreeable, or even any opportunity
to arrive at the conclusion that the original patch might be the ideal
outcome after all.

> iclogs are something that need to be moved behind the CIL, not sit
> in front of CIL. The CIL abstracts the journal and writing to the
> journal completely away from the transaction subsystem, yet the log
> force code punches straight through that abstraction and walks
> iclogs directly. The whole log force implementation needs to change,
> and I plan for the API that wraps the log forces to get reworked at
> that time.
> 
> For example, if we want to direct map storage for log writes, then
> iclog-based log force synchronisation needs to go away because we
> don't need iclogs for buffering journal writes. Hence the log foce
> code should interface only with the CIL, and only the CIL should
> manage whatever mechanism it is using to write to stable storage.
> The code is currently the way it is because the CIL, when first
> implemented, had to co-exist with the old way of writing to the log.
> We haven't used that old way for a decade now and we have very
> different storage performance characteristics these days, so it's
> about time we got rid of the mechanism designed to be optimal for
> spinning disks and actually integrated the CIL and the log
> efficiently.
> 
> There are a *lot* of steps to do this, and reworking the log force
> implementation and API is part of that. But reworking that API is
> premature because we haven't done all the necessary pre-work in
> place to make such a change yet. This patch is actually part of that
> pre-work to get the mechanisms that the log force rework will rely
> on.
> 
> I have very good reasons for pushing back against your suggestions,
> Brian. Your suggestions have merit but this patch is not the time to
> be making the changes you suggest. Code does not need to be perfect
> to be merged, nor does the entire larger change they will be part of
> need to be complete and 100% tested and reviewed before preparation
> and infrastructure patches can be merged. This is how we've done big
> changes in the past - they've been staged across multiple kernel
> cycles - and not everything needs to be done in the first
> patchset of a larger body of work....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\
  2021-03-09  4:35                       ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Darrick J. Wong
  2021-03-10  2:10                         ` Brian Foster
@ 2021-03-10 15:13                         ` Brian Foster
  2021-03-11 12:41                         ` Christoph Hellwig
  2 siblings, 0 replies; 33+ messages in thread
From: Brian Foster @ 2021-03-10 15:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

(Replying separately to the logistical bits...)

On Mon, Mar 08, 2021 at 08:35:59PM -0800, Darrick J. Wong wrote:
> On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote:
> > On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote:
> > > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote:
...
> > > 
> > > As a reviewer, I am pointing out that I object to how this patch is
> > > implemented and offered several fairly simple suggestions on how to
> > > address my concerns. Those suggestions have been rejected pretty much
> > > out of hand on the basis of the existence of some future plans.
> > 
> > We're allowed to disagree on the best approach. But to find a way
> > forward means that both the summitter and the reviewer need to
> > compromise. I've fixed up the obvious warts and bugs you've pointed
> > out, and agreed that it needs cleaning up and have committed to
> > cleaning it up in the near future.
> > 
> > > Future rework plans do not justify or change my view of this patch
> > > and the mess it adds (for other developers, for historical context
> > > and downstreams, etc.).  Therefore, if the only option you're
> > > willing to consider is "make this mess now and clean it up later
> > > in some broader rework," then I'd rather we just defer this patch
> > > until after that rework is available and avoid the mess that way.
> > 
> > So you won't review it until I have 100 outstanding patches in this
> > series and it's completely and utterly unreviewable?
> 
> Already unreviewable at 45, and I've only gotten through 2/3 of it.
> 
> > Then you'll ask me to split it up and re-order it into digestable
> > chunks, and so we'll go back to having to merge this because any of
> > the API rework that depends on the mechanism that this patch
> > introduces.
> 
> Here's something I haven't previously shared with all of you: Last cycle
> when we were going around and around on the ENOSPC/EDQUOT retry loop
> patches (which exploded from 13 to 41 patches) it was /very/ stressful
> to have to rework this and that part every day and a half for almost
> three weeks.
> 

I sympathize. There was at least a week in there that I spent pretty
much on nothing but review. :/

> When I get a patchset ready for submission, I export the patches from my
> development branch into a topic branch based off the latest -rc.  Every
> time anyone makes a suggestion, I update the topic branch and send that
> to the fstests "cloud" to see if it broke anything.  If it succeeds, I
> push it to k.org and resubmit.
> 
> The part that's less obvious is the fact that rebasing is /not/ that
> simple.  Next I integrate the changes into my development tree and
> rebase everything that comes after /that/ to make sure it doesn't
> interfere with my longer term development goals.  Sometimes this is
> easy, sometimes this takes an entire day to work out the warts.
> Then I send that to the fstests "cloud".  This rebase is particularly
> painful because every change that everyone makes to inode
> initialization collides with a rework of inode initialization that I've
> been working on in preparation for metadata directory trees.
> 
> The part that's even /less/ obvious than that is that once the
> development tree tests ok, then I do the same to my xfsprogs dev tree to
> make sure that nothing broke.  Frequently there are ABI or cognitive
> mismatches between kernel and userspace such that the build breaks, and
> then I have to patch the two kernel trees and re-run everything.
> 
> So, that's really stressful for me because a minor tweak to an interface
> can result in an enormous amount of work.  And I reject the argument
> that I could just rebase less frequently -- Dave does that, which means
> that any time he hears that one of his topic branches is being asked
> about, he has to spend weeks rebasing the whole mess to the latest
> upstream.  Maybe that works for him, but for me, I would hate that even
> more than doing a daily rebase.
> 
> Add to that the fact that vger has been a total delivery sh*tshow all
> year.  Now in addition to feeling disconnected from my family and
> friends, I also feel disconnected from work people too.  This really did
> nearly push me over the edge three weeks ago.
> 

Sorry. :(

> Please remember, one of the big advantages of our open development
> processes is that we /do/ accept code with warty (but functional)
> interfaces now, and we can clean them up later.  This is (IMHO) a good
> stress-reduction tactic, because each of us (ideally) should concentrate
> on getting the core algorithms right, and not focusing on rebasing code
> and smoothing over the same d*** merge conflicts over and over.
> 

I agree that we can and do accept code as such in certain cases, but not
always or by default. Usually, only when it makes sense due to
complexity, cross subsystem issues, etc. If the situation is the
submitter has sent a large patch series at a point where it's already so
unwieldy to change because of non-upstream reasons that some restricted
form of review is required, then I think that's a bit unfair to the
upstream community.

Using the whole block reclaim/retry loop thing as an example (just
because it was the most recent example of this kind of thing causing you
grief), perhaps those foundational changes should have been sent to the
list earlier before they became "too unwieldy to change?" I think it's
perfectly reasonable to send lightly tested or compile tested only
patches for initial thoughts, factoring feedback, etc., particularly if
a change proposes to add retry loops to various transaction allocation
callers. That's a fairly significant change in some core areas.

ISTM that on one hand review is fairly open ended and it's not really up
to the submitter to declare the scope of reviewer feedback (the
submitter is then certainly free to change certain things or not, but
the goal should be to cooperate toward some common ground). OTOH I
suppose the open development model allows for a community to determine a
standard for if/how they prefer to review patches.

That said, I personally don't like the "accept ugly code now, fix it
later over time" model by default because for one, I think it leads to a
lower quality mainline. I also think there's an implied assumption in
there that a reviewer wants to review endless series of cleanup patches
after providing such feedback on early versions of functionality. I know
if I take the time to review a 30+ patch series and had some
non-negligible amount of aesthetic feedback, I'm not really looking to
review another 20+ patches some time later just to clean up a mess that
the first series might have (unnecessarily) created, particularly when
the feedback has already been provided and so iterative review is more
time efficient (I find review much more time consuming than
development).

I dunno. That's just my .02 I guess. I do agree with Dave's previous
statements that not everything has to be 100% mutually agreeable. I
generally try to accommodate that by incorporating suggestions into my
own patches that I might not necessarily agree with, but don't feel
strongly enough about, to save time and keep things moving. Conversely
on the review side, I generally try to defer to other reviewers to see
whether my particular position might be in the majority or minority, and
go with that (That doesn't always mean I'm going to put reviewed-by tags
on patches I really don't like. In some cases, I think reviewing for
function/bugs and not nacking a patch is a form of compromise. ;).

> Yes, it's true that people think that a maintainer's only real power is
> to say 'no' in the hopes of forcing developers to fix everything now
> because they can't trust that a dev will ever come back with the
> promised updates, but I reject that 110%.  I'm not going anywhere, and I
> /do/ trust that when the rest of you say that you'll be back with wart
> remover, you will.
> 

I don't distrust that any XFS developers would make promised fixes as
such. I don't really see that as a problem at all. I think the whole
subsystem rework thing is pretty much a distraction from the original
feedback. E.g., I think your wrapper idea (explored in my previous
reply) is yet another reasonable solution that doesn't involve or
require any sort of broader rework.

Brian

> > I'm not going to play that "now jump through this hoop" game.  We
> > add flags for on-off behaviours in internal functions -all the
> > time-. If this makes the interface so complex and confusing that you
> > don't understand it, then the interface was already too complex and
> > confusing. And fixing that is *not in the scope of this patchset*.
> > 
> > Demanding that code be made perfect before it can be merged is
> > really not very helpful. Especially when there are already plans to
> > rework the API but that rework is dependent on a bunch of other
> > changes than need to be done first.
> > 
> > iclogs are something that need to be moved behind the CIL, not sit
> > in front of CIL. The CIL abstracts the journal and writing to the
> > journal completely away from the transaction subsystem, yet the log
> > force code punches straight through that abstraction and walks
> > iclogs directly. The whole log force implementation needs to change,
> > and I plan for the API that wraps the log forces to get reworked at
> > that time.
> 
> So here's what I want to know: Do Dave's changes to the log force APIs
> introduce broken behavior?  If the interfaces are so confusing that
> /none/ of us understand it, can we introduce the appropriate wrappers
> and documentation so that the rest of us plugging away at the rest of
> the system can only call it the supported ways to achieve any of the
> supported outcomes?
> 
> I'm willing to accept a bunch of documentation and "trivial" wrappers
> for the rest of us as a shoofly to enable the rest of the xfs developers
> to keep moving around a messy slow-moving log restructuring without
> falling into a work pit.
> 
> However, it's been difficult for me to check that without being able to
> reference a branch to see that at least the end result looks sane.  That
> was immensely helpful for reviewing Allison's deferred xattrs series.
> 
> (TLDR: git branch plz)
> 
> The other way to ease my fears, of course, would be to submit a ton of
> fstests to examine the log behavior for correctness, but that's
> difficult to pull off when the control surface is the userspace ABI.
> 
> > For example, if we want to direct map storage for log writes, then
> > iclog-based log force synchronisation needs to go away because we
> > don't need iclogs for buffering journal writes. Hence the log foce
> > code should interface only with the CIL, and only the CIL should
> > manage whatever mechanism it is using to write to stable storage.
> > The code is currently the way it is because the CIL, when first
> > implemented, had to co-exist with the old way of writing to the log.
> > We haven't used that old way for a decade now and we have very
> > different storage performance characteristics these days, so it's
> > about time we got rid of the mechanism designed to be optimal for
> > spinning disks and actually integrated the CIL and the log
> > efficiently.
> > 
> > There are a *lot* of steps to do this, and reworking the log force
> > implementation and API is part of that. But reworking that API is
> > premature because we haven't done all the necessary pre-work in
> > place to make such a change yet. This patch is actually part of that
> > pre-work to get the mechanisms that the log force rework will rely
> > on.
> > 
> > I have very good reasons for pushing back against your suggestions,
> > Brian. Your suggestions have merit but this patch is not the time to
> > be making the changes you suggest. Code does not need to be perfect
> > to be merged, nor does the entire larger change they will be part of
> > need to be complete and 100% tested and reviewed before preparation
> > and infrastructure patches can be merged. This is how we've done big
> > changes in the past - they've been staged across multiple kernel
> > cycles - and not everything needs to be done in the first
> > patchset of a larger body of work....
> 
> You mean there's even more beyond the 45 already on the list? /groan/
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing
  2021-03-10  2:10                         ` Brian Foster
@ 2021-03-10 22:00                           ` Dave Chinner
  0 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2021-03-10 22:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Tue, Mar 09, 2021 at 09:10:47PM -0500, Brian Foster wrote:
> On Mon, Mar 08, 2021 at 08:35:59PM -0800, Darrick J. Wong wrote:
> > On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote:
> > > I'm not going to play that "now jump through this hoop" game.  We
> > > add flags for on-off behaviours in internal functions -all the
> > > time-. If this makes the interface so complex and confusing that you
> > > don't understand it, then the interface was already too complex and
> > > confusing. And fixing that is *not in the scope of this patchset*.
> > > 
> > > Demanding that code be made perfect before it can be merged is
> > > really not very helpful. Especially when there are already plans to
> > > rework the API but that rework is dependent on a bunch of other
> > > changes than need to be done first.
> > > 
> > > iclogs are something that need to be moved behind the CIL, not sit
> > > in front of CIL. The CIL abstracts the journal and writing to the
> > > journal completely away from the transaction subsystem, yet the log
> > > force code punches straight through that abstraction and walks
> > > iclogs directly. The whole log force implementation needs to change,
> > > and I plan for the API that wraps the log forces to get reworked at
> > > that time.
> > 
> > So here's what I want to know: Do Dave's changes to the log force APIs
> > introduce broken behavior?  If the interfaces are so confusing that
> > /none/ of us understand it, can we introduce the appropriate wrappers
> > and documentation so that the rest of us plugging away at the rest of
> > the system can only call it the supported ways to achieve any of the
> > supported outcomes?
> > 
> 
> It looks to me that the changes to xlog_cil_force_seq() could
> essentially be replaced with something like:
> 
> /*
>  * Behold the magical async CIL force
>  * ... <explain what this does> ...
>  */
> static inline void
> xfs_cil_force_async(
> 	struct xlog	*log)
> {
> 	struct xfs_cil	*cil = log->l_cilp;
> 
> 	/* true for magic async CIL force */
> 	xlog_cil_push_now(log, cil->xc_current_sequence, true);
> }

Oh, is that what you are asking for? You were talking about changing
the API for all the callers that didn't use XFS_LOG_SYNC to make
them all async, not about a one-off, single use wrapper for the AIL.

If all you want is a single line wrapper function, then don't
suggest that the whole API should be reworked and callers changed to
use a new API. *Ask for a one-line wrapper to be added*.

This wrapper will still need to go away in the near future, but at
least it doesn't involve changing lots of unrelated code...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\
  2021-03-09  4:35                       ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Darrick J. Wong
  2021-03-10  2:10                         ` Brian Foster
  2021-03-10 15:13                         ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Brian Foster
@ 2021-03-11 12:41                         ` Christoph Hellwig
  2 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2021-03-11 12:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, Brian Foster, linux-xfs

On Mon, Mar 08, 2021 at 08:35:59PM -0800, Darrick J. Wong wrote:
> > So you won't review it until I have 100 outstanding patches in this
> > series and it's completely and utterly unreviewable?
> 
> Already unreviewable at 45, and I've only gotten through 2/3 of it.


Yes.  For patches that don't just repetitively apply similar changes
to a few places, about 20 patches is the max that is digestable.

> Here's something I haven't previously shared with all of you: Last cycle
> when we were going around and around on the ENOSPC/EDQUOT retry loop
> patches (which exploded from 13 to 41 patches) it was /very/ stressful
> to have to rework this and that part every day and a half for almost
> three weeks.

As someone part of the loop I was a little surprised how quickly you
did respin the patches.  In general if I have feedback that requires
a major rework of a non-trivial series, I do not want to touch it
instantly.  Let the discussion continue a bit, becaue it can easily
turn into another direction and create more work.  I think waiting a few
days before doing anything that involves a lot of work generally helpsto
make everyones life a little easier.

> (TLDR: git branch plz)

Yes, for any non-trivial series that really, really helps.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-03-11 12:43 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-23  5:32 [PATCH 0/3] xfs: CIL improvements Dave Chinner
2021-02-23  5:32 ` [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
2021-02-24 21:42   ` Darrick J. Wong
2021-02-24 22:19     ` Dave Chinner
2021-02-25 19:01     ` Christoph Hellwig
2021-03-02 18:12   ` Brian Foster
2021-02-23  5:32 ` [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing Dave Chinner
2021-02-24 21:10   ` Dave Chinner
2021-02-24 23:26     ` [PATCH 2/3 v2] " Dave Chinner
2021-02-25  2:15       ` Dave Chinner
2021-03-02 21:44       ` Brian Foster
2021-03-03  0:57         ` Dave Chinner
2021-03-03 17:32           ` Brian Foster
2021-03-04  1:59             ` Dave Chinner
2021-03-04 13:13               ` Brian Foster
2021-03-04 22:48                 ` Dave Chinner
2021-03-05 14:58                   ` Brian Foster
2021-03-09  0:44                     ` Dave Chinner
2021-03-09  4:35                       ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Darrick J. Wong
2021-03-10  2:10                         ` Brian Foster
2021-03-10 22:00                           ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing Dave Chinner
2021-03-10 15:13                         ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Brian Foster
2021-03-11 12:41                         ` Christoph Hellwig
2021-03-10 14:49                       ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing Brian Foster
2021-02-25 13:12     ` [PATCH 2/3] " Brian Foster
2021-02-25 22:03       ` Dave Chinner
2021-02-27 16:25         ` Brian Foster
2021-03-01  4:54           ` Dave Chinner
2021-03-01 13:32             ` Brian Foster
2021-03-03  1:23               ` Dave Chinner
2021-03-03 17:20                 ` Brian Foster
2021-03-04  2:01                   ` Dave Chinner
2021-02-23  5:32 ` [PATCH 3/3] xfs: CIL work is serialised, not pipelined Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.